ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

NG-Tax, a highly accurate and validated pipeline for analysis of 16S rRNA amplicons from complex biomes

[version 1; peer review: 2 approved with reservations, 1 not approved]
* Equal contributors
PUBLISHED 22 Jul 2016
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

Background
Massive high-throughput sequencing of short, hypervariable segments of the 16S ribosomal RNA (rRNA) gene has transformed the methodological landscape describing microbial diversity within and across complex biomes. However, several studies have shown that the methodology rather than the biological variation is responsible for the observed sample composition and distribution. This compromises true meta-analyses, although this fact is often disregarded.
Results
To facilitate true meta-analysis of microbiome studies, we developed NG-Tax, a pipeline for 16S rRNA gene amplicon sequence analysis that was validated with different mock communities and benchmarked against QIIME as the currently most frequently used pipeline. The microbial composition of 49 independently amplified mock samples was characterized by sequencing two variable 16S rRNA gene regions, V4 and V5-V6, in three separate sequencing runs on Illumina’s HiSeq2000 platform. This allowed evaluating important factors of technical bias in taxonomic classification: 1) run-to-run sequencing variation, 2) PCR–error, and 3) region/primer specific amplification bias. Despite the short read length (~140 nt) and all technical biases, the average specificity of the taxonomic assignment for the phylotypes included in the mock communities was 96%. On average 99.94% and 92.02% of the reads could be assigned to at least family or genus level, respectively, while assignment to ‘spurious genera’ represented on average only 0.02% of the reads per sample. Analysis of α- and β-diversity confirmed conclusions guided by biology rather than the aforementioned methodological aspects, which was not the case when samples were analysed using QIIME.
Conclusions
Different biological outcomes are commonly observed due to 16S rRNA region-specific performance. NG-Tax demonstrated high robustness against choice of region and other technical biases associated with 16S rRNA gene amplicon sequencing studies, diminishing their impact and providing accurate qualitative and quantitative representation of the true sample composition. This will improve comparability between studies and facilitate efforts towards standardization.

Keywords

16S rRNA amplicon analysis, microbial community analysis, microbial ecology, next-generation sequencing, bioinformatic pipeline

Background

Recent advances in massive high-throughput, short-amplicon sequencing are revolutionizing efforts to describe microbial diversity within and across complex biomes1. Cultivation-independent whole metagenome sequencing has received increasing attention in the functional characterization of individual communities. These efforts, however, remain relatively expensive on a per sample basis, and the richer but much more unstructured information content requires complex data modelling and analysis procedures2. Therefore targeted surveys for specific taxonomic marker genes, such as the 16S ribosomal RNA (rRNA) gene3,4, remain essential in many microbial ecological studies. These surveys rely on sequencing of short, PCR amplified, hypervariable subregions rather than of the full-length gene, mostly for reasons of throughput, sequence depth and cost-efficiency.

There have been great efforts to address the accuracy and reproducibility of findings from 16S rRNA gene amplicon sequencing studies through increased levels of standardization, and software pipelines provide comprehensive protocols to analyze microbial ecology datasets. However, these efforts have arguably enhanced replicability rather than reproducibility, by providing widely adopted defaults5. To this end, Drummond6 suggested that exact replication of an experiment (i.e., replicability) is less informative (although a necessary pre-requisite for any scientific endeavour) than the corroboration of findings by reproduction in different independent setups (i.e., reproducibility)7, because biological findings that are robust to independent methodologies are arguably more dependable than any single-track analysis5. This distinction is highly relevant for the field of microbial ecology, where replicability is often confused with reproducibility, which is apparent from many often non-interchangeable methodologies.

Accuracy can typically be evaluated by the addition of positive controls. Generally these are synthetic or mock communities (MCs) consisting of phylotypes that, ideally, are representative of the ecosystem of interest. MCs allow researchers to answer two essential questions concerning accuracy. 1) Do I retrieve the number of species I put in, and if so are they correctly assigned? 2) How well does the sequencing and data analysis procedure reproduce species relative abundances? Reproducibility can be evaluated by comparing separate sequencing runs and different primer pairs that cover distinct 16S rRNA gene regions. Although replicability is often achieved, accuracy has been shown to be challenging especially at higher taxonomic resolution such as at genus level8,9.

Central to all 16S rRNA gene amplicon studies are Operational Taxonomic Units (OTUs). These are often regarded as the synthetic proxy for microbial species and are typically clustered at 97% sequence similarity. However, the prokaryotic species definition remains a hotly debated topic without any satisfying solution so far1012. Moreover, the 97% sequence similarity threshold is based on the complete 16S rRNA gene (~1500 nt), and although sequence variability is not evenly distributed it is routinely applied to short reads of 100–500 nt. Different regions would therefore require their own species level cut-off. This combination of an ambiguous prokaryotic species definition and its application to short reads, is the foundation for many complications regarding ‘correct’ OTU clustering. Hence there is little consensus on key experimental choices such as primers, targeted variable regions and OTU picking/clustering algorithms. Each of these technical aspects generate biases, and different methods produce clearly distinct results, leading to a situation where results of current studies cannot be easily compared or extrapolated to other study designs. Therefore there is a strong need for standardization.

Historically, 16S rRNA gene sequences generated in a project were initially clustered de novo into OTUs at >97% sequence similarity using various clustering algorithms, mostly because available 16S rRNA gene reference databases were thought to provide insufficient coverage1316. Although new clustering algorithms that reduce the influence of clustering parameters, such as a hard cutoff for cluster similarity, have been specifically developed for amplicons17, cluster generation is context-dependent, i.e. different datasets generate different clusters, and different algorithms may produce different end-results5,18. Therefore, even though the same analysis framework is used, independent studies remain incomparable at OTU level. Consequently, reference-based OTU clustering has received increasing attention, due to the need for standardization, and because de-novo OTU clustering for very large datasets, such as those generated by Hiseq and Miseq sequencers has become computationally very intensive, unless greedy heuristics are employed which suffer from the problems described above. With reference-based OTU clustering, sequences are mapped to pre-clustered reference sets of curated 16S rRNA gene sequences, provided by dedicated databases such as the Ribosomal Database Project (RDP), Greengenes and SILVA1921. The consequence of this approach is that the ‘quality’ of the clustering of the reference set propagates to reference-picked OTUs. Clustering has limited robustness5,18,22, and unbalances in databases due to over- or under-representation of certain species as well as error hotspots that are not necessarily matched to the variable regions23, can potentially lead to a biased cluster formation, driven by non-biological factors. These effects have been previously ignored or underestimated in reference OTU picking protocols5.

Another essential experimental choice concerns the selection of targeted variable region, because it should represent the sequence variability encountered with the full-length gene. Despite several studies comparing the performance of diverse regions, sequence lengths, sequencing platforms and taxon assignment methodologies, both within and across laboratories2329, there still is no standard or consensus of best choices for variable regions. There are several factors that can lead to the commonly observed highly region-specific behaviour across datasets: 1) PCR bias of varying degrees23,27,30, 2) different regions are associated with different error profiles and different rates of chimera formation23,31, and 3) the actual variation contained in the sequence is dissimilar (e.g. some regions are not variable enough to differentiate between genera, while others are), which in turn can affect clustering5.

Apart from the use of a diverse range of primers and OTU picking protocols that can cause differences in results between studies and/or laboratories, sequencing error is a third important factor that defines data quality. Massive high throughput, short read length sequencing platforms have not been developed for amplicon sequencing but rather for whole genome sequencing, where sequence errors in individual reads is less important. However, in 16S rRNA gene amplicon sequencing every sequencing error could potentially lead to the false discovery of a new species. To avoid overestimation of microbial diversity, stringent quality filtering is therefore considered essential9.

Methodology rather than biology has often been shown to be the largest driver of variation in microbiome studies5,18,23,27,29,3234, and this aspect of amplicon sequencing is increasingly addressed in literature. Nevertheless, a satisfactory solution has not been found. To address the aforementioned challenges we have applied several recommendations from literature to validate high throughput, high-resolution microbiota profiling, using Illumina Hiseq2000 101nt paired end sequencing data as a test case. We implemented redundancy by sequencing two tandem variable 16S rRNA gene regions in parallel (V4 and V5-V6). To find optimal filtering settings and to empirically determine the noise floor, multiple standardized mock communities specifically designed to tackle issues associated with filtering parameter optimization were added to each sequencing run. To our knowledge a similar setup with multiple MCs and different variable regions has not been applied to datasets generated by the Illumina platform. This set-up has enabled us to develop NG-Tax, a pipeline that better accounts for error associated with a range of technical aspects of 16S rRNA gene amplicon sequencing. NG-Tax will improve comparability by removing technical bias and facilitate efforts towards standardization, by focusing on reproducibility as well as accuracy. To assess the performance we benchmarked the results obtained with NG-Tax with results obtained with QIIME35, a common pipeline used for the analysis of this type of data.

Results and discussion

NG-Tax layout

NG-Tax consists of three core elements, namely barcode-primer filtering, OTU-picking and taxonomic assignment (Figure 1).

fae3b7fa-a20a-431e-bb9b-ee0d422e7349_figure1.gif

Figure 1. NG-Tax flow-diagram.

The steps that are performed per sample are depicted in grey boxes. Optional steps are indicated with dashed lines.

Barcode-Primer filtering. In a first step, paired end libraries are combined, and only read pairs with perfectly matching primers and barcodes are retained. To this end, both primers are barcoded to facilitate identification of chimeras produced during library generation after pooling of individual PCR products.

OTU picking. For each sample an OTU table is created with the most abundant sequences, using a minimum user defined relative abundance threshold. In this particular study we employed a threshold of 0.1% minimum relative abundance. Lowering the threshold will lead to the acceptance of low abundant OTUs, with an increased probability of these OTUs being artifacts due to sequencing and PCR errors. Abundance thresholds are commonly used to remove spurious OTUs generated by sequencing and PCR errors8,36, but previous studies applied a fraction threshold defined by the complete dataset under study, thereby ignoring sample size heterogeneity which may lead to under-representation of asymmetrically distributed OTUs.

Commonly employed quality filtering parameters based on Phred score, such as minimum average Phred score, maximum number of ambiguous positions, maximum bad run length, trimming and minimum read length after quality trimming, are not utilized in NG-Tax because quality scores from the Illumina base caller have been shown to be of limited use for the identification of actual sequence errors for 16S rRNA gene amplicon studies9,37. Additionally, these quality scores only check for errors that occurred during sequencing, but do not account for other sources of error, such as PCR amplification, whereas quality filtering by abundance is sensitive to any source of error. Moreover, the application of global parameters (e.g. average Phred score) ignores that error is sequence-specific, and hence some sequences could be affected more than others. If a species specific amplicon is more prone to PCR or sequencing errors, the relative abundance of that particular species will be underestimated. To compensate for this potential bias, discarded reads are clustered to the OTUs with one mismatch.

Finally, all OTUs are subjected to non-reference based chimera checking according to the following principle: given three OTUs named A, B and C, C will be considered a chimera when the following conditions are satisfied: C and A 5’ reads are identical, C and B 3’ reads are identical and both OTUs, A and B, are at least twice as abundant as OTU C. A complete overview of the number of sequences retained in both pipelines, i.e. NG-Tax and QIIME, as well as the final number of OTUs, is provided in Dataset 1.

Taxonomic assignment. In the current version of NG-Tax, taxonomy is assigned to OTUs utilizing the uclust algorithm16 and the Silva_111_SSU Ref database, containing 731,863 unique full length 16S rRNA gene sequences. To ensure maximum resolution and avoid the risk of errors due to clustering-associated flaws (e.g. reference sequence error hotspots, overrepresentation of certain species and lack of robustness in cluster formation by clustering algorithms), we use the non-clustered database. To speed up the procedure by several orders of magnitude, 16S rRNA gene sequences from the reference database are trimmed to contain only the region amplified by the primers. For each OTU, a taxonomic assignment is retrieved at six different identity thresholds levels (100%, 98%, 97%, 95%, 92% and 90%) and at two taxonomic levels (genus and family). The final taxonomic label is determined by the assignments that show concordance at the highest taxonomic resolution.

Validation

Datasets

Our main objective was to develop a pipeline that accurately reproduces the synthetic MCs and also reduces the impact of experimental choices on the results. To achieve this goal, four synthetic communities of varying complexity were created, consisting of 16S rRNA gene amplicons of phylotypes (PTs) associated with the human GI-tract (Table 1). This specific setup limited the likelihood of overfitting to a particular OTU composition or distribution and allowed us to assess (1) the quantification potential, (2) noise floor and (3) the effect of richness and diversity on quality filtering parameters, thus ensuring a higher fidelity with biological samples than by using a single MC. As a reference, to assess the quality of the taxonomic classifications, full length sequences for all PTs were obtained through Sanger sequencing. Expected MCs were created by trimming the full length sequences to the sequenced region. MC1 and MC2 consisted of equimolar amounts of 17 and 55 PTs, respectively. MC3 contained 55 PTs in staggered concentrations typical for the human GI-tract, and MC4 included 50 PTs with relative abundances ranging between 0.001 and 2.49%. To account for pipetting errors, each of the four MCs was produced in triplicate. To design a pipeline that puts more focus on biology, these 12 MC templates were used to sequence the MCs with different conditions that cover most of the technical bias associated with 16S rRNA gene amplicon studies reported in literature. To this end, we 1) targeted either region V4 or region V5-V6, 2) used four PCR protocols differing in the number of PCR cycles and reaction volumes 3) PCR products were analysed in three different sequencing runs and in seven different libraries, and 4) two different library preparation protocols (with and without an extra amplification of 10 cycles) were applied (Table 2). In addition the sequencing depth ranged from 2363 to 335822 reads per sample (Dataset 1).

Table 1. Composition of the four MCs in relative abundance.

Taxonomy of 55 reference sequences used in this study was based on classification of the full length reference sequences according to the RDP classifier.

RDP Naive Bayesian rRNA Classifier Version 2.10, October 2014MC1MC2MC3MC 4
1 BacteriaActinobacteriaActinobacteriaActinomycetalesNocardiaceae Rhodococcus 1.821.160.001
2 BacteriaActinobacteriaActinobacteriaActinomycetalesMicrococcaceae Micrococcus 1.821.160.01
3 BacteriaActinobacteriaActinobacteriaBifidobacterialesBifidobacteriaceae Bifidobacterium 5.881.821.160.10
4 BacteriaActinobacteriaActinobacteriaBifidobacterialesBifidobacteriaceae Bifidobacterium 1.821.162.49
5 BacteriaActinobacteriaActinobacteriaBifidobacterialesBifidobacteriaceae Bifidobacterium 1.821.16
6 BacteriaActinobacteriaActinobacteriaBifidobacterialesBifidobacteriaceae Bifidobacterium 1.821.16
7 BacteriaActinobacteriaActinobacteriaBifidobacterialesBifidobacteriaceae Bifidobacterium 1.821.16
8 BacteriaActinobacteriaActinobacteriaBifidobacterialesBifidobacteriaceae Bifidobacterium 1.821.16
9 BacteriaActinobacteriaActinobacteriaBifidobacterialesBifidobacteriaceae Bifidobacterium 1.821.16
10 BacteriaActinobacteriaActinobacteriaBifidobacterialesBifidobacteriaceae Bifidobacterium 1.821.16
11 BacteriaBacteroidetesBacteroidiaBacteroidalesBacteroidaceae Bacteroides 1.821.162.49
12 BacteriaBacteroidetesBacteroidiaBacteroidalesBacteroidaceae Bacteroides 1.821.162.49
13 BacteriaBacteroidetesBacteroidiaBacteroidalesBacteroidaceae Bacteroides 1.821.162.49
14 BacteriaBacteroidetesBacteroidiaBacteroidalesBacteroidaceae Bacteroides 1.821.162.49
15 BacteriaBacteroidetesBacteroidiaBacteroidalesBacteroidaceae Bacteroides 1.824.072.49
16 BacteriaBacteroidetesBacteroidiaBacteroidalesBacteroidaceae Bacteroides 5.881.821.162.49
17 BacteriaBacteroidetesBacteroidiaBacteroidalesPorphyromonadaceae Parabacteroides 5.881.824.650.001
18 BacteriaBacteroidetesBacteroidiaBacteroidalesPrevotellaceae Prevotella 5.881.821.160.10
19 BacteriaBacteroidetesBacteroidiaBacteroidalesRikenellaceae Alistipes 5.881.828.140.01
20 BacteriaFirmicutesBacilliBacillalesBacillaceae 1 Bacillus 1.821.162.49
21 BacteriaFirmicutesBacilliBacillalesBacillaceae 1 Bacillus 1.821.162.49
22 BacteriaFirmicutesBacilliLactobacillalesCarnobacteriaceae Granulicatella 1.821.162.49
23 BacteriaFirmicutesBacilliLactobacillalesEnterococcaceae Enterococcus 1.821.162.49
24 BacteriaFirmicutesBacilliLactobacillalesLactobacillaceae Lactobacillus 1.821.162.49
25 BacteriaFirmicutesBacilliLactobacillalesLactobacillaceae Lactobacillus 1.821.162.49
26 BacteriaFirmicutesBacilliLactobacillalesLactobacillaceae Lactobacillus 1.821.162.49
27 BacteriaFirmicutesBacilliLactobacillalesStreptococcaceae Lactococcus 1.821.162.49
28 BacteriaFirmicutesBacilliLactobacillalesStreptococcaceae Streptococcus 5.881.821.162.49
29 BacteriaFirmicutesBacilliLactobacillalesStreptococcaceae Streptococcus 1.821.162.49
30 BacteriaFirmicutesClostridiaClostridialesClostridiaceae 1 Clostridium sensu
stricto
5.881.821.162.49
31 BacteriaFirmicutesClostridiaClostridialesLachnospiraceae Anaerostipes 5.881.821.162.49
32 BacteriaFirmicutesClostridiaClostridialesLachnospiraceae Blautia 5.881.825.810.001
33 BacteriaFirmicutesClostridiaClostridialesLachnospiraceae Dorea 5.881.821.162.49
34 BacteriaFirmicutesClostridiaClostridialesLachnospiraceae Lachnospiracea_
incertae_sedis
5.881.821.162.49
35 BacteriaFirmicutesClostridiaClostridialesLachnospiraceae Lachnospiracea_
incertae_sedis
5.881.821.162.49
36 BacteriaFirmicutesClostridiaClostridialesLachnospiraceae Lachnospiracea_
incertae_sedis
1.821.162.49
37 BacteriaFirmicutesClostridiaClostridialesLachnospiraceae Roseburia 5.881.828.720.01
38 BacteriaFirmicutesClostridiaClostridialesLachnospiraceae Ruminococcus2 1.821.162.49
39 BacteriaFirmicutesClostridiaClostridialesPeptostreptococcaceae Clostridium XI 5.881.821.162.49
40 BacteriaFirmicutesClostridiaClostridialesRuminococcaceae Clostridium IV 1.821.162.49
41 BacteriaFirmicutesClostridiaClostridialesRuminococcaceae Faecalibacterium 5.881.826.980.10
42 BacteriaFirmicutesNegativicutesSelenomonadalesVeillonellaceae Veillonella 1.821.162.49
43 BacteriaFusobacteriaFusobacteriiaFusobacterialesFusobacteriaceae Fusobacterium 1.821.742.49
44 BacteriaLentisphaeraeLentisphaeriaVictivallalesVictivallaceae Victivallis 1.820.582.49
45 BacteriaProteobacteriaGammaproteobacteriaEnterobacterialesEnterobacteriaceae Citrobacter 1.821.162.49
46 BacteriaProteobacteriaGammaproteobacteriaEnterobacterialesEnterobacteriaceae Enterobacter 1.821.162.49
47 BacteriaProteobacteriaGammaproteobacteriaEnterobacterialesEnterobacteriaceae Enterobacter 1.821.162.49
48 BacteriaProteobacteriaGammaproteobacteriaEnterobacterialesEnterobacteriaceae Enterobacter 1.821.162.49
49 BacteriaProteobacteriaGammaproteobacteriaEnterobacterialesEnterobacteriaceae Escherichia/
Shigella
5.881.824.072.49
50 BacteriaProteobacteriaGammaproteobacteriaEnterobacterialesEnterobacteriaceae Klebsiella 1.821.162.49
51 BacteriaProteobacteriaGammaproteobacteriaEnterobacterialesEnterobacteriaceae Salmonella 1.821.162.49
52 BacteriaProteobacteriaGammaproteobacteriaEnterobacterialesEnterobacteriaceae Serratia 1.821.162.49
53 BacteriaProteobacteriaGammaproteobacteriaPseudomonadalesPseudomonadaceae Pseudomonas 1.821.162.49
54 BacteriaProteobacteriaGammaproteobacteriaPseudomonadalesPseudomonadaceae Pseudomonas 1.821.162.49
55 BacteriaVerrucomicrobiaVerrucomicrobiaeVerrucomicrobialesVerrucomicrobiaceae Akkermansia 5.881.822.912.49

Table 2. The 49 sequenced MC samples.

NameMC
type
TemplateRegionRunLibraryExtra
amplification
PCR
cycles
PCR
volume
1 Mc.1.1.10111.1V5-V611Yes253x40µl
2 Mc.1.2.10111.2V5-V611Yes253x40µl
3 Mc.1.3.10111.3V5-V611Yes253x40µl
4 Mc.2.1.10122.1V5-V611Yes253x40µl
5 Mc.2.2.10122.2V5-V611Yes253x40µl
6 Mc.2.3.10122.3V5-V611Yes253x40µl
7 Mc.3.1.10133.1V5-V611Yes253x40µl
8 Mc.3.2.10133.2V5-V611Yes253x40µl
9 Mc.3.3.10133.3V5-V611Yes253x40µl
10 Mc.4.1.10144.1V5-V611Yes253x40µl
11 Mc.4.2.10144.2V5-V611Yes253x40µl
12 Mc.4.3.10144.3V5-V611Yes253x40µl
13 Mc.1.1.10211.1V422No253x40µl
14 Mc.1.2.10211.2V422No253x40µl
15 Mc.1.3.10211.3V422No253x40µl
16 Mc.2.1.10222.1V422No253x40µl
17 Mc.2.2.10222.2V422No253x40µl
18 Mc.2.2.dup.10222.2V422No253x40µl
19 Mc.2.2.30.10222.2V422No303x40µl
20 Mc.2.2.35.10222.2V422No353x40µl
21 Mc.2.2.1x.10222.2V422No251x100µl
22 Mc.2.2.1x.10222.2V422No251x100µl
23 Mc.2.3.10222.3V422No253x40µl
24 Mc.3.1.10233.1V422No253x40µl
25 Mc.3.2.10233.2V422No253x40µl
26 Mc.3.3.10233.3V422No253x40µl
27 Mc.4.1.10244.1V422No253x40µl
28 Mc.4.2.10244.2V422No253x40µl
29 Mc.4.3.10244.3V422No253x40µl
30 Mc.2.1.10322.1V433No253x40µl
31 Mc.2.2.10322.2V433No253x40µl
32 Mc.4.1.10344.1V433No253x40µl
33 Mc.4.2.10344.2V433No253x40µl
34 Mc.2.1.10422.1V434No253x40µl
35 Mc.2.2.10422.2V434No253x40µl
36 Mc.4.1.10444.1V434No253x40µl
37 Mc.4.2.10444.2V434No253x40µl
38 Mc.2.1.10522.1V435No253x40µl
39 Mc.2.2.10522.2V435No253x40µl
40 Mc.4.1.10544.1V435No253x40µl
41 Mc.4.2.10544.2V435No253x40µl
42 Mc.2.1.10622.1V436No253x40µl
43 Mc.2.2.10622.2V436No253x40µl
44 Mc.4.1.10644.1V436No253x40µl
45 Mc.4.2.10644.2V436No253x40µl
46 Mc.2.1.10722.1V437No253x40µl
47 Mc.2.2.10722.2V437No253x40µl
48 Mc.4.1.10744.1V437No253x40µl
49 Mc.4.2.10744.2V437No253x40µl

NG-Tax classification of short reads versus full length classification

To evaluate the accuracy and reproducibility of taxonomic classification using a low information content of ~140 nt compared to a maximal information content of ~1500 nt, we compared the NG-Tax classification of all 55 reference sequences used in this study trimmed to V4 and V5-V6, with a classification of the corresponding full length reference sequences using the RDP classifier (RDPc)24 (Figure 2). At family level, all three classifications (i.e. full length, V4 and V5-V6) were in complete concordance for all phylotypes. Correspondingly, the consistency at genus level was very high. Only a few phylotypes (nine and five for V4 and V5-V6 amplicons, respectively), that belong to poorly classified families such as Peptostreptococcaceae, Ruminococceae and Enterobacteriaceae, attained higher resolution using the full length sequences and RDPc. In turn, for Pseudobutyrivibrio (PT35), a higher resolution was attained with short reads due to the high specificity of the hypervariable regions, which can be overshadowed when using the full length sequence. Lastly, only two (PT51, PT52, V4) and one (PT51, V5-V6) assignments at genus level (both members of the Enterobacteriaceae) were incongruent between classification of the short and full length sequences. Overall, the V5-V6 amplicons outperformed the V4 amplicons because this region allowed for differentiation between members of the Enterobacteriaceae. The average taxonomic specificity (percentage of hits with an identical taxonomic label) for all reference phylotypes was 96% for both regions with an average of 1485 and 635 hits for regions V4 and V5-V6, respectively. The high specificity and high number of hits at very high identity thresholds, combined with the fact that the vast majority of V4 and V5-V6 based assignments matched, testifies for the reliability and quality of the assignments.

fae3b7fa-a20a-431e-bb9b-ee0d422e7349_figure2.gif

Figure 2. NG-Tax Assignment quality of the 55 MC phylotypes.

The x-axis shows taxonomic assignments by three methods: RDP full length/NG-Tax V5-V6 trimmed/NG-Tax V4 trimmed. If all three assignments are in agreement, that classification is shown. The y-axis depicts assignment specificity (the fraction of hits with an identical label) and the total number of hits supporting this taxonomic label.

Observed versus expected microbial profiles

To assess the ability to reproduce the expected composition of the MCs we benchmarked NG-Tax with QIIME, a commonly employed 16S rRNA gene amplicon analysis pipeline. The reproduction of MC compositional profiles generated by amplicon sequencing on Illumina platforms commonly suffers from a high fraction of poorly classified and spurious OTUs9. Using QIIME, up to 20% of OTUs per sample could not be assigned beyond the class level (Figure 3). In contrast, with NG-Tax we observed excellent reproduction of the expected profiles (Figure 4). An average of 92.02% of the reads could be assigned to genus level and 99.94% to at least family level. Spurious genera (Robinsoniella, Subdoligranulum, Cupriavidus, Ralstonia, Kluyvera and Pantoea) represented on average only 0.02% of the reads per sample compared to an average of 23% misclassified reads using QIIME38. One template, PT17 (Parabacteroides), attracted so much sequencing error in the V4 region that it was rendered undetectable although it was amplified by the primers (Supplementary Figure 1). Therefore, to test both pipelines without this sequencing anomaly, it was removed from the analysis.

fae3b7fa-a20a-431e-bb9b-ee0d422e7349_figure3.gif

Figure 3. Observed composition of all MCs compared with the expected ones (EXP) for both regions and each read separately obtained with QIIME.

fae3b7fa-a20a-431e-bb9b-ee0d422e7349_figure4.gif

Figure 4. Observed composition of all MCs compared with the expected ones (EXP) for both regions obtained with NG-tax.

Observed versus expected diversity

Richness and diversity measures are important for understanding community complexity and dynamics. Among these measures, α-diversity is defined as the diversity within a sample, which is often estimated based on the abundance distribution (evenness) and number (richness) of species, whereas β-diversity is defined as the partitioning of diversity among communities. The ability of researchers to quantify richness and diversity hinges on an accurate assessment of the composition of these communities39. For microbial communities, this has been particularly challenging, as none of the existing molecular microbial ecology methods normally capture more than a small proportion of the estimated total richness in most microbial communities40. For deep sequencing based approaches, filtering strategies that remove low-abundance reads make it impossible to apply richness estimation metrics such as the Chao1 index and the ACE coverage estimator, because low-abundance read counts are included in their calculations. Conversely, richness estimates based on unfiltered datasets are unlikely to be accurate, if many of the reads actually represent PCR and/or sequencing errors9. In contrast to purely OTU-based methods, divergence-based methods account for the fact that not all species within a sample are equally related to each other, considering two communities to be similar if they harbour the same phylogenetic lineages, even if the species representing those lineages in each of the communities are different. Consequently, these methods are more powerful than purely OTU-based methods, because similarity in 16S rRNA gene sequence often correlates with phenotypic similarity in key features such as metabolic capabilities. An added benefit is that small errors that are likely due to unfiltered sequencing errors, are punished less severely because OTUs that are only a few nt distant from each other due to error are still closely related using divergence based indices. Therefore, these indices probably provide a better estimate of the true diversity for data generated by high throughput next generation technology sequencers.

Because the focus of NG-Tax is to retain as much biological signal as possible while minimizing the impact of any technical choice, divergence-based α-diversity (Phylogenetic Diversity (PD)41) and β-diversity (Unifrac39) metrics were used to visualize the diversity within and between MCs (Figure 5). The results obtained with QIIME suffered from all of the previously described technological artifacts. The MCs clustered by primer pair instead of MC, and within each cluster the structure, i.e. the position of MCs relative to each other, was different. More importantly, the true biological variation depicted by the expected composition was reproduced by neither primer pair (Figure 5C). Based on these results not only the Principle Coordinates Analysis (PCoA) based conclusions would have been different for both primer pairs, but also the differences in taxonomic classification could lead to significant changes in identified biomarkers, in line with what has previously been observed by He and co-workers28. Here we show that replicability within a variable region was attained. The more important reproducibility, however, i.e. the corroboration of findings by reproduction in different independent setups that use e.g. different primers, was not. This is an important observation because biological findings should be insensitive to independent methodologies5. In line with the above, also the observed α-diversity (PD) was found highly inflated and the biological order was not reproduced (Figure 5D). In contrast, NG-Tax provided a clear separation of samples by MC type and their representative expected samples regardless of variable region, PCR protocol, sequencing run, library and sequencing depth. These results are remarkable, given the biases associated with each of these categories and the difference in resolution between the two regions (Figure 5A). Moreover, MC2, MC3 and MC4 were very similar, sharing the same genera and largely the same phylotypes, only differing in relative distribution (Table 1). Correspondingly, rarefaction curves for α-diversity (Figure 5B) showed excellent reproduction of the true diversity. A perfect overlap cannot be achieved since the expected MCs do not account for sequencing or PCR errors, and these factors cannot be completely removed from real sequencing data.

fae3b7fa-a20a-431e-bb9b-ee0d422e7349_figure5.gif

Figure 5.

A/C. PCoA using Weighted Unifrac of all sequenced and expected MCs as obtained after processing of data using NG-Tax (A) and QIIME (C). Darker colored triangles represent the expected composition while lighter colored circles represent sequenced samples. B/D. Rarefaction curves of PD for all MCs and their expected counterparts for NG-Tax (B) and QIIME (D). Dashed lines represent the expected composition while solid lines represent sequenced samples.

Dataset 1.Raw data of NG-Tax pipeline for analysis of 16S rRNA amplicons from complex biome.

Conclusions

An increasing number of studies have shown that the chosen methodology rather than the natural variance is responsible for the greatest variance in microbiome studies5,18,23,27,3234. Some authors raised their concern with comparing data generated using different strategies29, which basically suggests that true reproducibility (i.e. using different approaches and drawing the same biological conclusions) cannot be attained. This is an alarming observation since studies are often used to identity biomarker organisms, associated with certain host phenotypes (often comparing a diseased state to a healthy state), yet the use of different primers might show different biomarkers8,22,23,27,28,30. So far, neither currently available pipelines nor taxonomic classifiers have been able to efficiently reduce the noise in this type of data. Nevertheless, in properly de-noised datasets, taxonomical profiles, richness and diversity should be close to the expected values and the abundance of unassigned and poorly assigned reads should be low except when dealing with largely unexplored environments that are not sufficiently covered yet by the reference databases. At lower noise levels different variable regions should yield similar conclusions with small variations due to region specific resolution, and minor changes in the experiment should still deliver the same biological conclusions. Here we presented NG-Tax, an improved pipeline for 16S rRNA gene amplicon sequencing data, which continues to be a backbone in the analysis of microbial ecosystems. Several novel steps ensure much needed improved robustness against errors associated with technical aspects of these studies, such as PCR protocols, choice of 16S rRNA gene variable region and variable rates of sequencing error27,29,32. The commonly reported problems such as many un- or poorly classified OTUs, inflated richness and diversity, taxonomic profiles that do not match the expected ones, region dependent taxonomic classification and results being highly dependent on minor changes in the experimental setup have been tackled with NG-Tax. Despite the short read length (~140 nt) and all technical biases, the average taxonomic assignment specificity for the phylotypes included in the MCs was 96%. In addition, 92.02% of the reads could be assigned to at genus level and 99.94% to at least family. Spurious genera represented only 0.02% of the reads per sample. More importantly, rarefaction curves and PCoA plots confirmed improved performance of NG-Tax with respect to clustering based on biology rather than technical aspects, such as sequencing run, library or choice of 16S rRNA gene region. Therefore NG-Tax represents a method for 16S rRNA gene amplicon analysis with improved qualitative and quantitative representation of the true sample composition. Additionally, the high robustness against technical bias associated with 16S rRNA gene amplicon studies will improve comparability between studies and facilitate efforts towards standardization.

Methods

Primers

Primer pairs 515F (5’-GTGCCAGCMGCCGCGGTAA) - 806R (5’-GGACTACHVGGGTWTCTAAT) and BSF784F (5’-RGGATTAGATACCC) - 1064R (5’-CGACRRCCATGCANCACCT) have been previously reported for amplification of the V48 and V5- V627 regions of the bacterial 16S rRNA gene, respectively. They were selected based on 1) experimental validation, 2) taxonomic coverage of the relevant ecosystem (Supplementary Figure 2) and 4) adherence to specific rules associated with the sequencing platform, such as a maximum amplicon size of <500 nt. Unless noted otherwise all primers were ordered at Biolegio (Nijmegen, Netherlands).

Barcoding strategy

At the time of sequencing Illumina’s Hiseq2000 allowed for multiplexing of up to 12 samples per lane using an index or barcode read provided by Illumina. To achieve optimal sample throughput and phylogenetic depth, 70 primers containing a custom designed 8nt barcode were developed to combine with the Illumina barcodes to reach a maximum throughput of 12×70 samples per lane. Each set of 70 barcoded samples are referred to as “library”. Low diversity samples, such as 16S rRNA gene amplicons, can lead to problems with base calling due to overexposure of fluorescent labels. Therefore, the set of 70 barcodes was specifically designed to possess an equal base distribution over their complete length. Additionally, to avoid differential amplification, a two-base “linker” sequence that is not complementary to any 16S rRNA sequence at the corresponding position, from a database that contains 1132 phylotypes associated with the Human GI tract42, was inserted between the primer and barcode. The resulting set of 70 barcoded primers was checked for avoidance of secondary structure formation within or between primers (i.e., primer-dimers) or between barcodes and primers, using PrimerProspector43.

Mock communities

All MCs were mixed in triplicate to account for pipetting error. These MCs ranged from 17–55 species in both equimolar and staggered compositions. One MC contained members at very low abundances of 0.1, 0.01 and 0.001% (Table 2). Amplicons were generated either from cloned 16S rRNA gene amplicons, isolates available in the local culture collection of the Laboratory of Microbiology, Wageningen University, or strains ordered from DSMZ and cultured according to DSMZ recommendations, after which genomic DNA was isolated using the Genejet genomic DNA isolation kit (Thermo fisher scientific AG, Reinach, Zwitserland). A 16S rRNA gene specific PCR was performed using the universal primers 27F (5’-GTTTGATCCTGGCTCAG) - 1492R (5’-GGTTACCTTGTTACGACTT) to obtain full length amplicons of which size and concentration were checked on a 1% agarose gel and which were column purified and quantified with the Qubit 2.0 fluorometer, and dsDNA BR assay kit (Invitrogen, Eugene, USA). Amplicons were mixed in the MCs to obtain the specified relative abundances. High quality full length reference sequences of all MC members were obtained by Sanger sequencing at GATC Biotech AG (Constance, Germany) with sequencing primers 27F - 1492R in order to confirm their identity. The MCs were diluted 103-fold and subsequently used as templates in PCRs for the generation of barcoded PCR products.

Barcoded PCR

Unless noted otherwise, each sample was amplified in triplicate using Phusion hot start II high fidelity polymerase (Thermo fisher scientific AG), checked for correct size and concentration on a 1% agarose gel and subsequently combined and column-purified with the High pure PCR cleanup micro kit (Roche diagnostics, Mannheim, Germany). Forty μl PCR reactions contained 28.4 μL nucleotide free water (Promega, Madison, USA), 0.4 μL of 2 U/μl polymerase, 8 μL of 5× HF buffer, 0.8 μl of 10 μM stock solutions of each of the forward (515F) and reverse (806R) primers, 0.8 μL 10mM dNTPs (Promega) and 0.8 μL template DNA (103 × diluted 200 ng/μl stock). Reactions were held at 98°C for 30 s and amplification proceeding for 25 cycles at 98°C for 10 s, 50°C for 10 s, 72°C for 10 s and a final extension of 7 min at 72°C. Purified amplicons were quantified using Qubit. For primer pair BSF784F-1064R the thermal cycling conditions were identical to those detailed above except that the annealing temperature was 42°C. To quantify noise generated by the PCR protocol, several reactions were performed with 30 or 35 cycles and 1× 100μl reaction instead of pooling 40μl in triplicate (Table 2).

A composite sample for sequencing was created by combining equimolar amounts of amplicons from the individual samples, followed by gel purification and ethanol precipitation to remove any remaining contaminants. The resulting libraries were sent to GATC Biotech AG for sequencing on an Illumina Hiseq2000 instrument.

Sequence analysis with QIIME

We have used QIIME to benchmark NG-Tax. Illumina fastq files were de-multiplexed, quality filtered and analyzed using QIIME (v. 1.9)35 with closed reference OTU picking, using default settings and quality parameters as previously reported9.

NG-tax pipeline and user manual

The NG-tax pipeline, user manual and files and code to reproduce the presented results, are available for download at the ftp server http://systemsbiology.nl/NG-Tax/.

Abbreviations

rRNA: ribosomal RNA; MC: Mock Community; OTU: Operational Taxonomic Unit; PT: Phylotype; RDP: Ribosomal Database Project; RDPc: RDP classifier; PD: Phylogenetic Diversity; PCoA: Principle Coordinates Analysis

Data availability

F1000Research: Dataset 1. Raw data of NG-Tax pipeline for analysis of 16S rRNA amplicons from complex biome, 10.5256/f1000research.9227.d13012044

Sequence data have been deposited in the European Nucleotide Archive45, accession number [ENA:PRJEB11702]) http://www.ebi.ac.uk/ena/data/view/PRJEB11702 (amplicon sequencing data for all 49 samples) and [ENA:LN907729-LN907783]) (full length 16S rRNA gene sequences for all 55 PTs).

Comments on this article Comments (2)

Version 2
VERSION 2 PUBLISHED 23 Nov 2018
Revised
Version 1
VERSION 1 PUBLISHED 22 Jul 2016
Discussion is closed on this version, please comment on the latest version above.
  • Author Response 09 Feb 2018
    Javier Ramiro-Garcia, Laboratory of Systems and Synthetic Biology, Wageningen University, Wageningen, 6708 WE, The Netherlands
    09 Feb 2018
    Author Response
    A new version of NG-Tax can be downloaded from https://github.com/JavierRamiroGarcia/NG-Tax
    Competing Interests: No competing interests were disclosed.
  • Author Response 25 Jul 2016
    Javier Ramiro-Garcia, Wageningen University and Research Centre, The Netherlands
    25 Jul 2016
    Author Response
    The proper link to download the pipeline is:

    http://www.systemsbiology.nl/NG-Tax/

    We will correct the link in version 2 of the paper.
    Sorry for the inconveniences.

    Javier Ramiro-Garcia
    Competing Interests: No competing interests were disclosed.
  • Discussion is closed on this version, please comment on the latest version above.
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Ramiro-Garcia J, Hermes GDA, Giatsis C et al. NG-Tax, a highly accurate and validated pipeline for analysis of 16S rRNA amplicons from complex biomes [version 1; peer review: 2 approved with reservations, 1 not approved] F1000Research 2016, 5:1791 (https://doi.org/10.12688/f1000research.9227.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 22 Jul 2016
Views
128
Cite
Reviewer Report 02 Aug 2016
Fiona Fouhy, Teagasc Food Research Centre, Fermoy, Ireland 
Approved with Reservations
VIEWS 128
This is a novel and important piece of research. Extensive research is being conducted using next generation sequencing but researchers are becoming increasingly aware that many factors such as PCR bias, region of the 16S rRNA gene targeted etc. can ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Fouhy F. Reviewer Report For: NG-Tax, a highly accurate and validated pipeline for analysis of 16S rRNA amplicons from complex biomes [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2016, 5:1791 (https://doi.org/10.5256/f1000research.9931.r15389)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 02 Jan 2019
    Javier Ramiro-Garcia, Laboratory of Systems and Synthetic Biology, Wageningen University, Wageningen, 6708 WE, The Netherlands
    02 Jan 2019
    Author Response
    This is a novel and important piece of research. Extensive research is being conducted using next generation sequencing but researchers are becoming increasingly aware that many factors such as PCR ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 02 Jan 2019
    Javier Ramiro-Garcia, Laboratory of Systems and Synthetic Biology, Wageningen University, Wageningen, 6708 WE, The Netherlands
    02 Jan 2019
    Author Response
    This is a novel and important piece of research. Extensive research is being conducted using next generation sequencing but researchers are becoming increasingly aware that many factors such as PCR ... Continue reading
Views
260
Cite
Reviewer Report 02 Aug 2016
Julien Tremblay, Biomonitoring, Research Officer at the National Research Council Canada, Montreal, QC, Canada 
Not Approved
VIEWS 260
This paper describes a pipeline for processing 16S rRNA amplicon data. They implemented an experimental design in which they used data coming from three different HiSeq2000 runs using two variable regions (V4 and V5-V6). It is however not clear if ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Tremblay J. Reviewer Report For: NG-Tax, a highly accurate and validated pipeline for analysis of 16S rRNA amplicons from complex biomes [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2016, 5:1791 (https://doi.org/10.5256/f1000research.9931.r15175)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 02 Jan 2019
    Javier Ramiro-Garcia, Laboratory of Systems and Synthetic Biology, Wageningen University, Wageningen, 6708 WE, The Netherlands
    02 Jan 2019
    Author Response
    This paper describes a pipeline for processing 16S rRNA amplicon data. They implemented an experimental design in which they used data coming from three different HiSeq2000 runs using two variable ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 02 Jan 2019
    Javier Ramiro-Garcia, Laboratory of Systems and Synthetic Biology, Wageningen University, Wageningen, 6708 WE, The Netherlands
    02 Jan 2019
    Author Response
    This paper describes a pipeline for processing 16S rRNA amplicon data. They implemented an experimental design in which they used data coming from three different HiSeq2000 runs using two variable ... Continue reading
Views
173
Cite
Reviewer Report 02 Aug 2016
Thomas S. B. Schmidt, Institute of Molecular Life Sciences, Institute for Molecular Life Sciences, University of Zurich Microbial Ecology, Computational Biology, Bioinformatics,  Zürich, Switzerland 
Approved with Reservations
VIEWS 173
In their manuscript, the authors introduce NG-Tax, an open-source software for the (meta-)analysis of 16S rRNA-based microbiome datasets. Their tool focuses on an important and so-far arguably understudied aspect of microbial ecology research: the integration of results across studies, in ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Schmidt TSB. Reviewer Report For: NG-Tax, a highly accurate and validated pipeline for analysis of 16S rRNA amplicons from complex biomes [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2016, 5:1791 (https://doi.org/10.5256/f1000research.9931.r15177)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 02 Jan 2019
    Javier Ramiro-Garcia, Laboratory of Systems and Synthetic Biology, Wageningen University, Wageningen, 6708 WE, The Netherlands
    02 Jan 2019
    Author Response
    In their manuscript, the authors introduce NG-Tax, an open-source software for the (meta-)analysis of 16S rRNA-based microbiome datasets. Their tool focuses on an important and so-far arguably understudied aspect of ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 02 Jan 2019
    Javier Ramiro-Garcia, Laboratory of Systems and Synthetic Biology, Wageningen University, Wageningen, 6708 WE, The Netherlands
    02 Jan 2019
    Author Response
    In their manuscript, the authors introduce NG-Tax, an open-source software for the (meta-)analysis of 16S rRNA-based microbiome datasets. Their tool focuses on an important and so-far arguably understudied aspect of ... Continue reading

Comments on this article Comments (2)

Version 2
VERSION 2 PUBLISHED 23 Nov 2018
Revised
Version 1
VERSION 1 PUBLISHED 22 Jul 2016
Discussion is closed on this version, please comment on the latest version above.
  • Author Response 09 Feb 2018
    Javier Ramiro-Garcia, Laboratory of Systems and Synthetic Biology, Wageningen University, Wageningen, 6708 WE, The Netherlands
    09 Feb 2018
    Author Response
    A new version of NG-Tax can be downloaded from https://github.com/JavierRamiroGarcia/NG-Tax
    Competing Interests: No competing interests were disclosed.
  • Author Response 25 Jul 2016
    Javier Ramiro-Garcia, Wageningen University and Research Centre, The Netherlands
    25 Jul 2016
    Author Response
    The proper link to download the pipeline is:

    http://www.systemsbiology.nl/NG-Tax/

    We will correct the link in version 2 of the paper.
    Sorry for the inconveniences.

    Javier Ramiro-Garcia
    Competing Interests: No competing interests were disclosed.
  • Discussion is closed on this version, please comment on the latest version above.
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.