Unraveling bacterial fingerprints of city subways from microbiome 16S gene profiles

Walker, Alejandro R.; Grimes, Tyler L.; Datta, Somnath; Datta, Susmita

doi:10.1186/s13062-018-0215-8

Research
Open access
Published: 22 May 2018

Unraveling bacterial fingerprints of city subways from microbiome 16S gene profiles

Alejandro R. Walker¹,
Tyler L. Grimes¹,
Somnath Datta¹ &
…
Susmita Datta¹

Biology Direct volume 13, Article number: 10 (2018) Cite this article

4305 Accesses
22 Citations
6 Altmetric
Metrics details

Abstract

Background

Microbial communities can be location specific, and the abundance of species within locations can influence our ability to determine whether a sample belongs to one city or another. As part of the 2017 CAMDA MetaSUB Inter-City Challenge, next generation sequencing (NGS) data was generated from swipe samples collected from subway stations in Boston, New York City hereafter New York, and Sacramento. DNA was extracted and Illumina sequenced. Sequencing data was provided for all cities as part of 2017 CAMDA contest challenge dataset.

Results

Principal component analysis (PCA) showed clear clustering of the samples for the three cities, with a substantial proportion of the variance explained by the first three components. We ran two different classifiers and results were robust for error rate (< 6%) and accuracy (> 95%). The analysis of variance (ANOVA) demonstrated that overall, bacterial composition across the three cities is significantly different. A similar conclusion was reached using a novel bootstrap based test using diversity indices. Last but not least, a co-abundance association network analyses for the taxonomic levels “order”, “family”, and “genus” found different patterns of bacterial networks for the three cities.

Conclusions

Bacterial fingerprint can be useful to predict sample provenance. In this work prediction of provenance reported with over 95% accuracy. Association based network analysis, emphasized similarities between the closest cities sharing common bacterial composition. ANOVA showed different patterns of bacterial amongst cities, and these findings strongly suggest that bacterial signature across multiple cities are different. This work advocates a data analysis pipeline which could be followed in order to get biological insight from this data. However, the biological conclusions from this analysis is just an early indication out of a pilot microbiome data provided to us through CAMDA 2017 challenge and will be subject to change as we get more complete data sets in the near future. This microbiome data can have potential applications in forensics, ecology, and other sciences.

Reviewers

This article was reviewed by Klas Udekwu, Alexandra Graf, and Rafal Mostowy.

Background

The advent of NGS technologies has experienced a tremendous effect on –omics applications. The reduction of costs since its introduction [1] has accelerated the use of this technology on metagenomics experiments [2, 3]. Phylogenetic survey analyses based on 16S gene diversity have been fundamental on identification of bacterial varieties [4,5,6]. This sequencing revolution, in conjunction with high performance computing, and recently developed computing tools has had a vast impact on new 16S gene studies [5, 7]. The use of WGS data on microbiome experiments has been widely reported and has multiple advantages when compared with 16S amplicon data [8].

In this work, we focus on the MetaSUB Challenge dataset as part of the 2017 CAMDA competition. MetaSUB International Consortium aims to create longitudinal metagenomic map of mass-transit systems, and other public spaces around the world. They partnered with CAMDA for an early release of microbiome data of Boston, New York, and Sacramento for the massive data analysis challenge. Swab samples collected from subway stations in these three cities, were Illumina-sequenced at variable depths, and provided for further analyses in compressed FASTQ format. The data set consisted of 141, 1572, and 18 samples from Boston, New York, and Sacramento, respectively (Table 1). Subsequent bioinformatics processing was conducted in the “HiPerGator” high performance cluster at the University of Florida. Sequence data files were uncompressed, quality filtered, and open-reference operational taxonomic units (OTUs) were picked using QIIME pipeline [9]. After quality control, the effective number of samples included in this work was 134 in Boston, 777 in New York, and 18 in Sacramento (Table 1). OTUs were aggregated as counts and normalized for three taxonomic ranks. The selected ranks were “order”, “family”, and “genus”, based on the number of common levels across all three cities (see Fig. 1). A summary of the common levels for each taxonomic rank is also presented in Table 1.

Table 1 Sample count for city and effective samples analyzed and resulting number of common entries for each of the selected taxonomic ranks included in this work

Full size table

Our motivation is to unravel the bacterial fingerprints of all these three different cities (similarities and differences) using only common bacterial signatures within three taxonomic ranks. In particular, we consider four different statistical analyses; each is conducted across cities using a common taxonomic rank, and the analysis is repeated for each rank. The analyses include PCA, sample provenance prediction using classification techniques, differential abundance of bacteria across cities using ANOVA, and network analysis based on statistical association of bacterial signatures.

Results

Principal component analysis

First we describe the results of our PCA conducted on these samples. Table 2 presents a summary of the variability explained by the first three components. As seen in this summary, the total amount of variance explained by the first 3 principal components was consistently greater than 80% for all taxonomic ranks. Plots of principal components are presented in Fig. 2, sorted by taxonomic ranks with “order” on the left and “genus” on the right. The top row illustrates bi-plots of components 1 and 2 with a remarkable clustering of the samples from the three cities. As seen in all three plots (A1, B1, and C1), the majority of variables with each taxonomic rank were highly correlated with the first principal component (being nearly parallel to the corresponding axis). On the other hand, as seen in plot A1, the “order” enterobacteriales showed a higher correlation with the second principal component. This might highlight a low importance of this “order” for Boston, and New York. This was also concordant in plots B1, and C1 for “family” enterobacteriaceae, and “genus” enterobacter, respectively. Second row in Fig. 2 presents three-dimensional (3D) plots of first 3 components (A2, B2, and C2). The clustering of the cities is even more clear-cut from these 3D-plots. These plots, along with the bi-plots, also support the premise that Boston, and New York both have similar bacterial patterns compared with Sacramento.

Table 2 Total amount of variance explained by principal components 1-3 for all three taxonomic tanks (“order”, “family”, and “genus”)

Full size table

Classification analysis

Class prediction of city of origin was conducted using in two different approaches. First, prediction of sample provenance was carried out using the Random Forest [10] classifier (RF). This is a well-regarded classifier for its superior theoretical and practical performances, and is robust to over fitting. The model was fitted for each taxonomic rank. The overall classification error rates were 3.01, 3.12, and 6.77% for “order”, “family”, and “genus” respectively; note that RF calculates these rates internally by using the out-of-bag error of samples. Results for each city are presented in Table 3. The error rate for “genus” was somewhat elevated compared to the other two, perhaps as a consequence of having less features (10) compared to the other two (19, and 23). The classification error for New York samples was particularly low, probably because of the large amount of sequencing data available for this city. Sacramento also showed low classification errors even though the data set had only 18 samples for this city. However, as shown even by our PCA, these samples had a distinctive bacterial signature compared to the other two making them easier to identify by a classifier such as RF. Overall, the Boston samples were the hardest to distinguish possibly due to their similarity with New York samples. Perhaps a larger representative sample from Boston would produce a better classifier.

Table 3 Random forest classification error of city across all taxonomic ranks “order”, “family”, and “genus”

Full size table

The importance of each predictor can be measured based on the mean decrease in accuracy when the predictor is removed from the model; these results are presented in Fig. 3. In plot A, the top three “orders”, namely clostridiales, rhizobiales, and enterobacteriales are the most effective in predicting a city. Interestingly, in plot B, the top “families” belong to the same top “orders” from plot A. On the other hand, the top “genera” in plot C did not correspond to those in plots A and B.

The second approach we implemented was an Ensemble [11] classifier (EC), which is restricted to binary predictions. Results are presented (see, Fig. 4) in terms of classification accuracy, sensitivity, specificity, and area under the curve (AUC). Ensemble results showed that prediction accuracy, and sensitivity for Boston-Sacramento (B-S), and New York-Sacramento (NY-S) pairs were consistently over 98% for all taxonomic ranks. It is interesting to note that the overall accuracy for the three-city classification system was only slightly worse as shown in the previous paragraph for RF results. Accuracy, and sensitivity results for Boston-New York (B-NY) pair were smaller - 92, and 60%, respectively, both at taxonomic rank “genus”. Specificity results were the best for B-NY and worst for B-S for all ranks. AUC was generally greater than 95% across all three ranks, although at taxonomic rank “genus” appeared to have a large variation.

Differential abundance analysis

Analysis of variance for taxonomic rank “order” revealed that bacterial abundance is highly significantly different for most of the common levels across the three cities. Table 4 shows minimum, averaged, and maximum p-values, and counts for each “order” across the three cities, reported for the corresponding Tukey group after 5000 replications. It can also be inferred from Table 4 that city means for the first four orders were all significantly different across city (group a-b-c), with a small percentage of the samples (< 5%) corresponding to Tukey’s group a-a-b. Additionally the top 11 order means were significantly different in all the replications and were in a large number of them counted as a-b-c (> 30%) and in some others as a-a-b. The analysis also found a few features that were significantly different only in a small number of replications, proving the effectiveness of the balanced ANOVA. These orders were sphingomonadales, and rhodospirillales, with 324 and 649 significant cases respectively.

Table 4 ANOVA results for taxonomic rank “order”. Tukey’s multiple comparison test results after 5000 replications significant p-values (α = 0.01) were averaged and counted for Tukey’s groups (Boston-New York-Sacramento). In general terms, when comparing two cities if letters (‘a’, ‘b’ and ‘c’) are all the same, we conclude that the means are not significantly different. If the letters are different we conclude city means are significantly different in terms of bacterial abundances. As for example, “order” enterobacteriales, shows minimum, average and maximum p-value out of 5000 replications, and 4967 times out of 5000 replications the three city means were found to be significantly different ‘a’-‘b’-‘c‘; 30 times Boston and New York mean bacterial abundances remain the same but Sacramento is different (‘a’-‘a’-‘b’) and only in 3 cases Boston, and Sacramento are the same but New-York (‘a’-‘b’-‘a’) is different deemed by Tukey’s multiple comparison test. Taxonomic rank names (“order”) are presented in the same order for all groups (‘a’-‘b’-‘c’, ‘a’-‘a’-‘b’, ‘a’-‘b’-‘b’, ‘a’-‘b’-‘a’)

Full size table

Effective number of species (S) found in all cities across the three taxonomic ranks, is shown as proportional-area Venn diagram in Fig. 1. The plot shows greater diversity in Sacramento compared with both Boston, and New York for all taxonomic ranks also the diversity increases, as taxonomic rank moves from “order” to “genus”. Mean species diversity (α_t) [12, 13] were calculated for all taxonomic ranks across cities (see eq. (5)) for two values for the weight modifier “q” (0.5, and 2.0). Using bootstrap based test [14] results (see Table 5) showed that mean species diversity (q = 0.5) was significantly different (α = 0.05) for taxonomic ranks “order”, and “family”. For “genus”, test for Mean species diversity between the three cities was borderline significant. Results for the second weight modifier (q = 2) showed that mean species diversity, across all taxonomic ranks, was not significant in our bootstrap analysis. These opposing results, for values of the weight modifier, can be interpret as an over-inflated weight of low abundance species in the mean species diversity when q = 0.5, hence the number of time when the sum of squares deviated from the real value was low. Conversely when q = 2 high abundance species have a larger effect in the mean species diversity calculations.

Table 5 Bootstrap results (replications = 2000) for mean species diversity across all taxonomic ranks. Table shows p-values for two values of weight modifier (0.5, and 2)

Full size table

Network analysis

Networks presented in Fig. 5 are purposely placed geographically, west on the left, and east on the right. The first row depicts the networks for each city for taxonomic rank “order.” Plots in the top row show “orders” rhodobacteriales, and bacteroidales (green) as highly connected nodes for east cities, which belong to higher taxonomic rank “class” alphaproteobacteria, and bacteroidia, respectively. Nodes in red are those “orders” found across all cities, all belonging to “classes” alphaproteobacteria and gammaproteobacteria. Networks for taxonomic ranks “family” in the second row, show an interesting change across cities, with central nodes in red that are common between Boston and New York and nodes in green that are common between New York and Sacramento. The last row shows networks for taxonomic ranks “genus”. In all cities we can identify a sub-structure with a hub node in green corresponding to the “genus” sphingobacterium. This central node shares four highly connected nodes (in red) for the east-coast cities but lose complexity for the city of Sacramento as the number of connections for each node drops considerably compared with the other two cities. In general we have found that cities of Boston and New York have more complex networks for all taxonomic ranks when compared with networks from Sacramento.

Discussion and conclusion

It has been well established that WGS metagenomics can fail to detect rare species since DNA is not sequenced with enough depth as a result of its rarity [15, 16]. Nevertheless, this was not an issue for the development of this work since our main objective was to determine the common bacterial signature of the three cities in the form of normalized counts of taxonomic ranks and use this data to predict the source of origin of a specific sample. We present a set of tools complementing, rather than competing with one another, in characterizing the differential signatures in terms common bacteria. Overall the different analytical components of this work, collectively, conveyed the following consistent message: The bacterial signatures of common OTUs, are city specific in terms of normalized counts for the three taxonomic ranks.

PCA findings showed a large proportion of the variability (> 80%) is accounted for by the first three principal components for the three taxonomic ranks. Prediction of provenance based on bacterial fingerprints was also highly effective (classification error < 6%, and accuracy > 90%) for all classifiers tested, although the classifiers performed better for ranks “order”, and “family” as a result of having more common predictors (19, and 23 respectively). ANOVA showed that the bacterial signature is city specific with specific patterns of differentiation. While ANOVA showed differential bacterial patterns across cities, the effective number of species diversity showed that Sacramento had the largest number of species. This can be the result of warmer climate condition of Sacramento that promotes bacterial growth and ecological diversity compared with the colder climates of Boston, and New York, but we note that the result may be biased by the effect of uneven “wet lab” protocols for DNA extraction and sequencing, and very unequal city sample sizes, although we tried to deal with the later issue by subsampling. Finally, network analysis showed that each city has a different overall bacterial network structure. A careful review of nodes from Boston, and New York revealed common subnetwork-structures sharing similar bacterial patterns, which is believed to be result of geographic proximity, and common ecological niche for northeast coastal cities contrasting with a southwestern city in California. Network analyses for future datasets with a more balanced design, and more standardized DNA extraction and sequencing protocols, might lead to interesting ecological perspectives regarding species that live in mutualism or symbiosis, and others that show patterns of competition.

The results presented in this work, all support the fact that it is possible to capture the bacterial signal from samples collected in three cities using OTUs counts from common bacteria; nevertheless it is definitely possible that the quality of the results and conclusions could be greatly improved if a review of the experimental-design lead to a more balanced number of samples for each city, combined with objective-specific protocols for DNA extraction and sequencing of the samples, which should ensure a more uniform sequencing depth and quality, specially across cities. As a closing remark the authors emphasize that these analyses were conducted on preliminary data and results are a valuable source to plan future experiments and analyses.

Methods

For the 2017 meeting, CAMDA has partnered with the MetaSUB (Metagenomics & Metadesign of Subways & Urban Biomes) International Consortium (http://metasub.org/), which has provided microbiome data from three cities across the United States as part of the MetaSUB Inter-City Challenge.

Illumina next generation sequencing data was generated from swab DNA samples taken on subway stations from Boston, New York, and Sacramento. Data was provided in the form of FASTQ files for each sample, plus a supplementary dataset with information regarding swab places, sequencing technology, DNA extraction, and amplification, samples names, etc. A quality control of the reads was conducted to improve taxonomical classification with QIIME. The raw OTUs generated with QIIME, were aggregated for each sample to generate a matrix of OTUs counts for the three cities. The subsequent statistical analyses were conducted on the basis of common OTUs, finding additional patterns in the relative abundance that was not as obvious as the presence of city-specific OTUs. Other aspects of bio-diversity beyond what is apparent from Fig. 1 (such that Sacramento samples exhibited the most biodiversity) were not investigated further.

Sequencing data description

Boston sequencing data consisted of a total of 141 samples ranging from 1 Mbp to 11 Gbp single read Illumina data. The majority of the samples (117 Amplicon samples) were target sequenced after PCR amplification. Additionally, the rest of the samples (34) were whole genome shotgun (WGS) sequenced. Moreover, a small fraction of the amplicon samples did not effectively contribute to OTU counts, and hence they were removed from the analyses. Ultimately a total of 134 samples were included in further downstream analyses.

All 1572 New York samples were WGS, ranging from 0 Mbp to 19 Gbp of Illumina-sequence data. After quality control a subset of 777 samples effectively yielded OTU counts and were included in all subsequent analyses.

In the city of Sacramento, six locations were sampled three times each on different surfaces for a total of 18 WGS sequenced samples ranging from 2.8 to 3.4 Gbp. All the samples contained enough sequencing data after quality control to positively contribute to OTU counts, therefore all 18 samples were included in all the analyses.

Bioinformatics and data processing

Sequencing data from each city was uncompressed and quality filtered to ensure improved OTU picking. Filtering FASTQ files was done with FASTX-Toolkit [17] at variable Phred quality scores ranging from 35 to 39 with a variable minimum percent of bases that must satisfy the chosen quality averaged score ranging from 40 to 80. This filtering scheme was designed for the purpose of effectively reducing the size of the large FASTQ files without compromising the open-reference OTU picking and for keeping the computational burden in check. This strategy not only accomplished the later goal but also removed the low quality FASTQ files which were unusable for detecting any 16S gene signal; The reduced sample sizes and their distributions according to the taxonomic ranks are provided in Table 1. This quality control yielded sequencing data in the order of a few Mbp up to 5 Gbp as a maximum. It is noteworthy that we processed amplicon FASTQ files with the same approach. In the study we merged WGS (only the 16S region) and Amplicon data in a combined fashion in order to have enough sample size. However, in order to establish the similarity of data distribution for the two platforms, we implemented a Kolmogorov-Smirnov test of the equality of the distributions comparing the data from both the platforms for each one of the features or levels found for the three taxonomic ranks. The null hypothesis states that the empirical distribution of the normalized counts from the WGS data is not significantly different from the empirical distribution of the normalized counts for the Amplicon data. Results confirmed that the data from both platforms are similar enough to be used together for further downstream analyses. No significant p-values were found in the Kolmogorov-Smirnov test (p-value_min = 0.2387 and p-value_max = 0.9945).

Filtered FASTQ files were converted to FASTA files with a “bash” script in order to standardize the description line for each sequence making it acceptable for QIIME pipeline. This step was required since we faced some incompatibility between FASTA files automatically generated by open-source converters. OTUs picking was conducted with QIIME in open-reference mode. This strategy was preferred since our purpose is to effectively detect the 16S gene region from as many bacterial species as possible. QIIME pipeline was run in three steps.

$$ \mathrm{pick}\_\mathrm{open}\_\mathrm{reference}\_\mathrm{otus}.\mathrm{py}-\mathrm{o}./\mathrm{otus}-\mathrm{i}./\mathrm{sample}.\mathrm{fa}-\mathrm{p}../\mathrm{parameters}.\mathrm{txt}-\mathrm{f}-\mathrm{a}-\mathrm{O}\ 12 $$

(1)

$$ \mathrm{biom}\ \mathrm{convert}-\mathrm{i}./\mathrm{otu}\mathrm{s}/\mathrm{otu}\_\mathrm{table}.\mathrm{biom}-\mathrm{o}./\mathrm{otu}\mathrm{s}/\mathrm{from}\_\mathrm{biom}.\mathrm{txt}--\mathrm{to}-\mathrm{tsv} $$

(2)

$$ \mathrm{assign}\_\mathrm{taxonomy}.\mathrm{py}-\mathrm{i}./\mathrm{pynast}\_\mathrm{aligned}\_\mathrm{seqs}/\mathrm{aligned}.\mathrm{fasta}-\mathrm{m}\ \mathrm{rdp} $$

(3)

The first step was the open reference OTU picking (1). The second was to convert the binary biom table into a text format output (2). The final step corresponds to assigning taxonomy values to all OTUs within the output table (3). OTU output counts were later aggregated at three taxonomic ranks as input data for further statistical analyzes. In other words, those OTUs that by mapping score are different, but correspond to the same taxonomic rank are added and labeled as the corresponding taxonomic rank they belong.

The chosen taxonomic ranks were “order”, “family”, and “genus”. Figure 1 presents a summary of aggregated OTUs for all the ranks. The selection of ranks was determined by the count of common levels within each threshold. The raw data for each taxonomic rank was then normalized to log counts per million for each city before combining them in a single dataset. The normalization was done based on Law et al. work [18] given in Formula (4). The OTU proportions (transformed) were calculated for each sample by

$$ {y}_{gi}={\mathit{\log}}_2\left(\frac{r_{gi}+0.5}{N{R}_i+1}{10}^6\right), $$

(4)

where r_gi is the g^th OTU count for sample i, N is the number of OTU categories, and $ {R}_i=\frac{1}{N}\sum \limits_{g=1}^G{r}_{gi} $ is the mean number of mapped reads for i^th sample. This normalization scheme guarantees that the counts are bounded away from zero by 0.5 to make the logarithm meaningful and to reduce the variability of log-cpm for lowly expressed OTUs. Additionally, the library size was offset by 1. Together these guarantees that the ratio is strictly less than 1 and greater than zero.

Statistical analysis

The proceeding statistical analysis was conducted in multiple stages in R [19]. The first was a PCA, which showed that the normalized counts for all taxonomic ranks carry strong enough signals to group the cities of origin. The second was to build a statistical classifier, which can produce a well-defined rule (e.g., a machine) to predict the city of origin from the rank profiles of a sample. To this end, we used two well-regarded classifiers, all within the R environment, and compared the findings. In a third stage we conducted a differential abundance analysis using ANOVA and a novel bootstrap based test using the alpha diversity indices. The final stage was to implement a visual inspection of the co-abundance networks in order to assess how the bacterial abundances vary jointly across the cities.

Principal components analysis (PCA)

Unsupervised learning of normalized count data through principal component analysis was conducted on a taxonomic rank basis for “order”, “family”, and “genus”. The analysis was entirely conducted in R based on correlations structure. Eigenvalues were extracted to calculate the variability in the dataset accounted by each component. Two-dimensional PCA bi-plots, and three-dimensional plots of the first three components were generated for each taxonomic rank and color-coded by city to better visualize patterns amongst samples from each location (Fig. 2).