SBDI Sativa curated 16S GTDB database

Version 6 2024-05-22, 10:50

Version 5 2022-10-14, 15:21

Version 4 2022-08-31, 15:22

Version 3 2021-11-10, 09:09

Version 2 2021-10-25, 13:56

Version 1 2021-08-05, 05:17

dataset

posted on 2021-11-10, 09:09 authored by Daniel LundinDaniel Lundin, Anders Andersson

The data in this repository is the result of vetting 16S sequences from the GTDB database release R06-RS202 (https://gtdb.ecogenomic.org/; Parks et al. 2018) with the Sativa program (Kozlov et al. 2016).

Files for the DADA2 (Callahan et al. 2016) methods `assignTaxonomy` and `addSpecies` are available: gtdb-sbdi-sativa.r06rs202.assignTaxonomy.fna.gz and gtdb-sbdi-sativa.r06rs202.addSpecies.fna.gz.
There is also a fasta file with the original GTDB sequence names: gtdb-sbdi-sativa.r06rs202.fna.gz
All three files are gzipped fasta files with 16S sequences, the assignTaxonomy associated with taxonomy hierarchies from domain to genus whereas the addSpecies file have sequence identities and species names.

Taxonomical annotation of 16S amplicons using this data is available as an optional argument to the nf-core/ampliseq Nextflow workflow from version 2.1: --dada_ref_taxonomy sbdi-gtdb (https://nf-co.re/ampliseq; Straub et al. 2020).

The data will be updated circa yearly, when the GTDB database is updated.

Curation

After download, sequences longer than 2000 basepairs and sequences containing undetermined bases ('N') were removed. Subsequently, sequences, as well as the reverse-complements of these, were aligned to the archaeal and bacterial SSU profiles from Barrnap (https://github.com/tseemann/barrnap) with hmmalign from HMMER (Eddy 2011). Sequences aligning to fewer than 1000 bases of their respective profile in both forward and reverse-complementary direction were deleted. For the sequences passing the above filters, the longest sequence in each genome was kept.
For each species, a maximum of 5 sequences was selected, prioritizing sequences from GTDB species-representative genomes, and longer sequences before shorter. The remaining sequences were then analyzed with Sativa (Kozlov et al. 2016) and sequences misclassified at genus to phylum level were removed. A Perl script for conducting filtering of sequences prior to and after Sativa analysis can be found in the `scripts` folder in the GitHub repo: https://github.com/biodiversitydata-se/sbdi-gtdb. Run perl select_seq_sativa.pl --h for documentation.