ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

BlobTools: Interrogation of genome assemblies

[version 1; peer review: 2 approved with reservations]
PUBLISHED 31 Jul 2017
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

The goal of many genome sequencing projects is to provide a complete representation of a target genome (or genomes) as underpinning data for further analyses. However, it can be problematic to identify which sequences in an assembly truly derive from the target genome(s) and which are derived from associated microbiome or contaminant organisms. 
We present BlobTools, a modular command-line solution for visualisation, quality control and taxonomic partitioning of genome datasets. Using guanine+cytosine content of sequences, read coverage in sequencing libraries and taxonomy of sequence similarity matches, BlobTools can assist in primary partitioning of data, leading to improved assemblies, and screening of final assemblies for potential contaminants. 
Through simulated paired-end read dataset,s containing a mixture of metazoan and bacterial taxa, we illustrate the main BlobTools workflow and suggest useful parameters for taxonomic partitioning of low-complexity metagenome assemblies.

Keywords

Bioinformatics, visualisation, genome assembly, quality control, contaminant screening

Introduction

Advances in next generation sequencing technologies have generated vast amounts of data and knowledge (Goodwin et al., 2016). The decrease in cost per nucleotide lead to an increased application of these technologies to non-model organisms, life forms which have so far not been intensively studied by the research community. Genome-enabled science on these species can then illuminate novel processes and reveal the patterns of evolution. For non-model species, the luxury of large amounts of material from cultured isolates is often not possible, and research must progress from organisms sourced from the wild or from complex mixtures of species. DNA extracted from a sample may actually contain genomes from multiple organisms – food sources, host material, symbionts, pathogens, commensals and external contaminants – in addition to the target organism. In some cases, the associated genomes can be considered “contaminants”, while in others, they can provide insights into the biology of the target organism. In all cases they should be identified, isolated and investigated with care.

Interrogation of genome assemblies to assure single-taxon origin is an elemental step in the genome sequencing process. Failure to identify non-target sequence can lead to false conclusions regarding the biology of the target organism, such as metabolic potential and events of horizontal gene transfer (HGT) between species. Several reports of HGTs into eukaryotic genomes have later been shown to have been based on undetected contamination in assemblies. Identification of contamination can radically change the conclusions of a study, as shown for the starlet sea anemone Nematostella vectensis (Artamonova & Mushegian, 2013) and the tardigrade Hypsibius dujardini (Koutsovoulos et al., 2016). Importantly, undetected non-target sequence contamination of published genomes will pollute public sequence databases and promote propagation of annotation errors.

Reliable assignment of a DNA sequence from a new assembly to its species-of-origin, i. e. the association of the sequence ID to an unique, numerical identifier (TaxID) of the National Centre for Biotechnology Information (NCBI) Taxonomy database (Federhen, 2012), is a non-trivial problem. Current contaminant screening pipelines are based on sequence similarity to sequences of known origin, sequence composition signatures such as k-mers, and/or shared coverage profiles across different datasets. Few are readily applicable to datasets of eukaryotic genomes of any size (Eren et al., 2015; Kumar et al., 2013; Mallet et al., 2017; Tennessen et al., 2016). Anvi’o (Eren et al., 2015) partitions assemblies by clustering sequences based on the output of CONCOCT (Alneberg et al., 2014). CONCOCT uses Gaussian mixture models to predict the cluster membership of sequences by considering sequence composition and coverage profiles. PhylOligo (Mallet et al., 2017) relies exclusively on sequence composition and performs iterative, partially supervised clustering of sequences based on sequence composition profiles. ProDeGe (Tennessen et al., 2016) uses a fully unsupervised method based on sequence similarity to databases and sequence composition to partition assemblies using principal component analysis (PCA). It should be noted that while taxonomic assignment based on higher order sequence composition (such as k-mers of length 4 or greater) is highly effective for bacterial sequences, its success has been limited for eukaryotic genomes, as the information content, represented by the number of coding bases, is lower, and sequence composition spectra often show multimodal distributions (Chor et al., 2009).

Existing contaminant screening pipelines also differ in the way results are presented. Anvi’o depicts assemblies through interactive plots with rich annotations of sequence composition features, coverages across datasets and taxonomic/binning results. PhylOligo offers heatmaps of hierarchical clusterings of sequences, tree visualisations, and t-SNE (t-Distributed Stochastic Neighbor Embedding) plots, where sequence composition clusterings have been reduced to two dimensions. ProDeGe displays sequences in an interactive, three-dimensional k-mer PCA plots.

BlobPlots, or taxon-annotated GC-coverage plots (Kumar et al., 2013) are another contamination detection and data partitioning methodology. BlobPlots are two-dimensional scatter plots, in which sequences are represented by dots and coloured by taxonomic affiliation based on sequence similarity search results. For each sequence, the position on the Y-axis is determined by the base coverage of the sequence in the coverage library, a proxy for molarity of input DNA. The position on the X-axis is determined by the GC content, the proportion of G and C bases in the sequence, which can differ substantially between genomes.

Here, we present BlobTools, a modular command-line solution for the visualisation of genome assemblies as BlobPlots, and taxonomic interrogation for purposes of quality control. BlobTools is a complete reimplementation of the Blobology pipeline (Kumar et al., 2013) focussed on usability, improved taxonomic assignment of sequences based on custom user input, and support for coverage information based on multiple formats and sequencing libraries. We demonstrate the features of BlobTools using synthetic datasets, and offer guidelines for efficient adoption of BlobTools into genome assembly programmes.

Methods

Implementation

BlobTools is written in Python and consists of a main executable that allows the user to interact with the implemented modules (see Table 1). It offers a simple, modular command line interface which can easily be adapted to process multiple datasets simultaneously using GNU parallel (Tange, 2011). Inputs for BlobTools are standard file formats commonly created during the course of genome assembly projects. The primary processing in BlobTools constructs a BlobDB data structure based on user input. From this data structure, BlobTools generates easily interpretable, two-dimensional visualisations ready for publication, in conjunction with tabular output, enabling the user to partition sequences and paired-end (PE) reads contributing to them, for separate downstream processing. We present two recommended workflows, one targeted at de novo genome assembly projects in the absence of a reference genome (Figure 1A) and another for projects where a reference genome is available (Figure 1B).

Table 1. Tasks performed by BlobTools module.

BlobTools
module
Task
createParsing of input files and creation of BlobTools (JSON) data
structure, i. e. BlobDB
viewGeneration of tabular output for manual inspection and
subsequent partitioning of sequences in the assembly, input
files for CONCOCT, and/or COV files based on a BlobDB
plotPlotting of BlobPlots based on a BlobDB
covplotPlotting of CovPlots based on a BlobDB and a COV file
seqfilterPartitioning of sequences from a FASTA file based on a list of
sequence IDs
bamfilterPartitioning of paired-end reads from a BAM file based on a
list of sequence IDs and their mapping behaviour
map2covGeneration of a COV file (containing base and read
coverage) based on a BAM/CAS file
taxifyAnnotation of tabular sequence similarity search output (e. g.
BLAST/Diamond output) with TaxIDs from a mapping file or
generation of a BlobTools hits file based on custom user input
6b29a936-0de0-41ca-a338-334eb4aeb423_figure1.gif

Figure 1. Two common BlobTools workflows for taxonomic interrogation of paired-end (PE) read datasets.

(A) Workflow A. Targeted at de novo genome assembly projects in the absence of a reference genome. 1: Creation of a BlobDB data structure based on input files. 2: Visualisation of assembly and generation of tabular output. 3: Partitioning of sequence IDs in assembly, based on user-defined parameters informed by the visualisations. 4: Partitioning of PE reads based on sequence IDs. (B) Workflow B. Targeted at projects where a reference genome is available. 1: Reads are mapped against the reference genome. 2: BAM file is processed to generate FASTQ files based on read mapping behaviour. 3: FASTQ file of read pairs where neither read maps to the reference genome (UnUn) are assembled de novo and used in workflow A. 4: partition of read pairs of target taxon recovered from workflow A are assembled together with the other target taxon read pairs from step 2 and used in workflow A.

Taxonomy assignment

Taxonomy assignment in BlobTools is based on user-supplied, tab-separated-value (TSV) files composed of three columns: the input sequence ID, a NCBI TaxID, and a numerical score. We refer to these TSV files as ‘hits’ files below. They can be generated from the output of sequence similarity searches, such as BLAST (Camacho et al., 2009) or Diamond blastx (Buchfink et al., 2015) searches against public or reference databases, or the output of other contaminant identification tools. The BlobTools module taxify allows easy conversion of tabular file formats to BlobTools compatible input, in addition to annotation of similarity search results based on NCBI TaxID mapping files, as available from UniProt and NCBI.

Based on these inputs, BlobTools assigns a single NCBI taxonomy for each sequence in the assembly, based on the highest scoring NCBI TaxID at the following taxonomic ranks: species, genus, family, order, phylum, and superkingdom. Score calculation can be controlled by the user through a minimal score threshold (--min_score) and a minimal difference in scores (--min_diff) between the best and second-best scoring taxonomy. In addition, three non-canonical taxonomic annotations are possible: ‘no-hit’, the suffix ‘-undef’ and ‘unresolved’. Sequences not assigned to any taxonomic group, or not passing the --min_score threshold, are labelled ‘no-hit’. If a NCBI TaxID has no explicit parent at a taxonomic rank, the suffix ‘-undef’ is appended to the next upper taxonomic rank for which one does exist. In cases where the score difference between the best and second-best hits is smaller than --min_diff, sequences are labelled as ‘unresolved’.

Multiple ‘hits’ files can be provided as input. In this case, the behaviour of the taxonomy assignment process can be controlled further through ‘taxrules’. The highest scoring taxonomy can either be inferred across all files (‘bestsum’) or successively (‘bestsumorder’) in the order they were supplied as input, allowing only sequence that received no hits from one file to be considered for taxonomic annotation in the next file, thereby leveraging reliability of scores of different input file sources.

The original blobology pipeline (Kumar et al., 2013) recommended the use of a single, best BLAST hit per sequence for taxonomy assignment. However, taxonomically mis-annotated sequences in databases (derived from inclusion of un-screened genome assemblies) can lead to erroneous taxonomic annotation. BlobTools mitigates this issue by accepting multiple hits per sequence and allocating taxonomy based on the highest sum of scores.

It should be noted that a definitive taxonomic placement for every sequence in the assembly is not required for successful taxonomic partitioning of sequences, since differential coverage and sequence composition profiles between the genomes are often sufficient.

Visualisations

In BlobTools, sequences are depicted as circles in BlobPlots (as opposed to dots in the blobology pipeline), with diameters proportional to sequence length. The scatter-plot is decorated with coverage and GC histograms for each taxonomic group, which are weighted by the total span (cumulative length) of sequences occupying each bin. A legend reflects the taxonomic affiliation of sequences and lists count, total span and N50 by taxonomic group. Taxonomic groups can be plotted at any taxonomic rank and colours are selected dynamically from a colour map. The number of taxonomic groups to be plotted can be controlled (--plotgroups, default is ‘7’) and remaining groups are binned into the category ‘others’. An example is shown in Figure 2A.

6b29a936-0de0-41ca-a338-334eb4aeb423_figure2.gif

Figure 2. Visualisations of the combined assembly of simulated sequencing libraries.

(A) BlobPlot of the assembly. Sequences in the assembly are depicted as circles, with diameter scaled proportional to sequence length and coloured by taxonomic annotation (at the rank of ’order’) based on BLASTn and Diamond blastx similarity search results provided in this order and using taxrule ’bestsumorder’. Circles are positioned on the X-axis based on their GC proportion and on the Y-axis based on the sum of coverage across both library A and library B. (B) ReadCovPlot of library A. (C) ReadCovPlot of library B. In ReadCovPlots, mapped reads are shown by taxonomic group at the rank of ’order’.

The power of differential coverage profiles across different sequencing libraries for partitioning sequences in an assembly prompted the development of CovPlots (Figure 3) (Koutsovoulos et al., 2016), which are analogous to BlobPlots, except that the GC-axis is substituted by the coverage-axis from another sequencing library. CovPlots can be used for the visualisation of patterns of differential coverage signatures between taxonomic groups in the assembly.

6b29a936-0de0-41ca-a338-334eb4aeb423_figure3.gif

Figure 3. CovPlot of the combined assembly of simulated sequencing libraries.

Sequences in the assembly are depicted as circles, with diameter scaled proportional to sequence length and coloured by taxonomic annotation (at the rank of ’order’) based on BLASTn and Diamond blastx similarity search results provided in this order and using taxrule ’bestsumorder’. Circles are positioned on the X-axis based on coverage in library A and on the Y-axis based on coverage in library B. Parameters for partitioning the sequences in the assembly (which were applied to the tabular representation of the BlobDB) are indicated as dotted grey lines and text annotations in the scatter plot.

The modules for generating BlobPlots and CovPlots support additional input parameters controlling visualisation behaviour, including cumulative addition (--cumulative) or generation of separate plots for each taxonomic group (--multiplot), exclusion (--exclude) or relabelling (--relabel) of taxonomic groups, assignment of specific HEX colours to groups (--colour) or labelling sequences based on arbitrary, user defined categories (--catcolour). The latter could be, for instance, binned categories of RNAseq mappings to sequences in the assembly as shown in Koutsovoulos et al. (2016).

ReadCovPlots (Figure 2B and 2C) visualise the proportion of reads of a library that are unmapped or mapped, showing the percentage of mapped reads by taxonomic group, as barcharts. These can be of use for rapid taxonomic screening of multiple sequencing libraries within a single project. The underlying data of ReadCovPlots and additional metrics are written to tabular text files for custom analyses by the user.

Support of multiple coverage libraries

BlobTools supports coverage input (BAM/CAS format) from multiple sequencing libraries. As these data formats contain more information than needed, BlobTools parses coverage information of sequences (normalised base coverage and read coverage) into COV files in TSV format. These files can be generated through the module map2cov prior to construction of a BlobDB.

Within the BlobDB data structure, base and read coverage information is stored for each sequence in the assembly. If more than one coverage file is supplied, BlobTools constructs an additional coverage library (‘cov_sum’) internally, containing the sum of coverages for each sequence across all coverage files. This internal coverage library is considered when extracting views or plotting visualisations.

Operation

System requirements for BlobTools include a UNIX based operating system, Python 2.7, and pip. An installation script is provided, which installs Python dependencies, downloads and processes a copy of the NCBI TaxDump, and downloads and compiles a copy of samtools (Li et al., 2009). Instructions for installation and execution of BlobTools can be found at https://github.com/DRL/blobtools.

Two common BlobTools workflows for taxonomic interrogation of paired-end (PE) read datasets are depicted in the flowchart in Figure 1. Workflow A is targeted at de novo genome assembly projects where there is no preexisting reference genome. Workflow B should be followed where a reference genome is available.

Workflow A (Figure 1A) proceeds through construction a BlobDB data structure based on input files (step A1), visualisation of assembly and generation of tabular output (A2), partitioning of sequence IDs based on user-defined parameters informed by the visualisations (A3) and partitioning of PE reads based on sequence IDs (A4). It should be noted that while the BlobTools module create (step A1) supports multiple mapping formats, it is recommended that these are processed in advance using map2cov. Generation of tabular ‘hits’ files is simplified through the module taxify, which allows annotation of similarity search results based on TaxID mapping files or based on custom user input in tabular format.

BlobTools can process both PE and single-end read files. The module bamfilter in step A4 is only of relevance if PE read data is used, since single end read data can easily be partitioned using GNU grep or other tools. The module bamfilter can be controlled with a list of sequence IDs to include or to exclude. Use of an exclusion list causes all sequence IDs, except those specified, to be included. In both cases it will output up to four interleaved FASTQ files depending on the actual mapping behaviour of the read pairs and whether the parameter --include_unmapped is provided. Possible mapping behaviours of read pairs are: both reads mapping to included sequences (included-included: InIn), one read mapping to an included sequence and the other being unmapped (InUn), and one read mapping to an included sequence and the other mapping to an excluded sequence (ExIn). If the --include_unmapped parameter is specified, the module also writes read pairs where neither read maps to the assembly (UnUn). The latter case can occur if the assembler used for generating the sequences did not make use of all reads in the dataset. The resulting partitioned PE read files can then be assembled separately and the workflow is repeated. Decisions concerning which PE read files to use is left at the discretion of the user. However, as general rule, if target taxa have been sequenced at low coverages it might be preferable to be inclusive (using InIn, InEx, InUn and UnUn FASTQ files for assembly) and risking including non-target reads, than being exclusive (using only InIn and InUn for assembly) and risking losing significant proportions of reads from target genomes.

Workflow B (Figure 1B) should be applied when a reference genome is available. Reads are mapped against the reference genome (B1) and the resulting BAM file is processed with the module bamfilter (B2) using the parameter --include_unmapped and without providing a list of sequences. This will result in three FASTQ files: InIn, InUn and UnUn. Since taxonomic origin of the InIn and InUn reads has been established through the mapping step, only the UnUn reads are assembled de novo (B3) and processed via workflow A. This decreases computational requirements substantially. If workflow A yields a PE read partition of the target organism, which will consist of parts of the organism’s genome not present in the reference, these reads are can be used together with the InIn and InUn reads from step 2 to generate a new assembly (B4), which should be screened again via Workflow A. This iterative procedure can easily be applied to projects studying highly variable species where segmental presence-absence is common and a reference genome is expanded (to form a pangenome) as new samples are sequenced, or holobiomes, where reference genomes of multiple taxa are expanded as new samples are added.

Use cases

A detailed description of the programs and commands used can be found in Supplementary File 1.

Data

To illustrate workflow A (Figure 1A), we simulated read libraries for the nematode Caenorhabditis elegans contaminated with other organisms (see Table 2). Library A contains C. elegans reads contaminated with reads from Escherichia coli, Homo sapiens chromosome 19 and H. sapiens mitochondrial (mtDNA) genome, mimicking a dataset where the target genome is contaminated with DNA from food (E. coli) and operator (H. sapiens). Library B is composed of C. elegans reads contaminated with Pseudomonas aeruginosa, mimicking a project where the metazoan target species is heavily colonised by a prokaryotic organism.

Table 2. Simulated read libraries.

DatasetReference genomeINSDC AccessionCoverage (X)
Library AC. elegans N2GCA_000002985.350
E. coli str. K-12 substr. MG1655GCA_00080120525
H. sapiens chr19 GRCh38.p10GCA_000001405.2510
H. sapiens mtDNA GRCh38.p10GCA_000001405.25250
Library BC. elegans N2GCA_000002985.325
P. aeruginosa PAO1GCA_000006765.1100

Taxonomic interrogation and partitioning of read pairs using BlobTools

We assembled both read datasets together and mapped each library individually against the assembly. We supplied the assembly to BlobTools, in addition to coverage information extracted from both BAM files and the results of sequence similarity searches.

To simulate cases where sequences of genomes in the assembly are not part of public sequence databases, we removed all sequences annotated under the taxonomic terms ‘Caenorhabditis elegans’, ‘Hominids’, ‘Escherichia’, ‘Pseudomonas’, and ‘Other sequences’ before conducting sequence similarity searches. The search results provided to BlobTools were BLASTn megablast search against NCBI nt (-outfmt ’6 qseqid staxids bitscore std’ -max-target-seqs 1 -max_hsp 1 -evalue 1e-25) and Diamond blastx searches against UniProt Reference Proteomes (--outfmt 6 --sensitive --max-target-seqs 1 --evalue 1e-25), supplied in this order and using taxrule ‘bestsumorder’.

A BlobPlot (Figure 2A), ReadCovPlots (Figure 2B and C) and a CovPlot (Figure 3) were generated at the taxonomic rank of ’order’. A tabular view of the BlobDB was generated using the module view under the taxrule ’bestsumorder’ and for the taxonomic ranks of ’superkingdom’, ’phylum’, and ’order’. We partitioned sequences based on differential coverage and taxonomy annotation (Figure 3) using the tabular view and the UNIX tools GNU grep, GNU cut, and GNU awk. Subsequently, read pairs were partitioned based on mapping behaviour to these sequence partitions using the module bamfilter and read pairs where both reads mapped to included sequences (i. e. the InIn set) were assembled by taxonomic group.

We then generated BlobPlots for the four assemblies (named ‘rhabditida-BT’, ‘primates-BT’, ‘pseudomonadales-BT’ and ‘enterobacterales-BT’) (Figure 4). Coverage information was based on mapping of both simulated sequencing libraries against all four assemblies and sequences were coloured based on the genome-of-origin of the simulated reads mapping to them.

6b29a936-0de0-41ca-a338-334eb4aeb423_figure4.gif

Figure 4. BlobPlots of assemblies by taxon after read partitioning using BlobTools.

Coverage was obtained by mapping original reads to assemblies. Sequences are taxonomically annotated with ’true’ taxonomy based on origin of simulated reads mapping to them. Sequences labelled as ’no-hit’ did not receive any reads mapped to them. (A) Assembly of partition of Rhabditida reads (’rhabditida-BT’). One P. aeruginosa sequence (span 4,886 nt) remains. (B) Assembly of partition of Primates reads (’primates-BT’). Five E. coli sequences (total span 3,838 nt) remain. (C) Assembly of partition of Pseudomonadales reads (’pseudomonadales-BT’). (D) Assembly of partition of Enterobacterales reads (’enterobacterales-BT’). One sequence of P. aeruginosa (span 254 nt) remains.

Evaluation of results

Cleaned assemblies were evaluated based on the count of simulated reads, by genome-of-origin, mapping to them (Table 3), and based on standard assembly metrics (Table 4).

Table 3. Percentages of reads (partitioned by taxonomic origin) mapped to sequences in each of the BlobTools-processed assemblies (suffix ‘-BT’).

*: Reads that did not map to any sequence are listed under ’Not Mapped’. Bold: Zero reads mapped.

Taxonomic
origin of
simulated
reads
Mapping to
rhabditida-BT
(%)
Mapping to
primates-BT
(%)
Mapping to
pseudomonadales-BT (%)
Mapping to
enterobacterales-BT
(%)
Not
mapped
(%)
C. elegans99.990.000.000.000.01
H. sapiens0.0299.330.000.000.66
P. aeruginosa0.290.0099.660.030.02
E. coli0.720.220.0698.640.35

To account for assembly and mapping biases, the original simulated read sets were also assembled separately by taxon, yielding the assemblies CELEG-SIM (reads simulated from the C. elegans genome), HSAPI-SIM (reads simulated from H. sapiens chromosome 19 and mtDNA), PAERU-SIM (reads simulated from P. aeruginosa genome), and ECOLI-SIM (reads simulated from E. coli genome).

We evaluated the effect of parameters of similarity searches against public databases on taxonomic annotation using BlobTools (see Supplementary File 2). Since exhaustive searches against large databases require time and computing power we focussed on parameters that limit resource usage and control the number of returned results. In both BLASTn and Diamond blastx, the options -max-target-seq and -max-hsps are implemented. The former is an early filter applied during primary search and excludes initial hits from later examination. The latter controls the number of high-scoring pairs (HSPs) reported between a query and a subject in the search. The BLAST specific parameter -culling-limit controls the number of hits that can be allocated to a given region on the query. For this dataset, the best trade-off between false positive and false negative taxonomic annotations was achieved by combining BLAST search (-max-target-seqs 10 -evalue 1e-25) against NCBI nt with Diamond blastx searches (--evalue 1e-25 --max-target-seqs 1) against UniProt Reference Proteomes, in this order, using BlobTools taxrule ’bestsumorder’. However, a much faster search with acceptable outcome was achieved by changing the BLASTn parameters to -max-target-seqs 1 -max_hsps 1.

Summary

We have presented the BlobTools pipeline and illustrated the main BlobTools workflow (Figure 1A) by successfully disentangling read pairs from two simulated datasets composed of metazoan and bacterial genomes. The small fraction of read pairs that received an erroneous taxonomic assignment or were left out during the partitioning step (Table 3) had little effect on the overall assembly success for each taxon (Table 4). The outcome could have been improved further by being more inclusive during the partitioning step of sequences (to decrease the number of unassigned read pairs), combined with a second round of BlobTools workflow A (to remove read pairs which were partitioned into the wrong taxonomic group).

Table 4. Metrics of reference genomes (suffix ‘-REF’), assemblies generated from simulated reads by taxon (suffix ‘-SIM’) and assemblies generated from reads partitioned using BlobTools pipeline (suffix ‘-BT’).

MetricCELEG-
REF
CELEG-
SIM
rhabditida-
BT
HSAPI-
REF
HSAPI-
SIM
primates-
BT
PAERU-
REF
PAERU-
SIM
pseudomonadales-
BT
ECOLI-
REF
ECOLI-
SIM
enterobacterales-
BT
Span (b)100,286,40195,970,64095,964,66058,634,18550,765,88850,660,7766,264,4046,221,8466,215,1934,636,8314,561,1044,534,517
count75,5365,616212,70012,5041585218766
N50 (b)17,493,82951,17850,20958,617,6168,1868,2006,264,404333,929426,9634,636,831148,391151,538
GC (%)35.435.435.447.948.448.466.666.666.650.850.750.7
BUSCO
(Complete,
single copy
in %)
97.892.792.83.11.51.698.298.298.299.599.599.2
BUSCO
(Complete,
duplicated
in %)
0.60.40.40.1000.20.40.4000
BUSCO
(Fragmented
in %)
0.85.54.60.6110000.40.40.6
BUSCO
(Missing in
%)
0.81.42.296.297.597.41.61.41.40.10.10.2

The ease of interpretation of BlobPlots has favoured adoption by users, and the current implementation of BlobTools has been applied successfully to genome projects involving tardigrades (Koutsovoulos et al., 2016; Yoshida et al., 2017), mealybugs and their endosymbionts (Husnik & McCutcheon, 2016), ectoparasitic mites (Dong et al., 2017), diptera (Dikow et al., 2017), honeybees and their metagenomes (Gerth & Hurst, 2017), nematodes (Eves-van den Akker et al., 2016; Gawryluk et al., 2016; Slos et al., 2017; Szitenberg et al., 2017), bacteria (Fuller et al., 2017; Mellbye et al., 2017; Samad et al., 2016; Wang & Chandler, 2016), butterflies (Nowell et al., 2017), a fungal pathogen of barley (McGrann et al., 2016), and fungi (Compant et al., 2017).

BlobTools is a user-friendly and reliable solution for visualisation, quality control and taxonomic partitioning of genome datasets. Wider adoption of BlobTools screening by the research community will help control the influx of taxonomically mis-annotated sequences into public sequence databases and prevent inaccurate biological conclusions based on contaminated genome assemblies.

Software and data availability

BlobTools source code: https://github.com/DRL/blobtools

Archived source code as at time of publication: http://doi.org/10.5281/zenodo.833879 (Laetsch et al., 2017)

License: GNU-GPL

A walk through for all analyses in this study is deposited at https://github.com/DRL/blobtools_manuscript, together with additional code and resulting output files.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 31 Jul 2017
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Laetsch DR and Blaxter ML. BlobTools: Interrogation of genome assemblies [version 1; peer review: 2 approved with reservations] F1000Research 2017, 6:1287 (https://doi.org/10.12688/f1000research.12232.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 31 Jul 2017
Views
102
Cite
Reviewer Report 27 Sep 2017
 Richard M Leggett, Earlham Institute, Norwich, UK 
Approved with Reservations
VIEWS 102
This paper describes BlobTools, an open source software package for partitioning of genomic data, principally for contamination control. It is a reimplementation of the Blobology pipeline previously described by one of the authors.

The paper makes a ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Leggett  M. Reviewer Report For: BlobTools: Interrogation of genome assemblies [version 1; peer review: 2 approved with reservations]. F1000Research 2017, 6:1287 (https://doi.org/10.5256/f1000research.13242.r25294)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
143
Cite
Reviewer Report 11 Aug 2017
A. Murat Eren, Department of Medicine, University of Chicago, Chicago, IL, USA;  Marine Biological Laboratory, Woods Hole, MA, USA 
Approved with Reservations
VIEWS 143
The study by Laetsch and Blaxter describes the workflow of BlobTools, an open source software package for the curation of low-complexity metagenomic assemblies. The work is well-written and clear, and the efficacy of the tool have already been demonstrated by many ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Eren AM. Reviewer Report For: BlobTools: Interrogation of genome assemblies [version 1; peer review: 2 approved with reservations]. F1000Research 2017, 6:1287 (https://doi.org/10.5256/f1000research.13242.r24671)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 21 Aug 2017
    Dom Laetsch, The James Hutton Institute, Dundee, DD2 5DA, UK
    21 Aug 2017
    Author Response
    Dear Murat,

    Let me first thank you for reviewing our manuscript.

    We completely agree with your comments and suggestions and will:
    • Expand on our description of the Anvi’o pipeline
    ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 21 Aug 2017
    Dom Laetsch, The James Hutton Institute, Dundee, DD2 5DA, UK
    21 Aug 2017
    Author Response
    Dear Murat,

    Let me first thank you for reviewing our manuscript.

    We completely agree with your comments and suggestions and will:
    • Expand on our description of the Anvi’o pipeline
    ... Continue reading

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 31 Jul 2017
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.