BBMerge – Accurate paired shotgun read merging via overlap

Brian Bushnell; Jonathan Rood; Esther Singer

doi:10.1371/journal.pone.0185056

Abstract

Merging paired-end shotgun reads generated on high-throughput sequencing platforms can substantially improve various subsequent bioinformatics processes, including genome assembly, binning, mapping, annotation, and clustering for taxonomic analysis. With the inexorable growth of sequence data volume and CPU core counts, the speed and scalability of read-processing tools becomes ever-more important. The accuracy of shotgun read merging is crucial as well, as errors introduced by incorrect merging percolate through to reduce the quality of downstream analysis. Thus, we designed a new tool to maximize accuracy and minimize processing time, allowing the use of read merging on larger datasets, and in analyses highly sensitive to errors. We present BBMerge, a new merging tool for paired-end shotgun sequence data. We benchmark BBMerge by comparison with eight other widely used merging tools, assessing speed, accuracy and scalability. Evaluations of both synthetic and real-world datasets demonstrate that BBMerge produces merged shotgun reads with greater accuracy and at higher speed than any existing merging tool examined. BBMerge also provides the ability to merge non-overlapping shotgun read pairs by using k-mer frequency information to assemble the unsequenced gap between reads, achieving a significantly higher merge rate while maintaining or increasing accuracy.

Citation: Bushnell B, Rood J, Singer E (2017) BBMerge – Accurate paired shotgun read merging via overlap. PLoS ONE 12(10): e0185056. https://doi.org/10.1371/journal.pone.0185056

Editor: Patrick Jon Biggs, Massey University, NEW ZEALAND

Received: April 6, 2017; Accepted: September 6, 2017; Published: October 26, 2017

This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

Data Availability: Mock community data are available from http://genome.jgi.doe.gov/MeCorS/MeCorS.home.html. Synthetic data generated from the genome of Chlamydomonas reinhardtii (v3.0) is available at https://genome.jgi.doe.gov/Chlre3/Chlre3.home.html and ftp://ftp.jgi-psf.org/pub/JGI_data/Chlamy/v3.0/Chlre3.fasta.gz.

Funding: This work was conducted by the U.S. Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, and is supported under Contract No. DE-AC02-05CH11231. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared no competing interests exist.

Introduction

Many sequencing platforms–including Illumina and Ion Torrent, which comprise the majority of sequencing capacity at many institutions—produce relatively short reads with tens to low hundreds of bases. Short read lengths result from the decline of signal intensity and integrity with each subsequent base during the sequencing process. To compensate for this, paired-end reads are generated by sequencing two end regions of a nucleic acid fragment [1].

Although many advances have been achieved using paired-end sequencing, there remain situations in which single, longer reads are preferable to paired shorter reads, such as de novo assembly contig-building, read binning or clustering, gene annotation, and small-variant calling. To address this need, several programs have been designed to merge paired short reads into single longer reads; however, most of these are designed to primarily merge 16S rRNA gene amplicon sequences rather than shotgun sequence data.

In this study, we describe BBMerge, a new overlap-based tool for merging short high-throughput shotgun sequencing reads. BBMerge allows simple adjustment of merging sensitivity to accurately and efficiently process large datasets from a variety of sequence types. We designed BBMerge to address common difficulties associated with paired-end shotgun read merging, i.e. reducing incorrect merge rates, increasing scalability, and handling non-overlapping pairs from longer fragments, which most tools cannot merge. BBMerge’s performance is compared to existing read merging tools that allow shotgun read input using both synthetic and real-world data from Chlamydomonas reinhardtii and a defined microbial community with bacterial and archaeal members (MBARC-26) [2], respectively.

Materials and methods

Synthetic and real-world sequence data

In order to evaluate merging performance, we used synthetically generated data from a eukaryotic genome to allow precise evaluation of merging accuracy as well as real-world shotgun metagenome data from a prokaryotic community. These two datasets include eukaryotic, bacterial and archaeal organisms with complete reference genomes spanning a large spectrum of %GC.

We synthetically generated 20 million reads based on the Chlamydomonas reinhardtii genome (v3.0), which was retrieved from the JGI Plant Genomics Resource Phytozome (ftp://ftp.jgi-psf.org/pub/JGI_data/Chlamy/v3.0/Chlre3.fasta.gz). Synthetic reads were generated using BBMap (https://sourceforge.net/projects/bbmap/) as follows: first, reference sequences were indexed (Table 1A). Second, synthetic reads were generated (Table 1B). Third, read headers were renamed according to their known insert size, to allow subsequent grading (Table 1C). Fourth, reads were decompressed and moved to ramdisk (Table 1D).

Download:

Table 1. Test data setup.

https://doi.org/10.1371/journal.pone.0185056.t001

Real-world data is comprised of shotgun metagenomic sequence data from MBARC-26, a microbial mock community consisting of 23 bacterial and 3 archaeal strains [3–10]. DNA extraction from MBARC-26, Illumina metagenome library creation, and shotgun sequencing were performed as described in [4], yielding 2x150 bp reads.

Reference genomes for MBARC-26 were retrieved from JGI’s IMG [11] and used for mapping as described in the following: Reference genomes were first indexed (Table 1E). Second, shotgun metagenome reads were mapped to reference sequences to a) determine insert sizes, and b) to remove reads that mapped with indels or that did not map in a properly paired orientation (Table 1F) using BBMap’s default settings. This filtering step ensured the correct determination of the insert size for each read pair for subsequent grading; insert sizes of unpaired reads cannot be determined, and reads mapped with indels yield a different insert size as calculated by mapping versus merging. Mapping was not necessary for the synthetic data as the true insert size was known a priori. The remaining shotgun metagenome reads were subsampled to 20 million read pairs (Table 1G).

Grading was performed using GradeMerge (Table 1H) to obtain the number of correctly and incorrectly merged reads. A merged read was considered correct if its length exactly matched the insert size indicated by its header. The reported percentage values and signal-to-noise ratio (SNR) are defined as: (1) (2) (3) , where:

a. C is the number of correctly merged reads.
b. I is the number of incorrectly merged reads.
c. C% is the percent of correctly merged reads.
d. I% is the percent of incorrectly merged reads.
e. P is the number of input read pairs.

Assembly quality was evaluated using raw shotgun metagenomic reads from MBARC-26 subsampled to 20 million read pairs (Table 1I). To eliminate potential impact originating from pre-processing, reads were not filtered or trimmed. Reads were merged with each tool, then both the merged and unmerged output was passed to SPAdes v. 3.8.2 [12] for assembly in metagenome mode (Table 1J). Assembled contigs were compared to the metagenome reference using QUAST v. 4.2 [13] for evaluation (Table 1K). Global and local misassemblies as defined in [13] were combined and are reported as “total misassemblies”.

Paired-end read merging tools.

All algorithms for read merging compared here (Table 2) are based on overlap detection [14–19], with the exception of leeHom [20] and BBMerge, which additionally use adapter-sequence detection; and COPE [18] and BBMerge, which additionally use kmer counts in non-default modes. All tools were executed as described in S1 Table.

Download:

Table 2. Read merging tools compared in this study in alphabetical order.

https://doi.org/10.1371/journal.pone.0185056.t002

Although an effort was made to compare all available overlap-based read merging tools for a comprehensive evaluation in this study, the testing methodology precluded the use of PANDAseq [22], which cannot process reads with renamed headers. Eloper [23] was tested, but not included, as it was unable to produce fastq files or retain the original read headers.

Parameters and testing.

Each program was tested for speed, accuracy, and scalability. All testing was executed on the NERSC Genepool cluster (http://www.nersc.gov/users/computational-systems/genepool/), using a 1 TB, 32-core node based Intel Xeon E5-4650L CPUs @ 2.60GHz. Reads and writes were all performed using a ramdisk to eliminate any impact of contention for the cluster’s shared file system.

Execution of merging tools was performed according to each program’s defaults, except as noted (S1 Table). For accuracy testing, each program was run multiple times; the single parameter that was identified to impact the respective tool’s sensitivity most was varied between runs (if available) (S2 Table). After each run, the resulting output was graded, i.e. each merged read’s length was compared to the true insert size noted in that read’s header.

Speed and scalability testing was executed using the Linux “time” command, e.g. “time bbmerge.sh <other options>”, with default parameters and varying numbers of threads. For BBMerge, three modes were included in this study: default, REM, and RSEM, as described in 2.2.3 and 2.2.4. For COPE, two modes were included: default (M0), using simple overlap only, and M3, using k-mers to join non-overlapping pairs. COPE’s M1 mode was not found to differ substantially from M0, and M2 did not produce output, so neither are included. Speed tests were performed on both synthetic and real-world shotgun metagenome reads. Since no significant difference was found, we only report test results for the real-world metagenome data.

Results

BBMerge overlap-detection

Overlap-detection involves multiple heuristics, controlled by constants denoted C_i. These have already been optimized through extensive empirical testing and do not need to be adjusted by the user; they are only presented to describe the algorithm. For each read pair:

1. Read 2 is reverse-complemented, because read 1 and read 2 are produced from opposite strands of the initial DNA fragment.
2. Read 1 and read 2 are aligned in every possible offset.
1. An “offset” is defined by the relative start position of the reads. For offset O = 0, each base number X_i of read 1 aligns to base number X_i of read 2. In general, each base X_i in read 1 aligns to base X_i+O in read 2.
2. This alignment only counts matches and mismatches; indels are not allowed.
3. The standard mode for determining the offset is called “ratio mode”. For each offset, a ratio R is calculated: (4) , where:

B is the number of mismatches, G is the number of matches, C₀ is a constant. An optional flag, “ouq”, allows B and G to be calculated using quality scores, but this is only helpful if the quality scores are accurate.

4. The two best (lowest) ratios, R₁ and R₂, are tracked throughout the process.
5. Once the alignments finish, R₁ and R₂ are examined to decide whether an alignment will be accepted (Fig 1A) or discarded (Fig 1B), using heuristics with different constants.
1. If R₁>C₁, the alignment will be rejected as invalid.
2. If R₁*C₂>R₂, the alignment will be rejected as ambiguous.
3. If R₂<C₃, the alignment will be rejected as ambiguous.
4. If G<max(C₄, V) the alignment will be rejected as having too short of an overlap. V is derived from the sequence complexity of a given pair, decreasing as complexity increases.
5. If S<C₆, the alignment will be rejected as too short. S is the insert size implied by the alignment.
6. Otherwise, the best alignment will be reported for further consideration.
6. At extreme sensitivity settings, an additional algorithm–“flat mode”–is used. This mode determines the best overlap by minimizing the number of mismatching bases.
1. At the “xstrict” and “ustrict” settings, the alignment is only accepted if the best offset from flat mode matches the best offset from ratio mode.
2. At the “xloose” setting, an alignment produced by flat mode will be accepted if no alignment was produced by ratio mode.
3. Otherwise, flat mode is not used.
7. If the pair has an alignment reported in 5) or 6), it is subjected to further scrutiny.
1. If the implied insert size is shorter than the read length, and adapter sequences have been specified, non-overlapping portions of the reads are aligned to respective expected adapter sequence. If they do not match, the alignment is rejected.
2. The number of expected mismatches (E) in the overlap is calculated using quality scores. If B>E*C₅, the alignment is rejected.
3. The probability (P) of the specific pattern of matches and mismatches is calculated. If P<C₆, the alignment is rejected.
8. If, at this point, the alignment has not been rejected, the read pair is merged to create a new read of size equal to the insert size implied by the overlap.
1. The overlapping portions of the reads are represented in the resulting read as a consensus of the two parent sequences. Matching bases are assigned an increased quality score; for non-matching bases, the base with the higher quality score is used, and is assigned a quality score equal to the difference between the two parent qualities. Where both quality scores are equal and the bases mismatch, the resulting base is N.
2. If only the tail ends of the reads overlap, the insert size (and thus resulting read) is longer than the original read length. The merged read will be composed of the non-overlapping portion of read 1; the consensus of the overlapping sequence; and the non-overlapping portion of read 2, respectively.
3. If the tail ends of the reads do not overlap, the insert size is shorter than the initial read length, and the non-overlapping portion is non-genomic sequencing adapter read-through. In this case the resulting read is trimmed to the insert size, and will be 100% consensus sequence.

Download:

Fig 1.

Merging scenarios in BBMerge modes: default (A-B), REM (C-F), and RSEM (G-I). The left column (Fig 1A,C,D,F) displays scenarios resulting in successfully merged reads, while the right column (Fig 1B,E,G,H) displays scenarios resulting in discarded unmerged pairs.

https://doi.org/10.1371/journal.pone.0185056.g001

BBMerge k-mer-based modes

BBMerge has the ability to improve merging accuracy or merge non-overlapping reads using k-mer frequency information, if the sequencing depth is sufficient (Fig 2) and the library is randomly sheared. There are two k-mer-using modes described in this paper, REM and RSEM, which stand for “Require Extension Match” and “Require Strict Extension Match". In each case, the default BBMerge algorithm is used with an additional k-mer-based extension step. To summarize: The input read file is processed once, to build a table of k-mer counts. The file is then processed a second time to perform merging. Steps performed during the merging phase for each read pair include:

The standard BBMerge algorithm is used to determine the insert size S₀ based purely on overlap (Fig 1A and 1B).
Each read is extended by a fixed length on the tail end only, using the Tadpole assembler (https://sourceforge.net/projects/bbmap/). When not specified, as in this study, extension defaults to 50 bp. Extension will stop prematurely if a branch k-mer is encountered, or k-mer depth drops below a set threshold, so extension may not reach the full length specified by the user.
1. A “branch k-mer” is a k-mer with more than one possible next k-mer. They are identified based on BBMerge’s optional Tadpole-specific parameters.
2. If extension completely fails such that neither read is extended by at least one base, insert size S₀ is used regardless of mode and subsequent steps are skipped.
If extension was successful, the BBMerge algorithm is applied to the extended reads to obtain a new insert size S₁.
In REM mode, the alignment is accepted if S₀ = S₁ (Fig 1C). If there is no S₀ because overlap failed in step 1, S₁ will be used (Fig 1D). If S₀ and S₁ exist and S₀! = S₁, the alignment is rejected (Fig 1E).
In RSEM mode, the alignment is exclusively accepted if S₀ = S₁ (Fig 1F). If S₀≠S₁ (Fig 1G), or if there was no initial overlap detected (Fig 1H), the alignment is rejected.

Download:

Fig 2. Relationship between % merged reads and genome coverage.

https://doi.org/10.1371/journal.pone.0185056.g002

In practice, REM mode can produce merged reads from initially non-overlapping pairs, with insert size > sum of the read lengths. RSEM will only produce merged reads < sum of the read lengths–a strict subset of the merged reads produced by BBMerge run in pure overlap mode. Requiring that the overlap after extension matches the initial overlap reduces false-positive merges caused by short repeats.

Although k-mer-based modes can increase accuracy and merge rates, read processing requires more time and memory in these modes. This memory constraint may hence render k-mer modes impractical on very large datasets. Though not evaluated in this study, BBMerge also has additional k-mer-related options, “ecct” and “kfilter”. “ecct” enables k-mer-based error correction of reads that initially fail to merge; if the reads still fail to merge after correction, the changes are rolled back. This can increase the merge rate in data with many sequencing errors. “kfilter” is a setting applied after a potential overlap is found; if the merged read contains any k-mers that were not already present at a specified depth in the original file, the overlap is assumed to be wrong and will be rejected. All k-mer-using modes use the same k-mer count table, so they can be enabled concurrently without using additional memory, and with little speed impact.

BBMerge threading

BBMerge uses both pipelined and parallel threads to achieve a high degree of scalability. Data is streamed from and to disk during execution, so that BBMerge’s memory requirements (in default overlap mode) are unrelated to the amount of input data. Data is read by one thread per file and packaged into lists of P read pairs each (P = 200 by default). These lists are added to an ArrayBlockingQueue, a data structure that allows safe concurrent read/write access. A number of parallel worker threads is spawned (controlled by the “t” flag). Each worker fetches a list of reads from the queue; if the queue is empty, it will block until a new list is added. The worker thread will then iterate through the list and attempt to merge each of the read pairs, tracking statistics in thread-local variables, and adding merged reads to a new list. The finished list of merged reads is added to an output ArrayBlockingQueue, which is being fed by all of the worker threads. An output thread pulls lists from this output queue, and writes the reads to disk. The worker threads finish when all reads have been processed. Finally, the master thread summarizes and prints the statistics from the worker threads. As a result, the worker threads do not interfere with memory used by any other thread except when pulling lists from the input queue, or sending lists to the output queue; this means shared memory is only mutated twice per P read pairs. Furthermore, P can be set to an arbitrarily high value on the command line (with the “readbufferlength” flag), so that distributing and gathering work has minimal negative impact on scalability. Most tools in the BBMap package share this threading design.

Deployment and use

BBMerge is written in Java, with no other dependencies. It is distributed with both the source and precompiled class files, allowing simple deployment and use on any computer supporting Java, from Windows laptops to HPC Linux-based clusters. BBMerge is designed for production use, so to simplify pipeline integration, it supports a wide variety of input and output formats–fasta or fastq; interleaved or dual-file; raw or compressed; encoded in ASCII-33 or ASCII-64, with input format autodetection. It also provides alternative processing modes such as insert-size histogram generation, adapter-sequence detection, and overlap-based error-correction (without merging), allowing its use in situations when paired reads are preferred over merged reads.

Discussion

We tested BBMerge in three modes (default, REM, RSEM) and compared its merging performance with eight other read merging programs (Table 2) using synthetically generated reads from an algae genome, and real-world shotgun metagenomic reads from a prokaryotic mock community (MBARC-26). Merging performance was evaluated based on accuracy, speed and computing efficiency.

Accuracy of paired-end read merging

BBMerge outperformed all other tools in merging accuracy across the sensitivity curve, with the lowest rate of incorrectly merged reads for any given rate of correctly merged reads, though this difference was more pronounced in the synthetic (Fig 3A) compared to the real-world data (Fig 3B). Similarly, BBMerge resulted in the highest correct merge rate (Fig 3) of all non-k-mer-using tools.

Download:

Fig 3.

Comparison of merging accuracy by program using synthetic (A) and shotgun metagenome sequences (B). Correctly merged reads are defined as % of total input pairs. Program performance at default sensitivity is indicated by a triangle.

https://doi.org/10.1371/journal.pone.0185056.g003

Results from the three discussed k-mer-utilizing modes are clearly distinguishable from those of the purely overlap-based tools and modes (Fig 3). BBMerge’s RSEM mode substantially reduced the rate of incorrectly merged reads, while slightly reducing the rate of correctly merged reads. BBMerge’s REM mode, and COPE’s M3 mode, substantially increased correct merge rates compared to the programs’ default modes by merging initially non-overlapping reads (Fig 3). BBMerge-REM achieved the highest rate of correctly merged reads in the real-world data (77.5%) followed by COPE-M3 (62.1%), and COPE-M3 achieved the highest merge rate in the synthetic data (94.4%) followed by BBMerge-REM (93.8%). Stitch yielded 69.2% incorrectly and 0.8% correctly merged reads in the synthetic data, and 49.1% incorrectly and 0.64% correctly merged reads in the real-world data (S3 Table).

Speed and scalability of paired-end read merging

Merging speeds were evaluated using the real-world metagenome reads and programs set to default sensitivity. Multi-threaded programs were allowed to use all 32 available threads. Compared to the other merging tools, BBMerge and FLASH were substantially faster, although we found that USEARCH, PEAR, BBMerge REM/RSEM, and fastq-join can all merge large datasets within reasonable timescales (Fig 4). Based on the performance on our shotgun sequence datasets, XORRO, COPE, leeHom and Stitch were projected to require >1 day to process a 500 Gbp dataset.

Download:

Fig 4. Speed comparison by program of shotgun metagenome sequences.

https://doi.org/10.1371/journal.pone.0185056.g004

BBMerge variants, PEAR, and Stitch exhibited near-perfect scaling in these tests, and are expected to continue scaling past 32 threads if run on a system with more CPU cores (Fig 5). FLASH scaled linearly to 6 threads, at which point speed plateaued. leeHom scaled to a peak at 4 threads, after which speed slightly declined. USEARCH also reached a peak at ~4 threads, but did not scale as well; 4-threaded speed was only 150% of single-threaded speed, rather than an ideal 400%. Subsequently, USEARCH’s performance declined, ending at 85% of its peak speed at the maximum of 32 threads. Single-threaded programs (fastq-join, XORRO, and COPE) are each represented by a single point.

Download:

Fig 5. Scalability of each program, determined by measuring speed using various numbers of threads.

https://doi.org/10.1371/journal.pone.0185056.g005

Assembly quality following read merging

Assembly quality was evaluated with QUAST; we report here assembly continuity (NA50), genome completeness, misassemblies, and indels as defined in [13] (Table 3, S4 Table). Gurevich et al. [13] defined NA50 as the length at which the collection of all reference-aligned contigs, of that length or longer, contain at least half of the assembled bases. Merged reads were generally characterized by substantially improved assembly continuity compared to the raw data (Table 3, Fig 6A), with BBMerge-REM reaching a nearly two-fold increase in NA50 (119 kbp compared to 60 kbp). BBMerge-RSEM, BBMerge, USEARCH, and leeHom resulted in similar NA50 metrics (101–104 kbp). The NA50 achieved with the remaining programs ranged from 61 kbp (PEAR) to 98 kbp (COPE-M3), aside from Stitch at 5.6 kbp. The raw data resulted in a total misassembly count of 119. Only BBMerge-RSEM and BBMerge-REM reduced this count, to 115 and 117, respectively (Table 3, Fig 6B). The studied merge tools fell into 3 misassembly-count clusters: BBMerge variants and USEARCH ranged from 115 to 131; XORRO, fastq-join, COPE-M3, FLASH, leeHom, and COPE ranged from 158 to 294; and PEAR and Stitch resulted in 660 and 20,986 misassemblies, respectively.

Download:

Fig 6. NA50 length and misassembly rates for a SPAdes assembly of each program’s output at default settings.

https://doi.org/10.1371/journal.pone.0185056.g006

Download:

Table 3. Assembly metrics reported by QUAST for SPAdes metagenomic assemblies.

https://doi.org/10.1371/journal.pone.0185056.t003

Indel rates are noted because they can induce frameshifts, which disrupt gene annotation. BBMerge variants and USEARCH clustered together closely, with rates ranging from 0.81 (BBMerge-REM) to 0.88 (USEARCH) indels per 100 kbp (Table 3). The other tools yielded rates ranging from 1.08 (XORRO) to 1.52 (COPE), except for Stitch (47.78 per 100 kbp). The raw data yielded 1.13 indels per 100 kbp. The fraction of reference bases covered by assemblies exhibited a narrow range from 83.9% (COPE-M3) to 85.2% (FLASH), aside from Stitch at 68.4% (Table 3). All tools except PEAR, COPE-M3, and Stitch exceeded the 84.5% genome coverage of the raw read assembly. BBMerge-REM outperformed BBMerge in every assembly metric, but COPE-M3’s performance relative to COPE was more nuanced: COPE-M3 had a greater NA50 and fewer misassemblies and indels, but a 1.2% lower genome recovery than COPE.

Conclusion

Correctly merged shotgun reads can improve the performance of applications that benefit from longer reads, yet erroneously merged reads can create serious issues due to the introduction of new errors, a concern that is not present for other common preprocessing steps such as quality-trimming. Even at a low rate, the addition of incorrectly merged reads can cause misassemblies and reduced assembly contiguity compared to unmerged or correctly merged data (Fig 6). It is this possibility of introducing new errors that renders merging especially sensitive to accuracy.

Since BBMerge has been developed primarily as a tool to aid in clustering and de-novo assembly of shotgun metagenome sequence data, minimizing the false-positive merge rate has been considered paramount. Our data indicates that BBMerge successfully minimized the false-positive rate when merging shotgun reads from synthetic and real-world datasets, and was able to improve assembly quality by increasing continuity while reducing the number of misassemblies. Its ability to achieve maximal accuracy while scaling near-linearly to reach the highest speed of the compared software makes BBMerge a promising tool for improving the assembly of large datasets such as shotgun metagenomes.

Supporting information

S1 Table. Program command lines.

Non-default parameters are stated in bold letters.

https://doi.org/10.1371/journal.pone.0185056.s001

(DOC)

S2 Table. Program sensitivity parameters.

Default settings are stated in bold letters.

https://doi.org/10.1371/journal.pone.0185056.s002

(DOCX)

S3 Table.

Number of correctly and incorrectly merged read pairs, and Signal-Noise Ratio (SNR), from the synthetic (A) and real-world (B) shotgun datasets by program and sensitivity. All numbers are out of 20,000,000 input read pairs. Defaults are in bold.

https://doi.org/10.1371/journal.pone.0185056.s003

(DOCX)

S4 Table. Assembly report by program.

https://doi.org/10.1371/journal.pone.0185056.s004

(DOC)

Acknowledgments

We thank Bill Andreopoulos, Alex Copeland, Robert Egan, Bryce Foster, Douglas Jacobsen, Elmar Pruesse, Adam Rivers, Axel Visel, Zhong Wang, and Tanja Woyke for valuable comments and suggestions. This work was conducted by the U.S. Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, is supported under Contract No. DE-AC02-05CH11231.

References

1. Berka J, Chen Z, Egholm M, Godwin BC. Paired end sequencing. US Patent Office; 2009.
2. Singer E, Andreopoulos B, Bowers RM, Lee J, Deshpande S, Chiniquy J, et al. Next generation sequencing data of a defined microbial mock community. Scientific Data. 2016;3: 160081. pmid:27673566
- View Article
- PubMed/NCBI
- Google Scholar
3. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature. Nature Publishing Group; 2001;409: 860–921. pmid:11237011
- View Article
- PubMed/NCBI
- Google Scholar
4. Singer E, Andreopoulos B, Bowers RM, Lee J, Deshpande S, Chiniquy J, et al. Next generation sequencing data of a defined microbial mock community. Scientific Data.
- View Article
- Google Scholar
5. Ng P, Wei C-L, Sung W-K, Chiu KP, Lipovich L, Ang CC, et al. Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation. Nat Meth. 2005;2: 105–111. pmid:15782207
- View Article
- PubMed/NCBI
- Google Scholar
6. Shendure J, Porreca GJ, Reppas NB, Lin X. Accurate multiplex polony sequencing of an evolved bacterial genome. Science. 2005;309: 1728–1732. pmid:16081699
- View Article
- PubMed/NCBI
- Google Scholar
7. Dunn JJ, McCorkle SR, Everett L, Anderson CW. Paired-end genomic signature tags: a method for the functional analysis of genomes and epigenomes. Genet Eng (NY). 2007;28: 159–173.
- View Article
- Google Scholar
8. Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F. Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007;318: 420–426. pmid:17901297
- View Article
- PubMed/NCBI
- Google Scholar
9. Chen J, Kim YC, Jung YC, Xuan Z, Dworkin G, Zhang Y, et al. Scanning the human genome at kilobase resolution. Genome Research. 2008;18: 751–762. pmid:18292219
- View Article
- PubMed/NCBI
- Google Scholar
10. Holt RA, Jones SJM. The new paradigm of flow cell sequencing. Genome Research. 2008;18: 839–846. pmid:18519653
- View Article
- PubMed/NCBI
- Google Scholar
11. Markowitz VM, Ivanova NN, Szeto E, Palaniappan K, Chu K, Dalevi D, et al. IMG/M: a data management and analysis system for metagenomes. Nucleic Acids Research. 2007;36: D534–D538. pmid:17932063
- View Article
- PubMed/NCBI
- Google Scholar
12. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology. 2012;19: 455–477. pmid:22506599
- View Article
- PubMed/NCBI
- Google Scholar
13. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29: 1072–1075. pmid:23422339
- View Article
- PubMed/NCBI
- Google Scholar
14. Magoc T, Salzberg SL. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics. 2011;27: 2957–2963. pmid:21903629
- View Article
- PubMed/NCBI
- Google Scholar
15. Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26: 2460–2461. pmid:20709691
- View Article
- PubMed/NCBI
- Google Scholar
16. Dickson RJ, Gloor GB. XORRO: Rapid Paired-End Read Overlapper. arXiv. 2013;1304.4620.
- View Article
- Google Scholar
17. Zhang J, Kobert K, Flouri T, Stamatakis A. PEAR: a fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics. 2014;30: 614–620. pmid:24142950
- View Article
- PubMed/NCBI
- Google Scholar
18. Liu B, Yuan J, Yiu SM, Li Z, Xie Y, Chen Y, et al. COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly. Bioinformatics. 2012;28: 2870–2874. pmid:23044551
- View Article
- PubMed/NCBI
- Google Scholar
19. Aronesty E. Comparison of sequencing utility programs. The Open Bioinformatics Journal. 2013.
- View Article
- Google Scholar
20. Renaud G, Stenzel U, Kelso J. leeHom: adaptor trimming and merging for Illumina sequencing reads. Nucleic Acids Research. 2014;42: e141–e141. pmid:25100869
- View Article
- PubMed/NCBI
- Google Scholar
21. Brown CT, Davis-Richardson AG, Giongo A, Gano KA, Crabb DB, Mukherjee N, et al. Gut Microbiome Metagenomics Analysis Suggests a Functional Model for the Development of Autoimmunity for Type 1 Diabetes. Roop RM, editor. PLoS ONE. 2011;6: e25792. pmid:22043294
- View Article
- PubMed/NCBI
- Google Scholar
22. Masella AP, Bartram AK, Truszkowski JM, Brown DG, Neufeld JD. PANDAseq: PAired-eND Assembler for Illumina sequences. BMC Bioinformatics. BioMed Central Ltd; 2012;13: 1–7.
- View Article
- Google Scholar
23. Silver DH, Ben-Elazar S, Bogoslavsky A, Yanai I. ELOPER: elongation of paired-end reads as a pre-processing tool for improved de novo genome assembly. Bioinformatics. 2013;29: 1455–1457. pmid:23603334
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Berka J, Chen Z, Egholm M, Godwin BC. Paired end sequencing. US Patent Office; 2009.

[ref2] 2. Singer E, Andreopoulos B, Bowers RM, Lee J, Deshpande S, Chiniquy J, et al. Next generation sequencing data of a defined microbial mock community. Scientific Data. 2016;3: 160081. pmid:27673566
View Article
PubMed/NCBI
Google Scholar

[3] View Article

[4] PubMed/NCBI

[5] Google Scholar

[ref3] 3. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature. Nature Publishing Group; 2001;409: 860–921. pmid:11237011
View Article
PubMed/NCBI
Google Scholar

[7] View Article

[8] PubMed/NCBI

[9] Google Scholar

[ref4] 4. Singer E, Andreopoulos B, Bowers RM, Lee J, Deshpande S, Chiniquy J, et al. Next generation sequencing data of a defined microbial mock community. Scientific Data.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Ng P, Wei C-L, Sung W-K, Chiu KP, Lipovich L, Ang CC, et al. Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation. Nat Meth. 2005;2: 105–111. pmid:15782207
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref6] 6. Shendure J, Porreca GJ, Reppas NB, Lin X. Accurate multiplex polony sequencing of an evolved bacterial genome. Science. 2005;309: 1728–1732. pmid:16081699
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref7] 7. Dunn JJ, McCorkle SR, Everett L, Anderson CW. Paired-end genomic signature tags: a method for the functional analysis of genomes and epigenomes. Genet Eng (NY). 2007;28: 159–173.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref8] 8. Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F. Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007;318: 420–426. pmid:17901297
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref9] 9. Chen J, Kim YC, Jung YC, Xuan Z, Dworkin G, Zhang Y, et al. Scanning the human genome at kilobase resolution. Genome Research. 2008;18: 751–762. pmid:18292219
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref10] 10. Holt RA, Jones SJM. The new paradigm of flow cell sequencing. Genome Research. 2008;18: 839–846. pmid:18519653
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref11] 11. Markowitz VM, Ivanova NN, Szeto E, Palaniappan K, Chu K, Dalevi D, et al. IMG/M: a data management and analysis system for metagenomes. Nucleic Acids Research. 2007;36: D534–D538. pmid:17932063
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref12] 12. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology. 2012;19: 455–477. pmid:22506599
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref13] 13. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29: 1072–1075. pmid:23422339
View Article
PubMed/NCBI
Google Scholar

[45] View Article

[46] PubMed/NCBI

[47] Google Scholar

[ref14] 14. Magoc T, Salzberg SL. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics. 2011;27: 2957–2963. pmid:21903629
View Article
PubMed/NCBI
Google Scholar

[49] View Article

[50] PubMed/NCBI

[51] Google Scholar

[ref15] 15. Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26: 2460–2461. pmid:20709691
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref16] 16. Dickson RJ, Gloor GB. XORRO: Rapid Paired-End Read Overlapper. arXiv. 2013;1304.4620.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref17] 17. Zhang J, Kobert K, Flouri T, Stamatakis A. PEAR: a fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics. 2014;30: 614–620. pmid:24142950
View Article
PubMed/NCBI
Google Scholar

[60] View Article

[61] PubMed/NCBI

[62] Google Scholar

[ref18] 18. Liu B, Yuan J, Yiu SM, Li Z, Xie Y, Chen Y, et al. COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly. Bioinformatics. 2012;28: 2870–2874. pmid:23044551
View Article
PubMed/NCBI
Google Scholar

[64] View Article

[65] PubMed/NCBI

[66] Google Scholar

[ref19] 19. Aronesty E. Comparison of sequencing utility programs. The Open Bioinformatics Journal. 2013.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref20] 20. Renaud G, Stenzel U, Kelso J. leeHom: adaptor trimming and merging for Illumina sequencing reads. Nucleic Acids Research. 2014;42: e141–e141. pmid:25100869
View Article
PubMed/NCBI
Google Scholar

[71] View Article

[72] PubMed/NCBI

[73] Google Scholar

[ref21] 21. Brown CT, Davis-Richardson AG, Giongo A, Gano KA, Crabb DB, Mukherjee N, et al. Gut Microbiome Metagenomics Analysis Suggests a Functional Model for the Development of Autoimmunity for Type 1 Diabetes. Roop RM, editor. PLoS ONE. 2011;6: e25792. pmid:22043294
View Article
PubMed/NCBI
Google Scholar

[75] View Article

[76] PubMed/NCBI

[77] Google Scholar

[ref22] 22. Masella AP, Bartram AK, Truszkowski JM, Brown DG, Neufeld JD. PANDAseq: PAired-eND Assembler for Illumina sequences. BMC Bioinformatics. BioMed Central Ltd; 2012;13: 1–7.
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref23] 23. Silver DH, Ben-Elazar S, Bogoslavsky A, Yanai I. ELOPER: elongation of paired-end reads as a pre-processing tool for improved de novo genome assembly. Bioinformatics. 2013;29: 1455–1457. pmid:23603334
View Article
PubMed/NCBI
Google Scholar

[82] View Article

[83] PubMed/NCBI

[84] Google Scholar

Figures

Abstract

Introduction

Materials and methods

Synthetic and real-world sequence data

Paired-end read merging tools.

Parameters and testing.

Results

BBMerge overlap-detection

BBMerge k-mer-based modes

BBMerge threading

Deployment and use

Discussion

Accuracy of paired-end read merging

Speed and scalability of paired-end read merging

Assembly quality following read merging

Conclusion

Supporting information

S1 Table. Program command lines.

S2 Table. Program sensitivity parameters.

S3 Table.

S4 Table. Assembly report by program.

Acknowledgments

References