Short read fragment assembly of bacterial genomes

  1. Mark J. Chaisson1 and
  2. Pavel A. Pevzner2,3
  1. 1 Bioinformatics Program, University of California San Diego, La Jolla, California 92093, USA;
  2. 2 Department of Computer Science and Engineering, University of California San Diego, La Jolla, California 92093, USA

Abstract

In the last year, high-throughput sequencing technologies have progressed from proof-of-concept to production quality. While these methods produce high-quality reads, they have yet to produce reads comparable in length to Sanger-based sequencing. Current fragment assembly algorithms have been implemented and optimized for mate-paired Sanger-based reads, and thus do not perform well on short reads produced by short read technologies. We present a new Eulerian assembler that generates nearly optimal short read assemblies of bacterial genomes and describe an approach to assemble reads in the case of the popular hybrid protocol when short and long Sanger-based reads are combined.

Footnotes

  • 3 Corresponding author.

    3 E-mail ppevzner{at}cs.ucsd.edu; fax (858) 534-7029.

  • Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.7088808

  • 4 Most reads in this study are shorter than 120 bases.

  • 5 The paper by Pevzner et al. (1989) illustrates some potential advantages of the Eulerian path approach over the “overlap-layout-consensus” approach for fragment assembly. For example, while the study by Pevzner et al. (1989) described a simple algorithm for constructing the SBH repeat graph, it is not immediately clear how to generalize the approaches in the studies by Huang et al. (2003) and Jaffe et al. (2003) for efficient construction of the repeat graph even in the simple case of SBH “reads.”

  • 6 E. coli is 4,639,675 bp long, S. pneumoniae is 2,160,837 bp long, and BAC is 173,427 bp long (chromosome 6, bases 30537344–30710771).

  • 7 As described above, EULER-SR does not try to estimate the multiplicities of tandem repeats and misses three copies of tandem repeats in E. coli (approximately 1600, 1000, and 300 nucleotides) and one copy of a tandem repeat in S. pneumoniae (∼500 nt).

  • 8 The surprising deterioration of N50 statistics for 10-kb spacing (as comparing with 2.5-kb spacing) reflects ambiguities in mapping longer paths between mate-pairs in highly tangled de Bruijn graphs.

  • 9 123, 72, 31, 9, and 3 mate-pairs map to 3, 4, 5, 6, and more than 6 edges correspondingly.

  • 10 Also in cases where coverage across the tandem repeat is low, tandem repeats may be collapsed into a single copy.

    • Received August 28, 2007.
    • Accepted October 16, 2007.
| Table of Contents

Preprint Server