ARACHNE: A Whole-Genome Shotgun Assembler

  1. Serafim Batzoglou1,2,3,
  2. David B. Jaffe2,3,4,
  3. Ken Stanley2,
  4. Jonathan Butler2,
  5. Sante Gnerre2,
  6. Evan Mauceli2,
  7. Bonnie Berger1,5,
  8. Jill P. Mesirov2, and
  9. Eric S. Lander2,6,7
  1. 1Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA; 2Whitehead Institute/MIT Center for Genome Research, Cambridge, Massachusetts 02141, USA; 4Department of Mathematics and Statistics, University of Nebraska, Lincoln, Nebraska 68588, USA; 5Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA; 6Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA

Abstract

We describe a new computer system, calledARACHNE, for assembling genome sequence using paired-end whole-genome shotgun reads. ARACHNEhas several key features, including an efficient and sensitive procedure for finding read overlaps, a procedure for scoring overlaps that achieves high accuracy by correcting errors before assembly, read merger based on forward-reverse links, and detection of repeat contigs by forward-reverse link inconsistency. To testARACHNE, we created simulated reads providing ∼10-fold coverage of the genomes of H. influenzae, S. cerevisiae, and D. melanogaster, as well as human chromosomes 21 and 22. The assemblies of these simulated reads yielded nearly complete coverage of the respective genomes, with a small number of contigs joined into a smaller number of supercontigs (or scaffolds). For example, analysis of the D. melanogaster genome yielded ∼98% coverage with an N50 contig length of 324 kb and an N50 supercontig length of 5143 kb. The assembly accuracy was high, although not perfect: small errors occurred at a frequency of roughly 1 per 1 Mb (typically, deletion of ∼1 kb in size), with a very small number of other misassemblies. The assembly was rapid: the Drosophilaassembly required only 21 hours on a single 667 MHz processor and used 8.4 Gb of memory.

Footnotes

  • 3 These authors contributed equally to this work.

  • 7 Corresponding author.

  • E-MAIL lander{at}wi.mit.edu; FAX (617) 252-1933.

  • Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.208902.

| Table of Contents

Preprint Server