Genome-wide nucleotide-level mammalian ancestor reconstruction

  1. Benedict Paten1,4,
  2. Javier Herrero2,
  3. Stephen Fitzgerald2,
  4. Kathryn Beal2,
  5. Paul Flicek2,
  6. Ian Holmes3, and
  7. Ewan Birney2,4
  1. 1 Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, California 95064, USA;
  2. 2 EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom;
  3. 3 Department of Bioengineering, University of California Berkeley, Berkeley, California 94720, USA

Abstract

Recently attention has been turned to the problem of reconstructing complete ancestral sequences from large multiple alignments. Successful generation of these genome-wide reconstructions will facilitate a greater knowledge of the events that have driven evolution. We present a new evolutionary alignment modeler, called “Ortheus,” for inferring the evolutionary history of a multiple alignment, in terms of both substitutions and, importantly, insertions and deletions. Based on a multiple sequence probabilistic transducer model of the type proposed by Holmes, Ortheus uses efficient stochastic graph-based dynamic programming methods. Unlike other methods, Ortheus does not rely on a single fixed alignment from which to work. Ortheus is also more scaleable than previous methods while being fast, stable, and open source. Large-scale simulations show that Ortheus performs close to optimally on a deep mammalian phylogeny. Simulations also indicate that significant proportions of errors due to insertions and deletions can be avoided by not assuming a fixed alignment. We additionally use a challenging hold-out cross-validation procedure to test the method; using the reconstructions to predict extant sequence bases, we demonstrate significant improvements over using closest extant neighbor sequences. Accompanying this paper, a new, public, and genome-wide set of Ortheus ancestor alignments provide an intriguing new resource for evolutionary studies in mammals. As a first piece of analysis, we attempt to recover “fossilized” ancestral pseudogenes. We confidently find 31 cases in which the ancestral sequence had a more complete sequence than any of the extant sequences.

Footnotes

  • 4 Corresponding authors.

    4 E-mail benedict{at}soe.ucsc.edu; fax (831) 459-1809.

    4 E-mail birney{at}ebi.ac.uk; fax +44-(0)1223-494-468.

  • [Supplemental material is available online at www.genome.org. The source code for Ortheus is freely available at http://www.ebi.ac.uk/∼bjp/ortheus/, and the genome-wide alignments are freely available from Ensembl (http://www.ensembl.org/).]

  • Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.076521.108.

    • Received January 23, 2008.
    • Accepted September 9, 2008.
  • Freely available online through the Genome Research Open Access option.

| Table of Contents
OPEN ACCESS ARTICLE

Preprint Server