mGene: Accurate SVM-based gene finding with an application to nematode genomes

  1. Gabriele Schweikert1,2,3,
  2. Alexander Zien1,4,5,
  3. Georg Zeller1,3,5,
  4. Jonas Behr1,
  5. Christoph Dieterich3,6,
  6. Cheng Soon Ong1,2,7,
  7. Petra Philips1,
  8. Fabio De Bona1,
  9. Lisa Hartmann1,
  10. Anja Bohlen1,
  11. Nina Krüger1,
  12. Sören Sonnenburg4,1 and
  13. Gunnar Rätsch1,8
  1. 1 Friedrich Miescher Laboratory, Max Planck Society, Tübingen 72076, Germany;
  2. 2 Max Planck Institute for Biological Cybernetics, Tübingen 72076, Germany;
  3. 3 Max Planck Institute for Developmental Biology, Tübingen 72076, Germany;
  4. 4 Fraunhofer Institute FIRST.IDA, Berlin 12489, Germany
    • 6 Present addresses: Max Delbrück Center for Molecular Medicine, Berlin 13125, Germany;

    • 7 Department of Computer Science, ETH, Zürich 8092, Switzerland.

    1. 5 These authors contributed equally to this work.

    Abstract

    We present a highly accurate gene-prediction system for eukaryotic genomes, called mGene. It combines in an unprecedented manner the flexibility of generalized hidden Markov models (gHMMs) with the predictive power of modern machine learning methods, such as Support Vector Machines (SVMs). Its excellent performance was proved in an objective competition based on the genome of the nematode Caenorhabditis elegans. Considering the average of sensitivity and specificity, the developmental version of mGene exhibited the best prediction performance on nucleotide, exon, and transcript level for ab initio and multiple-genome gene-prediction tasks. The fully developed version shows superior performance in 10 out of 12 evaluation criteria compared with the other participating gene finders, including Fgenesh++ and Augustus. An in-depth analysis of mGene's genome-wide predictions revealed that ≈2200 predicted genes were not contained in the current genome annotation. Testing a subset of 57 of these genes by RT-PCR and sequencing, we confirmed expression for 24 (42%) of them. mGene missed 300 annotated genes, out of which 205 were unconfirmed. RT-PCR testing of 24 of these genes resulted in a success rate of merely 8%. These findings suggest that even the gene catalog of a well-studied organism such as C. elegans can be substantially improved by mGene's predictions. We also provide gene predictions for the four nematodes C. briggsae, C. brenneri, C. japonica, and C. remanei. Comparing the resulting proteomes among these organisms and to the known protein universe, we identified many species-specific gene inventions. In a quality assessment of several available annotations for these genomes, we find that mGene's predictions are most accurate.

    Footnotes

    • 8 Corresponding author.

      E-mail Gunnar.Raetsch{at}tuebingen.mpg.de.; fax: 49-7071-601-801.

    • 9 More generally, generative methods model the joint probability Pr(Y, X) of hidden states Y and observations X, whereas discriminative techniques directly model the conditional probability Pr(Y/X) of hidden states given observations (Ng and Jordan 2002).

    • 10 We investigated this for Augustus by evaluating gene predictions prepared by the maintainer (downloaded from http://augustus.gobics.de/predictions/caenorhabditis/abinitio/). We found the ab initio prediction performance to be unchanged since the nGASP competition (data not shown).

    • [Supplemental material is available online at http://www.genome.org. mGene is available as source code under Gnu Public License from the project website http://mgene.org and as a Galaxy-based webserver at http://mgene.org/web. Moreover, the gene predictions have been included in the WormBase annotation available at http://wormbase.org and the project website.]

    • Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.090597.108.

      • Received December 18, 2008.
      • Accepted June 23, 2009.
    | Table of Contents
    OPEN ACCESS ARTICLE

    Preprint Server