ProbCons: Probabilistic consistency-based multiple sequence alignment

  1. Chuong B. Do1,
  2. Mahathi S.P. Mahabhashyam1,
  3. Michael Brudno1, and
  4. Serafim Batzoglou1,2
  1. 1 Department of Computer Science, Stanford University, Stanford, California 94305, USA

Abstract

To study gene evolution across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein families. Obtaining accurate alignments, however, is a difficult computational problem because of not only the high computational cost but also the lack of proper objective functions for measuring alignment quality. In this paper, we introduce probabilistic consistency, a novel scoring function for multiple sequence comparisons. We present ProbCons, a practical tool for progressive protein multiple sequence alignment based on probabilistic consistency, and evaluate its performance on several standard alignment benchmark data sets. On the BAliBASE, SABmark, and PREFAB benchmark alignment databases, ProbCons achieves statistically significant improvement over other leading methods while maintaining practical speed. ProbCons is publicly available as a Web resource.

Footnotes

  • [Supplemental material is available online at www.genome.org. Source code and executables are available as public domain software at http://probcons.stanford.edu.]

  • Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2821705.

  • 3 Previous results on the BAliBASE 2.01 benchmark alignments database reported in an abstract (Do et al. 2004), which correspond to the ProbCons-ext program, differ slightly from those shown in the text. These small differences are attributable to (1) a change in the methods used for extracting BAliBASE core blocks as suggested by Robert C. Edgar (pers. comm.), and (2) minor changes in the HMM model and training procedure for the current version of ProbCons.

  • 4 The results for the nw-ns-i script from MAFFT on the PREFAB database given in Edgar (2004) contain an editing error (R.C. Edgar, pers. comm.); the values shown here are correct. Interestingly, although MAFFT achieves a slightly higher overall average SP score than MUSCLE, a Friedman rank test indicates that MUSCLE consistently produces better alignments than MAFFT (see Table 4).

  • 5 The numbers reported for the Align-m aligner are similar to those given in Edgar (2004), but differ from the results reported in Van Walle et al. (2004). The primary reason for this difference is that the averages in the latter study were computed across all SABmark pairwise alignments; this fails to account for dependencies within each subset, so the weight of each subset scales quadratically with the number of sequences present. We avoid this by averaging pairwise alignment scores within each subset before averaging all subset scores.

  • 6 While a ROC analysis would better characterize aligner performance, properly defining sensitivity and specificity measures for alignment accuracy involves subtle issues regarding the alignability of particular positions in sequences. Furthermore, the appropriate manner for adjusting program parameters so as to observe the sensitivity/specificity trade-off for the expected accuracy alignment algorithm is also an open problem. We leave these questions for future work.

  • 2 Corresponding author. E-mail serafim{at}cs.stanford.edu; fax (650) 725-1449.

    • Accepted November 29, 2004.
    • Received May 24, 2004.
| Table of Contents

Preprint Server