Optimal Haplotype Block-Free Selection of Tagging SNPs for Genome-Wide Association Studies

  1. Bjarni V. Halldórsson1,
  2. Vineet Bafna1,4,
  3. Ross Lippert1,5,
  4. Russell Schwartz1,6,
  5. Francisco M. De La Vega2,
  6. Andrew G. Clark3, and
  7. Sorin Istrail1,7
  1. 1 Celera/Applied Biosystems, Rockville, Maryland 20850, USA
  2. 2 Applied Biosystems, Foster City, California 94404, USA
  3. 3 Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853, USA

Abstract

It is widely hoped that the study of sequence variation in the human genome will provide a means of elucidating the genetic component of complex diseases and variable drug responses. A major stumbling block to the successful design and execution of genome-wide disease association studies using single-nucleotide polymorphisms (SNPs) and linkage disequilibrium is the enormous number of SNPs in the human genome. This results in unacceptably high costs for exhaustive genotyping and presents a challenging problem of statistical inference. Here, we present a new method for optimally selecting minimum informative subsets of SNPs, also known as “tagging” SNPs, that is efficient for genome-wide selection. We contrast this method to published methods including haplotype block tagging, that is, grouping SNPs into segments of low haplotype diversity and typing a subset of the SNPs that can discriminate all common haplotypes within the blocks. Because our method does not rely on a predefined haplotype block structure and makes use of the weaker correlations that occur across neighboring blocks, it can be effectively applied across chromosomal regions with both high and low local linkage disequilibrium. We show that the number of tagging SNPs selected is substantially smaller than previously reported using block-based approaches and that selecting tagging SNPs optimally can result in a two- to threefold savings over selecting random SNPs.

Footnotes

  • [Supplemental material is available online at www.genome.org.]

  • Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2570004.

  • 4 Present address: University of California, San Diego, Computer Science & Engineering, La Jolla, CA 92093, USA

  • 5 Present address: Department of Mathematics, MIT, Cambridge, MA 02139-4307, USA

  • 6 Present address: Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA

  • 7 Corresponding author. E-MAIL Sorin.Istrail{at}celera.com; FAX (240) 453-3324.

    • Accepted June 8, 2004.
    • Received March 13, 2004.
| Table of Contents

Preprint Server