Calling amplified haplotypes in next generation tumor sequence data

  1. Itsik Pe'er2,6
  1. 1Department of Biomedical Informatics, Columbia University, New York, New York 10032, USA;
  2. 2Department of Computer Science, Columbia University, New York, New York 10027, USA;
  3. 3Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, Massachusetts 02115, USA;
  4. 4Medical and Population Genetics Program, The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA;
  5. 5Department of Genetics, Case Western Reserve University School of Medicine, Cleveland, Ohio 44106, USA

    Abstract

    During tumor initiation and progression, cancer cells acquire a selective advantage, allowing them to outcompete their normal counterparts. Identification of the genetic changes that underlie these tumor acquired traits can provide deeper insights into the biology of tumorigenesis. Regions of copy number alterations and germline DNA variants are some of the elements subject to selection during tumor evolution. Integrated examination of inherited variation and somatic alterations holds the potential to reveal specific nucleotide alleles that a tumor “prefers” to have amplified. Next-generation sequencing of tumor and matched normal tissues provides a high-resolution platform to identify and analyze such somatic amplicons. Within an amplicon, examination of informative (e.g., heterozygous) sites deviating from a 1:1 ratio may suggest selection of that allele. A naive approach examines the reads for each heterozygous site in isolation; however, this ignores available valuable linkage information across sites. We, therefore, present a novel hidden Markov model-based method—Haplotype Amplification in Tumor Sequences (HATS)—that analyzes tumor and normal sequence data, along with training data for phasing purposes, to infer amplified alleles and haplotypes in regions of copy number gain. Our method is designed to handle rare variants and biases in read data. We assess the performance of HATS using simulated amplified regions generated from varying copy number and coverage levels, followed by amplicons in real data. We demonstrate that HATS infers the amplified alleles more accurately than does the naive approach, especially at low to intermediate coverage levels and in cases (including high coverage) possessing stromal contamination or allelic bias.

    Footnotes

    • Received February 22, 2011.
    • Accepted November 1, 2011.
    | Table of Contents

    Preprint Server