Calling amplified haplotypes in next generation tumor sequence data

Ninad Dewal; Yang Hu; Matthew L. Freedman; Thomas LaFramboise; Itsik Pe'er

doi:10.1101/gr.122564.111

Calling amplified haplotypes in next generation tumor sequence data

¹Department of Biomedical Informatics, Columbia University, New York, New York 10032, USA;
²Department of Computer Science, Columbia University, New York, New York 10027, USA;
³Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, Massachusetts 02115, USA;
⁴Medical and Population Genetics Program, The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA;
⁵Department of Genetics, Case Western Reserve University School of Medicine, Cleveland, Ohio 44106, USA

Abstract

During tumor initiation and progression, cancer cells acquire a selective advantage, allowing them to outcompete their normal counterparts. Identification of the genetic changes that underlie these tumor acquired traits can provide deeper insights into the biology of tumorigenesis. Regions of copy number alterations and germline DNA variants are some of the elements subject to selection during tumor evolution. Integrated examination of inherited variation and somatic alterations holds the potential to reveal specific nucleotide alleles that a tumor “prefers” to have amplified. Next-generation sequencing of tumor and matched normal tissues provides a high-resolution platform to identify and analyze such somatic amplicons. Within an amplicon, examination of informative (e.g., heterozygous) sites deviating from a 1:1 ratio may suggest selection of that allele. A naive approach examines the reads for each heterozygous site in isolation; however, this ignores available valuable linkage information across sites. We, therefore, present a novel hidden Markov model-based method—Haplotype Amplification in Tumor Sequences (HATS)—that analyzes tumor and normal sequence data, along with training data for phasing purposes, to infer amplified alleles and haplotypes in regions of copy number gain. Our method is designed to handle rare variants and biases in read data. We assess the performance of HATS using simulated amplified regions generated from varying copy number and coverage levels, followed by amplicons in real data. We demonstrate that HATS infers the amplified alleles more accurately than does the naive approach, especially at low to intermediate coverage levels and in cases (including high coverage) possessing stromal contamination or allelic bias.

Footnotes

↵6 Corresponding author.

E-mail itsik{at}cs.columbia.edu.
[Supplemental material is available for this article.]
Article published online before print. Article, supplemental material, and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.122564.111 .