Detection of long repeat expansions from PCR-free whole-genome sequence data

  1. Michael A. Eberle1,18
  1. 1Illumina Incorporated, San Diego, California 92122, USA;
  2. 2Department of Neurology, Brain Center Rudolf Magnus, University Medical Center Utrecht, Utrecht University, 3584 CX Utrecht, The Netherlands;
  3. 3Illumina Limited, Chesterford Research Park, Little Chesterford, Nr Saffron Walden, Essex, CB10 1XL, United Kingdom;
  4. 4Repositive Limited, Future Business Centre, Cambridge CB4 2HY, United Kingdom;
  5. 5Department of Neuroscience, Mayo Clinic, Jacksonville, Florida 32224, USA;
  6. 6New York Genome Center, New York, New York 10013, USA;
  7. 7SURFsara, 1098 XG Amsterdam, The Netherlands;
  8. 8Academic Unit of Neurology, Trinity College Dublin, Trinity Biomedical Sciences Institute, Dublin 2, Republic of Ireland;
  9. 9Department of Neurology, Beaumont Hospital, Dublin 9, Republic of Ireland;
  10. 10Department of Basic and Clinical Neuroscience, Maurice Wohl Clinical Neuroscience Institute, King's College London, London SE5 9RX, United Kingdom;
  11. 11Department of Molecular Neuroscience, UCL Institute of Neurology, London WC1N 3BG, United Kingdom;
  12. 12University of Southampton, Southampton SO17 1BJ, United Kingdom;
  13. 13Sheffield Institute for Translational Neuroscience, University of Sheffield, Sheffield S10 2HQ, United Kingdom;
  14. 14Columbia University, New York, New York 10032, USA;
  15. 15Hereditary Disease Foundation, New York, New York 10032, USA;
  16. 16The US–Venezuela Collaborative Research Group;
  17. 17Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
  1. 18 These authors contributed equally to this work.

  • Corresponding authors: meberle{at}illumina.com, J.H.Veldink{at}umcutrecht.nl
  • Abstract

    Identifying large expansions of short tandem repeats (STRs), such as those that cause amyotrophic lateral sclerosis (ALS) and fragile X syndrome, is challenging for short-read whole-genome sequencing (WGS) data. A solution to this problem is an important step toward integrating WGS into precision medicine. We developed a software tool called ExpansionHunter that, using PCR-free WGS short-read data, can genotype repeats at the locus of interest, even if the expanded repeat is larger than the read length. We applied our algorithm to WGS data from 3001 ALS patients who have been tested for the presence of the C9orf72 repeat expansion with repeat-primed PCR (RP-PCR). Compared against this truth data, ExpansionHunter correctly classified all (212/212, 95% CI [0.98, 1.00]) of the expanded samples as either expansions (208) or potential expansions (4). Additionally, 99.9% (2786/2789, 95% CI [0.997, 1.00]) of the wild-type samples were correctly classified as wild type by this method with the remaining three samples identified as possible expansions. We further applied our algorithm to a set of 152 samples in which every sample had one of eight different pathogenic repeat expansions, including those associated with fragile X syndrome, Friedreich's ataxia, and Huntington's disease, and correctly flagged all but one of the known repeat expansions. Thus, ExpansionHunter can be used to accurately detect known pathogenic repeat expansions and provides researchers with a tool that can be used to identify new pathogenic repeat expansions.

    Footnotes

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.225672.117.

    • Freely available online through the Genome Research Open Access option.

    • Received June 1, 2017.
    • Accepted August 28, 2017.

    This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/.

    | Table of Contents
    OPEN ACCESS ARTICLE

    Preprint Server