Published December 11, 2019 | Version 1.1
Dataset Open

Data for Bovine breed-specific augmented reference graphs facilitate accurate sequence read mapping and unbiased variant discovery

  • 1. Animal Genomics ETH Zürich

Description

Description of the datasets

Data are organized as folders and compressed with tar.gz.

There are two compressed data folder: data which used for cattle genome graphs experiment and data_human which we used for human genome graphs experiment. 

Cattle genome graphs experiments

First you need to unzip the file using command tar -xvzf data.tar.gz. After unzipping, the data folder is organized as follows:

  • Utilities: contain bovine ARS-UCD 1.2 fasta reference with the accompanying index.
  • Bin: contain the softwares used in the paper (vg, liftover, vcf2diploid)
  • Part1: data for analysis in variant prioritization section, further subdivided into:
    • vcf_sim: variant files from four animal in each breed used to simulate reads
    • reads_sim: simulated short reads used for read mapping
    • vcf_freq: variants augmented to graphs filtered based on allele frequency
  • Part2: data used for analysis in the section of graph mapping with breeds-filtered variants, further subdivided into:
    • vcf_breed: variant files used to graphs construction.
  • Part3: data used for analysis in the section of consensus genome, further subdivided into:
    • read_sims: simulated reads as in the part1, but the coordinates are liftovered to the new consensus genomes.
    • reference: contain the original reference and consensus references.
    • vcf_consensus: contain major allele variants to construct consensus genomes.
  • Part4: data analysis in the section of whole genome graph construction and variant genotyping.
    • vcf_construct: variants from chromosome 1-29 from 82 Brown Swiss used to construct BSW whole genome graph.
    • BSW_graph: whole genome Brown Swiss graph with the three accompanying indexes (xg,gcsa, and gbwt).

Human genome graphs experiments

First you need to unzip the data_human file using command tar -xvzf data_hum.tar.gz. After unzipping, the data folder is organized as follows:

  • reference: the g1k_v37 reference used as a graph backbone
  • vcf_sim: variant files from four individuals in each population used to simulate reads
  • reads_sim: simulated short reads used for read mapping
  • vcf_freq: variants augmented to graphs filtered based on allele frequency

Files

Files (35.4 GB)

Name Size Download all
md5:61174b9dcfeba5e0fd4378e8b8fb7e2e
24.8 GB Download
md5:59f7cd71bbfc0661a24844b5e8433410
10.6 GB Download