Published December 11, 2019
| Version 1.1
Dataset
Open
Data for Bovine breed-specific augmented reference graphs facilitate accurate sequence read mapping and unbiased variant discovery
Description
Description of the datasets
Data are organized as folders and compressed with tar.gz.
There are two compressed data folder: data which used for cattle genome graphs experiment and data_human which we used for human genome graphs experiment.
Cattle genome graphs experiments
First you need to unzip the file using command tar -xvzf data.tar.gz. After unzipping, the data folder is organized as follows:
- Utilities: contain bovine ARS-UCD 1.2 fasta reference with the accompanying index.
- Bin: contain the softwares used in the paper (vg, liftover, vcf2diploid)
- Part1: data for analysis in variant prioritization section, further subdivided into:
- vcf_sim: variant files from four animal in each breed used to simulate reads
- reads_sim: simulated short reads used for read mapping
- vcf_freq: variants augmented to graphs filtered based on allele frequency
- Part2: data used for analysis in the section of graph mapping with breeds-filtered variants, further subdivided into:
- vcf_breed: variant files used to graphs construction.
- Part3: data used for analysis in the section of consensus genome, further subdivided into:
- read_sims: simulated reads as in the part1, but the coordinates are liftovered to the new consensus genomes.
- reference: contain the original reference and consensus references.
- vcf_consensus: contain major allele variants to construct consensus genomes.
- Part4: data analysis in the section of whole genome graph construction and variant genotyping.
- vcf_construct: variants from chromosome 1-29 from 82 Brown Swiss used to construct BSW whole genome graph.
- BSW_graph: whole genome Brown Swiss graph with the three accompanying indexes (xg,gcsa, and gbwt).
Human genome graphs experiments
First you need to unzip the data_human file using command tar -xvzf data_hum.tar.gz. After unzipping, the data folder is organized as follows:
- reference: the g1k_v37 reference used as a graph backbone
- vcf_sim: variant files from four individuals in each population used to simulate reads
- reads_sim: simulated short reads used for read mapping
- vcf_freq: variants augmented to graphs filtered based on allele frequency
Files
Files
(35.4 GB)
Name | Size | Download all |
---|---|---|
md5:61174b9dcfeba5e0fd4378e8b8fb7e2e
|
24.8 GB | Download |
md5:59f7cd71bbfc0661a24844b5e8433410
|
10.6 GB | Download |