Published July 25, 2022 | Version 1.0
Dataset Open

Prop3D-20

  • 1. University of Virginia

Description

Prop3D-20 is a protein structure dataset that combines 3D atomic coordinates with biophysical and evolutionary properties for every atom in every "cleaned" domain structure from <b>20 CATH [1] Homologous Superfamilies.</b> Domain structures are 'cleaned' by adding missing residues with MODELLER [2], missing atoms with SCWRL4 [3], and protonating and energy minimizing (simple de-bump) with PDB2PQR [4]. We follow the CATH hierarchy in a hierarchical data format (HDF) file and include atomic level features, residue level features, residue-residue contact, and pre-calculated train (~80%) / test (~10%) / validation (~10%) splits for each superfamily derived from CATH's sequence identity clusters (e.g. S35 for 35% seq ID).

This dataset was originally stored in the Highly Scalable Data Service ([HSDS](http://www.github.com/HDFGroup/hsds)), and was exported into this raw .h5 file as backup. We recommend loading this data into HSDS for use in [h5pyd](http://www.github.com/HDFGroup/h5pyd), but the .h5 file can opened using [h5py](http://www.github.com/h5py/h5py) as well.

Please see the REAME attached to this dataset to learn how to use this dataset and how it organized.

For more information on setting up HSDS and/or recreate this dataset, please see [http://www.github.com/bouralab/Prop3D/README.md](http://www.github.com/bouralab/Prop3D/README.md)

References

1. Sillitoe I, Bordin N, Dawson N, Waman VP, Ashford P, Scholes HM, Pang CSM, Woodridge L, Rauer C, Sen N, Abbasian M, Le Cornu S, Lam SD, Berka K, Varekova IH, Svobodova R, Lees J, Orengo CA. CATH: increased structural coverage of functional space. Nucleic Acids Res. 2021 Jan 8;49(D1):D266-D273. doi: 10.1093/nar/gkaa1079. PMID: 33237325; PMCID: PMC7778904.

2. Webb B, Sali A. Comparative Protein Structure Modeling Using MODELLER. Curr Protoc Bioinformatics. 2016 Jun 20;54:5.6.1-5.6.37. doi: 10.1002/cpbi.3. PMID: 27322406; PMCID: PMC5031415.

3. Krivov GG, Shapovalov MV, Dunbrack RL Jr. Improved prediction of protein side-chain conformations with SCWRL4. Proteins. 2009 Dec;77(4):778-95. doi: 10.1002/prot.22488. PMID: 19603484; PMCID: PMC2885146.

4. Jurrus E, Engel D, Star K, Monson K, Brandi J, Felberg LE, Brookes DH, Wilson L, Chen J, Liles K, Chun M, Li P, Gohara DW, Dolinsky T, Konecny R, Koes DR, Nielsen JE, Head-Gordon T, Geng W, Krasny R, Wei GW, Holst MJ, McCammon JA, Baker NA. Improvements to the APBS biomolecular solvation software suite. Protein Sci. 2018 Jan;27(1):112-128. doi: 10.1002/pro.3280. Epub 2017 Oct 24. PMID: 28836357; PMCID: PMC5734301.


 

Files

README.md

Files (40.8 GB)

Name Size Download all
md5:79739646c03d8777c6b9dc618f61e80e
40.8 GB Download
md5:80d1662ea562a5597828ef447c617a0d
15.3 kB Preview Download

Additional details

Related works

Is described by
Dataset: https://www.wikidata.org/wiki/Q108040542 (URL)