ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Data Note

High quality, small molecule-activity datasets for kinase research

[version 1; peer review: 2 approved]
PUBLISHED 14 Jun 2016
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Cheminformatics gateway.

This article is included in the Data: Use and Reuse collection.

Abstract

Kinases regulate cell growth, movement, and death. Deregulated kinase activity is a frequent cause of disease. The therapeutic potential of kinase inhibitors has led to large amounts of published structure activity relationship (SAR) data. Bioactivity databases such as the Kinase Knowledgebase (KKB), WOMBAT, GOSTAR, and ChEMBL provide researchers with quantitative data characterizing the activity of compounds across many biological assays. The KKB, for example, contains over 1.8M kinase structure-activity data points reported in peer-reviewed journals and patents. In the spirit of fostering methods development and validation worldwide, we have extracted and have made available from the KKB 258K structure activity data points and 76K associated unique chemical structures across eight kinase targets. These data are freely available for download within this data note.

Keywords

Kinase, SAR, Bioactivity Database, Dataset, Drug Discovery, Bioactive Molecules, Kinase Knowledgebase, KKB

Introduction

Since their discovery in 1975 by Cohen et al.1, kinases are now one of the most established drug target families, second only to G-protein-coupled receptors (GPCRs). Most progress in kinase research has occurred in the last 25 years including the discovery of many new kinases2,3, identification of new isoforms of pre-existing kinases4,5, elucidation of new biological pathways, and identification of many new kinase-disease associations6,7. While kinases are well-validated anti-cancer targets811, kinase inhibitors also have been pursued in cardiovascular12, autoimmune13, inflammatory skin and bowel14, neurodegenerative15, and renal disease programs16. Most small-molecule kinase inhibitors target the ATP binding site of the kinase catalytic domain11. The ATP binding region of the catalytic domain is highly conserved among protein kinases, which has important consequences for drug development. Achieving selectivity of a small molecule inhibitor against kinase off-targets to avoid adverse reactions can be a major hurdle. However, the cross reactivity of many chemotypes can also open opportunities to focus on other closely related kinases. Despite the high degree of conservation in the ATP binding site, reasonably selective inhibitors with favorable pharmacological properties can be developed17. It is now common in discovery programs to profile inhibitors against an extensive set of kinase targets18. These kinase-profiling efforts have generated valuable data, providing insight into selectivity and promiscuity of clinical inhibitors1921.

Medicinal chemists can benefit significantly from well-curated databases documenting chemical structure(s) with an experimentally measured biological activity. These structure and activity databases or SAR databases help to better understand drug-target interaction, which can assist in the design of potent and selective chemical inhibitors2225. A well populated, editable, easy to search and flexible SAR database is an integral part of the modern drug design process26. SAR databases provide elementary insights to researchers, including:

  • (a) Target druggability: known small molecule binders are required to categorize a protein as druggable. High-affinity and non-promiscuous inhibitors are particularly valuable to establish druggability; and can be further validated using structure biology information. In many cases druggability can be inferred for new targets using homology models27 where similarities can be mapped via sequences, pathways or functions. Examples include the Target Informatics Platform (TIP)28 and Modbase29.

  • (b) Scaffold selectivity: the golden principle that applies is “less selective scaffolds have more undesirable side effects.” A prior knowledge of selectivity profiles can help in making informed decisions on which chemotypes to pursue at the start of discovery programs30. Organizing data by scaffold enables classic SAR analysis in which side-chain moieties can be evaluated and considered or avoided in lead optimization31.

  • (c) Clinical molecules: it can be very helpful to see scaffold(s) or derivatives under the study of launched drugs. This enables medicinal chemists to associate therapeutic classes with active scaffolds.

  • (d) Development and validation of computational methods: well-curated datasets are very helpful in the development and refinement of computational methodologies. With a common set of data, computational researchers can also compare and contrast methods, providing additional validation32.

  • (e) Virtual screening: high-quality, well-curated, standardized and annotated datasets are required to build predictive models for virtual screening as we have shown previously specifically for the Kinase Knowledgebase (KKB) data33.

Kinase Knowledgebase (KKB)

The KKB is a database of biological activity data, structure-activity relationships, and chemical synthesis data focused on protein kinases. Since its inception in 2001, the KKB has grown steadily with quarterly updates each year. With more than two decades of high quality SAR data, the KKB represents one of the first kinase target specific databases of biological activity and chemical synthesis data from curated scientific literature and patents. The KKB contains a large number of kinase structure-activity data points (>1.8M) reported in peer-reviewed literature covering journals and patents. The data have been curated from over 150 different journals reporting kinase inhibitors with activity data, with leading contributions from J Med Chem, Bioorg Med Chem, Bioorg Med ChemLett and Euro J Med Chem. In addition, the KKB contains data curated from patents/applications from WO, EP and US. The scientific information is curated from the published text using a combination of automatic and manual efforts.

A summary of the first quarter release for year 2016 (Q1-2016) is reported in Table 1. With the Q1-2016 KKB release, there are total of 506 unique kinase targets with over 682K unique small molecules. A listing of few “hot” kinase targets with their inhibitors (data points) is reported in Table 2.

Table 1. Eidogen-Sertanty Kinase Knowledgebase.

Summary Statistics – Q1 2016 Release.

Articles covered:2,780
Patents and patent applications covered:6,346
Total Number of Bio-activity data points:1,775,368
Total Number of unique molecules:682,289
Total Number of unique molecules w/ assay data:337,491
Total Number of assay protocols:32,462

Table 2. Eidogen-Sertanty Kinase Knowledgebase.

Data Points for Selected Targets– Q1 2016 Release.

Kinase
Classification
FamilyTarget NameAll
SAR
Data
Points
All
IC50
Data
Points
Unique
Assay
Molecules
All
SAR
Data
Points
All
IC50
Data
Points
Unique
Assay
Molecules
Non-Receptor
Tyrosine Kinases
Abl ABL1 1475048432177423718361098
Csk CSK 37921448450548266146
Fak FAK/PTK2 1031140673863288013061300
JakA JAK3 295508778114561327605440
Src SRC 219368289448034251473747
LCK 23819105146090784381214
FYN 312587315128117
Syk SYK 3942617549167741037484268
ZAP70 595129981013522
Tec ITK 101313690219721983113
Receptor
Tyrosine Kinases
EGFR EGFR 342931468465931973190683321
ERBB2 1118251991756798841151803
Eph EPHA2 29357652231201
FGFR FGFR1 1958283944149878133451622
InsR INSR 460712931032920422395
Met MET 27032104069308514725261983
PDGFR PDGFRB 140585889238854262653983
FLT3/FLK2 13082397428301022443862268
KIT 1499151532527704033392747
Tie TEK 914243062300312215611360
Trk NTRK1/
TRKA
8199320729251743814563
VEGFR KDR/FLK1 5599124821138992031791196541
FLT1 996342511116864432197
CMGC Kinases CDK CDK2 33878126951041153441119667
CDK5 8227304817141833
GSK GSK3B 22950776669922013519832
MAPK MAPK14 360671607714270654123732787
MAPK1 1128630733081272510641085
MAPK10 572516151610964823
MAPK8 622518031523880285393
MAPK11 1162196100000
AGC Kinases AKT AKT1 1460163335794697030642831
DMPK ROCK1 9135205231051894065
PKB PDPK1 9569376526421486844
PKC PRKCA 10670352825885477669510
PRKCE 375914941032211
CAMK Kinases CAMKL CHEK1 137245192520231402201130
MAPKAPK MAPKAPK2 11041407337471311649637
MAPKAPK3 2138518299000
Other Protein
Kinases
AUR AURKA 22646790470341128474382
IKK IKBKB 76282978314636783144
CHUK/
IKBKA
2938999764296148147
PLK PLK1 91813223348029861364888
STE MAP2K1 6340255120451651573655
TKL ILK 36018017258125380
RAF1 11302505833781956885581
BRAF 26349121698983672624422106
Other Non-Protein
Kinases
Lipid
Kinases
PIK3/
PIK3CG
299251343810899352517581217
PIK3CA 361681641812448339213101219
Nucleotide
Kinases
TK1 11063013392416533193
ADK 1924931723669252240

Kinase inhibitors are biologically active small molecules and their activity refers to experimentally measured data on a given kinase target (in enzyme or in cell based assays), using predefined experimental protocols. After curation and standardization, these measured values together with related information are indexed in the KKB. Each inhibitor entered in the KKB carries unique identifiers such as:

  • (a) Chemical information and biological information: unique structure IDs (MR_ID) are assigned based on unique canonical SMILES. In addition hand-drawn Cartesian coordinates are captured. Chemical compounds are associated with calculated chemical and physical properties.

  • (b) Biological target and assay protocol: biological targets are annotated by EntrezGeneID, UniProt ID, and HUGO approved names. An assay protocol includes detailed information pertaining to the experiments performed to measure the biological activity for the compound. Each protocol has a descriptive title and a unique set of keywords. Assays are categorized by assay format (biochemical, cell-based, etc.) following standards set forth by BioAssay Ontology (BAO)34,35. Kinase targets are classified by protein and non-protein kinases and protein kinases by the typical domain-based classification into group, family, etc. We are in the process of mapping KKB targets to the Drug Target Ontology (DTO), which is in development.

  • (c) Experimental bioactivity screening results. A bioactivity data point is a defined result/endpoint of a specified small molecule compound tested in a biological assay. The assay is defined in b); result type/endpoint captured include IC50, Ki, Kd; the vast majority for biochemical and cell-based assays correspond to BAO definitions.

  • (d) Source reference: bibliographic information and unique identifiers for journal article and patents from which information related to the molecules was extracted include PubMedID, DOI, and standardized patent numbers. For journals, the KKB provides title, authors name, journal-name, volume, issues, and page numbers. For patents their titles, patent or patent application number (along with family members), inventor’s names, assignee names, publication data and priority numbers are provided.

It is observed that a disease type can be related to multiple kinase groups, and several diseases can arise from a common set of kinase group (Table 3)6. In the KKB, kinases are classified by protein and non-protein kinases with several sub-categories such as carbohydrate and lipid kinase and the typical protein kinase groups (such CMGC, CAMK, TK, TKL, RGC, AGC) and further sub-groups such as families. DTO provides a functional and phylogenetic classification of kinase domains to facilitate navigation of kinase drug targets. DTO is developed as part of the Illuminating the Druggable Genome (IDG) project. Here we make datasets freely available for the research community including to support efforts such as IDG. We also offer to run our predictive models built using KKB data to support prioritization of drug targets.

Table 3. Kinase-disease association in top therapeutic segments.

Disease ClassKinase Group
CancerAGC;atypical;CAMK;CK1;
CMGC;RGC;STE;TK;TKL
DiabetesAGC;CMGC;TK
CardiovascularAGC;CAMK;CMGC;TKL
HypertensionAGC;CAMK;RGC
NeurodegenerationAGC;CAMK;CMGC;CK1
InflammationCMGC;STE;TKL
ImmunityAGC;TK

Kinase inhibitor datasets

The wealth of kinase inhibitor data presents opportunities for analysis as a whole or by integrating such data into various computational platforms to support development and validation of hypotheses of kinase inhibition. Several years ago, Eidogen-Sertanty made available 3880 pIC50 data points across three kinase targets (ABL1, SRC, and AURKA – validation sets) to foster algorithm development and validation worldwide. With this data note, eight additional targets comprising inhibitors for therapeutically important classes: EGFR, CDK2, ROCK2, MAPK14 and PI3K (class I catalytic) (Table 4) totaling ~258K data points (structure with standard results/endpoints such as IC50, Ki or Kd) and ~76K unique chemical structures now have been made available to further foster worldwide development, validation, and collaborative interaction (see KB_SAR_DATA_F1000.txt and KB_SAR_DATA_F1000.sdf files). These datapoints have been exported from the KKB and survey 1044 articles and 942 patents respectively.

Table 4. Important aspects about the selected targets.

KinaseApproved NameClassDiseases AssociatedEntrez
GeneID
Uniprot
ID
EGFR*Epidermal Growth Factor
Receptor
Receptor Tyrosine
Kinase
NSCLC, Medullary Thyroid
Cancer, Breast Cancer,
Neonatal Inflammatory Skin
and Bowel Disease
1956P00533
CDK2Cyclin-Dependent Kinase 2Serine/Threonine
Kinase
Angiomyoma, Carbuncle1017P24941
ROCK2Rho-Associated, Coiled-Coil
Containing Protein Kinase 2
Serine/Threonine
Kinase
Colorectal Cancer, Penile
Disease, Hepatocellular
Carcinoma
9475O75116
MAPK14Mitogen-Activated Protein
Kinase 14
Serine/Threonine
Kinase
Acquired Hyperkeratosis,
Prostate Transitional Cell
Carcinoma, Immunity-related
Diseases
1432Q16539
PIK3CAPhosphatidylinositol-4,5-
Bisphosphate 3-Kinase,
Catalytic Subunit Alpha
Lipid KinaseColorectal Cancer, Actinic
Keratosis
5290P42336
PIK3CBPhosphatidylinositol-4,5-
Bisphosphate 3-Kinase,
Catalytic Subunit Beta
Lipid Kinase-5291P42338
PIK3CDPhosphatidylinositol-4,5-
Bisphosphate 3-Kinase,
Catalytic Subunit Delta
Lipid KinaseImmunodeficiency 14,
Activated PIK3-Delta
Syndrome
5293O00329
PIK3CGPhosphatidylinositol-4,5-
Bisphosphate 3-Kinase,
Catalytic Subunit Gamma
Lipid KinaseLichen Nitidus5294P48736

*Afatinib, Erlotinib, Gefitinib, Lapatinib, Osimertinib, Vandetanib are US-FDA approved kinase inhibitors with EGFR as one of the valid targets.

The datasets cover a broad range of biochemical and cell based studies investigating kinase inhibition; and they represent a diverse collection of pharmaceutically active scaffolds. These scaffolds can be easily examined for selectivity and specificity for the given eight kinase targets. Additionally, they can be used to infer novel target-inhibitor relationships for kinases and compounds not included in these subsets.

Bibliographic information is reported in the files ArticleInfo_F1000.txt and PatentInfo_F1000.txt. Experimental procedure along with metadata information for targets including EntrezGeneIDs, assay format/type (biochemical/enzyme, cell based, etc), keywords, species, and cell lines used in cell-based data are stored in AssayProtocols_F1000 (txt and xml attached).

The KKB validation sets have a maximum contribution from EGFR with nearly ~54K inhibitors molecules. This is followed by ~43K inhibitors for MAPK14; CDK2 and PIK3CA each have ~39K inhibitors. Figure 1 depicts data point distributions for each kinase in the attached subset. Moreover, 84% of the data are from biochemical enzyme based assay experiments, and 16% of the data from cell-based assays (in Figure 2). The datapoint measures include IC50, Ki and Kd (Figure 3).

0c1cc91b-adf9-4f73-af9e-d1ef05d1654f_figure1.gif

Figure 1. Data point distributions for each kinase.

0c1cc91b-adf9-4f73-af9e-d1ef05d1654f_figure2.gif

Figure 2. Data points share for each assay type.

0c1cc91b-adf9-4f73-af9e-d1ef05d1654f_figure3.gif

Figure 3. Data points in various assay measures.

Analysis of ~76K unique molecules for selectivity against targets reveals that ~64K inhibit only one kinase of the eight kinases extracted (Figure 4). Approximately 5K molecules show activity against two kinase targets, and ~3K molecules show activity against three kinases. A total of 79 molecules in the subset have some activity against all the eight kinase targets.

0c1cc91b-adf9-4f73-af9e-d1ef05d1654f_figure4.gif

Figure 4. Selectivity profile for data points.

Dataset 1.High quality, small molecule-activity for kinase research. Raw data behind the analyses described in the Data Note are included.
The file 'Datasets legends' contains descriptions for each dataset.

Conclusions

The KKB is available in various formats such as SQL, SDF and IJC format (Instant JChem) as quarterly updates. Two mobile apps, iKinase and iKinasePro25, are also available for download which enable basic search access into KKB content, including kinase inhibitor structures, biological data and references/patents. Simple substructure and exact structure search access into the KKB is also available. We have extracted from the KKB ~258K structure activity data points and ~76K associated unique chemical structures across eight kinase targets and made these data freely available for download within this datanote to foster algorithms development and validation worldwide.

Data availability

F1000Research: Dataset 1. High quality, small molecule-activity for kinase research, 10.5256/f1000research.8950.d12459136

Comments on this article Comments (0)

Version 3
VERSION 3 PUBLISHED 14 Jun 2016
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Sharma R, Schürer SC and Muskal SM. High quality, small molecule-activity datasets for kinase research [version 1; peer review: 2 approved] F1000Research 2016, 5(Chem Inf Sci):1366 (https://doi.org/10.12688/f1000research.8950.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 14 Jun 2016
Views
45
Cite
Reviewer Report 08 Jul 2016
Sorin Avram, Department of Computational Chemistry, Institute of Chemistry Timisoara of the Romanian Academy (ICT), Timișoara, Romania 
Approved
VIEWS 45
The paper describes Kinase Knowledgebase (KKB), i.e., a database containing structure-activity data on kinases. The current data note briefly presents the KKB Q1 2016 Release and the appended eight kinase data sets, which are made hereby publicly available. 

... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Avram S. Reviewer Report For: High quality, small molecule-activity datasets for kinase research [version 1; peer review: 2 approved]. F1000Research 2016, 5(Chem Inf Sci):1366 (https://doi.org/10.5256/f1000research.9629.r14358)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
29
Cite
Reviewer Report 08 Jul 2016
George Nicola, Computational Biology, University of California at San Diego, San Diego, CA, USA 
Approved
VIEWS 29
This article describes an overview of current kinase-related databases of significance, with particular focus on the contents of the Kinase Knowledgebase (KKB). The KKB has the largest repository of high-quality kinase activity data. Providing access to over ¼ million data ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Nicola G. Reviewer Report For: High quality, small molecule-activity datasets for kinase research [version 1; peer review: 2 approved]. F1000Research 2016, 5(Chem Inf Sci):1366 (https://doi.org/10.5256/f1000research.9629.r14833)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 3
VERSION 3 PUBLISHED 14 Jun 2016
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.