Predicting drug sensitivity of cancer cells based on DNA methylation levels

Sofia P. Miranda; Fernanda A. Baião; Julia L. Fleck; Stephen R. Piccolo

doi:10.1371/journal.pone.0238757

Abstract

Cancer cell lines, which are cell cultures derived from tumor samples, represent one of the least expensive and most studied preclinical models for drug development. Accurately predicting drug responses for a given cell line based on molecular features may help to optimize drug-development pipelines and explain mechanisms behind treatment responses. In this study, we focus on DNA methylation profiles as one type of molecular feature that is known to drive tumorigenesis and modulate treatment responses. Using genome-wide, DNA methylation profiles from 987 cell lines in the Genomics of Drug Sensitivity in Cancer database, we used machine-learning algorithms to evaluate the potential to predict cytotoxic responses for eight anti-cancer drugs. We compared the performance of five classification algorithms and four regression algorithms representing diverse methodologies, including tree-, probability-, kernel-, ensemble-, and distance-based approaches. We artificially subsampled the data to varying degrees, aiming to understand whether training based on relatively extreme outcomes would yield improved performance. When using classification or regression algorithms to predict discrete or continuous responses, respectively, we consistently observed excellent predictive performance when the training and test sets consisted of cell-line data. Classification algorithms performed best when we trained the models using cell lines with relatively extreme drug-response values, attaining area-under-the-receiver-operating-characteristic-curve values as high as 0.97. The regression algorithms performed best when we trained the models using the full range of drug-response values, although this depended on the performance metrics we used. Finally, we used patient data from The Cancer Genome Atlas to evaluate the feasibility of classifying clinical responses for human tumors based on models derived from cell lines. Generally, the algorithms were unable to identify patterns that predicted patient responses reliably; however, predictions by the Random Forests algorithm were significantly correlated with Temozolomide responses for low-grade gliomas.

Citation: Miranda SP, Baião FA, Fleck JL, Piccolo SR (2021) Predicting drug sensitivity of cancer cells based on DNA methylation levels. PLoS ONE 16(9): e0238757. https://doi.org/10.1371/journal.pone.0238757

Editor: Thippa Reddy Gadekallu, Vellore Institute of Technology: VIT University, INDIA

Received: August 21, 2020; Accepted: June 28, 2021; Published: September 10, 2021

Copyright: © 2021 Miranda et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Scripts for downloading and processing the data can be found at https://osf.io/r7pdb/.

Funding: This work was supported in part by the Coordination for the Improvement of Higher Education Personnel (CAPES) - Finance Code 001. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Abbreviations: ACC, Accuracy; AUC, Area under the receiver operating characteristic curve; FDR, Benjamini-Hochberg False Discovery Rate; GDSC, Genomics of Drug Sensitivity in Cancer; KNN, K-nearest neighbors; MAE, Mean absolute error; MCC, Matthews correlation coefficient; MMCE, Mean misclassification error; MSE, Mean squared error; NB, NaÏve Bayes; RF, Random forests; RMSE, Root-mean-square deviation; SCC, Spearman’s rank correlation coefficient; SVM, Support vector machines; TCGA, The Cancer Genome Atlas; XGBoost, Extreme Gradient Boosting

Introduction

Cancers are complex, dynamic diseases characterized by aberrant cellular processes such as excessive proliferation, resistance to apoptosis, and genomic instability [1]. Tumors are caused by somatic variations, which can affect individual nucleotides or larger segments of DNA [2]. Dysregulation of cellular function can also be caused by epigenetic modifications, including aberrant DNA methylation [3]. One goal of cancer research is to advance precision medicine through identifying genomic and epigenomic features that influence treatment outcomes in individuals [4]. In this context, therapeutic decisions have the potential to be guided by molecular signatures.

Cancer cell lines are cell cultures derived from tumor samples. They represent one of the least expensive and most studied preclinical models [5]. Drug screening in cell lines can be used to prioritize candidate drugs for testing in humans. In performing a screen, researchers calculate IC₅₀ values, which quantify the amount of drug necessary to induce a biological response in half of the cells tested for a given experiment [6]. Drugs with a relatively high potency (corresponding to low log-transformed IC₅₀ values) are generally considered to be the strongest candidates for use in humans, although patient safety must also be evaluated. After a candidate drug has been identified, researchers may seek to identify molecular markers associated with those responses, comparing cell lines that respond to the drug against those that do not. Such markers might be useful for elucidating drug mechanisms or eventually predicting clinical responses in patients [7].

Over the past decade, researchers have catalogued the molecular profiles of more than a thousand cancer cell lines and their responses to hundreds of drugs [8–10]. These resources have been made publicly available, thus providing an opportunity for researchers to identify molecular signatures that predict drug responses in a preclinical setting. In addition, recent efforts to catalog molecular profiles in human tumors have resulted in massive collections of publicly available molecular data for tumor samples [11–13]. Such data can be used to validate findings from preclinical studies and assess our ability to classify cancer patients into groups that will most likely benefit from a certain treatment [14].

Many computational methods have been proposed to predict anticancer drug sensitivity based on genetic, genomic, or epigenomic features of cancer cell lines. The most common approach is to generate a drug-specific model, which is independently trained using molecular observations and drug-response data from cell lines tested with each drug individually. Linear-regression based, drug-specific models have been developed using gene expression data [7, 8, 15] or a combination of gene expression data and other genomic data types, such as copy number alterations and DNA methylation [16]. Non-linear models using a single data type or multiple data types have also been proposed, including artificial neural networks, random forests, support vector machines (SVM), kernel regression, latent and Bayesian approaches, attractor landscape analysis of network dynamics, unsupervised pathway activity models, and recommender systems [17–34]. Transfer-learning techniques have also been proposed to improve drug-response prediction performance for one type of cancer by incorporating data from other types of cancer [35]. Drug response information has also been modeled in combination with chemical drug properties using elastic net regression, support vector machines, regularized matrix factorization, and manifold Learning [36–40].

Most recent cell-line studies have emphasized the potential to predict drug responses based on gene-expression profiles [17, 41–44]. Technologies for profiling gene-expression levels are widely available and reflect the downstream effects of genomic and epigenomic aberrations. However, gene-expression profiles may be difficult to apply in the clinic because of the instability of RNA [45]. Moreover, gene-expression data are generated using a wide range of technologies (e.g., different types of oligonucleotide microarrays and RNA-sequencing), and are preprocessed using diverse algorithms. Thus, it is often difficult to combine datasets from multiple sources (e.g., preclinical and tumor data). In this study, we focus on DNA methylation profiles, using cell-line data from the Genomics of Drug Sensitivity in Cancer (GDSC) database [7] in combination with tumor data from The Cancer Genome Atlas (TCGA) [46]. These projects used the same technology to quantify methylation levels, and the GDSC team created a version of the methylation data that had been normalized in a consistent manner, thus enabling us to perform a more systematic evaluation of whether DNA methylation levels can predict drug responses.

DNA methylation is an epigenetic mechanism that controls gene-expression levels. The addition of a methyl group to DNA may lead to changes in DNA stability, chromatin structure, and DNA-protein interactions. Hypermethylation of CpG islands in promoter regions of DNA has been acknowledged as an important means of gene inactivation, and its occurrence has been detected in almost all types of human tumors [47]. Similar to genetic alterations, methylation changes to DNA may alter a gene’s behavior. However, hypermethylation can be reversed with the use of targeted therapy [48], making it an attractive target for anticancer therapy [49, 50].

In some cases, DNA methylation levels for a single gene may control cellular responses for a given drug. For example, MGMT hypermethylation predicts temozolomide responses in glioblastomas [51], and BRCA1 hypermethylation predicts responses to poly ADP ribose polymerase inhibitors in breast carcinomas [52]. However, in many cases, drug responses are likely influenced by the combined effects of many genes interacting in the context of signaling pathways [53]. Accordingly, to maximize our ability to predict drug responses, it is critical to account for this complexity.

In this study, we use DNA methylation profiles from preclinical samples to model drug responses for eight anti-cancer drugs. We compare the performance of five classification algorithms and four regression algorithms that encompass a diverse range of methodologies, including tree-based, probability-based, kernel-based, ensemble-based, and distance-based approaches. We use classical algorithms as a way to establish a performance baseline against which other algorithms might be compared when working with DNA methylation profiles. For regression, we predict IC₅₀ values directly. For classification, we use discretized IC₅₀ values. For both types of algorithm, we artificially subsample the data to varying degrees to evaluate whether training models based on relatively extreme outcomes would yield improved performance; we assess our ability to predict drug responses using as few as 10% of the cell lines (those with the most extreme IC₅₀ values). An underlying motivation of this approach was to decrease data-generation costs. For example, if it could be shown that generating data for relatively few (extreme) responders performs as well as or better than generating data for responders across the full range of response values, cost savings may result. Perhaps surprisingly, the classification algorithms performed best when only 10–20% of the cell lines were used. The regression algorithms performed best when we trained the models using the full range of drug-response values, although this depended on the performance metrics we used. Finally, we derived classification models from the cell-line data and predicted drug responses for TCGA patients. In most cases, the models failed to generalize effectively; however, predictions by the Random Forests algorithm were significantly correlated with Temozolomide responses for low-grade gliomas.

Methods

The GDSC database contains data for human cell lines derived from common and rare types of adult and childhood cancers. GDSC provides multiple types of molecular data for these cell lines in addition to response values for 265 anti-cancer drugs. In this work, we used database version GDSC1, which includes data for 987 cell lines curated between 2010 and 2015 [7]. Drug responses were measured as the natural log of the fitted IC₅₀ value. The more sensitive the cell line, the lower the IC₅₀ value for any given drug. We developed machine-learning models of drug response using DNA methylation data from GDSC1 that had been preprocessed and summarized as gene-level beta values [7]; these values ranged between 0 and 1 (higher values indicated relatively high methylation for a given gene). We used all available methylation regions, represented by gene-level summarized values, as input to the classification and regression algorithms.

For external validation, we used DNA methylation data and clinical drug-response values from TCGA. We selected eight drugs that were administered to TCGA patients and present in GDSC: Gefitinib, Cisplatin, Docetaxel, Doxorubicin, Etoposide, Gemcitabine, Paclitaxel, and Temozolomide. These drugs represent a variety of molecular mechanisms, including DNA crosslinking, microtubule stabilization, and pyrimidine anti-metabolization. Aside from Gefitinib, which we used for model optimization on GDSC data, these drugs were associated with the largest number of patient drug-response values in TCGA [54]. GDSC provides DNA methylation values for 6,035 TCGA samples that had been preprocessed using the same pipeline as the GDSC samples. We obtained drug-response data for TCGA patients from [55].

Cell lines with missing IC₅₀ values were excluded on a per-drug basis; thus, sample sizes differed across the drugs. We applied Z-score normalization on a per-gene basis across all samples in GDSC and TCGA. Next, we used ComBat [56] to adjust for systematic differences between the two datasets (GDSC and TCGA); we also specified cell type as a covariate to adjust for methylation patterns associated with this factor.

We started with a classification analysis. Classification algorithms are widely available, and their predictions are intuitive to interpret—they assign probabilities to each sample for each class. To enable classification for the GDSC cell lines, we discretized the IC₅₀ values into "low" and "high" values. However, the choice of a threshold for distinguishing low and high values was necessarily arbitrary. Initially, we used the median IC₅₀ value across all cell lines as a threshold. However, cell lines with an IC₅₀ just above or below this threshold naturally showed very little difference in their drug responses, even though they were assigned to different classes. In contrast, cell lines with extreme IC₅₀ values (far from the threshold) had much more distinct drug responses. To investigate the effects of using a threshold to discretize the IC₅₀ values for classification, we used subsampling. We created 10 different scenarios that included increasing percentages of the overall data. First, we sorted the samples by IC₅₀ value in ascending order. For the first scenario, we evaluated cell lines with the 5% lowest and 5% highest IC₅₀ values (10% of the total data). In the next scenario, we evaluated cell lines with the 10% lowest and 10% highest IC₅₀ values (20% of the total data), and so on. The last scenario included all the data, where the lowest 50% were considered to have low IC₅₀ values and the highest 50% were considered to have high values (S1 Fig). For the regression analysis, we followed a similar process for subsampling but retained the continuous nature of the IC₅₀ values.

For both classification and regression, we used the Random Forests (tree-based) [57], (Support Vector Machines (kernel-based) [58], Gradient Boosting Machines (ensemble-based) [59], and k-Nearest Neighbors (distance-based) [60] algorithms. We used the Naïve Bayes (probability-based) [61] algorithm for classification but not for regression because this algorithm is only designed for classification analyses. We performed the analyses using the R programming language [62] and Rstudio (https://rstudio.com). The machine-learning algorithms were implemented in the following R packages: mlr [63], e1071 [64], xgboost [65], randomForest [66], and kknn [67].

Using the GDSC cell-line data, we sought to select the best hyperparameters for each algorithm via nested cross validation. We used the mlr package [63] to randomly assign the cell lines to 10 outer folds and 5 inner folds (per outer fold). For each combination of algorithm and data-subsampling scenario, we evaluated the performance of all hyperparameter combinations (Table 1) using the inner folds; we used MMCE (Mean Misclassification Error) [68] for classification and MSE (Mean Squared Error) [69] for regression as evaluation metrics in the inner folds (defaults in mlr). For the outer-fold predictions, we assessed performance for predicting drug responses using several performance metrics. This enabled us to evaluate how consistently the algorithms performed. For the classification analysis, we used accuracy (1—MMCE), area under the receiver operating characteristic curve (AUC) [70], F1 measure [71], Matthews correlation coefficient (MCC) [72], recall, and specificity. For the regression analysis, we used Mean Absolute Error (MAE), Root Mean Square Error (RMSE) [69], R-squared coefficient of determination [73] and Spearman’s rank correlation coefficient (SCC) [74].

Download:

Table 1. Descriptions of the algorithms we tested and hyperparameters that we evaluated via nested cross validation.

Hyperparameter optimization was performed for all tested algorithms. All parameter combinations for each algorithm were evaluated via nested cross validation; optimal combinations were then used for outer-fold predictions.

https://doi.org/10.1371/journal.pone.0238757.t001

After assessing the algorithms separately for the classification and regression approaches, we evaluated the predictive ability of these two types of tasks against one another. We calculated the Spearman correlation coefficient as a nonparametric measure of the concordance between the predicted probabilities (classification algorithms) and predicted IC₅₀ values (regression algorithms).

For the classification and regression analyses, we used feature selection to identify genes deemed to be most informative. We performed an information-gain analysis, assigning an importance score to each feature (gene). More specifically, we estimated the relative importance of each gene based on the conditional entropy of the class variable with respect to that gene. Entropy measures the amount of randomness in the information. Thus, higher information gain implies lower entropy. This analysis was implemented using the FSelectorRcpp package [75]. To assess the functional relevance of the top-ranked genes, we used a gene-set overlap technique implemented in the Molecular Signatures Database 3.0 [76]. As candidate gene sets, we included the C2 (curated gene sets), C4 (computational genes sets), and C6 (oncogenic signature gene sets). We used a False Discovery Rate q-value threshold of 0.05.

For additional validation, we trained classification models based on discretized drug responses in the GDSC cell lines and then predicted patient drug responses using tumor data from TCGA. These patient responses were based on clinical data, having no direct relation to IC₅₀ values. Because the patient-response values were categorical in nature, we only performed classification for these data. We used nested cross validation to perform hyperparameter optimization using the GDSC (training) data. To evaluate the relationship between the predicted labels and actual clinical responses, we calculated Spearman’s rank correlation coefficient and a corresponding p-value for each combination of algorithm and data-subsampling scenario; then we used the Benjamini-Hochberg False Discovery Rate to adjust for multiple tests [77].

Results

Using data from 987 cell lines, we used machine-learning algorithms to evaluate the potential to predict cytotoxic responses based on genome-wide, DNA methylation profiles. Second, we examined which genes were most predictive of these responses. Finally, we evaluated the feasibility of predicting clinical responses in humans based on models derived from cell-line data.

Classification analysis using cell-line data

We collected DNA methylation data and IC₅₀ response values for eight drugs from the GDSC repository. In our initial analysis, we aimed to predict categories (classes) of drug sensitivity. These categories represented whether each cell line exhibited a "low" or "high" response to each drug, corresponding to relatively low or high IC₅₀ values, respectively. This categorization facilitated a simplified yet intuitive interpretation of the treatment outcomes and enabled us to use classification algorithms, which have been implemented for a broader range of algorithmic methodologies than regression algorithms.

Before performing classification, we categorized each cell line on a per-drug basis, according to whether its IC₅₀ value was greater than the median across all cell lines. One limitation of categorizing the cell lines in this way was that cell lines just above or below the median threshold showed a relatively small difference in IC₅₀ values, even though they were assigned to different classes. Generally, IC₅₀ values did not follow a multimodal distribution (Fig 1). Therefore, we evaluated whether classification performance could be improved by excluding cell lines with an IC₅₀ value relatively close to the median, even though this would reduce the amount of data available for training and testing. We evaluated ten scenarios that varied the number of cell lines used. In the most extreme scenario, we used methylation data for cell lines with the 5% lowest and 5% highest IC₅₀ values. In describing these subsampling scenarios, we use a notation that indicates the percentage of samples on each side of the distribution as well as the algorithm type. For example, when we analyzed the samples with the 5% highest and 5% lowest IC₅₀ values and employed a classification algorithm, we indicate this using "+-5%c". The equivalent scenario for regression was represented as +-5%r.

Download:

Fig 1. Histograms for each drug based on drug response (IC₅₀ values) for the GDSC dataset.

The black line represents the median value for each subsample across all available cell lines for each drug.

https://doi.org/10.1371/journal.pone.0238757.g001

We evaluated the performance of five classification algorithms using six performance metrics (see Methods). In addition, we optimized hyperparameters via nested cross validation; Table 1 lists the hyperparameters we evaluated. Initially, we evaluated Gefitinib, an EGFR inhibitor. Overall, the algorithms performed best when relatively few cell lines (+-5%c and +-10%c) were used to train and test the models, attaining area-under-the-receiver-operating-characteristic curve (AUC) and classification-accuracy values as high as 0.93 and 0.84 (Table 2). This pattern was consistent across all five algorithms and all six metrics that we evaluated (Fig 2). However, the SVM algorithm consistently achieved higher classification performance than the other algorithms for this drug.

Download:

Fig 2. Gefitinib classification results across six metrics.

These "spider" graphs illustrate how each classification algorithm performed in each subsampling scenario via cross validation on the GDSC cell-line data. Results that are further away from the center represent higher metric values (relatively better performance) than results closer to it. These metrics are accuracy (ACC), specificity, recall, Matthews correlation coefficient (MCC), F1 score (F1) and area under the receiver operating characteristic curve (AUC). Scenarios that used relatively few cell lines—but those with the most extreme IC₅₀ values—performed best for all algorithms. Specific metric values may be found in Table 2.

https://doi.org/10.1371/journal.pone.0238757.g002

Download:

Table 2. Classification results for all subsampling scenarios and algorithms for Gefitinib.

https://doi.org/10.1371/journal.pone.0238757.t002

When evaluating the seven remaining drugs, we continued to see a trend in which using a relatively small proportion of the data resulted in better classification performance. For Cisplatin, Docetaxel, Doxorubicin, and Etoposide, the best performance was attained for +-5%c and +-10%c, and the best-performing algorithms were always SVM or Random Forests (RF) (S1–S7 Tables). In contrast, for Gemcitabine, the highest AUC value (0.82) was obtained for +-15%c (SVM algorithm). For Paclitaxel, the Random Forests algorithm performed best for +-10%c (AUC = 0.75). The overall highest AUC value was attained for Docetaxel (0.97, +-10%c, Random Forests and SVM). S2–S8 Figs illustrate these results across all algorithms, metrics, and drugs and show that generally the top-performing algorithms were consistent across all metrics, although these patterns were less consistent in scenarios where the highest AUC values were lower than 0.80.

To further analyze combinations of subsampling scenarios and classification algorithms, we ranked the AUC values for all combinations and for each drug (where the lowest rank was considered best and represented the highest AUC value). Subsequently, we calculated the average AUC rank across all drugs. The best performance was attained for +-10%c (SVM) and +-10%c (Random Forests), achieving average ranks of 4.75 and 5.13, respectively (Table 3). When we evaluated the minimum, mean, and maximum AUC values for each combination of drug and algorithm, Docetaxel attained the best overall performance (Table 4).

Download:

Table 3. Summary of AUC values across all combinations of subsampling scenario and algorithm.

We ranked the AUC values for each combination and then calculated the average rank across the combinations (lower ranks imply better performance). In addition, this table lists the minimum, maximum, and standard deviation AUC value across the combinations.

https://doi.org/10.1371/journal.pone.0238757.t003

Download:

Table 4. Minimum, mean and maximum AUC value for each combination of drug and algorithm, averaged across all subsampling scenarios.

https://doi.org/10.1371/journal.pone.0238757.t004

Regression analysis using cell-line data

We performed a regression analysis using the same DNA methylation data but with continuous IC₅₀ response values for the same eight drugs. For this analysis, we applied four regression algorithms and evaluated their performance using nested cross validation and four performance metrics (RMSE, MAE, R-squared and SCC). As with the classification analysis, we performed data subsampling to evaluate the effects of using relatively extreme IC₅₀ values. For Gefitinib and the MAE and RMSE metrics, all algorithms performed best when all cell lines were used to train and test the models, attaining RMSE values as low as 0.95 (lower is better, see Table 5). However, for the R-squared and SCC metrics, the +-5%r subsampling scenario resulted in the best performance in some cases. Typically, the magnitude of the differences between the original and predicted IC₅₀ values was larger toward the extremes, resulting in relatively high MAE and RMSE values when middle values were excluded. In contrast, SCC is a rank-based metric, and the algorithms struggled most to differentiate between IC₅₀ values toward the middle of the distribution. We observed similar patterns for the other seven drugs (S8–S14 Tables).

Download:

Table 5. Regression results for all combinations of subsampling scenarios and algorithms for Gefitinib.

https://doi.org/10.1371/journal.pone.0238757.t005

Across all drugs and metrics, the SVM and Random Forests algorithms performed best for every combination of drug and performance metric (Fig 3). Furthermore, predictive performance was highly consistent for all metrics (S9–S15 Figs). When evaluating the mean RMSE ranked values (where the lowest rank was considered best and represented the lowest RMSE value), the RF and SVM algorithms and the +-50%r scenarios performed best (Table 6), and predictions for Temozolomide were more accurate overall than those for other drugs (Table 7).

Download:

Fig 3. Gefitinib regression results across four metrics.

These "spider" graphs illustrate how each regression algorithm performed in each subsampling scenario via cross validation on the GDSC cell-line data. Results that are further away from the center represent higher metric values (relatively better performance) than results closer to it. These metrics are RMSE (Root Mean Square Error), MAE (Mean Absolute Error), R-squared and Spearman correlation coefficient. Scenarios that used all cell lines performed best for all algorithms. Specific metric values may be found in Table 5.

https://doi.org/10.1371/journal.pone.0238757.g003

Download:

Table 6. Average RMSE rank for all combinations of subsampling scenarios and algorithms.

RMSE values were ranked for each drug and were then averaged. Lower ranks imply a better result. We also include standard deviation and the minimum and maximum RMSE values.

https://doi.org/10.1371/journal.pone.0238757.t006

Download:

Table 7. Minimum, mean and maximum RMSE value for each drug and algorithm combination, averaged across all subsampling scenarios.

https://doi.org/10.1371/journal.pone.0238757.t007

Classification and regression evaluation

As a way to compare the predictions of the classification versus regression algorithms, we used SCC as a nonparametric measure. For the classification algorithms, we calculated the SCC between the probabilistic predictions that these algorithms produced and the original IC₅₀ values. For the regression algorithms we used the SCC values that quantified the correlation between the predicted and actual IC₅₀ values. Then for each combination of subsampling scenario and drug, we compared the SCC for the same algorithm types against each other (Fig 4). These coefficients were strongly correlated with each other, illustrating that the classification and regression algorithms typically ranked the patients similarly in relation to the original IC₅₀ values.

Download:

Fig 4. Spearman correlation coefficient results for classification algorithms (predicted probabilities) and regression algorithms (predicted IC50 values).

For the classification analyses, we calculated the Spearman correlation coefficient between the predicted probabilities and the original IC₅₀ values. These are represented on the x-axis. The y-axis represents the Spearman coefficients from the regression analyses. Each dot reflects results for a particular combination of drug, subsampling scenario, and algorithm.

https://doi.org/10.1371/journal.pone.0238757.g004

Informative genes for predicting cell-line responses

The DNA methylation assays target CpG islands associated with genes across the genome. After identifying analysis scenarios that resulted in optimal performance for classification and regression, we used feature ranking to identify genes that were most informative in these scenarios. For the classification analysis, we focused on the +-5%c scenario. For the regression task, we focused on the +-50%r scenario. Table 8 lists the 20 top-ranked genes for Gefitinib. The CTGF gene was ranked 1st for the classification analysis and 13th for the regression analysis. The CTGF protein plays important roles in signaling pathways that control tissue remodeling via cellular adhesion, extracellular matrix deposition, and myofibroblast activation [78]; these processes are known to influence tumorigenesis and may alter drug responses [79]. For example, EGFR is expressed in many head and neck squamous cell carcinomas and non-small cell lung carcinomas, yet many of these patients do not respond to Gefitinib treatment [80]. This lack of response has been associated with a loss of cell-cell adhesion, elongation of cells, and tumor-cell invasion of the extracellular matrix [81–83]. F11R was ranked second in importance for the classification analysis and seventeenth for the regression analysis. The protein encoded by this gene is a junctional adhesion molecule that regulates the integrity of tight junctions and permeability [84]. Although these associations provide some support for our feature-ranking results and that adhesion processes are important to Gefitinib responses, none of the other top-20 genes overlapped between the classification and regression analysis. The lack of agreement between the classification and regression results is not surprising. For example, even though the Random Forests algorithm uses a similar methodology for classification and regression, it is not unlikely that different genes would be selected for classification versus regression. We used data for thousands of genes, and different genes may exhibit similar methylation patterns, so the algorithms may choose different (correlated) genes by random chance. Secondly, the algorithms optimized against different objective functions for classification versus regression; even small differences in how the algorithms prioritized genes could lead to large differences in the gene ranks. However, the SVM and RF models represent multivariate patterns; thus, known cancer genes may alter drug responses in combination with the genes identified via our univariate feature-selection approach, even if they are not among the top-ranked genes.

Download:

Table 8. Most informative genes for predicting cell-line responses for Gefitinib.

We used an information-gain analysis to rank genes based on their association with Gefitinib drug response. Genomic coordinates are based on build 37 of the human genome. We used information gain to rank the genes; higher scores indicate more informativeness.

https://doi.org/10.1371/journal.pone.0238757.t008

S15–S21 Tables indicate the top-20 ranked genes for the other 7 drugs. To gain insight regarding the roles that these genes might play in drug responses, we identified gene sets (e.g., pathways, oncogenic signatures) that significantly overlapped with these genes (S22, S23 Tables). For the classification analysis, we identified significant gene sets for 5 drugs (Gefitinib, Cisplatin, Docetaxel, Doxorubicin, Etoposide). Many of these gene sets are associated with cell differentiation, cell-cell communication, and drug resistance; however, these mechanisms did not always align with the respective drugs or target proteins that we expected based on the drugs’ known mechanisms. We observed similar patterns for the regression analysis. Two perhaps notable findings are that 1) a gene set associated with EGFR overexpression was associated with Gefitinib responses (this drug targets EGFR) and 2) a gene set associated with Gefitinib resistance was associated with Cisplatin responses, and it has been shown that Cisplatin’s ability to induce cell death is dependent in part on EGFR signaling in some cases [85].

Using methylation profiles from cell lines to predict tumor/patient drug responses

The above analyses used methylation profiles to predict drug responses in cell lines. Via cross validation, we showed that high levels of predictive accuracy are attainable using this approach. We also found that subsampled datasets with more extreme IC₅₀ values yielded the best classification results and that the SVM and Random Forests algorithms typically produced the most accurate results. Next we evaluated whether this performance would hold true in a translational-medicine context. The GDSC repository provides methylation profiles for 6,035 tumors from TCGA; these data had been preprocessed using the same methodology as the GDSC samples, thus enabling easier integration and reducing technical biases. For 1,638 TCGA patients, clinical drug-response information was available. These data indicate clinical outcomes over the course of the patients’ treatment by physicians (not as part of clinical trials). In many cases, drug-response values for multiple drugs were recorded for a given patient. Each response value was categorized as "clinical progressive disease," "stable disease," "partial response," or "complete response". These respective categories represent increasing levels of response to a given drug.

We trained the SVM and Random Forests classification algorithms on the full GDSC dataset and predicted drug-response categories for each TCGA patient for which methylation and drug-response data were available. Based on our cross-validation results from the GDSC analysis, we focused on the +-5%c and +-10%c scenarios. For each TCGA test sample, our models generated a probabilistic prediction indicating whether that patient would respond to a given drug. We compared these predictions against the ordinal clinical responses for each combination of subsampling scenario (+-5%c and +-10%c), drug, and algorithm (SVM and RF); we calculated the SCC and a corresponding p-value for each comparison and adjusted for multiple tests. Generally, the predictions exhibited low correlation with clinical responses (Table 9); However, the predictions for lower-grade glioma patients who had been treated with Temozolomide were relatively strongly correlated with clinical responses (rho = 0.372; FDR = 0.014), though this result was specific to the Random Forests algorithm and the +-5%c scenario (Fig 5). Temozolomide is an oral alkylating agent, is used commonly to treat lower-grade glioma patients, and may reduce seizures and improve prognosis [10].

Download:

Fig 5. Predicting patient drug response from cell-line methylation profiles for Temozolomide (n = 85).

For each TCGA test sample, we used classification models from the GDSC data (+-5%c Random Forest) to generate probabilistic predictions of drug response.

https://doi.org/10.1371/journal.pone.0238757.g005

Download:

Table 9. Correlation between predicted drug responses based on GDSC cell lines and recorded clinical responses in TCGA patients for selected combinations of subsampling scenarios and algorithms across all drugs.

We treated the clinical drug responses as an ordinal variable and used the Spearman rank correlation coefficient to assess the extent to which the predicted responses correlated with the clinical responses.

https://doi.org/10.1371/journal.pone.0238757.t009

Discussion

In an ideal setting, patient data would be used to train predictive models for clinical drug responses directly, as these data may accurately reflect tumor behavior in patients. Environmental factors, the tumor microenvironment, co-existing conditions, and a variety of other factors can affect a tumor’s behavior in ways that may not be accounted for in preclinical studies. However, acquiring drug-response data directly from human patients may require conducting many experimental tests on a given patient, which could be unethical, harmful, and subject to many confounding factors. In addition, patients are typically assigned standard-of-care protocols based on their specific cancer type. As a result, experimental drug-response data for large patient cohorts are scarcely available. An alternative approach is to use preclinical samples to identify molecular signatures of drug response and later use those signatures to predict clinical drug responses in patients.

Cell lines serve as preclinical models for drug development. Being able to accurately predict drug responses for a given cell line based on molecular features may help in optimizing drug-development pipelines and explain mechanisms behind treatment responses. We focused on DNA methylation profiles as one type of molecular feature that is known to drive tumorigenesis and modulate treatment responses [47]. When using classification or regression algorithms to predict discrete or continuous responses, respectively, we consistently observed excellent predictive performance when the training and test sets both consisted of cell-line data. Although conventional wisdom advises against discretizing a continuous response variable, where possible, due to loss of information, we wished to evaluate the potential to make effective predictions in this scenario, in part because clinical treatment responses are sometimes represented as discrete values.

Of note, this study focuses primarily on evaluating the effect of subsampling on model performance rather than on introducing new algorithms. Using subsampling, we observed that classification performance generally improved as more extreme examples were used for training and testing, whereas the opposite was often true for the regression analyses. This suggests that during regression, the algorithms benefitted from seeing examples across a diverse range of IC₅₀ values for a given drug, whereas the classification algorithms were confounded by seeing examples with relatively similar drug responses, even though sample sizes were smaller. However, again we note that the regression results often differed depending on the evaluation metric used. These results have potential financial implications: if researchers can identify cell lines that are extreme responders for a particular drug, they may only need to generate costly molecular profiles for those cell lines. Future research may elucidate whether this finding generalizes to other types of molecular data and other drugs.

Previous efforts to associate DNA methylation levels with drug responses include work from Shen et al. (2007) [86] who quantified methylation for 32 CpG islands in the NCI-60 cell lines, creating a sensitivity database for ~30k drugs and identifying biomarkers that predict drug sensitivity. Instead, our work uses microarray data to quantify methylation levels for thousands of genes across 987 cell lines but for fewer drugs. Rather than searching for individual genes that predict drug sensitivity, we constructed predictive models that represent patterns spanning as many as thousands of genes. Such an approach may better represent complex interactions among genes and thus yield improved predictive power, but a tradeoff is reduced model interpretability. We sought to shed some insight into the biological mechanisms that influence drug responses via feature selection, but methods for deriving such insights from genome-wide data are still in their infancy. Recent work using mathematical optimization models shows promise as a way to integrate molecular data from cell lines with drug-sensitivity information to infer resistance mechanisms [87, 88].

A variety of computational methods have been proposed to predict drug responses for cell lines based on molecular data. Classical algorithms like decision trees and support vector machines have been used to predict the clinical efficiency of anti-cancer drugs and to classify drug responses [44, 89–93]. Neural networks [36] and deep neural networks [43] have been used to predict drug response based on genomic profiles from cell lines. Other techniques have included elastic net regression [44, 92, 94], linear ridge regression [45], and LASSO regression [54]. Alternative approaches based on computational linear algebra or network structures have also been applied to infer drug response in cell lines; these include matrix factorization [95], matrix completion [96], and link prediction [97] methods. Finally, a community-based competition assessed the ability to predict therapeutic responses in cell lines using 44 regression-based algorithms [17]. In our study we used diverse algorithms, but our primary focus was data subsampling and evaluating the potential to make accurate predictions of drug response in cell lines using relatively extreme responders, rather than to introduce new algorithms.

We attempted to predict clinical responses for patients from TCGA, but the accuracy of these predictions was typically poor. Integrating datasets can introduce batch effects [98] and other systematic biases; we attempted to mitigate these biases using data that had been preprocessed identically for GDSC and TCGA and using an empirical Bayesian method. However, subtle differences in the way biological samples are handled and processed in the lab can make generalization difficult to achieve. Furthermore, inherent differences between cell lines and tumors may confound such predictions. Cell lines are grown in a controlled environment, and the cells are relatively homogeneous, whereas tumor samples are a heterogeneous milieu of cells. In addition, TCGA tumor responses were based on clinical observations, so there was no direct mapping between these measurements and IC₅₀ values for the cell lines. Furthermore, our approach to quantifying predictive performance was different for the GDSC cross-validation analysis compared to the TCGA training/testing analysis. In the former, the class variable represented two possible outcomes (response and non-response). In the latter, the class variable was ordinal. Yet another challenge was that we used cell lines from all available cell types in GDSC. Better accuracy might be attained when training and testing on a single cell type; however, larger sample sizes would be necessary.

Our study has additional limitations that could be addressed in future research. For one, we focused on DNA methylation profiles in isolation, but other types of molecular features likely modulate treatment responses. A number of cell-line studies have used gene-expression profiles to predict drug responses, and future studies could evaluate the potential benefits of incorporating more than one type of molecular feature into response-prediction models. The treatment-response data were often imbalanced, meaning that not all response classes included similar numbers of patients. Hence, additional work could analyze the effect of class imbalance on model performance. Finally, we adjusted the methylation data for dataset and cell type using an empirical Bayesian framework. However, as few as 2–3 samples were available for some of the cell types, so the correction method may have had difficulty adjusting based on such small numbers of examples.

Conclusion

We applied machine-learning algorithms to predict cytotoxic responses for eight anti-cancer drugs using genome-wide, DNA methylation profiles from 987 cell lines from the Genomics of Drug Sensitivity in Cancer (GDSC) database. We then compared the performance of the classification and regression algorithms and evaluated the effect of sample size on model performance by artificially subsampling the data to varying degrees. The classification algorithms performed best when relatively few cell lines were used to train and test the models, attaining AUC values as high as 0.97. In contrast, the regression algorithms typically performed best when all cell lines were used to train and test the models, though this result depended on the evaluation metric used. For additional validation, we evaluated our ability to train a model based on drug responses in the GDSC cell lines and then accurately predict patient drug responses using data from The Cancer Genome Atlas (TCGA). Because patient-response values are categorical in nature, we only performed classification for these data. In most cases, classification algorithms trained on the full GDSC dataset to predict drug-response categories for TCGA patients were unable to identify patterns in the cell-line methylation data that translated to patient responses.

Supporting information

S1 Fig. Example of subsampling process.

When performing classification, we discretized drug-response (IC₅₀) values. To evaluate alternative thresholds for discretization, we performed a subsampling analysis. In Scenario 1 illustrated above, we considered the cell lines with the lowest and highest 5% of IC₅₀ values. In Scenario 2, we considered the cell lines with the lowest and highest 10% of IC₅₀ values. Each scenario used 10% more data than the previous scenario (5% on each side). This pattern continues until all data were considered in the analysis.

https://doi.org/10.1371/journal.pone.0238757.s001

(TIF)

S2 Fig. Graphs for Cisplatin classification analysis.

The graphs compare different scenarios ranked in order of best result. GDSC cell-line data were used to generate ten subsampling scenarios, which we then tested via nested cross validation. Scenarios that are further away from the center represent higher metric values than scenarios closer to it. The evaluated metrics for each algorithm are accuracy (ACC), specificity, recall, Matthews correlation coefficient (MCC), F1 score (F1) and area under the receiver operating characteristic curve (AUC).

https://doi.org/10.1371/journal.pone.0238757.s002

(TIF)

S3 Fig. Graphs for Docetaxel classification analysis.

The graphs compare different scenarios ranked in order of best result. GDSC cell-line data were used to generate ten subsampling scenarios, which we then tested via nested cross validation. Scenarios that are further away from the center represent higher metric values than scenarios closer to it. The evaluated metrics for each algorithm are accuracy (ACC), specificity, recall, Matthews correlation coefficient (MCC), F1 score (F1) and area under the receiver operating characteristic curve (AUC).

https://doi.org/10.1371/journal.pone.0238757.s003

(TIF)

S4 Fig. Graphs for Doxorubicin classification analysis.

The graphs compare different scenarios ranked in order of best result. GDSC cell-line data were used to generate ten subsampling scenarios, which we then tested via nested cross validation. Scenarios that are further away from the center represent higher metric values than scenarios closer to it. The evaluated metrics for each algorithm are accuracy (ACC), specificity, recall, Matthews correlation coefficient (MCC), F1 score (F1) and area under the receiver operating characteristic curve (AUC).

https://doi.org/10.1371/journal.pone.0238757.s004

(TIF)

S5 Fig. Graphs for Etoposide classification analysis.

The graphs compare different scenarios ranked in order of best result. GDSC cell-line data were used to generate ten subsampling scenarios, which we then tested via nested cross validation. Scenarios that are further away from the center represent higher metric values than scenarios closer to it. The evaluated metrics for each algorithm are accuracy (ACC), specificity, recall, Matthews correlation coefficient (MCC), F1 score (F1) and area under the receiver operating characteristic curve (AUC).

https://doi.org/10.1371/journal.pone.0238757.s005

(TIF)

S6 Fig. Graphs for Gemcitabine classification analysis.

The graphs compare different scenarios ranked in order of best result. GDSC cell-line data were used to generate ten subsampling scenarios, which we then tested via nested cross validation. Scenarios that are further away from the center represent higher metric values than scenarios closer to it. The evaluated metrics for each algorithm are accuracy (ACC), specificity, recall, Matthews correlation coefficient (MCC), F1 score (F1) and area under the receiver operating characteristic curve (AUC).

https://doi.org/10.1371/journal.pone.0238757.s006

(TIF)

S7 Fig. Graphs for Paclitaxel classification analysis.

The graphs compare different scenarios ranked in order of best result. GDSC cell-line data were used to generate ten subsampling scenarios, which we then tested via nested cross validation. Scenarios that are further away from the center represent higher metric values than scenarios closer to it. The evaluated metrics for each algorithm are accuracy (ACC), specificity, recall, Matthews correlation coefficient (MCC), F1 score (F1) and area under the receiver operating characteristic curve (AUC).

https://doi.org/10.1371/journal.pone.0238757.s007

(TIF)

S8 Fig. Graphs for Temozolomide classification analysis.

The graphs compare different scenarios ranked in order of best result. GDSC cell-line data were used to generate ten subsampling scenarios, which we then tested via nested cross validation. Scenarios that are further away from the center represent higher metric values than scenarios closer to it. The evaluated metrics for each algorithm are accuracy (ACC), specificity, recall, Matthews correlation coefficient (MCC), F1 score (F1) and area under the receiver operating characteristic curve (AUC).

https://doi.org/10.1371/journal.pone.0238757.s008

(TIF)

S9 Fig. Graphs for Cisplatin regression analysis.

We used DNA methylation data from cell lines to predict continuous IC₅₀ response values using four regression algorithms. We evaluated the algorithms’ performance via nested cross validation for ten subsampling scenarios. Graphs illustrate performance for these scenarios, ranked in order of relative performance for four metrics: RMSE (Root Mean Square Error), MAE (Mean Absolute Error), R-squared and Spearman correlation coefficient. Scenarios further away from the center represent relatively low metric values (and thus better performance). Scenarios that used all cell lines performed best for all algorithms.

https://doi.org/10.1371/journal.pone.0238757.s009

(TIF)

S10 Fig. Graphs for Docetaxel regression analysis.

We used DNA methylation data from cell lines to predict continuous IC₅₀ response values using four regression algorithms. We evaluated the algorithms’ performance via nested cross validation for ten subsampling scenarios. Graphs illustrate performance for these scenarios, ranked in order of relative performance for four metrics: RMSE (Root Mean Square Error), MAE (Mean Absolute Error), R-squared and Spearman correlation coefficient. Scenarios further away from the center represent relatively low metric values (and thus better performance). Scenarios that used all cell lines performed best for all algorithms.

https://doi.org/10.1371/journal.pone.0238757.s010

(TIF)

S11 Fig. Graphs for Doxorubicin regression analysis.

We used DNA methylation data from cell lines to predict continuous IC₅₀ response values using four regression algorithms. We evaluated the algorithms’ performance via nested cross validation for ten subsampling scenarios. Graphs illustrate performance for these scenarios, ranked in order of relative performance for four metrics: RMSE (Root Mean Square Error), MAE (Mean Absolute Error), R-squared and Spearman correlation coefficient. Scenarios further away from the center represent relatively low metric values (and thus better performance). Scenarios that used all cell lines performed best for all algorithms.

https://doi.org/10.1371/journal.pone.0238757.s011

(TIF)

S12 Fig. Graphs for Etoposide regression analysis.

We used DNA methylation data from cell lines to predict continuous IC₅₀ response values using four regression algorithms. We evaluated the algorithms’ performance via nested cross validation for ten subsampling scenarios. Graphs illustrate performance for these scenarios, ranked in order of relative performance for four metrics: RMSE (Root Mean Square Error), MAE (Mean Absolute Error), R-squared and Spearman correlation coefficient. Scenarios further away from the center represent relatively low metric values (and thus better performance). Scenarios that used all cell lines performed best for all algorithms.

https://doi.org/10.1371/journal.pone.0238757.s012

(TIF)

S13 Fig. Graphs for Gemcitabine regression analysis.

We used DNA methylation data from cell lines to predict continuous IC₅₀ response values using four regression algorithms. We evaluated the algorithms’ performance via nested cross validation for ten subsampling scenarios. Graphs illustrate performance for these scenarios, ranked in order of relative performance for four metrics: RMSE (Root Mean Square Error), MAE (Mean Absolute Error), R-squared and Spearman correlation coefficient. Scenarios further away from the center represent relatively low metric values (and thus better performance). Scenarios that used all cell lines performed best for all algorithms.

https://doi.org/10.1371/journal.pone.0238757.s013

(TIF)

S14 Fig. Graphs for Paclitaxel regression analysis.

We used DNA methylation data from cell lines to predict continuous IC₅₀ response values using four regression algorithms. We evaluated the algorithms’ performance via nested cross validation for ten subsampling scenarios. Graphs illustrate performance for these scenarios, ranked in order of relative performance for four metrics: RMSE (Root Mean Square Error), MAE (Mean Absolute Error), R-squared and Spearman correlation coefficient. Scenarios further away from the center represent relatively low metric values (and thus better performance). Scenarios that used all cell lines performed best for all algorithms.

https://doi.org/10.1371/journal.pone.0238757.s014

(TIF)

S15 Fig. Graphs for Temozolomide regression analysis.

We used DNA methylation data from cell lines to predict continuous IC₅₀ response values using four regression algorithms. We evaluated the algorithms’ performance via nested cross validation for ten subsampling scenarios. Graphs illustrate performance for these scenarios, ranked in order of relative performance for four metrics: RMSE (Root Mean Square Error), MAE (Mean Absolute Error), R-squared and Spearman correlation coefficient. Scenarios further away from the center represent relatively low metric values (and thus better performance). Scenarios that used all cell lines performed best for all algorithms.

https://doi.org/10.1371/journal.pone.0238757.s015

(TIF)