Skip to main content
Advertisement
  • Loading metrics

iCDA-CGR: Identification of circRNA-disease associations based on Chaos Game Representation

  • Kai Zheng ,

    Roles Conceptualization, Methodology, Software, Writing – original draft

    ‡ These authors share first authorship on this work.

    Affiliation School of Computer Science and Engineering, Central South University, Changsha, China

  • Zhu-Hong You ,

    Roles Funding acquisition, Project administration, Writing – review & editing

    zhuhongyou@ms.xjb.ac.cn (ZHY); leiwang@ms.xjb.ac.cn (LW)

    ‡ These authors share first authorship on this work.

    Affiliation Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, China

  • Jian-Qiang Li,

    Roles Formal analysis, Resources

    Affiliation College of Computer and Software Engineering, Shenzhen University, Shenzhen, China

  • Lei Wang ,

    Roles Formal analysis, Funding acquisition, Writing – review & editing

    zhuhongyou@ms.xjb.ac.cn (ZHY); leiwang@ms.xjb.ac.cn (LW)

    Affiliations Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, China, College of Information Science and Engineering, Zaozhuang University, Zaozhuang, China

  • Zhen-Hao Guo,

    Roles Data curation, Investigation, Validation

    Affiliation Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, China

  • Yu-An Huang

    Roles Data curation, Investigation, Validation

    Affiliation Department of Computing, Hong Kong Polytechnic University, Hung Hom, Hong Kong, China

Abstract

Found in recent research, tumor cell invasion, proliferation, or other biological processes are controlled by circular RNA. Understanding the association between circRNAs and diseases is an important way to explore the pathogenesis of complex diseases and promote disease-targeted therapy. Most methods, such as k-mer and PSSM, based on the analysis of high-throughput expression data have the tendency to think functionally similar nucleic acid lack direct linear homology regardless of positional information and only quantify nonlinear sequence relationships. However, in many complex diseases, the sequence nonlinear relationship between the pathogenic nucleic acid and ordinary nucleic acid is not much different. Therefore, the analysis of positional information expression can help to predict the complex associations between circRNA and disease. To fill up this gap, we propose a new method, named iCDA-CGR, to predict the circRNA-disease associations. In particular, we introduce circRNA sequence information and quantifies the sequence nonlinear relationship of circRNA by Chaos Game Representation (CGR) technology based on the biological sequence position information for the first time in the circRNA-disease prediction model. In the cross-validation experiment, our method achieved 0.8533 AUC, which was significantly higher than other existing methods. In the validation of independent data sets including circ2Disease, circRNADisease and CRDD, the prediction accuracy of iCDA-CGR reached 95.18%, 90.64% and 95.89%. Moreover, in the case studies, 19 of the top 30 circRNA-disease associations predicted by iCDA-CGR on circRDisease dataset were confirmed by newly published literature. These results demonstrated that iCDA-CGR has outstanding robustness and stability, and can provide highly credible candidates for biological experiments.

Author summary

Understanding the association between circRNAs and diseases is an important step to explore the pathogenesis of complex diseases and promote disease-targeted therapy. Computational methods contribute to discovering the potential disease-related circRNAs. Based on the analysis of the location information expression of biological sequences, the model of iCDA-CGR is proposed to predict the circRNA-disease associations by integrates multi-source information, including circRNA sequence information, gene-circRNA associations information, circRNA-disease associations information and the disease semantic information. In particular, the location information of circRNA sequences was first introduced into the circRNA-disease associations prediction model. The promising results on cross-validation and independent data sets demonstrated the effectiveness of the proposed model. We further implemented case studies, and 19 of the top 30 predicted scores of the proposed model were confirmed by recent experimental reports. The results show that iCDA-CGR model can effectively predict the potential circRNA-disease associations and provide highly reliable candidates for biological experiments, thus helping to further understand the complex disease mechanism.

Introduction

Circular RNA (circRNA) is a type of non-coding RNA without 5' end caps or a 3' end poly (A) tails [1]. Since the discovery of circular RNA (circRNA) in RNA viruses 40 years ago, more than 100,000 circRNAs have been found in cells [2]. With the rapid development of RNA sequencing (RNA-seq) technology and bioinformatics, more and more studies have shown that circRNA plays an important role in many cell activities including effecting on arteriosclerosis, involving in the regulation of mRNA expression and regulating alternative splicing [38]. In addition, some evidence suggests that some diseases may be related to abnormal expression of circRNA. Zhou et al. found miR-141 is suppressed by circRNA_010567 through targeting TGF-beta1 to promote myocardial fibrosis[9]. Meanwhile, Liang et al. discovered that breast cancer proliferation and progression can be promoted by circ-ABCB10 through sponging miR-1271 [10]. Many scholars believe that many circRNAs can be used as tumor markers and therapeutic targets in clinical applications [11]. Based on the above reasons, confirming the potential association has gradually become a research hotspot in recent years. However, the high experimental cost and long experimental circle restrict the traditional experimental methods from verifying the association between circRNA and diseases on a large scale. In order to solve this problem, the calculation method rises in response to the proper time and conditions[1216].

In recent years, in order to unify the standards of circRNAs obtained by experiment, many databases were established as circBase, CIRCpedia, deepBase, CircNet and circRNADb [1721]. These databases provided biological essential information about circRNA, such as sequencing data and gene target. What’s more, there are many databases that choose to collect circRNAs that have been shown to be associated with various diseases, including CircR2Disease, circRNADisease, circFunBase, and Circ2Disease [2225]. These databases provide data support for selecting candidates of potential circRNA-disease associations by computational methods. For example, Xiao et al. proposed a weighted dual-manifold regularized-based calculation model named MRLDC which integrates geometric information and intrinsic diversity of circRNA and disease feature spaces [26]. Although this method has achieved good results, there are only 331 association for training model. A small number of training samples may lead to insufficient robustness of the model. In addition, MRLDC only describes the behavior information in circRNA-disease association network, and cannot directly and accurately measure circRNA similarity and disease similarity from the attributes of circRNA and disease. Fan et al. proposed a computational model of KATZ measures for human circRNA-disease association prediction (KATZHCDA) using a heterogeneous network [27]. Similarly, this model also does not have enough training samples. Among them, 275 circRNAs, 36 diseases, and 312 associations were used. Although KATZHCDA uses circRNA expression profile information, its performance is still limited. Compared with the above two models, GHICD and RWRHCD have relatively sufficient training samples. They used 541 circRNAs, 83 diseases, and 592 associations[28]. It is worth noting that although they have achieved some effects and used the circRNA-gene association network to describe the attribute information of circRNA, the accuracy is still limited because the association network formed by circRNA and genes is very sparse.

Through the above analysis, we can see that although the current computing models have achieved good results, they also have some defects. First, it is not difficult to see that the training data used by the current model is limited, which has an impact on the robustness of the model. At the same time, the lack of training data also brings the problem of limited coverage. The potential associations that these models can predict are all around 10,000. Secondly, they are mainly based on a single data description method, which does not integrate circRNA and disease behavior information and attribute information in the network to comprehensively define the feature of circRNA and disease, resulting in limited prediction performance. Finally, they did not take the circRNA sequence information into account and cannot accurately measure the circRNA similarity. Therefore, in order to improve the drawbacks of the current computational models, we propose iCDA-CGR model to identify CircRNA-Disease Associations based on Chaos Game Representation. By introducing the circFunBase database and sequence information, the problems of limited model coverage and limited predictive performance are solved. The iCDA-CGR integrates multi-source information, including circRNA sequence information, gene-circRNA associations information, circRNA-disease associations information and the disease semantic information. In particular, iCDA-CGR extracts the biological sequence position information and quantifies the biological sequence nonlinear relationship of circRNA by Chaos Game Representation (CGR) technology [29]. Specifically, iCDA-CGR first figures the disease semantic similarity and disease Gaussian interaction profile kernel (GAS) kernel similarity and combines them to construct disease fusional similarity. Secondly, the method quantizes position and nonlinear sequence information through Chaos Game Representation (CGR) technology to calculate the similarity and difference of circRNAs by Pearson correlation coefficient. Thirdly, circRNA sequence-based similarity, circRNA gene-based similarity and circRNA GAS similarity are integrated into circRNA fusional similarity. Fourthly, feature descriptors are formed by circRNA fusional similarity and disease fusional similarity. Finally, the iCDA-CGR put feature descriptors into support vector machines to predict potential circRNA-disease association. The workflow of iCDA-CGR is shown as Fig 1. We verify the reliability of the method with the five-fold cross-validation on the CircR2Disease database. The average prediction area under curve (AUC) of our method is of 85.14% and the prediction accuracy is 81.12%. Our source code and data can be downloaded on GitHub (https://github.com/look0012/iCDA-CGR). It contains the datasets, the algorithm code and the models. It is worth mentioning that in order to make it more convenient for readers, we provide an easy-to-use version. The user only needs to enter the predicted circRNA and disease name in the following code to perform the prediction operation. The list of circRNAs and diseases is also in the published document, and users can use the list to find the associations they need. There are two models in this version, trained on circR2Disease and CircFunBase respectively. Among them, iCDA-CGR (circR2Disease) can predict 46,825 unconfirmed associations. iCDA-CGR (CircFunBase) can provide predictive scores for approximately 170,000 unconfirmed associations. We hope that these improvements will better serve circRNA researchers as a way to advance the field.

thumbnail
Fig 1. The workflow of iCDA-CGR model to predict potential circRNA-disease associations.

https://doi.org/10.1371/journal.pcbi.1007872.g001

Methods

Data sets

Benchmark database of circRNA-disease associations.

In the past year, a number of benchmark databases have been proposed for collecting circRNA-disease associations, such as circR2Disease, circRNADisease, circFunBase, and Circ2Disease, which contain the association between experimentally validated diseases and circRNAs [2224]. In this article, circR2Disease and circFunBase are used as the benchmark data set. The detailed description is as follows:

circR2Disease. To evaluate the reliability of our method, the widely used benchmark set circR2Disease was selected. The dataset was preprocessed due to its repetitiveness and non-human circRNA disease association. Specifically, we obtained 612 confirmed circRNA-disease associations consisting of 533 circRNA and 89 diseases after removing the circRNAs in which the gene symbol could not be found, as shown in Table 1. The base dataset circR2Disease can be defined as: (1) where Z1p is a positive subset constructed by 612 confirmed circRNA-disease associations, Z1n is a negative subset containing 612 associations which are selected from all 47437 unconfirmed associations between diseases and circRNAs. ∪ is the union of set theory. Known circRNA-disease associations and their names obtained from circR2Disease database can be seen in S1S3 Tables.

thumbnail
Table 1. Data distribution of the benchmark set circR2Disease and circFunBase of circRNA-disease association.

https://doi.org/10.1371/journal.pcbi.1007872.t001

circFunBase. CircFunBase is a database that provides high-quality functional circRNA resources and few models are used. In order to improve the problem of small coverage predicted by the current model, we also performed experiments on this dataset. After removing circRNAs that did not match the gene symbols, 2984 confirmed circRNA-disease associations were obtained, including 2597 circRNAs and 67 diseases, as shown in Table 1. The Benchmark database circFunBase can be defined as: (2) where Z2p is a positive subset constructed by 2984 confirmed circRNA-disease associations, Z2n is a negative subset containing 2984 associations which are selected from all 168031 unconfirmed associations between diseases and circRNAs.

CircRNAs and their sequence information.

Sequence information and gene symbols information for circRNAs are provided by many public databases such as circBase, CIRCpedia, deepBase, CircNet and circRNADb[1721]. To be able to construct a more complete circRNA sequence dataset, we downloaded circRNA sequence information from a database, circBase. The database is accessible free of charge via the web server http://www.circbase.org/.

Related work

Chaos Game Representation (CGR).

It is an iterative mapping technique for processing sequences[29]. The first advantage of this algorithms is that the original sequence information can be completely recovered from the coordinates. It means that information is not lost in mapping. Secondly, each sequence has a unique mapping, which means that positional information is preserved. For these reasons, the CGR is suitable for transformation of nucleotide sequence. The position Pi was figured by: (3) Where ν is the nucleotide contribution factor and we set it to be 0.5. gi is the nucleotide position factor. A, C, G, T are corresponding to (0,0), (0,1), (1,1), (1,0) respectively. nseq is the length of the sequence and P0 = (0.5,0.5).

Similarity between diseases

Disease semantic similarity.

The Medical Subject Headings (MeSH) database categorizes the disease rigorously, which helps to calculate the semantic similarity of the disease. It can be download from https://www.nlm.nih.gov/ [30]. We can express a disease as a directed acyclic graph (DAG) based on semantic information from the MeSH database. The nodes in DAG represent the diseases, and the edges represent their relationships. If the disease is pathologically similar, more parts of DAG will be shared. Wang et al. [31] proposed a method that has been widely used to calculate the semantic similarity of diseases in recent years. We defined a model for calculating disease contribution values, which is as follows: (4) We define the amount of DAGs which includes disease r as n(DAGs(r)) and the quantity of all diseases as n(disease). Therefore, the semantic similarity score of the disease d(i) and the disease d(j) is described as follows: (5) where Nd(i) is defined as all diseases that appear in the disease d(i)’s DAG.

Disease GAS similarity.

Many researches have applied Gaussian interaction profile kernel (GAS) to measure the similarity between diseases, according to that pathologically similar diseases tend to be associated with functionally similar circRNAs. In this study, the was used to describe the disease similarity information as follow: (6) Where (7) (8)

We define the parameter as the width parameter of the function, τd. The quantity of diseases and circRNAs are defined as m and n represently. Association adjacency matrix Acd represents the positive subset Zp. If circRNA r(i) and disease r(j) have an association, element ti,j is set to be 1, otherwise 0. Acd(d(i)) is association profiles of disease d(i). Here, we utilize the ith column vector of the adjacency matrix to describe Acd(d(i)).

Disease fusional similarity.

By analyzing the disease similarity measures form multiple perspectives, we gain the similarity matrices, including and . However, some of semantic similarity are unable to be calculated if the disease does not have its own DAG. To compensate for this deficiency, we will fuse and like the previous researches [3234]. The disease fusional similarity SD between disease d(i) and d(j) is defined as follow, and the final disease similarity matrix can be seen in S4 Table.

(9)

Similarity between circRNAs

CircRNA gene-based similarity.

Circular RNA regulates the activity of RNA polymerase and promotes parental genes’ transcription found in previous researches. Because if RNA affects the same human disease, their functions tend to be similar [3537]. In this work, we downloaded gene-circRNA association information from crcR2Disease database. The circRNA gene-based similarity matrix was constructed as follow: (10) Where the elements in is functional similarity scores between circRNAs. Association adjacency matrix Acg represents the association between genes and circRNA. If gene target and circRNA have an association, the element of Acg is set to be 1, otherwise 0. The gene’s GAS similarity matrix is constructed by Association adjacency matrix Acg. T is the transpose operator.

CircRNA GAS similarity.

Many researches chose to utilize gaussian interaction profile kernel (GAS) to measure the similarity between biomolecules [38]. Because if RNA affects the same human disease, their functions tend to be similar [3537]. In this study, the was used to describe the circRNA similarity information as follow: (11) (12) Where is the GAS similarity value between circRNAs c(i) and circRNAs c(j). The i -th row vector in the adjacency matrix Acd is defined as the association profile Acd(c(i)) of circRNA c(i), which is a vector composed of the relationship between circRNA c(i) and all diseases. τc is the width parameter.

circRNA sequence-based similarity.

Existing sequence alignment algorithms only quantify position information or non-linear information, and few algorithms that can combine both are proposed. Therefore, a new CGR-based method is proposed to quantify the similarity and difference between position and non-linear information using Pearson correlation coefficient. The specific calculation process is as follows.

Firstly, the CGR space is divided into Ng grid (Ng = 2s×2s,s = 3), as Fig 2. And, grid can be represented as formula 13.

thumbnail
Fig 2.

A) the CGR of hsa_circ_0005931 are plotted with the average coordinates for each 8 × 8 quadrant represented. B) A matrix of hsa_circ_0005931’s nucleotides with probabilities for chaos game representation.

https://doi.org/10.1371/journal.pcbi.1007872.g002

(13)

Secondly, the abscissa point.x and ordinate point.y in each grid are accumulated respectively to quantify position information.

(14)(15)

Thirdly, we calculate the z-scores of each grid Zi to quantify nonlinear information.

(16)(17)

Finally, each grid can be described as three attributes, and we fused the attributes to construct the descriptors descriptors(c(i)) to determine the sequence similarity by Pearson correlation coefficient. Where c(i) represents the i -th cricRNA. The workflow is shown as Fig 3. (18) (19) where Cov(descriptors(c(i))) is the covariance of descriptors(c(i)), D(descriptors(c(i))) is the variance of descriptors(c(i)). The size of circRNA sequence similarity matrix is n×n. All sequence information used in this article was downloaded from circBase [17].

thumbnail
Fig 3. The workflow of circRNA sequence-based similarity.

https://doi.org/10.1371/journal.pcbi.1007872.g003

CircRNA fusional similarity.

By analyzing circRNA’s characteristics from different perspectives, we can obtain three similarity matrices, including (formula 8), (formula 9), and (formula 16). Since the two adjacency matrices Acd and Acg are sparse, the two similarities and obtained by collaborative filtering have no significant difference in value and can't effectively distinguish circRNA. In order to solve the small difference between circRNAs due to lack of data and availability, we try to describe circRNA from a different perspective to make it more informative. To this end, the sequence similarity is introduced. However, some circRNAs lack sequence information corresponding to the experiment. So, the completion of similarity information is accomplished by combining three matrices. The fusional similarity SC is defined as follow, and the final circRNA similarity matrix can be seen in S5 Table.

(20)

Prediction of association between circRNA and disease by SVM

Support Vector Machines (SVM) was introduced in 1963 by Vanpik et al., which demonstrated many unique advantages in solving small sample, nonlinear and high dimensional pattern recognition problems. Due to the training samples used in iCDA-CGR are small, SVM is selected to build a model of predicting potential circRNA-disease association. Prediction is mainly divided into three steps: 1. Construct positive and negative sample sets; 2. Form the association descriptors based on the characteristics of the circRNA and disease; 3. Train models based on descriptors to predict potential circRNA-disease associations. Each step will be described in detail below.

Firstly, we built positive and negative sample sets. Specifically, 612 corresponding experimentally supported circRNA-disease pairs in circR2Disease were chosen as positive samples. Meantime, we randomly selected the same number of associations that without experimentally supported as negative samples.

Secondly, the association descriptors based on the characteristics of the circRNA and disease were formed. We calculated the semantic similarity and the GAS similarity of the disease separately, and integrated them into a matrix SD, and used the similarity of the disease d(id) with all diseases including itself (the idth row of the matrix SD) as the characteristic descriptor of the disease defined as follow: (21) where SD(d(id)) represents the ith row of the matrix SD. v1 is the similarity value of d(id) and d(1). The size of SD(d(id)) is 1×m. At the same time, we calculated the gene-based similarity , the GAS similarity and sequence-based similarity of the circRNA separately to form circRNA fusional similarity SC. Using the similarity of the circRNA c(ic) with all circRNA including itself (the ith row of the matrix SC) describes the characteristic descriptor of the circRNA defined as follow: (22) where SC(c(ic)) represents the ith row of the matrix SC. The similarity value between c(ic) and c(1) is defined as w1. The size of SC(c(ic)) is 1×n. circRNA disease samples can be defined as 622-dimensional association descriptors combined SD(d(i)) and SC(c(ic)): (23) where (f1,f2,f3,…,fm) is idth row of the disease fusional similarity SD, the icth row of the circRNA fusional similarity SC is defined as (fm+1,f m+2,f m+3,…,fm+n).

Finally, support vector machines (SVM) is utilized to train samples to build predictive models. More specifically. Firstly, we set the label of the training set. If the samples are in Zp, the label is defined as 1. Meanwhile, if the samples are in Zn, the label is defined as 0. Secondly, we fed the training data into support vector machines (SVM) to get prediction model. By predicting, the higher the score of the circRNA-disease association, the more likely it is the candidate for the potential association.

Results

Performance Evaluation

The five-fold cross-validation(5-CV).

In this work, the five-fold cross-validation (5-CV) is selected to evaluate the effectiveness of iCDA-CGR in predicting disease-related circRNAs. We separated the base dataset Z into five parts on average: (24) where ∅ is empty set. ∪ and ∩ are the union and intersection of set theory. Subset Zi, Zp, Zn can be defined as: (25)

The relationship between the ith positive subset or the ith negative can be expressed as: (26) where the quantity of sample in the ith positive subset are described as . In same way, we described the quantity of sample in the ith negative subset as . In the iCDA-CGR, we utilized four of the positive subset and negative as the training set and the remaining one as the test set as a cross-validation. The cross-validation is repeated 5 times, and each test set is verified once, with an average of 5 results, and finally a final estimate is obtained.

Evaluation criteria.

Three evaluation criteria were introduced for assessing the performance of iCDA-CGR. Accu. is the ratio of the number of samples correctly classified by the classifier to the total number of samples. (27) where TP and FP are the number of true positive and false positive samples, respectively. TN and FN are the number of true negative and false negative samples, respectively. Sen. is the ratio of the number of samples correctly classified by the classifier to the total positive samples.

(28)

Prec. is the ratio of the number of samples correctly classified by the classifier to the sum of true positive and false positive samples.

(29)

F1 is a comprehensive evaluation index of Sen. and Prec.

(30)

Assessment of prediction ability

To evaluate the capabilities of the model, we performed experiments on the circR2Disease and circFunBase datasets, respectively. The five-fold cross-validation results on the circR2Disease dataset are summarized in Table 2. iCDA-CGR has gained an average prediction AUC of 0.8533+/-0.0249. The AUCs of the five experiments are 0.8923 (fold 1), 0.8252 (fold 2), 0.8390 (fold 3), 0.8723 (fold 4) and 0.8385 (fold 5) respectively as Fig 4. iCDA-CGR has gained an average prediction AUPR of 0.7584+/-0.0351. The AUPRs of the five experiments are 0.8240 (fold 1), 0.7463 (fold 2), 0.7187 (fold 3), 0.7566 (fold 4) and 0.7465 (fold 5) respectively as Fig 5. The yielded averages of accuracy, sensitivity, precision and f1-score come to be 81.95%, 88.08%, 78.46% and 82.97% as in Table 2.

thumbnail
Fig 4. ROC curves performed by iCDA-CGR on circR2Disease dataset.

https://doi.org/10.1371/journal.pcbi.1007872.g004

thumbnail
Fig 5. PR curves performed by iCDA-CGR on circR2Disease dataset.

https://doi.org/10.1371/journal.pcbi.1007872.g005

thumbnail
Table 2. The five-fold cross-validation results performed by iCDA-CGR on circR2Disease dataset.

https://doi.org/10.1371/journal.pcbi.1007872.t002

On the circFunBase dataset, the mean and standard deviation were utilized as the experimental results of the five-fold cross-validation. In Table 3, the experimental results were obtained by iCDA-CGR on the circFunBase database. iCDA-CGR has gained an average prediction AUC of 0.8049+/-0.169. The AUCs of the five experiments are 0.7820 (fold 1), 0.8316 (fold 2), 0.8104 (fold 3), 0.7926 (fold 4) and 0.8080 (fold 5) respectively as Fig 6. The AUPRs of the five experiments are 0.7276 (fold 1), 0.8037 (fold 2), 0.7816 (fold 3), 0.7437 (fold 4) and 0.7727 (fold 5) respectively as Fig 7. The yielded averages of accuracy, precision, sensitivity and f1-score come to be 78.03%, 79.96%, 74.94% and 77.31% as in Table 3.

thumbnail
Fig 6. ROC curves performed by iCDA-CGR on circFunBase dataset.

https://doi.org/10.1371/journal.pcbi.1007872.g006

thumbnail
Fig 7. PR curves performed by iCDA-CGR on circFunBase dataset.

https://doi.org/10.1371/journal.pcbi.1007872.g007

thumbnail
Table 3. The five-fold cross-validation results performed by iCDA-CGR on circFunBase dataset.

https://doi.org/10.1371/journal.pcbi.1007872.t003

Comparison among different classifiers

In the above experiment, iCDA-CGR has received a reliable result. To prove the correctness of the classifier selection, we have compared the support vector machine (SVM) with random forest (RF), decision tree (DT), k-nearest neighbor (KNN) on benchmark database circR2Disease.

Support vector machines (SVM) is a binary classification model. Its purpose is to find a hyperplane to segment samples. The principle of segmentation is to maximize the spacing, and finally it is transformed into a convex quadratic programming problem to solve. The decision tree (DT) adopts a top-down recursive method. The basic idea is to construct a tree with the fastest entropy decline as measured by information entropy, and the entropy value at the leaf node is 0. The random forest (RF) is a kind of Ensemble Learning, which belongs to Bagging. By combining multiple weak classifiers, the final results can be voted or averaged, which makes the results of the whole model have higher accuracy and generalization performance. The main idea of the k-nearest neighbor (KNN) algorithm is that if most of the k most adjacent samples in the feature space belong to a certain category, then the sample also belongs to this category and has the characteristics of samples in this category.

In Table 4, we compare the results of Support vector machines with the other three classifiers on the circR2Diseas database. The accuracy of the four experiments are 82.44% (Support vector machines), 76.32% (k-nearest neighbor), 70.61% (Random forest) and 73.06% (Decision Tree). Their AUC are 0.8645 (Support vector machines), 0.8479 (k-nearest neighbor), 0.7927 (Random forest) and 0.7281 (Decision Tree) shown as Fig 8.

thumbnail
Fig 8. The ROCs of four different classifiers which are support vector machines, decision tree, random forest and k-nearest neighbor on circR2Disease dataset.

https://doi.org/10.1371/journal.pcbi.1007872.g008

thumbnail
Table 4. Performance comparison among four different classifiers which are k-nearest neighbor, random forest, decision tree and support vector machine.

https://doi.org/10.1371/journal.pcbi.1007872.t004

Comparison with related models

To further evaluate the reliability of iCDA-CGR, we compared it to five related prediction models: KATZHCDA, GHICD, RWRHCD, CD-LNLP and ICFCDA. The details of the comparison are summarized in Table 5. From the table, we can see that KATZHCDA, GHICD, RWRHCD and our model iCDA-CGR are all based on circR2Disease data set and use the five-fold cross-validation method, so iCDA-CGR can be directly compared with these three models. In terms of AUC scores reflecting the overall performance of the model, KATZHCDA, GHICD and RWRHCD achieved 0.7936, 0.7290 and 0.6660 respectively, while the proposed model iCDA-CGR achieved 0.8533. The results show that iCDA-CGR is significantly better than these methods.

thumbnail
Table 5. Performance comparison (AUC scores) among four different prediction model which are iCDA-CGR, KATZHCDA, GHICD, RWRHCD and CD-LNLP, ICFCDA.

https://doi.org/10.1371/journal.pcbi.1007872.t005

In the last two rows of Table 5, we list the performance of CD-LNLP and ICFCDA, which are 0.9007 and 0.9460, respectively. However, because the dataset or assessment methods used by these two models are inconsistent with the proposed model, we cannot directly compare them, so they are used as a reference for model performance. The specific reasons that cannot be directly compared are as follows:

For model CD-LNLP, it uses the circ2Disease database instead of the more commonly used circR2Disease database. Due to the different data sources used, the training model evaluation criteria will be different. Furthermore, CD-LNLP uses leave-one-out cross validation (LOOCV) to evaluate model performance instead of the more commonly used five-fold cross validation (5-CV). Based on previous work, using the same model and data, LOOCV assessments are usually higher than 5-CV [39]. Therefore, CD-LNLP cannot be directly compared with the proposed model.

For model ICFCDA, it uses the circR2Disease database, but this method removes more noisy data. The training data of ICFCDA includes 212 associations consisting of 200 circRNAs and 42 diseases. The predicted coverage of this model is 7976 associations, which is 17.25% of the coverage of iCDA-CGR. This operation makes the model performance stronger, but sacrifices the model's coverage. In addition, ICFCDA also uses LOOCV. Therefore, ICFCDA cannot be directly compared with the proposed model.

In summary, the proposed model has superior performance and coverage, which indicates that CGR-based sequence extraction technology and characterization of intrinsic structure and circRNA-disease association information could effectively improve the reliability of prediction.

Case study

To verify the performance of the model in predicting potential associations based on confirmed associations, we carried out a case study. To be specific, we define the training samples and test samples as follows: (31)

In the validation, confirmed associations Z1 between circRNA and disease provided by the circR2Disease database were selected as training set Z1train. Meanwhile, all the possible association are selected as test sets Z1test. The size of Z1train and Z1test are 1224 and 46213 respectively. Here, we verified the top 30 associations with the highest score. Among them, 19 pairs were verified in different literatures shown as Table 6.

(32)
thumbnail
Table 6. Prediction of the top 30 predicted circRNAs associated based on known associations on circR2Disease.

https://doi.org/10.1371/journal.pcbi.1007872.t006

Similar to the definition above, the confirmed associations provided by the circFunBase database were selected as the training set Z2train. At the same time, all possible associations are selected as test set Z2test. The size of Z2train and Z2test are 5968 and 168031 respectively. Here, we verified the top 30 correlations with the highest score. And, 17 pairs were verified in different literatures shown as Table 7.

thumbnail
Table 7. Prediction of the top 30 predicted circRNAs associated based on known associations on circFunBase.

https://doi.org/10.1371/journal.pcbi.1007872.t007

Performance on independent data set

The results indicate that this method is reliable for circRNA-disease association prediction. In order to further support this conclusion, we verified the method in other databases (CRDD, circRNADisease, and Circ2Disease). It is not possible to identify all potential circRNA disease associations because each database is incomplete. So, we assume that the associations in the database are the only known associations that have been experimentally verified, and the rest are set to unknown associations. The training samples and test samples are described as follows: (33) where and are the training set and test set of the independent data sets respectively. Zdatabase represents the independent data sets, such as CRDD, circRNADisease, and Circ2Disease. In this experiment, the iCDA-CGR was utilized to construct the prediction model using the base dataset Z1. Since the disease and circRNA are different for each data source, the intersection of all possible association sets CUZ1 with independent data set Zdatabase is used as the test set . It can be seen from Table 8 that the proposed method obtained predicted values of 95.18% (Circ2Disease), 90.64% (circRNADisease) and 95.89% (CRDD) in three databases, respectively. In addition, we did the same on circFunBase. The training samples and test samples are described as follows: (34)

It can be seen from Table 8 that the proposed method obtained predicted values of 63.26% (Circ2Disease), 73.43% (circRNADisease) and 72.72% (CRDD) in three databases, respectively. The experiment shows that the iCDA-CGR has strong generalization ability.

thumbnail
Table 8. Predictive results of the iCDA-CGR on other three databases.

https://doi.org/10.1371/journal.pcbi.1007872.t008

Discussion

In this study, we proposed the calculation model iCDA-CGR based on quantify location and non-linear information to identify the circRNA-disease associations. This model integrates circRNA sequence information, gene-circRNA associations information, circRNA-disease associations information and the disease semantic information, and predicts the final results by SVM classifier. In particular, we introduce circRNA sequence information and extract the biological sequence position information and quantifies the biological sequence nonlinear relationship of circRNA by Chaos Game Representation for the first time in the circRNA-disease prediction model. The model achieved outstanding results in the experiments of five cross-validation, comparisons with other methods, and independent data sets. Furthermore, 19 of the top 30 circRNA-disease associations predicted in case studies experiments were confirmed by the latest published literature. Due to the addition of sequence information, iCDA-CGR exhibited strong reliability and stability in predicting potential circRNA-disease associations. These experimental results indicate that the sequence information has sufficient coverage relative to nucleic acids, and iCDA-CGR has great potential for nucleic acid function analysis.

Supporting information

S1 Table. Data distribution of the benchmark set circR2Disease and circFunBase of circRNA-disease association.

https://doi.org/10.1371/journal.pcbi.1007872.s001

(XLSX)

S2 Table. Known circRNA-disease associations obtained from circR2Disease database.

https://doi.org/10.1371/journal.pcbi.1007872.s002

(XLSX)

S3 Table. Names of 533 circRNAs involved in known circRNA-disease associations obtained from circR2Disease database.

https://doi.org/10.1371/journal.pcbi.1007872.s003

(XLSX)

S4 Table. Names of 89 diseases involved in known circRNA-disease associations obtained from circR2Disease database.

https://doi.org/10.1371/journal.pcbi.1007872.s004

(XLSX)

S5 Table. The final disease similarity matrix.

https://doi.org/10.1371/journal.pcbi.1007872.s005

(XLSX)

S6 Table. The final circRNA similarity matrix.

https://doi.org/10.1371/journal.pcbi.1007872.s006

(XLSX)

References

  1. 1. Ashwal-Fluss R, Meyer M, Pamudurti NR, Ivanov A, Bartok O, Hanan M, et al. circRNA biogenesis competes with pre-mRNA splicing. Molecular cell. 2014;56(1):55–66. pmid:25242144
  2. 2. Zheng L-L, Li J-H, Wu J, Sun W-J, Liu S, Wang Z-L, et al. deepBase v2. 0: identification, expression, evolution and function of small RNAs, LncRNAs and circular RNAs from deep-sequencing data. Nucleic acids research. 2015;44(D1):D196–D202. pmid:26590255
  3. 3. Du WW, Fang L, Yang W, Wu N, Awan FM, Yang Z, et al. Induction of tumor apoptosis through a circular RNA enhancing Foxo3 activity. Cell death and differentiation. 2017;24(2):357. pmid:27886165
  4. 4. Armakola M, Higgins MJ, Figley MD, Barmada SJ, Scarborough EA, Diaz Z, et al. Inhibition of RNA lariat debranching enzyme suppresses TDP-43 toxicity in ALS disease models. Nature genetics. 2012;44(12):1302. pmid:23104007
  5. 5. Li Z, Huang C, Bao C, Chen L, Lin M, Wang X, et al. Exon-intron circular RNAs regulate transcription in the nucleus. Nature structural & molecular biology. 2015;22(3):256.
  6. 6. Zhang Y, Zhang X-O, Chen T, Xiang J-F, Yin Q-F, Xing Y-H, et al. Circular intronic long noncoding RNAs. Molecular cell. 2013;51(6):792–806. pmid:24035497
  7. 7. Xu H, Guo S, Li W, Yu P. The circular RNA Cdr1as, via miR-7 and its targets, regulates insulin transcription and secretion in islet cells. Scientific reports. 2015;5:12453. pmid:26211738
  8. 8. Li F, Zhang L, Li W, Deng J, Zheng J, An M, et al. Circular RNA ITCH has inhibitory effect on ESCC by suppressing the Wnt/β-catenin pathway. Oncotarget. 2015;6(8):6001. pmid:25749389
  9. 9. Zhou B, Yu J-W. A novel identified circular RNA, circRNA_010567, promotes myocardial fibrosis via suppressing miR-141 by targeting TGF-β1. Biochemical and biophysical research communications. 2017;487(4):769–75. pmid:28412345
  10. 10. Liang H-F, Zhang X-Z, Liu B-G, Jia G-T, Li W-L. Circular RNA circ-ABCB10 promotes breast cancer proliferation and progression through sponging miR-1271. American journal of cancer research. 2017;7(7):1566. pmid:28744405
  11. 11. Li P, Chen S, Chen H, Mo X, Li T, Shao Y, et al. Using circular RNA as a novel type of biomarker in the screening of gastric cancer. Clinica Chimica Acta. 2015;444:132–6.
  12. 12. Wang L, Yan X, Liu M-L, Song K-J, Sun X-F, Pan W-W. Prediction of RNA-protein interactions by combining deep convolutional neural network with feature selection ensemble method. Journal of theoretical biology. 2019;461:230–8. pmid:30321541
  13. 13. Wang L, You Z-H, Chen X, Li Y-M, Dong Y-N, Li L-P, et al. LMTRDA: Using logistic model tree to predict MiRNA-disease associations by fusing multi-source information of sequences and similarities. PLoS computational biology. 2019;15(3):e1006865. pmid:30917115
  14. 14. Wang L, Wang H-F, Liu S-R, Yan X, Song K-J. Predicting Protein-Protein Interactions from Matrix-Based Protein Sequence Using Convolution Neural Network and Feature-Selective Rotation Forest. Scientific reports. 2019;9(1):9848. pmid:31285519
  15. 15. Zheng K, You Z-H, Wang L, Zhou Y, Li L-P, Li Z-W. MLMDA: a machine learning approach to predict and validate MicroRNA–disease associations by integrating of heterogenous information sources. Journal of translational medicine. 2019;17(1):1–14.
  16. 16. Zheng K, You Z-H, Wang L, Li Y-R, Wang Y-B, Jiang H-J, editors. MISSIM: Improved miRNA-Disease Association Prediction Model Based on Chaos Game Representation and Broad Learning System. International Conference on Intelligent Computing; 2019: Springer.
  17. 17. Glažar P, Papavasileiou P, Rajewsky N. circBase: a database for circular RNAs. Rna. 2014;20(11):1666–70. pmid:25234927
  18. 18. Dong R, Ma X-K, Li G-W, Yang L. CIRCpedia v2: an updated database for comprehensive circular RNA annotation and expression comparison. Genomics, proteomics & bioinformatics. 2018;16(4):226–33.
  19. 19. Yang J-H, Shao P, Zhou H, Chen Y-Q, Qu L-H. deepBase: a database for deeply annotating and mining deep sequencing data. Nucleic acids research. 2009;38(suppl_1):D123–D30.
  20. 20. Liu Y-C, Li J-R, Sun C-H, Andrews E, Chao R-F, Lin F-M, et al. CircNet: a database of circular RNAs derived from transcriptome sequencing data. Nucleic acids research. 2015;44(D1):D209–D15. pmid:26450965
  21. 21. Chen X, Han P, Zhou T, Guo X, Song X, Li Y. circRNADb: a comprehensive database for human circular RNAs with protein-coding annotations. Scientific reports. 2016;6:34985. pmid:27725737
  22. 22. Fan C, Lei X, Fang Z, Jiang Q, Wu F-X. CircR2Disease: a manually curated database for experimentally supported circular RNAs associated with various diseases. Database. 2018;2018.
  23. 23. Zhao Z, Wang K, Wu F, Wang W, Zhang K, Hu H, et al. circRNA disease: a manually curated database of experimentally supported circRNA-disease associations. Cell death & disease. 2018;9(5):475.
  24. 24. Yao D, Zhang L, Zheng M, Sun X, Lu Y, Liu P. Circ2Disease: A manually curated database of experimentally validated circRNAs in human disease. Scientific reports. 2018;8(1):11018. pmid:30030469
  25. 25. Meng X, Hu D, Zhang P, Chen Q, Chen M. CircFunBase: a database for functional circular RNAs. Database. 2019;2019.
  26. 26. Xiao Q, Luo J, Dai J. Computational prediction of human disease-associated circRNAs based on manifold regularization Learning framework. IEEE journal of biomedical and health informatics. 2019.
  27. 27. Fan C, Lei X, Wu F-X. Prediction of CircRNA-Disease Associations Using KATZ Model Based on Heterogeneous Networks. International journal of biological sciences. 2018;14(14):1950. pmid:30585259
  28. 28. Lei X, Fang Z, Chen L, Wu F-X. PWCDA: Path Weighted Method for Predicting circRNA-Disease Associations. International journal of molecular sciences. 2018;19(11):3410.
  29. 29. Jeffrey HJ. Chaos game representation of gene structure. Nucleic acids research. 1990;18(8):2163–70. pmid:2336393
  30. 30. Lord PW, Stevens RD, Brass A, Goble CA. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics. 2003;19(10):1275–83. pmid:12835272
  31. 31. Wang JZ, Du Z, Payattakool R, Yu PS, Chen C-F. A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007;23(10):1274–81. pmid:17344234
  32. 32. Xuan P, Han K, Guo M, Guo Y, Li J, Ding J, et al. Prediction of microRNAs associated with human diseases based on weighted k most similar neighbors. PloS one. 2013;8(8):e70204. pmid:23950912
  33. 33. Zheng K, Wang L, You Z-H. CGMDA: An Approach to Predict and Validate MicroRNA-Disease Associations by Utilizing Chaos Game Representation and LightGBM. IEEE Access. 2019;7:133314–23.
  34. 34. Zheng K, You Z-H, Wang L, Zhou Y, Li L-P, Li Z-W. DBMDA: A Unified Embedding for Sequence-Based miRNA Similarity Measure with Applications to Predict and Validate miRNA-Disease Associations. Molecular Therapy-Nucleic Acids. 2020;19:602–11. pmid:31931344
  35. 35. Zhong Y, Du Y, Yang X, Mo Y, Fan C, Xiong F, et al. Circular RNAs function as ceRNAs to regulate and control human cancer progression. Molecular cancer. 2018;17(1):79. pmid:29626935
  36. 36. Huang Y-A, Chan KC, You Z-H. Constructing prediction models from expression profiles for large scale lncRNA–miRNA interaction profiling. Bioinformatics. 2018;34(5):812–9. pmid:29069317
  37. 37. Liu Y, Zeng X, He Z, Zou Q. Inferring microRNA-disease associations by random walk on a heterogeneous network with multiple data sources. IEEE/ACM transactions on computational biology and bioinformatics. 2016;14(4):905–15. pmid:27076459
  38. 38. van Laarhoven T, Nabuurs SB, Marchiori E. Gaussian interaction profile kernels for predicting drug–target interaction. Bioinformatics. 2011;27(21):3036–43. pmid:21893517
  39. 39. Xiao Q, Luo J, Dai J. Computational prediction of human disease-associated circRNAs based on manifold regularization learning framework. IEEE journal of biomedical and health informatics. 2019;23(6):2661–9. pmid:30629521