Skip to main content
Advertisement
  • Loading metrics

Adaptive multi-view multi-label learning for identifying disease-associated candidate miRNAs

Abstract

Increasing evidence has indicated that microRNAs(miRNAs) play vital roles in various pathological processes and thus are closely related with many complex human diseases. The identification of potential disease-related miRNAs offers new opportunities to understand disease etiology and pathogenesis. Although there have been numerous computational methods proposed to predict reliable miRNA-disease associations, they suffer from various limitations that affect the prediction accuracy and their applicability. In this study, we develop a novel method to discover disease-related candidate miRNAs based on Adaptive Multi-View Multi-Label learning(AMVML). Specifically, considering the inherent noise existed in the current dataset, we propose to learn a new affinity graph adaptively for both diseases and miRNAs from multiple similarity profiles. We then simultaneously update the miRNA-disease association predicted from both spaces based on multi-label learning. In particular, we prove the convergence of AMVML theoretically and the corresponding analysis indicates that it has a fast convergence rate. To comprehensively illustrate the prediction performance of our method, we compared AMVML with four state-of-the-art methods under different validation frameworks. As a result, our method achieved comparable performance under various evaluation metrics, which suggests that our method is capable of discovering greater number of true miRNA-disease associations. The case study conducted on thyroid neoplasms further identified a potential diagnostic biomarker. Together, the experimental results confirms the utility of our method and we anticipate that our method could serve as a reliable and efficient tool for uncovering novel disease-related miRNAs.

Author summary

MiRNAs are a class of small non-coding RNAs that are associated with a variety of complex biological processes. Increasing studies have shown that miRNAs have close relationships with many human diseases. The prediction of the associations between miRNAs and diseases has thus become a hot topic. Although traditional experimental methods are reliable, they could only identify a limited number of associations as they are in general time-consuming and expensive. Consequently, great efforts have been made to effectively predict reliable disease-related miRNAs based on computational methods. In this study, we develop a novel method to discover potential miRNA-disease associations based on Adaptive Multi-View Multi-Label learning. Considering the inherent noise existed in the current dataset, we propose to learn a new affinity graph adaptively for both diseases and miRNAs from multiple biological data source, including miRNA sequence similarity, miRNA functional similarity and Gaussian interaction profile kernel similarity. Notably, our method is applicable to diseases without any known associated miRNAs and also obtains satisfactory results. The case study conducted on thyroid neoplasms further confirms the prediction reliability of the proposed method. Overall, results show that our method can predict the potential associations between miRNAs and diseases effectively.

Introduction

MiRNAs are a group of short non-coding RNAs that mediate post-transcriptional gene silencing[1]. Accumulating evidence has proved that miRNAs play crucial roles in a variety of cancer-related pathways. Therefore, the identification of miRNA-disease associations can shed new light on understanding possible pathogenesis of diseases.

To compensate for the limitations of experiment-based approaches, a great number of computational models have been proposed to identify potential disease-related miRNAs in recent years[2]. Under the assumption that functionally similar miRNAs tend to be associated with phenotypically similar diseases, Jiang et al. prioritized the entire microRNAome for over a thousand diseases by constructing an integrated phenome-microRNAome network[3]. Chen et al. measured the global network similarity and inferred potential miRNA-disease interactions based on random walk with restart[4]. Shi et al. adopted a similar idea and further integrated the protein-protein interactions into the prediction process[5]. Chen et al. proposed a novel heterogeneous graph inference method by iteratively updating the association probability[6, 7]. Liu et al. constructed a heterogeneous network in which they integrated the miRNA-target gene and miRNA-lncRNA associations[8]. Specifically, the methods introduced above mainly predicted disease-related miRNAs by applying random walk algorithms to the reconstructed similarity networks[9]. Another family of prediction methods was generally based on network topological characteristics and also achieved remarkable performance. For instance, Zou et al. computed the similarity score based on walks of different lengths between the miRNA and disease nodes[10]. Sun et al. exploited the potential disease-related miRNAs based on known miRNA-disease network topological similarity[11]. You et al. proposed to measure the association score for a miRNA-disease pair by calculating the accumulative contributions from all paths between them[12]. Li et al. used DeepWalk to enhance the existing associations through a topology-based similarity measure[13]. Chen et al. computed the association possibility between a disease node and a miRNA node in the corresponding graphlet interaction isomers[14]. Although effective, these methods are sensitive to the change of the network topological structures, which might affect the prediction accuracy. Alternatively, prediction methods that were based on semi-supervised learning as well as supervised learning have been well developed. Xiao et al. introduced a graph regularized non-negative matrix factorization to effectively discover sparse miRNA-disease associations[15]. Both Chen et al. and Yu et al. adopted matrix completion to recover the potential missing miRNA-disease associations[16, 17]. Zeng et al. used a derivative algorithm structural perturbation method to estimate the link predictability with structural consistency as the indicator[18]. Chen et al. used an ensemble model where a sequence of weak learners were trained to collectively obtain a predicted association score[19]. Recently, we reconstructed the miRNA and disease similarity matrices based on global linear neighborhoods and then applied label propagation to predict potential associations between diseases and miRNAs[20, 21]. Chen et al. extracted novel feature vectors for both miRNAs and diseases to train a random forest classifier for the prediction task[22].

Although great efforts have been made to efficiently uncover potential miRNA-disease associations, most existing computational approaches still suffer from several limitations. Specifically, the inherent noise in the current datasets resulted in incomplete and sparse similarity matrices and thus inevitably affected the prediction accuracies of these methods. Moreover, the integration of multiple biological data sources in calculating the similarity matrices for both miRNAs and diseases was generally performed by averaging the input similarity information, which might lead to suboptimal results. Lastly, the predicted association scores from miRNA space and disease space were often updated separately during the learning process. To solve these problems, in this paper, we propose a novel Adaptive Multi-View Multi-Label(AMVML) learning framework to infer disease-related miRNAs. In particular, our method adaptively learns a new affinity graph for miRNAs and diseases respectively from multiple data sources (i.e. miRNA sequence similarity, Gaussian interaction profile kernel similarity and so on). In addition, we unify the optimization process for both disease space and miRNA space based on multi-label learning. The experimental results under several different evaluation metrics clearly demonstrate the superior performance of our method over previous methods. We further carry out a case study on thyroid cancer to identify potential prognostic biomarkers.

Materials and methods

Human miRNA-disease associations

The known human miRNA-disease associations were retrieved from HMDD v2.0 database[23]. HMDD is a database for experimentally supported human miRNA and disease associations that were manually collected from all the miRNA-related publications in PubMed. Each entry in HMDD contains four items, i.e. miRNA name, disease name, experimental evidence for the miRNA-disease association and the publication PubMed ID. To keep consistent of data from different sources, we also downloaded the annotation information of 4796 human miRNAs released on March 2018 from miRBase[24]. We then downloaded the latest MeSH descriptors from the National Library of Medicine(https://www.nlm.nih.gov/) and only retained items from Category C for diseases, which resulted in 11572 unique items. After mapping the miRNA names and disease names involved in each association with miRBase records and MeSH descriptors, we finally obtained 6088 associations between 328 diseases and 550 miRNAs for subsequent analysis(S1 File). Specifically, we classified the 328 diseases based on the Diseases Categories provided in MeSH. For diseases belonging to multiple categories, we increased the count by one for each category accordingly. As a result (Fig 1, S2 File), we can see that most diseases recorded in HMDD were cancers. For convenience, we used a binary matrix Y ∈ ℝ328×550 to represent the miRNA-disease associations. For a given disease i and miRNA j, Yij = 1 if i is related to j, and Yij = 0 otherwise.

thumbnail
Fig 1. The type and distribution of the 328 diseases recorded in HMDD v2.0.

https://doi.org/10.1371/journal.pcbi.1006931.g001

Disease semantic similarity

As described in [25], the disease semantic similarity can be calculated based on Directed Acyclic Graphs (DAGs). Specifically, for a given disease d, its DAG is composed of three items, i.e. DAG = (d, T(d), E(d)), where T(d) represents d itself together with all its ancestor nodes, and E(d) contains all direct links connecting the parent nodes to child nodes. The contribution Dd(t) of a disease t in a DAGd to the semantics of disease d was defined as follows: (1)

The semantic similarity score between two diseases i and j can then be calculated by: (2)

Moreover, the similarity between a given disease d and a group of diseases Dt = {dt1, dt2,…, dtk} was defined by: (3)

Finally, we obtained the semantic similarities for each disease pair according to (Eq 2). We denoted the semantic similarity matrix as AD(1)∈ ℝ328×328 where represents the semantic similarity between disease i and disease j(S3 File).

MiRNA similarity measures

In this subsection, to comprehensively characterize similarities between miRNAs, we adopt three measures using different biological data sources for subsequent predictions[26].

MiRNA sequence similarity.

The sequence information of all miRNAs were downloaded from miRBase[24]. We then used the "pairwiseAlignment" function in R package Biostrings to calculate a similarity score for each miRNA pair based on their entire mature sequences with a gap opening penalty of 5 and a gap extension penalty of 2. Moreover, to generate a substitution matrix for sequence alignment, we set the match score to 1 and the mismatch score to -1. Finally, the sequence similarity score obtained for each miRNA pair was further normalized to the range [0, 1] by the following equation: (4) where Scoremin and Scoremax represent the minimum and maximum similarity score of all miRNA pairs. For simplicity, we use AM(1)∈ ℝ328×550 to denote the sequence similarity matrix where represents the sequence similarity between miRNA i and miRNA j(S4 File).

MiRNA functional similarity.

To measure the functional similarity for each miRNA pair, we followed the same pipeline presented in [25]. Let DTi and DTj denote the disease sets related to miRNA i and j, respectively, the functional similarity is calculated as follows: (5) where S(dt, DT) is the same as that defined in (Eq 3). We use AM(2) ∈ ℝ550×550 to denote the miRNA functional similarity matrix and represents the functional similarity between miRNA i and miRNA j(S5 File).

MiRNA semantic similarity.

As stated in previous section, the miRNA functional similarities could be obtained based on the overlap of miRNA-related diseases[15]. However, it is not applicable to miRNAs without any known associated diseases. Therefore, we here propose to use miRNA target information and the Gene Ontology (GO) annotations to better describe the miRNA semantic similarities. To this end, we first downloaded the experimentally-verified miRNA-gene interactions from mirTarBase[27], which contains 380639 interactions between 2599 miRNAs and 15064 genes. For each miRNA pair in our analysis, we maintained their target gene lists and then calculated the semantic similarity between the two corresponding gene groups by using the "clusterSim" function in the R package GOSemSim[28]. Specifically, the GO annotations were retrieved from the Bioconductor package "org.Hs.eg.db" and "BMA" method was used for combining semantic similarity scores of multiple GO terms[29]. Similarly, we used AM(3)∈ ℝ550×550 to denote the miRNA semantic similarity matrix where represents the semantic similarity between miRNA i and miRNA j(S6 File).

Gaussian interaction profile kernel similarity

Gaussian interaction profile kernel similarity has been widely used in previous studies and proved effective in measuring both miRNA and disease similarities. For a given miRNA i or disease j, its interaction profile IP(mi) or IP(dj) was a binary vector extracted from the i-th row or the j-th column of the association matrix Y. The kernel similarity between two miRNAs or two diseases could then be computed by: (6) (7) where βm and βd are defined as follows: (8) (9) where β'm and β'd are two parameters controlling the kernel bandwidth. As a result, we used AM(4)∈ ℝ550×550 and AD(2)∈ ℝ328×328 to represent the obtained Gaussian interaction profile similarity matrices for miRNAs and diseases, respectively.

Adaptive multi-view multi-label learning for miRNA-disease association prediction

We summarize the notations used throughout this paper. Given a matrix M, Mij and Mi represent its ij-th element and i-th row, respectively. The transpose of M is denoted by MT. Tr(M) denotes the trace of M and the Frobenius norm of M is represented as ||M||F. For a similarity matrix S, its Laplacian matrix LS is defined as , where DS is a diagonal matrix with its i-th diagonal element equal to j(Sij + Sji)/2.

Graph-based multi-label learning.

Multi-label learning refers to the problems where an instance can be assigned to more than one category[30]. The graph-based multi-label learning framework is characterized by simultaneously exploiting the inherent correlations among multiple labels and the label consistency over the graph[31]. As a matter of fact, since each miRNA or disease could be associated with multiple diseases or miRNAs, this learning framework can be directly applied to solve the miRNA-disease association prediction problem by defining its objective function as follows: (10) where L and C are the normalized Laplacian matrices corresponding to the similarity matrices of miRNAs and diseases, respectively. μ and v are two non-negative trade-off parameters. By differentiating the objective function with respect to F, (Eq 10) can be efficiently solved by a Sylvester equation.

AMVML.

The graph-based multi-label learning provides us a unified framework to collaboratively update the prediction results from miRNA space and disease space, which successfully solves the last issue mentioned above. However, we still suffer from the inherent noises in the existing datasets as well as the lack of appropriate methods to integrate datasets from multiple biological data sources. To conquer these limitations, we propose a new objective function which can adaptively learns new affinity graphs for miRNAs and diseases given similarity information obtained from multiple views, respectively(Fig 2). Moreover, instead of explicitly assigning weights for each view, our method can also perform self-conducted weight learning during the optimization process[32]. Specifically, let n be the number of views in miRNA space and AM(1), AM(2),…, AM(n) be the corresponding miRNA similarity matrix of each view, where AM(u) ∈ ℝp×p (1≤un). Similarly, let m be the number of views in disease space and AD(1), AD(2),…, AD(m) be the corresponding disease similarity matrix of each view, where AD(v) ∈ ℝq×q (1≤vm). p and q are the number of miRNAs and diseases, respectively. Our objective is to obtain the predicted association matrix F as well as two optimal similarity matrices SD and SM by considering multiple input views in both disease and miRNA spaces simultaneously, which is formulated as follows: (11) where LSD and LSM are the corresponding Laplacian matrices for SD and SM, respectively. 1 is a column vector with all elements equal to 1. and are weight parameters for the v-th view of disease similarity and u-th view of miRNA similarity defined by: (12) (13)

thumbnail
Fig 2. The overall workflow of AMVML.

Our method first learns two new affinity graphs for miRNAs and diseases from multiple sources of biological datasets, respectively. It then updates the miRNA-disease association information simultaneously based on multi-label learning framework. It repeats this process until the algorithm converges and finally outputs the prediction results.

https://doi.org/10.1371/journal.pcbi.1006931.g002

As can be seen from Eqs (12) and (13), the two weight factors and are seamlessly coupled with ||SD—AD(v)||F and ||SM—AM(u)||F, respectively. Intuitively, if the v-th or u-th view is good, then ||SD—AD(v)||F or ||SM—AM(u)||F should be small and thus the learnt or will be assigned a larger weight accordingly. As a result, and are updated adaptively in terms of the quality of the corresponding view during each iteration, which essentially makes the optimization of our objective function a self-weighted learning process. In the next part, we propose an efficient algorithm to solve (Eq 11).

Optimization.

It is difficult to directly solve (Eq 11) as it involves three variables. Therefore, we iteratively optimize one variable by fixing the others[33].

  1. Solving SD. When SM and F are fixed, (Eq 11) becomes: (14)
    (Eq 14) can be further transformed into: (15)
    Since (Eq 15) is independent for different i, we can optimize each row separately: (16)
    Denoting zi as a vector with its j-th element and similarly for SDi and , (Eq 16) could be rewritten as: (17)
    Problem (17) can be solved efficiently by an iterative algorithm[34].
  2. Solving SM. Similarly, when SD and F are fixed, (Eq 11) becomes: (18)
    By following the same optimization process for SD, we can derive the solution for problem (18) as follows: (19)
  3. Solving F. When SD and SM are fixed, (Eq 11) degenerates to: (20)
    Taking the derivative of (Eq 20) with respect to F and setting it to zero, we could obtain: (21)
    (Eq 21) is a Sylvester equation and could be solved directly. We summarized the overall procedure in Algorithm 1. Besides, the datasets and source code of AMVML are freely available at https://github.com/alcs417/AMVML.

Algorithm 1: AMVML

Input: miRNA similarity matrices of n views {AM(1), AM(2),…, AM(n)}, disease similarity matrices of m views {AD(1), AD(2),…, AD(m)}, known association matrix Y ∈ ℝq×p, parameter α and β.

Output: Predicted association matrix F.

1. Initialize the weights of each view for both miRNAs and diseases by .

2. Repeat:

3.  Repeat:

4.   Update SD by solving problem ().

5.   Update SM by solving problem ().

6.   Update F with ().

7.  Until convergence

8. Update according to Eqs () and ()

9. Until convergence

10. Return SD, SM and F

Theoretical convergence analysis.

We prove the convergence of Algorithm 1 by the following theorem.

Theorem 1. The iterative optimization process in Algorithm 1 can monotonically decrease the objective function value of (Eq 11) until convergence.

Proof. By fixing SM and F, the optimization for SD in (Eq 17) is a quadratic programming problem[35]. Specifically, the Hessian matrix of the Lagrange function of (Eq 17) is positive definite, i.e. 2I ∈ ℝq×q. Therefore, we arrive at: (22)

Similarly, we could prove that the optimization for SM by fixing the other variables is also convex and we could obtain that: (23)

By fixing SD and SM, the optimization function for updating F is also convex[36]. Moreover, since βLSM+I is positive definite and βLSD is positive semi-definite, the eigenvalues σ1, σ2,…,σl of βLSM+I and ξ1, ξ2,…,ξk of βLSD satisfy the inequality σij> 0 (i = 1,…,l; j = 1,…,k), which guarantees that there is a unique solution to (Eq 21). As a result, we could obtain: (24)

As demonstrated in the above analysis, Algorithm 1 can monotonically decrease the objective function value of (Eq 11) in each iteration until it converges.

Results

Performance evaluation

To systematically evaluate the performance of our method and illustrate its superiority over existing alternatives, we compared AMVML with fourstate-of-the-art methods, i.e. IMCMDA[37], SPMMDA[38], PBMDA[12] and EGBMMDA[19] under several evaluation metrics. All these methods have been proved effective in predicting reliable disease-associated miRNAs. First of all, we adopted the global Leave-One-Out Cross-Validation(LOOCV) and five-fold cross-validation to test the general prediction performance. Specifically, in the framework of global LOOCV, each known miRNA-disease association was selected as a test sample while the remaining associations were considered as training samples. For five-fold cross-validation, all known miRNA-disease associations were randomly divided into five subsets and each subset was chosen as the test samples. Besides, the five-fold cross-validation was repeated 10 times to eliminate the potential bias caused by the sample division. The prediction performance was illustrated by Receiver Operating Characteristic(ROC) curve and the accuracy was quantified by the Area Under the ROC Curve(AUC). As shown in Fig 3, AMVML achieved the highest accuracy among all methods in both global LOOCV and five-fold cross-validation.

thumbnail
Fig 3. The comparison results between our method and the other four methods in terms of (A) global LOOCV; (B) five-fold cross-validation.

https://doi.org/10.1371/journal.pcbi.1006931.g003

Next, we employed another evaluation metric called Leave-One-Disease-Out Cross-Validation(LODOCV) to verify the prediction performance when no prior information is available. Specifically, for each disease d, we removed all known miRNAs associated with d and carried out predictions based on miRNA association information of the other diseases. Since there are no known associations for each tested disease in advance, LODOCV is more difficult than global LOOCV and five-fold cross-validation. We calculated an AUC value for each disease in LODOCV and thus obtained a vector consisting of 328 AUC values for each method. We then demonstrated the comparison results by density plots (Fig 4A). As a result, the AUC values obtained by our method mainly concentrated over the interval [0.9, 1], indicating a better performance than that of the other methods in terms of LODOCV. Wilcoxon signed-rank test further confirmed the statistical significance of the comparison results (Table 1).

thumbnail
Fig 4.

(A) The comparison results between our method and the other four methods in terms of LODOCV. As shown in the density plot, AMVML lies in the rightmost position of the figure, indicating a superior performance over the other methods; (B) The number of identified true positives in the latest HMDD by all methods.

https://doi.org/10.1371/journal.pcbi.1006931.g004

thumbnail
Table 1. Statistical significance of differences in performance between AMVML and the other four methods in terms of LODOCV.

P-values were calculated by Wilcoxon signed rank test.

https://doi.org/10.1371/journal.pcbi.1006931.t001

Lastly, we conducted experiments on real datasets to further demonstrate the prediction ability of our method. To this end, we first downloaded the older version of HMDD (v1.0) which contains 1474 known associations between 129 diseases and 280 miRNAs after filtering (S7 File). Compared to HMDD v1.0, there were 4614 (i.e. 6088–1474) new miRNA-disease associations, 199 (i.e. 328–129) new diseases and 270 (i.e. 550–280) new miRNAs involved in HMDD v2.0. In particular, among the 4614 newly recorded associations in HMDD v2.0, 2445 associations were related with miRNAs and diseases already existed in HMDD v1.0, while 2169 associations were related with either new miRNAs or new diseases only contained in HMDD v2.0. Moreover, the degree distribution of miRNAs as well as that of diseases for the 4614 associations indicating that only a minority of these associations were related with highly connected miRNAs and diseases (S1 Fig). We then applied each method on HMDD v1.0 and validated the prediction results by the 4614 associations newly added in HMDD v2.0. Therefore, for each method, the greater the number of true positives predicted, the better the performance. Specifically, we compared the number of true positives in the top-N miRNAs predicted by each method with N ranging from 10 to 50 and an interval of 10. As exhibited in Fig 4B, AMVML obtained greater number of validated disease-associated miRNAs than the other methods. Similar results were also obtained with increased N and larger intervals (S2 Fig). Taken together, the experimental results under various evaluation metrics proved the effectiveness of our method.

Parameter analysis

There were two trade-off parameters α and β in our method which balance the learned similarity matrices and the predicted association matrix. Generally, since our objective function is a minimization problem, setting a large value to α or β indicates a large impact of the label consistency between diseases or miRNAs on the learned disease or miRNA similarity matrix. To show a reasonable searching range of these two parameters as well as a general trend of the prediction performance affected by varying their values, in this subsection, we analyzed their influences on the prediction accuracy in terms of five-fold cross-validation (Fig 5A). Similar trends were also observed in global LOOCV. In particular, when β was fixed, the smaller the α, the better the performance. In contrast, when α was fixed, the performance varied in a "U" shape with the change of β. We can see that the proposed method reached the best performance when both α and β were equal to 1e-4.

thumbnail
Fig 5.

(A) The influence of the two parameters alpha and beta on the prediction accuracy of five-foldcross-validation; (B) The convergence rate of AMVML.

https://doi.org/10.1371/journal.pcbi.1006931.g005

Convergence speed in practice

As described in previous section, we have theoretically proved the convergence of our algorithm. Here we investigated the convergence rate of our method by analyzing the variations of the objective function value in (Eq 11) with respect to the number of iterations. As demonstrated in Fig 5B, the objective function value reached a steady state within 5 iterations, indicating a fast convergence rate of our method.

Case study

In this section, we conducted a case study on thyroid neoplasms to identify potential miRNA biomarkers for this disease. The overall prediction results and the differential expression analysis for several other diseases were also provided on Github (https://github.com/alcs417/AMVML). Thyroid cancer is the most common endocrine cancer and its incidence rate has increased rapidly over recent years[39]. We first downloaded the miRNA expression profiles together with the clinical information of thyroid carcinoma from GDC data portal (https://portal.gdc.cancer.gov/projects/TCGA-THCA). Concretely, the downloaded data contained 506 tumors samples and 59 normal samples and each sample measured the expression level of 1881 miRNAs. We then applied our method on the given disease to obtain the top-10 predicted miRNAs(Table 2). Specifically, we evaluated the classification power of these miRNAs in differentiating tumor samples from normal samples according to their expression profile and the results of five-fold cross-validation illustrated that they could achieve a mean classification accuracy of 0.983(S3 Fig). Next, we calculated for each miRNA the fold-change as well as the statistical significance of differential expression using the R package edgeR (Table 2)[40]. Besides, we searched in another two databases dbDEMC and miR2Disease to see if the predicted miRNAs were also recorded in them[41, 42]. dbDEMC is an integrated database that designed to store and display differentially expressed miRNAs in human cancers detected by high-throughout methods while miR2Disease is a manually curated database providing information about miRNA deregulation in various human diseases. As a result, the expression level of the top predicted miRNA hsa-mir-181a-2 was significantly altered between tumor samples and normal samples (log2 fold-change > 1 and adjusted p-value< 0.05), which is consistent with the records in both db2DEMC and miR2Disease. Therefore, we further checked whether this miRNA could serve as a potential biomarker for thyroid cancer. Specifically, we carried out one-way ANOVA test to validate whether its expression level at different tumor stages also significantly altered. The tumor stages of all patients were retrieved from the clinical information and there were six pathologic stages after filtering. As expected, the expression level of hsa-mir-181a-2 varied significantly among different stages (Fig 6A). Furthermore, the Kaplan-Meier survival analysis confirmed that the survival rates of patients were also significantly related with its expression level (Fig 6B)[43]. Taken together, our results provided new evidence for the functional role of hsa-mir-181a-2 in the development of thyroid cancer.

thumbnail
Fig 6.

(A) The expression profiles of hsa-mir-181a-2 at different tumor stages. P-value was calculated by one-way ANOVA test; (B) Kaplan-Meier survival analysis using hsa-mir-181a-2 in thyroid cancer. The patients were divided into two groups ’High’ and ’Low’ according to their expression level of hsa-mir-181a-2 against the sample mean. As observed, patients with lower expression levels were at lower risk level.

https://doi.org/10.1371/journal.pcbi.1006931.g006

thumbnail
Table 2. Top-10 miRNAs predicted to be associated with thyroid neoplasms by AMVML.

https://doi.org/10.1371/journal.pcbi.1006931.t002

Discussion

Identification of disease-associated miRNAs has drawn much attention during the past decade and it still remains a challenging task. In this study, we proposed a novel computational framework to effectively uncover the potential links between miRNAs and diseases. Our method integrated datasets from multiple sources and adaptively learned two new similarity graphs. Specifically, instead of assigning predetermined weight values to each input similarity matrix, the proposed method automatically updated the view weights according to the reliability of each view. It is also worth mentioning that our method could be easily extended if there are new data sources available. Besides, our method could simultaneously update the prediction results from both disease space and miRNA space. The convergence of our method has been proved both theoretically and experimentally. To demonstrate the utility of our method, we compared AMVML with five state-of-the-art methods and the experimental results confirmed the superiority of our method. We then applied our method on thyroid cancer and found that hsa-mir-181a-2 could be a potential prognostic biomarker. Notably, our method is not limited to discover miRNAs for which an association is already known between its paralogous miRNA and the same disease. In essence, as a semi-supervised learning model, our method could fully take advantage of the limited number of known miRNA-disease associations together with multiple sources of biological datasets to reliably predict novel associations. Therefore, we anticipate that our method could serve as an effective tool for miRNA-disease association prediction.

The superior performance of our model can be largely attributed to the following two reasons. First, the consensus similarity matrices obtained from multiple biological datasets based on multi-view learning for both miRNAs and diseases are more robust to outliers and noises compared to existing methods. Second, the graph-based multi-label learning unified the two prediction spaces into one optimization framework, which enhances the inherent correlations between miRNAs and diseases from the label-consistency point of view. Nevertheless, our method still has some limitations. Specifically, there are two parameters α and β in the objective function that need to be tuned in advance, and it is a non-trivial task to find the best combination of the two parameters. In addition, although our method can adaptively learn a new affinity graph from different data sources, the integration of unreliable similarity matrices might weaken the overall prediction accuracy.

Supporting information

S1 Fig. The degree distribution of (A): Diseases and (B): miRNAs.

https://doi.org/10.1371/journal.pcbi.1006931.s001

(EPS)

S2 Fig. The number of identified true positives in HMDD v2.0 by all methods with N ranging from (A) 200 to 1000; and (B) 2000 to 10000.

https://doi.org/10.1371/journal.pcbi.1006931.s002

(EPS)

S3 Fig. Five-fold cross-validation results obtained by taking the top 10 predicted miRNAs as features.

https://doi.org/10.1371/journal.pcbi.1006931.s003

(EPS)

S1 File. The 6088 miRNA-disease associations obtained from HMDD v2.0.

https://doi.org/10.1371/journal.pcbi.1006931.s004

(XLSX)

S2 File. The number of diseases in each disease typecontained in HMDD v2.0.

https://doi.org/10.1371/journal.pcbi.1006931.s005

(XLSX)

S3 File. Disease semantic similarity matrix.

https://doi.org/10.1371/journal.pcbi.1006931.s006

(XLSX)

S4 File. MiRNA sequence similarity matrix.

https://doi.org/10.1371/journal.pcbi.1006931.s007

(XLSX)

S5 File. MiRNA functional similarity matrix.

https://doi.org/10.1371/journal.pcbi.1006931.s008

(XLSX)

S6 File. MiRNA semantic similarity matrix.

https://doi.org/10.1371/journal.pcbi.1006931.s009

(XLSX)

S7 File. The 1474 miRNA-disease associations obtained from HMDD v1.0.

https://doi.org/10.1371/journal.pcbi.1006931.s010

(XLSX)

References

  1. 1. Lin SB, Gregory RI. MicroRNA biogenesis pathways in cancer. Nat Rev Cancer. 2015;15(6):321–33. pmid:25998712
  2. 2. Zeng X, Zhang X, Zou Q. Integrative approaches for predicting microRNA function and prioritizing disease-related microRNA using biological interaction networks. Brief Bioinform. 2016;17(2):193–203. pmid:26059461
  3. 3. Jiang Q, Hao Y, Wang G, Juan L, Zhang T, Teng M, et al. Prioritization of disease microRNAs through a human phenome-microRNAome network. BMC Syst Biol. 2010;4 Suppl 1:S2.
  4. 4. Chen X, Liu MX, Yan GY. RWRMDA: predicting novel human microRNA-disease associations. Mol Biosyst. 2012;8(10):2792–8. pmid:22875290
  5. 5. Shi H, Xu J, Zhang G, Xu L, Li C, Wang L, et al. Walking the interactome to identify human miRNA-disease associations through the functional link between miRNA targets and disease genes. BMC Syst Biol. 2013;7:101. pmid:24103777
  6. 6. Chen X, Yan CC, Zhang X, You ZH, Huang YA, Yan GY. HGIMDA: Heterogeneous graph inference for miRNA-disease association prediction. Oncotarget. 2016;7(40):65257–69. pmid:27533456
  7. 7. Chen X, Yin J, Qu J, Huang L. MDHGI: Matrix Decomposition and Heterogeneous Graph Inference for miRNA-disease association prediction. PLoS Comput Biol. 2018;14(8):e1006418. pmid:30142158
  8. 8. Liu YS, Zeng XX, He ZY, Zou Q. Inferring MicroRNA-Disease Associations by Random Walk on a Heterogeneous Network with Multiple Data Sources. IEEE-ACM Trans Comput Biol Bioinform. 2017;14(4):905–15.
  9. 9. Xuan P, Han K, Guo Y, Li J, Li X, Zhong Y, et al. Prediction of potential disease-associated microRNAs based on random walk. Bioinformatics. 2015;31(11):1805–15. pmid:25618864
  10. 10. Zou Q, Li J, Hong Q, Lin Z, Wu Y, Shi H, et al. Prediction of MicroRNA-Disease Associations Based on Social Network Analysis Methods. Biomed Res Int. 2015;2015:810514. pmid:26273645
  11. 11. Sun D, Li A, Feng H, Wang M. NTSMDA: prediction of miRNA-disease associations by integrating network topological similarity. Mol Biosyst. 2016;12(7):2224–32. pmid:27153230
  12. 12. You ZH, Huang ZA, Zhu Z, Yan GY, Li ZW, Wen Z, et al. PBMDA: A novel and effective path-based computational model for miRNA-disease association prediction. PLoS Comput Biol. 2017;13(3):e1005455. pmid:28339468
  13. 13. Li GH, Luo JW, Xiao Q, Liang C, Ding PJ, Cao BW. Predicting MicroRNA-Disease Associations Using Network Topological Similarity Based on DeepWalk. Ieee Access. 2017;5:24032–9.
  14. 14. Chen X, Guan NN, Li JQ, Yan GY. GIMDA: Graphlet interaction-based MiRNA-disease association prediction. J Cell Mol Med. 2018;22(3):1548–61. pmid:29272076
  15. 15. Xiao Q, Luo JW, Liang C, Cai J, Ding PJ. A graph regularized non-negative matrix factorization method for identifying microRNA-disease associations. Bioinformatics. 2018;34(2):239–48.
  16. 16. Chen X, Wang L, Qu J, Guan NN, Li JQ. Predicting miRNA-disease association based on inductive matrix completion. Bioinformatics. 2018;34(24):4256–65. pmid:29939227
  17. 17. Yu SP, Liang C, Xiao Q, Li GH, Ding PJ, Luo JW. MCLPMDA: A novel method for miRNA-disease association prediction based on matrix completion and label propagation. J Cell Mol Med. 2018.
  18. 18. Zeng X, Liu L, Lu L, Zou Q. Prediction of potential disease-associated microRNAs using structural perturbation method. Bioinformatics. 2018;34(14):2425–32. pmid:29490018
  19. 19. Chen X, Huang L, Xie D, Zhao Q. EGBMMDA: Extreme Gradient Boosting Machine for MiRNA-Disease Association prediction. Cell Death Dis. 2018;9(1):3. pmid:29305594
  20. 20. Yu SP, Liang C, Xiao Q, Li GH, Ding PJ, Luo JW. GLNMDA: a novel method for miRNA-disease association prediction based on global linear neighborhoods. RNA Biol. 2018;15(9):1215–27. pmid:30244645
  21. 21. Qu Y, Zhang HX, Lyu C, Liang C. LLCMDA: A Novel Method for Predicting miRNA Gene and Disease Relationship Based on Locality-Constrained Linear Coding. Frontiers in genetics. 2018;9.
  22. 22. Chen X, Wang CC, Yin J, You ZH. Novel Human miRNA-Disease Association Inference Based on Random Forest. Mol Ther Nucleic Acids. 2018;13:568–79. pmid:30439645
  23. 23. Li Y, Qiu C, Tu J, Geng B, Yang J, Jiang T, et al. HMDD v2.0: a database for experimentally supported human microRNA and disease associations. Nucleic Acids Res. 2014;42(Database issue):D1070–4. pmid:24194601
  24. 24. Kozomara A, Griffiths-Jones S. miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Res. 2014;42(Database issue):D68–73. pmid:24275495
  25. 25. Wang D, Wang J, Lu M, Song F, Cui Q. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics. 2010;26(13):1644–50. pmid:20439255
  26. 26. Zou Q, Li J, Song L, Zeng X, Wang G. Similarity computation strategies in the microRNA-disease network: a survey. Briefings in functional genomics. 2016;15(1):55–64. pmid:26134276
  27. 27. Chou CH, Shrestha S, Yang CD, Chang NW, Lin YL, Liao KW, et al. miRTarBase update 2018: a resource for experimentally validated microRNA-target interactions. Nucleic Acids Res. 2018;46(D1):D296–D302. pmid:29126174
  28. 28. Yu GC, Li F, Qin YD, Bo XC, Wu YB, Wang SQ. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics. 2010;26(7):976–8. pmid:20179076
  29. 29. Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF. A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007;23(10):1274–81. pmid:17344234
  30. 30. Chen G, Song Y, Wang F, Zhang C, editors. Semi-supervised Multi-label Learning by Solving a Sylvester Equation. Siam International Conference on Data Mining, SDM 2008, April 24–26, 2008, Atlanta, Georgia, Usa; 2008.
  31. 31. Zha ZJ, Mei T, Wang J, Wang Z, Hua XS. Graph-based semi-supervised learning with multiple labels. J Vis Commun Image Represent. 2009;20(2):97–103.
  32. 32. Nie FP, Cai GH, Li J, Li XL. Auto-Weighted Multi-View Learning for Image Clustering and Semi-Supervised Classification. Ieee Transactions on Image Processing. 2018;27(3):1501–11. pmid:28945592
  33. 33. Nie F, Li J, Li X. Self-weighted Multiview Clustering with Multiple Graphs. The Twenty-Sixth International Joint Conference on Artificial Intelligence; Melbourne, Australia 2017. p. 2564–70.
  34. 34. Huang J, Nie FP, Huang H, editors. A new simplex sparse learning model to measure data similarity for clustering. The twenty-fourth International Conference on Aritificial Intelligence; 2015; Buenos Aires, Argentina.
  35. 35. Zhu L, Huang Z, Li ZH, Xie L, Shen HT. Exploring Auxiliary Context: Discrete Semantic Transfer Hashing for Scalable Image Retrieval. Ieee Transactions on Neural Networks and Learning Systems. 2018;29(11):5264–76. pmid:29994644
  36. 36. Zhang HX, Cao LL, Gao S. A locality correlation preserving support vector machine. Pattern Recogn. 2014;47(9):3168–78.
  37. 37. Chen X, Wang L, Qu J, Guan NN, Li JQ. Predicting miRNA-disease association based on inductive matrix completion. Bioinformatics. 2018.
  38. 38. Zeng XX, Liu L, Lu LY, Zou Q. Prediction of potential disease-associated microRNAs using structural perturbation method. Bioinformatics. 2018;34(14):2425–32. pmid:29490018
  39. 39. Wang TS, Sosa JA. Thyroid surgery for differentiated thyroid cancer—recent advances and future directions. Nature Reviews Endocrinology. 2018;14(11):670–83.
  40. 40. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40. pmid:19910308
  41. 41. Yang Z, Wu L, Wang A, Tang W, Zhao Y, Zhao H, et al. dbDEMC 2.0: updated database of differentially expressed miRNAs in human cancers. Nucleic Acids Res. 2017;45(D1):D812–D8. pmid:27899556
  42. 42. Jiang Q, Wang Y, Hao Y, Juan L, Teng M, Zhang X, et al. miR2Disease: a manually curated database for microRNA deregulation in human disease. Nucleic Acids Res. 2009;37(Database issue):D98–104. pmid:18927107
  43. 43. Goel MK, Khanna P, Kishore J. Understanding survival analysis: Kaplan-Meier estimate. Int J Ayurveda Res. 2010;1(4):274–8. pmid:21455458