MCLPMDA: A novel method for miRNA‐disease association prediction based on matrix completion and label propagation

Abstract MiRNAs are a class of small non‐coding RNAs that are involved in the development and progression of various complex diseases. Great efforts have been made to discover potential associations between miRNAs and diseases recently. As experimental methods are in general expensive and time‐consuming, a large number of computational models have been developed to effectively predict reliable disease‐related miRNAs. However, the inherent noise and incompleteness in the existing biological datasets have inevitably limited the prediction accuracy of current computational models. To solve this issue, in this paper, we propose a novel method for miRNA‐disease association prediction based on matrix completion and label propagation. Specifically, our method first reconstructs a new miRNA/disease similarity matrix by matrix completion algorithm based on known experimentally verified miRNA‐disease associations and then utilizes the label propagation algorithm to reliably predict disease‐related miRNAs. As a result, MCLPMDA achieved comparable performance under different evaluation metrics and was capable of discovering greater number of true miRNA‐disease associations. Moreover, case study conducted on Breast Neoplasms further confirmed the prediction reliability of the proposed method. Taken together, the experimental results clearly demonstrated that MCLPMDA can serve as an effective and reliable tool for miRNA‐disease association prediction.

carcinoma. 7 Besides, plenty of studies have indicated that miRNA mutations or misexpression are closely related with various human cancers and thus miRNAs could act as tumour suppressors and oncogenes. [8][9][10] Therefore, prediction of potential miRNA-disease associations makes an important contribution to understanding the molecular mechanism of disease pathogenesis and further promoting the level of treatment.
Traditional experimental methods such as qRT-PCR 11 and microarray profiling 12 have been adopted to identify miRNA-disease association predictions. Although reliable, experiment-based methods are generally expensive and time-consuming. With the rapid development of biotechnology, a vast amount of publicly available RNArelated datasets have been released, which also provides great opportunities for uncovering potential associations between diseases and miRNAs by taking advantage of these data resources computationally. Recently, considerable efforts have been made to discover disease-associated miRNAs based on the assumption that miRNAs with similar functions are tend to be associated with similar disease. 13 Jiang et al. constructed a human phenome-miRNAome functional association network and proposed the first computational model to infer the candidate disease-related miRNAs based on the hypergeometric distribution scoring system. By testing the proposed model on 270 known experimentally verified miRNA-disease associations, they achieved an accuracy of 0.758 in leave-one-out cross validation (LOOCV). 14 They further proposed a weighted network-based method to improve the calculation of concordance score between a specific miRNA and a given disease, and achieved an area under the receiver operating characteristic curve (AUC) value of 0.80 in global LOOCV. 15 Nevertheless, the high false-positive rate in miRNA target predictions severely limited the efficacy of Jiang's methods. By incorporating miRNA-target interactions, disease-gene associations and protein-protein interactions, Shi et al. introduced a modified random walk algorithm with restart (RWR) to identify miRNA-disease associations. As a result, their approach achieved satisfactory performance in identifying known cancer-related miRNAs for nine human cancers with an AUC value of 0.713 and 0.913 in LOOCV framework. 16 Similarly, Mørk et al. presented a miRNA-Protein-Disease network by integrating known miRNA-protein associations and disease-protein interactions to infer potential miRNAs associated with each investigated disease. 17 Later, Xu et al. utilized known disease-related protein-coding genes to prioritize miRNAs-disease associations according to context-dependent miRNA-target interactions and obtained an average overall prediction accuracy of 0.887 in crossvalidation tests. 18 In contrast to previous methods, Xu's method does not depend on known disease-related miRNAs. However, their method also suffers from the high false positive rates and false negative rates existed in the predicted miRNA-target interactions. By integrating known associations, disease semantic similarity, miRNA functional similarity and Gaussian interaction profile kernel similarity, Chen et al. calculated a within-score and a between score to gain an eventual confidence score for miRNA-disease associations. Specifically, they obtained an AUC value of 0.8031 in LOOCV, which clearly demonstrated their improvement. 19 Considering the fact that there are only very few known miRNA-disease associations and many associations are "missing" in the known training database, Chen et al. introduced the concepts of "super-miRNA" and "superdisease" to enhance the miRNA similarity and disease similarity measures to infer disease-related miRNAs. 20 Specifically, their method could be applied to new diseases without any known associated miRNAs as well as new miRNAs without any known associated diseases. As a result, their method achieved reliable performance with AUCs of 0.9032, 0.8323 and 0.8970 in global LOOCV, local LOOCV and 5-fold cross validation respectively.
In addition, machine learning-based methods for predicting miRNA-disease association have attracted widespread attention. [21][22][23][24][25][26] Chen et al. proposed a novel computational model based on heterogeneous graph inference for miRNA-disease association prediction by integrating miRNA functional similarity, disease semantic similarity, kernel similarity of Gaussian interaction profile and experimentally validated miRNA-disease associations into a heterogeneous network. 21 Concretely, HGIMDA adopted an iterative process to find the optimal solutions based on global network similarity information, which led to superior performance over local network similarity- Recently, several path-based methods taking advantage of network topological structures have been proposed to predict miRNAdisease associations. Sun et al. proposed a method called NTSMDA which only utilizes the miRNA-disease network topological similarity to predict disease-associated miRNAs 27 and achieved an AUC of 0.894 by using the LOOCV experiment. Chen et al. devised a method GIMDA based on graphlet interaction which was applied to analyse the relevance between two points. 28 The AUCs of GIMDA in global, local LOOCV and 5-fold cross validation turned out to be 0.9006 and 0.8455 and 0.8927 respectively. However, as NTSMDA and GIMDA strongly depends on network topological structure, they cannot be applied to diseases without any known associated miR- Although existing methods have made great contributions to uncover disease-related miRNAs, there are still some limitations that could be improved in many aspects. Therefore, in this paper, we develop a novel method for miRNA-disease association prediction based on Matrix Completion and Label Propagation (MCLPMDA). An important innovation of MCLPMDA is that it leverages matrix completion algorithm to solve the problem of sparsity and incompletion, which greatly improves the prediction accuracy. To demonstrate the effectiveness of our proposed method, we apply different evaluation metrics to comprehensively measure the prediction performance. In addition, we compare our method with four state-of-the-art methods and the results indicate that our method could achieve comparable performance. Moreover, the results of case study on Breast Neoplasms (BN) further verify the reliability and robustness of MCLPMDA.
Together, all the results demonstrate that MCLPMDA can serve as an effective tool for discovering miRNA-disease associations.

| Human miRNA-disease associations
MiRNA-disease associations were downloaded directly from the HMDD v2.0 which contains 5340 experimentally verified links between 495 miRNAs and 383 diseases. 31 We used an adjacency matrix DM to describe the obtained miRNA-disease associations.
Concretely, the element DM(i,j) is 1 if disease d(i) is verified to be associated with miRNA m(j), and 0 otherwise. Therefore, the i-th row of DM is a binary vector representing the associations between disease d(i) and each miRNA, while the j-th column of DM represents the associations between miRNA m(j) and each disease.

| MiRNA functional similarity
MiRNA functional similarity scores were computed based on the assumption that functionally similar miRNAs are more likely to connect with phenotypically similar disease. 32,33 In this paper, we downloaded the miRNA functional similarity scores directly from http://www.cuilab.cn/files/images/cuilab/misim.zip. We used matrix FM to denote the obtained miRNA functional similarity network, in which FM(i,j) indicates the similarity between miRNA m(i) and miRNA m(j).

| Disease semantic similarity
Mesh database (http://www.ncbi.nlm.nih.gov/) is a strict system for disease classification and is a credible dataset for effectively researching the association between different diseases. A disease can be described as a directed acyclic graph, DAG = (D,T(D),E(D)),

where T(D) represents both node D and its ancestor nodes, and E(D)
represents all direct edges connecting the parent nodes to child nodes. The contribution values of disease d to the semantic value of disease D can be calculated as follows: Here, Δ is the semantic contribution factor and we set Δ=0.5 in this paper. For disease D, the contribution of itself is 1, while the contribution of another disease d j decreases as the distance between D and d j increases. Hence, the semantic value of disease D can be calculated according to the contribution of ancestor diseases and disease D itself: Then, the semantic similarity between disease d i and disease d j could be calculated as follows: According to Equation (3), we can construct an overall disease semantic similarity matrix SD where SD ij represents the semantic similarity between disease d i and disease d j .

| Gaussian interaction profile kernel similarity for miRNAs and diseases
Based on the assumption that functional similar miRNAs tend to be associated with similar diseases and vice versa, we first constructed YU ET AL.

| 1429
Gaussian interaction profile kernel similarity for miRNAs. 34 Specifically, a binary vector M(i) representing the i-th column of the adjacency matrix DM is considered as the interaction profiles of miRNA m(i). The Gaussian kernel similarity between miRNA m(i) and m(j) can then be calculated as follows: where γ m is a parameter to control the kernel bandwidth and it can be obtained by the following formula: where δ m is a new bandwidth parameter and nm denotes the number of all the miRNAs. Similarly, the Gaussian interaction profile kernel similarity between disease d(i) and d(j) is calculated by: For simplicity, δ m and δ d were set to 1 according to previous studies. 32,34-36

| MCLPMDA
As mentioned above, due to the inherent noise in the current datasets, the obtained miRNA functional similarity matrix and disease semantic similarity matrix might be sparse and incomplete, which have greatly limited the prediction accuracy of existing methods. In this work, we developed a novel method named MCLPMDA to predict miRNA-disease associations based on matrix completion and label propagation. MCLPMDA can be simply divided into three steps: firstly, we construct a new miRNA similarity matrix CM as well as a disease similarity matrix CD based on matrix completion algorithm. Secondly, we combine the two constructed similarity matrices with existing similarity information for miRNAs and diseases respectively. Thirdly, we conduct label propagation algorithm in both miRNA space and disease space to obtain the final prediction results. An overall workflow of MCLPMDA is illustrated in Figure 1.

| Matrix completion for miRNA and disease
The present data are often far from perfect, meaning that, a part of the dataset would be incorrect or missing. 37

Matrix CompleƟon
Gaussian interacƟon profile kernel similarity for miRNAs Gaussian interacƟon profile kernel similarity for diseases F I G U R E 1 Flowchart of potential disease-miRNA association prediction based on the computational model of MCLPMDA. Our algorithm mainly consists of three steps: (1) we construct a new miRNA similarity matrix as well as a disease similarity matrix based on matrix completion algorithm; (2) the two reconstructed similarity matrices are combined with Gaussian interaction profile kernel similarity for miRNAs and diseases respectively; (3) label propagation algorithm is conducted in both miRNA space and disease space to obtain the final prediction results incomplete data matrix D can be decomposed into two parts. The first part is a linear combination of D, which is a low-rank matrix and is essentially a projection from the noisy data D into a more refined or informative lower-dimensional space. The second part is a noise data matrix separated from the original data matrix D.
According to the above statement, D can be decomposed as follows: Apparently, Equation (8) has infinite solutions. However, as we want R to be low-rank and N to be sparse, we add nuclear norm or trace norm on D and adopt the ℓ 2,1 norm to characterize the error term N. Specifically, we could obtain a low-rank recovery matrix by solving the following convex optimization problem: is the noise regularization term and ω is the positive weighting parameter to balance the weights of low-rank matrix R and sparse matrix N. 38 After obtaining the minimizer (R*, N*), we could use DR* (or D − N*) to obtain a low-rank recovery matrix CD.
The optimization problem (9) is convex and can be solved in various ways, for example, accelerated proximal gradient method (APG), 39 Singular Value Thresholding Algorithm (SVT), 40 Augmented Lagrange multiplier method (ALM) 41 and dual approach. 42 In this work, we adopt the Augmented Lagrange Multiplier (ALM) method due to its efficiency. According to ALM, the Equation (9) can be converted to the following equivalent problem: We further adopted Inexact ALM method to transform Equation (10) to an unconstraint problem, and then minimize this problem by utilizing augmented Lagrange function defined as follows: where μ > 0 is the penalty parameter. Equation (11) can be minimized with respect to J, R and N, respectively, by fixing the other variables and then updating the Lagrange multipliers Y 1 and Y 2 . Concretely, we can fix the other variables to update J by the following rule: It is worth noting that Equation (12) has a closed-form solution.
It can be solved by Singular Value Thresholding (SVT) operator. 40 Similarly, we can update R and J by fixing the others according to Equations (13) and (14) respectively: Equation (14) can be solved by the following lemma 43 : Let Q be a given matrix, if the optimal solution to min is W*, then the i-th column of W* is: After the J, R, N were updated, we could update the multipliers as follows: The convergence condition is D À DR À N k k 1 <ɛ and R À J k k 1 <ɛ, where ɛ is a very small number (set as 1 × 10 −8 in this paper).
Finally, after the convergence condition is reached, we could get the pure data matrix R* and noise data matrix N* and then calculate a complete data matrix by D × R*. The procedure to solve Equation (9) is outlined in Algorithm 1. According to Algorithm 1, by replacing the input data matrix D with disease semantic similarity matrix SD as well as miRNA functional similarity matrix FM, we could obtain two refined similarity matrices CD and CM respectively.

Repeat:
1. Fix the others and update J by: Fix the others and update R by: 3. Fix the others and update N by: Update the multiplier Y 1 and Y 2 by: where GD and GM represent the Gaussian interaction profile kernel similarity for diseases and miRNAs respectively. Then, the final disease similarity matrix FDS and final miRNA similarity matrix FMS obtained by Equations (16) and (17) will be used to infer miRNA-disease associations by label propagation.

| Label Propagation
Label propagation is a semi-supervised learning method by propagat- where t is the time step and Y t+1 represents the iteration results after t + 1 steps of label propagation. α∈[0, 1] is a hyper-parameter which balanced the rate between retaining the information from its neighbours and its initial label information, Y is a binary matrix encoding the initial label information of data points against each class. 44 Equation (18)  Equation (18) to update the label of each data object until convergence. Therefore, we can predict miRNA-disease associations from both disease space and miRNA space based on label propagation algorithm: where F D and F M represent the prediction result from disease space and miRNA space respectively. The final association score is calculated by: where β is a hyper-parameter balancing the prediction results from disease space and miRNA space (β was simply set to 0.5 in this paper). The overall procedure of MCLPMDA is summarized in Algorithm 2. Besides, the source code of MCLPMDA can be freely downloaded at https://github.com/ShengPengYu/MCLPMDA.
Output: Predicted association matrix F.
1. Input FM to Algorithm 1 and obtain the complete miRNA similarity matrix CM.
2. Input SD to Algorithm 1 and obtain the complete miRNA similarity matrix CD.
3. Integrate similarity information to get DSS and MFS according to Equations (16) and (17).

Predict from miRNA space and disease space:
Repeat:

| Performance evaluation
In this section, we employed four different evaluation metrics to comprehensively evaluate the performance of MCLPMDA. We first implemented global LOOCV and 5-fold cross validation to verify the general prediction ability of our method based on the experimentally verified miRNA-disease associations from HMDD v2.0 databases. 31 Specifically, global LOOCV selected a known miRNA-disease association in turn as a test sample, and the rest of the associations were considered as training samples. 46  Next, we adopted another evaluation metric called leave one disease out cross validation (LODOCV) to test the ability of our method to predict for diseases without any known associated miRNAs.
Specifically, for each disease, we removed all its associated miRNAs and then prioritized all the candidate miRNAs using the information of other disease-related miRNAs. As there is no prior association information for the disease investigated, LODOCV is considerably more stringent compared with the cross-validation frameworks mentioned above and can thus better evaluate the risk of overfitting.
Finally, AUC value was used to evaluate the performance of all methods in LODOCV framework. As shown in Figure 4

| Case study
In this section, we conducted a case study on BN to further validate the effectiveness of MCLPMDA. BN is the most common malignancy in women, accounting for >40 000 deaths each year. 49 Data have shown that the number of affected people is climbing, and a forecast deemed that there will be nearly 3. As shown in Table 2, only hsa-mir-449a was not confirmed by our To carry out a thorough analysis towards the top predicted miRNAs, we first calculated the differentially expressed miRNAs by using the R package edgeR. 55 Concretely, edgeR automatically calculates the log2 fold change and the statistical significance of differential expression of each miRNA. It also provides the adjusted P-values for multiple testing correction with false discovery rate (FDR). As a result, 29 of the 50 miRNAs were differentially expressed (adjusted P-value <0.05 and |logFC| >1, Table 2). We then tested whether these top predicted miRNAs could be used as features to classify normal samples and tumour samples. Support vector machine from R package e1071 was adopted to perform the classification analysis.
The radial basis function was chosen as the kernel function, and the best values of the two parameters cost (C) and gamma (γ) in the kernel function were obtained by a grid-search approach using crossvalidation. Finally, the classification accuracy was evaluated by fivefold cross-validation. We found that the top 6 miRNAs could achieve a mean classification accuracy of 0.969 ( Figure 6), which clearly demonstrates the classification power of the top prioritized miRNAs.
Next, we focused on hsa-mir-125a, the top predicted miRNA for BN. We checked whether its expression level was significantly  I, IA, IB, II, IIA, IIB, III, IIIA, IIIB, IIIC, IV and X. The one-way ANOVA test was performed by the R built-in function "aov." As a result, we obtained a P-value of 5.68e-3 ( Figure 7A), indicating that its expression level was significantly altered among different stages. Besides, we performed the Kaplan-Meier survival analysis to examine its potential diagnostic power by using the R package survival. Notably, different expression levels of hsa-mir-125a have led to significantly different survival rate ( Figure 7B).
Taken together, the analysis results verified that hsa-mir-125a could serve as a potential biomarker for BN.

| DISCUSSION
Traditional experimental methods are in general time-consuming and cannot be scaled to large datasets. Fortunately, the accumulating amount of data from multiple sources have posed great opportunities to identify miRNA-disease associations T A B L E 2 Top 50 predicted miRNAs associated with Breast Neoplasms based on known associations in HMDD. The first column records the top 1-50 predicted miRNAs; the second column records the corresponding evidence in two databases; the third column records log2 fold change and the fourth column records the adjusted P-values of the significance of differential expression for each miRNA  The success of our model could be mainly attributed to the following two reasons. First, matrix completion was adopted to refine the miRNA functional similarity matrix and disease semantic similarity matrix, which greatly alleviated the influences caused by the inherent noise existing in the current datasets. Second, the label propagation process ensured that the labels of candidate miRNAs were reliably updated based on the reconstructed similarity matrices.
Nevertheless, the performance of our model can still be improved. In particular, more data sources such as miRNA target information and miRNA sequence similarities could be incorporated to further elevate the prediction accuracy. Besides, adaptive weights should be assigned instead of equal weights when combining the refined similarity matrices with existing similarity information for both miRNAs and diseases.