Improved Pre-miRNAs Identification Through Mutual Information of Pre-miRNA Sequences and Structures

Fu, Xiangzheng; Zhu, Wen; Cai, Lijun; Liao, Bo; Peng, Lihong; Chen, Yifan; Yang, Jialiang

doi:10.3389/fgene.2019.00119

ORIGINAL RESEARCH article

Front. Genet., 25 February 2019

Sec. Computational Genomics

Volume 10 - 2019 | https://doi.org/10.3389/fgene.2019.00119

This article is part of the Research Topic Machine Learning Techniques on Gene Function Prediction View all 48 articles

Improved Pre-miRNAs Identification Through Mutual Information of Pre-miRNA Sequences and Structures

$\r\nXiangzheng Fu$ Xiangzheng Fu¹

Wen Zhu²

Lijun Cai¹^*

Bo Liao^1,2^*

Lihong Peng³

Yifan Chen¹

Jialiang Yang^2,4^*

¹College of Information Science and Engineering, Hunan University, Changsha, China
²School of Mathematics and Statistics, Hainan Normal University, Haikou, China
³School of Computer Science, Hunan University of Technology, Zhuzhou, China
⁴Department of Genetics and Genomic Sciences, Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, United States

Playing critical roles as post-transcriptional regulators, microRNAs (miRNAs) are a family of short non-coding RNAs that are derived from longer transcripts called precursor miRNAs (pre-miRNAs). Experimental methods to identify pre-miRNAs are expensive and time-consuming, which presents the need for computational alternatives. In recent years, the accuracy of computational methods to predict pre-miRNAs has been increasing significantly. However, there are still several drawbacks. First, these methods usually only consider base frequencies or sequence information while ignoring the information between bases. Second, feature extraction methods based on secondary structures usually only consider the global characteristics while ignoring the mutual influence of the local structures. Third, methods integrating high-dimensional feature information is computationally inefficient. In this study, we have proposed a novel mutual information-based feature representation algorithm for pre-miRNA sequences and secondary structures, which is capable of catching the interactions between sequence bases and local features of the RNA secondary structure. In addition, the feature space is smaller than that of most popular methods, which makes our method computationally more efficient than the competitors. Finally, we applied these features to train a support vector machine model to predict pre-miRNAs and compared the results with other popular predictors. As a result, our method outperforms others based on both 5-fold cross-validation and the Jackknife test.

Introduction

Derived from hairpin precursors (pre-miRNAs), mature microRNAs (miRNAs) belong to a family of non-coding RNAs (ncRNAs) that play significant roles as post-transcriptional regulators (Lei and Sun, 2014). For example, hypothalamic stem cells partially control aging rate through extracellular miRNAs (Zhang et al., 2017). MiRNAs are formed by cleavage of pre-miRNAs by enzymes. Discovery of miRNAs relies on predictive models for characteristic features from pre-miRNAs. However, the short length of miRNA genes and the lack of pronounced sequence features complicate this task (Lopes et al., 2016). In addition, miRNAs are involved in many important biological processes, including plant development, signal transduction, and protein degradation (Zhang et al., 2006; Pritchard et al., 2012). Due to their intimate relevance to miRNA biogenesis and small interfering RNA design, pre-miRNA prediction has recently become a hot topic in miRNA research. However, traditional experimental methods like ChIP-sequencing are expensive and time-consuming (Bentwich, 2005; Li et al., 2013; Liao et al., 2014; Peng et al., 2017). In the post-genome era, a large number of genome sequences have become available, which provides an opportunity for large scale pre-miRNA identification by computational techniques (Li et al., 2010).

In recent years, many computational methods have been proposed to identify pre-miRNAs, most of which are based on machine learning (ML) algorithms or statistical models. The ML-based methods usually model pre-miRNA identification as a binary classification problem to discriminate real and pseudo-pre-miRNAs. Widely used ML-based algorithms include support vector machines (SVMs) (Xue et al., 2005; Helvik et al., 2007; Huang et al., 2007; Wang Y. et al., 2011; Lei and Sun, 2014; Lopes et al., 2014; Wei et al., 2014; Liu et al., 2015b; Khan et al., 2017), back-propagation and self-organizing map (SOM) neural networks (Stegmayer et al., 2016; Zhao et al., 2017), linear genetic programming (Markus and Carsten, 2007), hidden Markov model (Agarwal et al., 2010), random forest (RF) (Jiang et al., 2007; Kandaswamy et al., 2011; Lin et al., 2011), covariant discrimination (Chou and Shen, 2007; Lopes et al., 2014), Naive Bayes (Lopes et al., 2014), and deep learning (Mathelier and Carbone, 2010). For example, Yousef et al. (2006) Peng et al. (2018) used a Bayesian classifier for pre-miRNA recognition, which has demonstrated effectiveness in recognizing pre-miRNAs in the genomes of different species. Xue et al. (2005) proposed a triplet-SVM predictor to identify pre-miRNA hairpin structural features, whose prediction performance has been improved by 10% in a later method using a RF-based MiPred classifier (Jiang et al., 2007). In addition, Stegmayer et al. (2016) proposed a deepSOM predictor to solve the problem of imbalance of positive and negative pre-miRNA samples.

It is known that the performance of ML-based methods is highly associated with the extraction of features (Liao et al., 2015b; Zhang and Wang, 2017; Ren et al., 2018). Typical feature representation methods include secondary structure and sequence information-based methods (Wei et al., 2016; Saçar Demirci and Allmer, 2017; Yousef et al., 2017). For example, Xue et al. (2005) proposed a 32-dimensional feature of triplet sequences containing secondary structure information to better express pre-miRNA sequences. Jiang et al. (2007) performed random sequence rearrangement, which is useful in obtaining the energy characteristics of pre-miRNA sequences. However, this method is quite slow. In addition, Wei et al. (2014) and Chen et al. (2016) extended the features proposed by Xue et al. (2005) into 98-dimensional pre-miRNA features, which resulted in a better pre-miRNA prediction accuracy. Most pre-miRNAs have the characteristic stem–loop hairpin structure (Xue et al., 2005); thus, the secondary structure is an important feature used in computational methods. Recently, Liu et al. proposed several methods for predicting pre-miRNAs on the basis of the secondary structure, namely, iMiRNA-PseDPC (Liu et al., 2016), iMcRNA-PseSSC (Liu et al., 2015b), miRNA-dis (Liu et al., 2015a), and deKmer (Liu et al., 2015c). Some researchers (Khan et al., 2017; Yousef et al., 2017) have increased the dimensionality of features by combining multi-source features to improve the accuracy of pre-miRNAs prediction. With the increase of feature dimension, considerable redundant information and noises are also incorporated, which may reduce the prediction accuracy and slow down the algorithm. Thus, it is usually necessary to perform feature selection to remove irrelevant or redundant features. An excellent feature selection method can effectively reduce the running time for training the model and improve the performance of the prediction (Wang X. et al., 2011; Wang Y. et al., 2011). To further facilitate computational processes, several bioinformatics toolkits have been developed to generate numerical sequence feature information (Liu et al., 2015d).

Developing an effective feature representation algorithm for pre-miRNA sequences is a challenging task. Existing methods have several drawbacks, which may not be sufficiently informative to distinguish between pre-miRNAs and non-pre-miRNAs. First, even excellent feature extraction methods usually only consider the frequency or sequence information of the bases of pre-miRNA sequences while ignore the interaction between two bases. Second, feature extraction methods based on secondary structures usually only consider the global characteristics while ignore the mutual influence of the local characteristics of structures. Third, methods combining multisource feature information and integrating feature selection algorithms to reduce dimensionality (Khan et al., 2017; Yousef et al., 2017) is inefficient in computational time.

As a useful measure to compare profile information based on their entropy, mutual information (MI) has been extensively applied in computational and bioinformatics studies. For instance, MI profiles were used as genomic signatures to reveal phylogenetic relationships between genomic sequences (Bauer et al., 2008), as a metric of phylogenetic profile similarity (Date and Marcotte, 2003), and for predicting drug-target interactions (Ding et al., 2017) and gene essentiality (Nigatu et al., 2017). Inspired by previous studies (Date and Marcotte, 2003; Bauer et al., 2008; Ding et al., 2017; Nigatu et al., 2017; Zhang and Wang, 2018), we proposed a novel MI-based feature representation algorithm for sequences and secondary structures of pre-miRNAs. Specifically, we used entropy and MI to calculate the interdependence between bases, and calculated the 3-gram MI and 2-gram MI of the sequences and secondary structures as feature vectors, respectively. Due to the nature of MI in representing profile dependency, our method is capable of catching the interactions between sequence bases and local features of the secondary structure, which is critical to pre-miRNA prediction. In addition, we combined the MI feature with the minimum free energy (MFE) feature of pre-miRNA, one of the most widely used features for RNA study and constructed a total of 55-dimensional features. Since the feature space is smaller than that of most popular methods, our method is computationally more efficient than the competitors while keeping most important information for pre-miRNA prediction. Our method was evaluated on a stringent benchmark dataset by a jackknife test and compared with a few canonical methods.

Materials and Methods

Framework of the Proposed Method

We illustrated in Figure 1 the overall framework of our method, which consists of two main steps, namely, feature extraction and pre-miRNA prediction. In the feature extraction step, the initial pre-miRNA sequences were first extracted from the raw data. Secondly, homology bias was avoided by using the CD-HIT software (Li and Godzik, 2006) (with threshold value 0.8), and the samples with similarity greater than the threshold in the initial dataset were filtered out. The remaining data was used as the benchmark dataset for this study. After that, the secondary structures of the sequences in the benchmark dataset were predicted by the software RNAfold (Hofacker, 2003). Finally, the primary sequence features based on mutual information (PSFMI), secondary structure features based on mutual information (SSFMI), and MFE features were retrieved, respectively for samples in the benchmark dataset. In the pre-miRNA prediction step, the generated features were fed into an SVM classifier to generate a training model, which was employed to predict pre-miRNAs.

FIGURE 1

Figure 1. The overall framework of the proposed method for predicting pre-miRNAs.

Datasets

Balanced Dataset

Our balanced benchmark dataset for pre-miRNA identification consists of real Homo sapiens pre-miRNAs as positive set and two pseudo pre-miRNAs subsets as negative set, named as: S₁ and S₂, respectively. The benchmark dataset S₁ and S₂ can be formulated as:

\begin{array}{l} S_{1} = S^{+} \cup S_{x u e (1612)}^{-} \\ S_{2} = S^{+} \cup S_{w e i (1612)}^{-} \end{array}

The benchmark dataset S⁺contains a total of 1,612 positive samples, which were selected from the 1,872 reported Homo sapiens pre-miRNA entries downloaded from the miRBase (20th Edition) (Kozomara and Griffithsjones, 2011), and the pre-miRNAs sharing sequence similarity more than 80% were removed using the CD-HIT software (Li and Godzik, 2006) to get rid of redundancy and avoid bias; the negative samples set $S_{x u e}^{-}$ contains 1,612 pseudo miRNAs, which were selected from the 8,494 pre-miRNA-like hairpins $S_{x u e}^{-}$ (Xue et al., 2005); the $S_{w e i}^{-}$ contains 1,612 pseudo miRNAs, which were selected from the 14,250 pre-miRNA-like hairpins $S_{w e i}^{-}$ (Wei et al., 2014).

In addition, we selected 88 new pre-miRNA sequences from a later version (e.g., miRBase22) as positive samples, and selected 88 samples from $S_{w e i}^{-}$ as negative samples to construct a benchmark dataset for independent testing, named S₃. The benchmark dataset S₃ can be formulated as:

\begin{array}{l} S_{3} = S_{m i R 22}^{+} \cup S_{w e i (88)}^{-} \end{array}

Imbalanced Dataset

To evaluate the performance of our approach in an unbalanced dataset, we have constructed two unbalanced benchmark datasets, named as: S₄ and S₅, respectively. The benchmark dataset S₄ and S₅ can be formulated as:

\begin{array}{l} S_{4} = S^{+} \cup S_{w e i}^{-} \\ S_{5} = S_{m i c r o P r e d}^{+} \cup S_{m i c r o P r e d}^{-} \end{array}

Specifically, S₄ consists of S⁺ (positive samples) and $S_{w e i}^{-}$ (negative samples) with ratio ~1:8.8 (1,612:14,250). S₅ was adopted from microPred (Batuwita and Palade, 2009), which contains 691 non-redundant human pre-miRNAs from miRBase release 12 and 8,494 pseudo hairpins.

To evaluate experimental performance on other species, we retrieved the virus pre-miRNA sequences dataset from the study of Gudyś et al. (2013). Similar to other datasets, we removed pre-miRNAs sharing more than 80% sequence similarity by the CD-HIT software. As a result, we constructed a virus dataset namely S₆, which contains 232 positive samples and 232 negative samples. The benchmark dataset S₆ can be formulated as:

\begin{array}{l} S_{6} = S_{v i r u s}^{+} \cup S_{v i r u s}^{-} \end{array}

Where the virus pre-miRNA sequences dataset S₆ consists of $S_{v i r u s}^{+}$ (positive samples) and $S_{v i r u s}^{-}$ (negative samples), which were obtained from the study of Gudyś et al. (2013).

Classification Algorithm and Optimization

We selected SVM to classify the samples. Specifically, the publicly available support vector machine library (LIBSVM) was applied to the benchmark data with our feature representation. The LIBSVM toolkit can be downloaded freely at http://www.csie.ntu.edu.tw/~cjlin/libsvm. We integrated this toolbox in the Matrix Laboratory (MATLAB) workspace to build the prediction system. We selected the radial basis function as the kernel function, and a grid search based on the 10-fold cross validation was used to optimize the SVM parameter γ and the penalty parameter C. C = 65,536 and γ = 10⁻⁴ was tuned to be the optimal parameters.

Features Extraction

Primary Sequence Features Based on Mutual Information (PSFMI)

Recently, it has been shown that local continuous primary sequence characteristics are crucial for pre-miRNA prediction (Bonnet et al., 2004). As one of the important characteristics, n-grams are often used in feature mapping (Liu and Wong, 2003). Let S be a given pre-miRNA sequence (consisting of four characters: A, U, C, and G) with length L. Then the n-grams represent a continuous subsequences of length n in S with.

Figure 2 shows the calculation process for the 2-gram and 3-gram PSFMI feature representations. Any two and three consecutive bases in the pre-miRNA sequence, regardless of the order of the bases, are represented as 2- and 3-gram, respectively. For example, as shown in Figure 2, the number of bases “GA”(2-gram) is 3. The number of bases “UG”(2-gram) is 4. Similarly, 3-gram represents three consecutive bases, such as the number of bases “G G U”(3-gram) is 2.

FIGURE 2

Figure 2. The 2-gram and 3-gram feature representation.

In this study, we used entropy and mutual information (MI) to calculate the interdependence between two bases on a given pre-miRNA sequence. Specifically, we calculated the 3-gram MI and the 2-gram MI as the feature vector for a given pre-miRNA sequence. The 3-tuple MI for 3-gram is calculated as:

\begin{array}{rcl} M I (x, y, z) = M I (x, y) - M I (x, y | z) & (1) \end{array}

where x, y, and z are three conjoint bases. Subsequently, the MI MI(x, y) and conditional MI MI(x, y|z) can be calculated as follows:

\begin{matrix} M I (x, y | z) = H (x | z) - H (x | y, z) & (2) \end{matrix}

\begin{matrix} M I (x, y) = p (x, y) * \log (\frac{p (x, y)}{p (x) * p (y)}) & (3) \end{matrix}

\begin{matrix} M I (x, y) = M I (y, x) & (4) \end{matrix}

Here, H(x|z) andH(x|y, z) are calculated as follows:

\begin{array}{rcl} H (x) = p (x) * log (p (x)) & (5) \end{array}

\begin{array}{rcl} H (x | z) = - \frac{p (x, z)}{p (z)} log (\frac{p (x, z)}{p (z)}) & (6) \end{array}

\begin{array}{rcl} H (x | y, z) = - \frac{p (x, y, z)}{p (y, z)} log (\frac{p (x, y, z)}{p (y, z)}) & (7) \end{array}

where p(x) denotes the frequency of x appearing in a pre-miRNA sequence, p(x, y)denotes the frequency of x and y appearing in 2-grams and p(x, y, z) denotes the frequency of x, y, and z appearing in 3-tuples in a pre-miRNA sequence. p(x), p(x, y) and p(x, y, z) can be calculated by Equations (8)–(10):

\begin{array}{rcl} p (x) = \frac{N_{x} + ε}{L} & (8) \end{array}

\begin{array}{rcl} p (x, y) = \frac{N_{x y} + ε}{L - 1} & (9) \end{array}

\begin{array}{rcl} p (x, y, z) = \frac{N_{x y z} + ε}{L - 2} & (10) \end{array}

(10) N_x is the number of occurrences of base x appearing in the pre-miRNA sequence, and L is the length of the pre-miRNA sequence. In Equation (8), ε represents a very small positive real number that does not affect the final score, which is used to avoid having 0 as the denominator.

According to the Equation (10), a given pre-miRNA sequence can be expressed as 30 mutual information values [20 3-tuples IM (x, y, z) and 10 2-tuples IM (x, y)]. In addition, we calculated the frequency of the four base classes appearing in this pre-miRNA sequence. Therefore, the pre-miRNA sequence can be expressed as 20 + 10 + 4 = 34 features, as determined using our proposed mutual information method.

Secondary Structure Features Based on Mutual Information (SSFMI)

It has been shown that the structure of pre-miRNA can provide insights into biological functions. Pre-miRNA structural information can be predicted by RNAfold (Hofacker, 2003) software from sequences and is frequently used as features by machine-learning algorithms. Figure 3 shows the pre-miRNA secondary structure of miRNA hsa-mir-302f, which was obtained using the algorithm in Mathews et al. (1999).

FIGURE 3

Figure 3. The pre-miRNA secondary structure of miRNA hsa-mir-302f.

The pre-miRNA secondary structure is represented as a sequence of three symbols: a left parenthesis, a right parenthesis, and a point. In other words, nucleotides have only two states: paired and unpaired nucleotides, which are represented in parentheses “(” or “)” and points “.”, respectively. The open parenthesis “(” indicates that the paired nucleotides located on the 5′ end can be paired with 3′-end nucleotides, which are represented by the corresponding close parenthesis “).” The secondary structure of the pre-miRNA sequence is composed of free radicals and radical pairs A–U and C–G. To a certain extent, after such treatment, the secondary structure of the pre-miRNA sequence can be converted into a linear sequence.

A given pre-miRNA sequence S is converted to a pre-miRNA secondary structure sequence by using the RNAfold software. The length of the sequence is denoted by L, and the mutual information of the secondary structure sequence n-gram is calculated by Equations (1) and (3). The calculation process is similar to that for the PSFMI. Figure 2 shows the calculation process for the 2-gram and 3-gram SSFMI feature representations.

According to Equations (1)–(10), the pre-miRNA secondary structure sequence can be expressed as 16 mutual information values [10 3-tuples IM(x, y, z) and 6 2-tuples IM(x, y)]. Similarly, the frequencies of the three symbols that appear in the sequence of secondary structure elements were calculated. Another significant feature is the amount of base pairs in pre-miRNA sequences. For the pre-miRNA gene, given the presence of the G–U wobble pair in the hairpin loop structure (secondary structure) of the pre-miRNA, the G–U pair is considered in the base pairing.

Therefore, the secondary structure features can be expressed as 10 + 6 + 3 + 1 = 20 features, as determined using our proposed mutual information method.

In addition, studies have shown that real pre-miRNA sequences are generally more stable than randomly generated pseudo-pre-miRNAs and therefore have lower MFE. Therefore, during the process of feature extraction for pre-miRNA sequences, structural energy features are often used to characterize pre-miRNA sequences. Since the structural calculation result of RNAfold is actually provided along with the MFE value of the secondary structure of the sequence, we took this value.

In summary, we extracted a total of 55 [34 (PSFMI) + 20 (SSFMI) + 1 (MFE)] features, in which the 34-dimensional feature was obtained by applying the PSFMI method from the pre-miRNA sequence, the 20-dimensional feature was obtained by applying the SSFMI method from the pre-miRNA secondary structure, and the 1 (MFE) dimension feature is the MFE value calculated by the RNAfold software. Since the distribution of the values in each feature is non-uniform, we normalized each feature to (−1,1) using the MATLAB function mapminmax (MATLAB 2014b), and obtained the final 55-dimensional feature data set for model training.

Measurements

In statistical prediction experiments, three cross-validation methods are often used to test the effectiveness of a prediction algorithm including independent dataset test, K-fold validation test and the Jackknife validation test. Among them, the Jackknife test is considered to be the most rigorous and objective method of verification. In the field of pre-miRNA prediction, the Jackknife tests are often used to verify the predictive performance of different algorithms. In the Jackknife test, each pre-miRNA sequence was individually selected as a test sample, and the remaining pre-miRNA sequences were used as training samples, and the test sample categories were predicted from the model trained by the training samples. Therefore, we adopted the Jackknife test in this study.

In order to comprehensively evaluate the performance of the pre-miRNA prediction method, several indicators were introduced in this paper. Receiver operating characteristic (ROC) was plotted based on specificity (Sp) and sensitivity (Sn). The areas under ROC curves (AUC) and average area under the precision-recall curve (AUPR) are both used as the evaluation metrics. The AUC provides a measure of the classifier performance; the larger the value of the AUC is, the better the performance of the classifier. However, for class imbalance problem, AUPR is more suitable than AUC, for it punishes false positive more in evaluation. In addition, Matthew correlation coefficient (MCC) was used to evaluate the prediction performance. The MCC accounts for true and false positives and negatives and are usually regarded as a balanced measure that can be used even if the classes are of different sizes. The sensitivity (SE), specificity (SP), precision (PR), accuracy (ACC), and MCC are defined as follows:

\begin{array}{rcl} S E = \frac{T P}{T P + F N} & (11) \end{array}

\begin{array}{rcl} S P = \frac{T N}{T N + F P} & (12) \end{array}

\begin{array}{rcl} P R = \frac{T P}{T P + F P} & (13) \end{array}

\begin{array}{rcl} F_{1} - s c o r e = 2 \times \frac{S E \times P R}{S E + P R} & (14) \end{array}

\begin{array}{rcl} A C C = \frac{T P + T N}{T P + F P + T N + F N} & (15) \end{array}

\begin{matrix} M C C = \frac{T P \times T N - F P \times F N}{\sqrt{(TP+FN) (TN+FP) (TP+FP) (TN+FN)}} & (16) \end{matrix}

Where TP, TN, FP, and FN denote the number of true positives, true negatives, false positives and false negatives, respectively.

Results and Discussion

Performance of Different Features

According to the feature extraction algorithm proposed in this paper, the corresponding 55 features (including PSFMI, SSFMI, and MFE) were extracted for each true and false pre-miRNA (positive and negative sample data) in the benchmark dataset. For the improved evaluation of these features, they were subdivided into four subsets according to the different feature types, namely, PSFMI, SSFMI, PSFMI + MFE, and SSFMI + MFE feature sets. To assess the importance of each feature subset, predictive models were constructed on the basis of the different feature subsets of the benchmark dataset. Jackknife verification was used to evaluate the performance of the predictive models.

Table 1 presents a comparison of the performances of the predictive models based on the different feature subsets and combinations thereof. As demonstrated in Table 1, the predictive model based on the feature subset SSFMI is better than that based on the feature subset PSFMI. The predictive model based on SSFMI achieves 80.21% sensitivity, 88.34% specificity, the Matthews coefficient of 0.688, and prediction accuracy of 84.27%. The predictive model based on the mutual information of pre-miRNA secondary structure is better than that based on the sequence-based mutual information. The performances of the predictive models based on the PSFMI + MFE and SSFMI + MFE feature sets are significantly improved compared with those based on the independent feature subsets (i.e., PSFMI and SSFMI feature sets). In terms of accuracy, the performance of the PSFMI + MFE model is 13.24% better than that of the PSFMI model, whereas the performance of the SSFMI + MFE based model is 1.31% better than that of the SSFMI model. The experimental results show that the combination of MFE features should be considered to increase prediction accuracy.

TABLE 1

Table 1. The performance of different features on benchmark dataset (Jackknife test evaluation).

We also compare the AUROC of four feature combinations obtained by Jackknife cross-validation on benchmark dataset S₁, shown in Figure 4. We can draw the same conclusion that the prediction model based on feature subset SSFMI is better than the prediction model based on feature subset PSFMI, and the combination of MFE features can improve the accuracy of prediction.

FIGURE 4

Figure 4. The AUROC comparison of four feature combinations through the Jackknife cross-validation.

Feature Importance Analysis

To explore the extent to which the features in the feature set affect the classification, we analyzed the importance of each feature in the feature set. To quantitatively measure the importance of each feature, we introduced the metric information gain (IG) (Deng et al., 2011; Uǧuz, 2011). IG scores are widely used in the analysis of feature importance of biological sequences (Wei et al., 2014, 2017; Chen et al., 2016). The higher the value of IG, the more important the feature is for the classifier. Table 2 presents the IG scores of 55 features. As shown in Table 2, although the 4 highest IG values all belong to PSFMIs, the 10 lowest IG values also belong to PSFMIs, indicating that the IG values of the PSFMIs are unevenly distributed and have large differences. The average IG value of PSFMI features is 0.5761, whereas the average IG value of SSFMI is 0.7489, further confirming that the secondary structure characteristics of pre-miRNA have a greater influence on the classification results than the primary sequence characteristics. The experimental findings are also consistent with the feature importance analysis.

TABLE 2

Table 2. Importance of the relatively specific features in the proposed features set.

Effect of Different Kernel Functions

To justify different kernel functions of SVM for our algorithm, we ran another set of experiments on the benchmark dataset using Jackknife test evaluation. Several kernel functions were tested in the experiments: SVM with linear kernel, SVM with polynomial kernel, SVM with Radial Basis Function (RBF) kernel and SVM with sigmoid kernel. The results achieved in these experiments are shown in Table 3. We could see the ACC, MCC, and AUC of the SVM classifier with RBF kernel outperformed all other classifiers. Therefore, in this study, we choose the SVM classifier of the RBF kernel.

TABLE 3

Table 3. Comparison of performance of different kernel functions on the benchmark dataset S₁ (Jackknife test evaluation).

Performance on Balanced Dataset

We compared the ACC, SE, SP, MCC, and AUC achieved on the benchmark dataset S₁ and S₂ by our predictor with the following methods: iMiRNA-SSF (Chen et al., 2016), miRNAPre (Wei et al., 2014), Triplet-SVM (Xue et al., 2005), iMcRNA-PseSSC (Liu et al., 2015b), and iMiRNA-PseDPC (Liu et al., 2016), and A brief introduction to these methods is shown in Table 4. As can be seen from Table 4, both the iMcRNA-PseSSC (Liu et al., 2015b) and iMiRNA-PseDPC (Liu et al., 2016) methods require parameters, and the iMiRNA-PseDPC (Liu et al., 2016) method features the largest dimension.

TABLE 4

Table 4. A brief introduction to the state-of-the-art predictors.

The performance of different methods on the benchmark datasets S₁ and S₂ via the jackknife test, as showed in Tables 5, 6, respectively. For a fair comparison, the performances of these methods were taken from other studies with best tuned parameters (Liu et al., 2015b, 2016). Table 5 shows that our method significantly outperforms previous methods in all evaluation metrics used. Among the evaluated methods, our method achieves the best predictive performance on four metrics: AUC (96.54%), ACC (90.60%), MCC (0.813), and SP (92.62%). The respective ACC and MCC of our method are 1.51% and 0.051 higher than those of the previously known best-performing predictor iMiRNA-SSF (Chen et al., 2016) (ACC = 88.09% and MCC = 0.762). The AUC of our method is 1.57% higher than those of the previously known best-performing predictor iMiRNA-PseDPC (Liu et al., 2016) (AUC = 94.97%). In addition, We have incorporated the new negative samples from Wei's study (Wei et al., 2014) to construct a new benchmark dataset S₂, and compared the prediction performance of our method together with 5 other popular methods using the Jackknife test (see Table 6). As can be seen, our method achieves the best predictive performance on 4 (out of 5) metrics including AUC (95.04%), ACC (88.00%), MCC (0.760), and specificity (88.71%), and is slightly worse than iMiRNA-PseDPC in sensitivity.

TABLE 5

Table 5. Results of the proposed method and state-of-the-art predictors on benchmark dataset S₁ (Jackknife test evaluation).

TABLE 6

Table 6. Results of the proposed method and state-of-the-art predictors on benchmark dataset S₂ (Jackknife test evaluation).

To further compare the performance of our method with other methods on independent testing, we chose the S₁ dataset as the training set and the S₃ dataset as the test set. Table 7 shows that our method outperforms all other methods in the independent test with an ACC of 70.45% and MCC of 0.412. The iMiRNA-PseDPC (Liu et al., 2016) method has an AUC value of 81.69%, which is the best AUC value in all methods. The AUC of our method (AUC = 75.54%) is comparable to the AUC of the iMcRNA-PseSSC (Liu et al., 2015b) method (AUC = 75.81%). The dimensions of iMiRNA-PseDPC are as high as 725 dimensions, far exceeding the 55-dimensional of our method, and the time overhead of our method is less than iMiRNA-PseDPC.

TABLE 7

Table 7. Comparing the proposed method with other state-of-the-art predictors on an independent dataset S₃.

Performance on Imbalanced Dataset

We then tested our method on S₄ and S₅ together with the other 4 State-of-the-Arts methods including miRNAPre (Wei et al., 2014), Triplet-SVM (Xue et al., 2005), iMcRNA-PseSSC (Liu et al., 2015b), and iMiRNA-PseDPC (Liu et al., 2016). The performance was evaluated using the 5-fold cross validation and the results were summarized in Table 8. As can be seen, our method performed the best for all 3 evaluation metrics including AUC (0.9589), F₁score (0.7813), and AUPR (0.8525), respectively on the dataset S₅. As for the dataset S₄, our method ranks the first on F₁score (with a value 0.7084) and second on AUC and AUPR. For a better view, we also plotted the AUC curves and AUPR curves of our method on S₄ and S₅ for all 5-folds, respectively in Figures 5, 6.

TABLE 8

Table 8. Five-fold cross-validation prediction performance of the proposed method and 4 state-of-the-art predictors on imbalanced benchmark dataset S₄ and S₅.

FIGURE 5

Figure 5. The AUROC curves of our method on the imbalanced benchmark dataset S₄ and S₅ via 5-fold cross validation.

FIGURE 6

Figure 6. The AUPR curves of our method on the imbalanced benchmark dataset S₄ and S₅ via 5-fold cross validation.

Performance on Other Species

We then compared our method with 4 state-of-the-arts methods on the benchmark dataset S₆ through the jackknife test. Table 9 shows that our method outperforms all other methods in the independent test with an ACC of 92.59%, MCC of 0.852, and AUC of 98.07%. The experimental results show that our method also has good performance on other species.

TABLE 9

Table 9. Comparing the proposed method and state-of-the-art predictors on the benchmark dataset S₆ (using the Jackknife test).

Case Study

Sometimes, the lower version of miRBase database (e.g., miRBase 20) may contain some false-positive pre-miRNAs, which will be excluded in a later version (e.g., miRBase 22). Usually, they are saved in the file “miRNA.dead.” Obviously, if we used miRBase 20 as a bench-mark data, a good method should predict the false-positive pre-miRNAs to be negative (i.e., not to be pre-miRNAs). Fortunately, it is the case for our method and we listed the 8 predicted false-positive pre-miRNAs in Table 10, in which the column names “ID” and “Accession” indicate the Id number and the Accession number of the pre-miRNA sequences in miRbase 22, respectively.

TABLE 10

Table 10. False-positive pre-miRNAs predicted to be negative by our method.

Running Time

In this study, we used the SVM model to predict pre-miRNAs. The time complexity of training our SVM model is $O (N_{S}^{3} + N_{S}^{2} . l + N_{S} . d . l)$ (Burges, 1998). Where l is the number of training points, N_S is the number of support vectors (SVs), and d is the dimension of the input data.

To further evaluate the performance of our method and other competitors, we tested the running time on S₆ datasets on the same platform. The experiments were carried out on a computer with Intel(R) Xeon(R) CPU E5-2650 0@2.00GHz 2.00GHz, 16GB memory and Windows OS. Detailed results of running time were shown in Table 11. Our method achieves the better performance of running time, and obtains a good performance of accuracy.

TABLE 11

Table 11. The running time (in seconds) of different methods on benchmark dataset S₆ using the Jackknife test, where C and γ represent the penalty coefficient of the SVM model and the parameters of the RBF function, respectively.

Conclusions

Pre-miRNA prediction is one of the hot topics in the field of miRNA research (Yue et al., 2014; Cheng et al., 2015; Liao et al., 2015a; Luo et al., 2017, 2018; Peng et al., 2017, 2018; Xiao et al., 2017; Fu et al., 2018). In recent years, machine learning-based miRNA precursor prediction methods have made great progress. Most of the existing prediction methods are based on the global feature extraction feature of the sequence, ignoring the influence of the sequence base characters, and the pre-miRNA structure information does not consider the local characteristics. For this reason, this paper performs mutual information calculation on the pre-miRNA sequence and the secondary structure, respectively, to extract the pre-miRNA sequence and the local features of the secondary structure. Then, the extracted features are input to a support vector machine classifier for prediction.

Finally, the experimental results show that: compared with the existing methods, the proposed method improves the sensitivity and specificity of pre-miRNA prediction. In addition, since the feature space of our method is only 55, less than that of most state-of-the-art methods, our feature construction is also efficient when plugging into canonical classification methods such as SVM. In summary, our method can extract effective features of pre-miRNAs and predicts reliable candidate pre-miRNAs for further experimental validation.

Author Contributions

XF, JY, YC, BL, and LC conceived the concept of the work. XF, WZ, and LP performed the experiments. XF and JY wrote the paper.

Funding

This study is supported by the Program for National Nature Science Foundation of China (Grant Nos. 61863010,61873076,61370171, 61300128, 61472127, 61572178, 61672214, and 61772192), and the Natural Science Foundation of Hunan, China (Grant Nos. 2018JJ2461, 2018JJ3570).

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Agarwal, S., Vaz, C., Bhattacharya, A., and Srinivasan, A. (2010). Prediction of novel precursor miRNAs using a context-sensitive hidden Markov model (CSHMM). BMC Bioinformatics 11(Suppl.1):S29. doi: 10.1186/1471-2105-11-S1-S29

PubMed Abstract | CrossRef Full Text | Google Scholar

Batuwita, R., and Palade, V. (2009). microPred: effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics 25, 989–995. doi: 10.1093/bioinformatics/btp107

PubMed Abstract | CrossRef Full Text | Google Scholar

Bauer, M., Schuster, S. M., and Sayood, K. (2008). The average mutual information profile as a genomic signature. BMC Bioinformatics 9:48. doi: 10.1186/1471-2105-9-48

PubMed Abstract | CrossRef Full Text | Google Scholar

Bentwich, I. (2005). Prediction and validation of microRNAs and their targets. FEBS Lett. 579, 5904–5910. doi: 10.1016/j.febslet.2005.09.040

PubMed Abstract | CrossRef Full Text | Google Scholar

Bonnet, E., Wuyts, J., Rouz,é, P., and Van, Y. D. P. (2004). Evidence that microRNA precursors, unlike other non-coding RNAs, have lower folding free energies than random sequences. Bioinformatics 20, 2911–2917. doi: 10.1093/bioinformatics/bth374

PubMed Abstract | CrossRef Full Text | Google Scholar

Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining Knowl. Disc. 2, 121–167. doi: 10.1023/a:1009715923555