Skip to main content
Advertisement
  • Loading metrics

Review of machine learning methods for RNA secondary structure prediction

Abstract

Secondary structure plays an important role in determining the function of noncoding RNAs. Hence, identifying RNA secondary structures is of great value to research. Computational prediction is a mainstream approach for predicting RNA secondary structure. Unfortunately, even though new methods have been proposed over the past 40 years, the performance of computational prediction methods has stagnated in the last decade. Recently, with the increasing availability of RNA structure data, new methods based on machine learning (ML) technologies, especially deep learning, have alleviated the issue. In this review, we provide a comprehensive overview of RNA secondary structure prediction methods based on ML technologies and a tabularized summary of the most important methods in this field. The current pending challenges in the field of RNA secondary structure prediction and future trends are also discussed.

Introduction

Since its discovery, for a long time, RNA was regarded solely as a message carrier between DNA and protein. However, we are now beginning to understand its important roles, as increasing numbers of noncoding RNAs (ncRNA) are being discovered [1]. According to the latest report, less than 2% of the human genome comprises protein-coding regions [2]. The majority of the remaining genome portions encode ncRNAs [3], which are involved in translation, catalysis, RNA stability, RNA modification, gene expression regulation, protein synthesis, and protein degradation [49]. The enormous importance of ncRNAs in various human diseases, such as cancer, diabetes, and atherosclerosis [6,10], is also being recognized.

ncRNA molecules often fold into higher-order structures, and functionally important ncRNA structures are typically conserved during evolution. Similar to protein, the ncRNA function is usually closely related to its structure. Currently, a wide variety of ncRNA sequences are publicly available, and their numbers keep dramatically increasing [11]. By contrast, most of their structures remain unknown, which hinders the inference of their function. Hence, efficient determination of ncRNA structure is of great value to research.

Unlike the global folding of protein driven by hydrophobic forces, the RNA folding process is hierarchical [12] (Fig 1). Specifically, the RNA secondary structure, composed of base pairs, forms rapidly from linear RNA (primary structure), with a large energy loss, while the formation of a complex tertiary structure (or 3D structure) is usually much slower [13]. The RNA secondary structure is very stable and abundant in the cell, which is important for ncRNA function [14,15]. Even without the knowledge of the higher-order structure, RNA secondary structure is sufficient to infer function and for other practical applications [15].

thumbnail
Fig 1.

RNA primary (left), secondary (middle), and tertiary structures (right). The RNA folding process is hierarchical, i.e., the RNA secondary structure forms rapidly from linear RNA (primary structure) with a large energy loss, while the formation of a complex tertiary structure is usually much slower.

https://doi.org/10.1371/journal.pcbi.1009291.g001

Computational predictions are mainstream approaches for identifying RNA secondary structure. A number of prediction methods have been developed since the 1970s. Most of these methods attempt to identify a structure with a minimum free energy (MFE), in agreement with the hypothesis that an RNA molecule is likely to exist in an MFE state, just like protein [16]. Many prominent software applications have been developed incorporating these methods [1719]. However, in the last 10 years, the accuracy of prediction failed to significantly improve, and neither did the calculating speed. An alternative approach, the machine learning (ML)-based methodology, was proposed to improve the predictions of RNA secondary structure. However, such methods did not receive much attention because of the limited accuracy. That was mainly because of the small size of the training datasets and the limitations of simple ML models. As a result of the recent explosion of RNA sequence data and the improvement of ML techniques, especially deep learning (DL) techniques, the latest ML-based methods supersede the current mainstream methods in terms of accuracy and applicability. We believe that these ML-based methods will inspire the next generation of prediction tools in the near future.

In this paper, we provide a comprehensive overview of ML-based methods for RNA secondary structure prediction, with a thorough discussion of their advantages and disadvantages. We also provide a tabularized summary (Table 1) of the most important models in the field, and a perspective on the future promising directions, with a special emphasis on DL models. Although several review papers have been published on the topic of RNA secondary structure prediction [2022], reviews with an emphasis on ML techniques are lacking. We believe that this review will enable researchers to understand the key issues that remain to be solved and facilitate further advances in predicting the RNA secondary structures based on ML.

thumbnail
Table 1. Summary of the ML-based RNA secondary structure prediction methods.

https://doi.org/10.1371/journal.pcbi.1009291.t001

RNA secondary structure: The basics

The RNA molecule is an ordered sequence of nucleotides that contain 1 of the 4 bases: adenine (A), cytosine (C), guanine (G), and uracil (U), arranged in the 5′ to 3′ direction. Pairing (via hydrogen bonds) of these 4 bases within an RNA molecule gives rise to the secondary structure. Typically, each base pairs with at most one other base. The canonical base pairs include the Watson–Crick base pairs (A–U and G–C) and the wobble base pair (G–U). These base pairs often result in the formation of a nested structure, in which multiple stacked base pairs form a helix, and one or multiple unpaired base pairs form a loop.

It has to be noted that 3 kinds of special base pairs [23] commonly occur in the native RNA secondary structures, i.e., noncanonical base pairs, base triples, and pseudoknots. Noncanonical base pairs are the base pairs other than A–U, G–C, and G–U, and they make up 40% of all base pairs in structured RNAs [24]. Base triples are the cluster of 3 bases interacting, which [25] can stabilize many RNA tertiary interactions [26]. Base triples also occur widely in RNA structures. A pseudoknot [27] occurs when bases in different loops pair with each other, forming a nonnested structure between 2 bases that are located apart from each other. Pseudoknots represent a small fraction of base pairs in known RNA secondary structures but often play an important role in RNA function [28].

Typically, the secondary structure of an RNA molecule with a length n can be regarded as:

  1. A set of base pairs {(i,j),1≤i<jn}, where (i,j) indicates a base pair formed between the i-th and j-th nucleotide in the RNA sequence; or a set of compatible helixes [28].
  2. A contact table (CT table), i.e., a square matrix, with elements in the i-th row and j-th column representing the interaction between the i-th and j-th nucleotides in the RNA sequence.
  3. A graph, where nodes represent nucleotides and edges represent base pairing relationships.
  4. A labeled sequence with the length n, e.g., “dot-parenthesis” notation, with matching parentheses for paired bases and dots for unpaired bases.
  5. A parse tree derived from context-free grammars, of which the leaf nodes comprise the RNA sequence [29].

The above definitions form the basis of both traditional and ML-based RNA secondary structure prediction methods.

Traditional methods of detecting or predicting RNA secondary structure

RNA structure determination is a fast-evolving topic. Many different methods have emerged in the last 20 years. They can be divided into 2 categories, i.e., wet lab experimental approaches and computational predicting approaches.

Wet lab experiments

X-ray crystallography [30] and nuclear magnetic resonance (NMR) [31] are the most accurate approaches for determining RNA structures, both of which can offer structural information at a single base pair resolution. However, both methods are characterized by high experimental cost and low throughput, limiting their wide usage. In addition, RNA molecules are highly unstable and difficult to crystallize. Although many methods have been developed to infer the state of nucleotides (paired or unpaired) in an RNA molecule using enzymatic [32,33] or chemical probes [34,35] coupled with next-generation sequencing [36,37], most of them can only be used to capture the RNA secondary structure in vitro. The obtained structure may differ markedly from the in vivo conformation. In fact, to date, the structure of only a very small percentage (<0.001%) of known ncRNAs has been determined experimentally [38]. Hence, predicting the RNA secondary structure using computational methods is an important alternative to wet lab–based approaches.

Traditional computational methods

Comparative sequence analysis [39,40] is the most accurate computational method for determining the RNA secondary structure. This method is based on the assumption that the RNA secondary structure is evolutionarily conserved to a greater extent than the RNA sequence. This method usually finds the base pairs that covary to maintain Watson–Crick and wobble base pairs (compensatory mutations) [41] of a given sequence using a set of homologous sequences. Han and Kim [42] designed the first comparative sequence analysis algorithm based on the phylogenetic comparative analysis. This algorithm predicts a common secondary structure conserved in the given homologous sequence set with a high time complexity (O(n3), n being the RNA sequence length). To reduce the running time, Tahi and colleagues [43] implemented another algorithm DCfold with time complexity O(n2 log n). DCfold searches for helices based on their lengths and mutation rates using a “divide and conquer” approach. Comparative sequence analysis can also predict the structures with pseudoknots [4446]; however, the accuracy is very limited. In addition, comparative sequence analysis can be combined with score-based methods [4750], e.g., RNAalifold [48], KnetFold [49], and ILM [47]. One great limitation of this method is that it requires a large set of homologous sequences. However, only thousands of RNA families are currently known [51], which makes it impossible to obtain homologous sequences for all RNAs. Therefore, most methods for RNA secondary structure prediction are score based, where only a single RNA sequence is required as the input.

Score-based methods are the most widely used methods and have dominated the field of RNA secondary structure prediction in the last 4 decades. These methods assume that the native RNA structure is a structure with a minimum/maximum total score, depending on the hypothesis of RNA folding mechanism or its simplification. Hence, the problem of RNA secondary structure prediction is transformed into an optimization problem. Since the RNA secondary structure can be recursively broken down into smaller elements with independent score contributions, the dynamic programming (DP) algorithm is often employed to identify the optimal structure. Evaluation of the score for structure elements requires a score scheme of many parameters. Nussinov and Jacobson [52] proposed the first, and also the simplest, DP algorithm for finding the maximum-matching structure. The authors proposed to assign one point to each matched base pair and assumed that the native structure is the structure with the maximum score among all the possible conformations. Zuker and Stiegler [53] proposed a more realistic scoring scheme based on free energy, the nearest neighbor model (NN model) [5457]. It is based on Tinoco’s hypothesis (see Section 4.1) [58]. The NN model can be used for the calculation of energy changes of any structure of a given RNA molecule, and the DP algorithm can be also employed to efficiently find the MFE structure. A number of slightly different variations of this method were also proposed [5962]. For predicting the structure with noncanonical base pairs, some other score schemes were employed as scoring functions, such as nucleotide cyclic motifs score system [6365] or equilibrium partition function [66]. In addition, several score-based methods were developed to predict RNA secondary structures with pseudoknots [6771], where the structure search scope or input RNA length is limited or the types of pseudoknots are restricted to lower the time complexity in general.

However, the folding mechanism hypotheses of score-based methods do not always hold, e.g., the RNA molecule often folds into locally stable structural domains. Furthermore, almost all score-based methods use virtually the same DP algorithm to find the best-scoring structures. However, the running time of the DP algorithm is usually O(n3) (where n is the RNA sequence length), neglecting the special base pairs and weak interactions. Hence, the computational cost is not acceptable, especially when analyzing an RNA molecule longer than 1,000 nucleotides. Moreover, predicting the special base pairs in RNA structures is still a difficult task. Since an RNA structure with special bases pairs is not a nested structure in general, score-based methods have to employ sophisticated algorithms to capture these special base pairs at the cost of higher time complexity. However, the performance of these methods needs to be further improved.

In fact, it is extremely difficult to fully understand the RNA folding mechanism. ML methods, in contrast, are data driven and requiring no knowledge of such mechanism. These methods can learn the underlying folding patterns from large amount of training data. In the last few decades, ML methods have been used for many aspects of RNA secondary structure prediction methods to improve the prediction performance (see Section 4). However, they did not replace the mainstream score-based methods with respect to accuracy and generalization. This situation has been changing in the last 2 years because of the development of ML techniques, especially DL.

ML-based methods

The ML-based methods for RNA secondary structure prediction can generally be divided into 3 categories (S1 Fig) according to the subprocess that ML participates in, i.e., score scheme based on ML, preprocessing and postprocessing based on ML, and prediction process based on ML. All the ML-based methods in these 3 categories trained their models in a supervised way [72]. These models learn functions that map inputs (features) to outputs by adjusting model parameters based on the known input–output pairs. Many of them employ free energy parameters, encoded RNA sequences, sequence patterns, or evolutionary information as key features, and their outputs can be classification labels (such as paired or unpaired) or continuous values (such as free energy). When a new input is fed to the trained model, the model can classify a corresponding label or predict a corresponding value [72].

Score scheme based on ML

Early ML-based methods usually train an ML model that can generate a new score scheme (Fig 2) and replace the score scheme used in the traditional methods. According to the meaning of the score, ML-based score schemes can be further divided into 3 categories (S1 Fig), i.e., the free energy parameter-refining approach, weighted approach, and probabilistic approach. Although ML-based methods are used for parameter estimation in the score schemes to improve the prediction accuracy, the structure prediction is still formulated as an optimization problem, where the estimated parameters are used for the evaluation of the scores of possible conformations.

thumbnail
Fig 2. Framework for RNA secondary structure prediction methods with ML-based score schemes. Wet lab data, RNA sequence data, or RNA structure data can be employed to train an ML model to obtain a score scheme.

Using this score scheme, an RNA secondary structure can be predicted using a traditional score-based approach from a single RNA sequence.

https://doi.org/10.1371/journal.pcbi.1009291.g002

Free energy parameter refining based on ML.

Considering the score schemes, the free energy–focused approach is the most popular approach. Ever since Tinoco and colleagues [58] put forward their hypothesis for free energy calculation (that the free energy of a secondary structure is the sum of the free energy values of its elements), many studies have been devoting their efforts to the problem of assigning free energy values to the elements of RNA molecules. Turner’s NN model [57] is the most popular approach and provides a considerably accurate approximation of the RNA free energy. However, the multiple thermodynamic parameters of the NN model have to be based on a large number of optimal melting experiments. These experiments are time and labor consuming [17,19], however, and not all free energy changes in structural elements can be measured because of the associated technical difficulties.

Some ML techniques were adopted to refine the parameters in the energy model. These techniques can employ subtle models to estimate the scores for a richer and more accurate feature representation using known thermodynamic data or RNA secondary structure data. Xia and colleagues [56] first trained a linear regression model using known thermodynamic data to infer some of the thermodynamic parameters and expanded the NN model into a more accurate model, i.e., the INN-HB model. This model provides an improved fit for the known experimental data. A disadvantage of this approach, however, is that the parameters for some structural elements are fixed before other parameters are calculated, which limits the range of possibilities considered for the overall parameter set. To overcome this problem, Andronescu and colleagues [73] proposed a constraint generation approach to estimate free energy parameters. This method uses different types of constraints to ensure that the energies of reference structures are low relative to the alternatives for the same sequence. Trained on large sets of structural and thermodynamic data, this method achieves 7% higher F-measure than the standard Turner parameters. Subsequently, the authors further modified the method and proposed a loss-augmented max-margin constraint generation model and Boltzmann-likelihood model using a larger dataset [74]. The constraints imposed on parameters ensure that the more inaccurate the structure, the greater the margin between its free energy and that of the reference structure in the training set.

Of note, the parameters determined by the above free energy parameter-refining approaches are thermodynamic and can be used directly in the algorithms embedded by the same energy model, such as miRNA target prediction [75] and RNA folding kinetics simulation [76].

Weighted approaches based on ML.

While ML-based free energy parameter approaches successfully improved the accuracy of the RNA secondary structure prediction, the energy model is still far from ideal. Actually, the above methods for the estimation of ML-based parameters can only substitute for some wet lab experiments geared toward obtaining the energy parameters. However, it is entirely possible to obtain an improved score scheme independent of free energy based on ML techniques. Several weighted approaches were proposed that consider the parameters of RNA structure elements as weights instead of free energy changes. By removing the thermodynamic meaning, the weighted approach can utilize ML models to determine thousands of weights for more comprehensive RNA structure elements instead of obtaining a few energy parameters from a large number of wet lab experiments.

By combining a discriminative structured-prediction learning framework with an online learning algorithm, Zakov and colleagues [77] greatly increased the number of weights to approximately 70,000 by examining more types of structural elements with more numerous sequential contexts, using thousands of training datasets. Based on these weights, the ContextFold tool was proposed, marking a significant improvement in the prediction accuracy [77]. Akiyama and colleagues [78] integrated the thermodynamic approach with a structured support vector machine (SSVM) to obtain a large number of weights for detailed structure elements, of which l1 regularization was used to relieve overfitting. Then, MXfold was built by combining ML-based weights with experimentally determined thermodynamic parameters, achieving better performance than a model based on thermodynamic parameters or ML-based weights alone. Most recently, MXfold2 [79] was proposed by Sato and colleagues They trained a fairly deep neural network using the max-margin framework with thermodynamic regularization, which made the folding scores predicted by MXfold2 and the free energy calculated by the thermodynamic parameters were as close as possible. This method showed a robust prediction on both sequence-wise and family-wise cross-validation. These studies suggest that ML-based weights can complement the gaps in the thermodynamic parameter approach.

An advantage of the weighted approach is that it decouples structure prediction from energy estimation, which is potentially beneficial for both tasks. However, learned weights are not explainable, partly because of the black-box attribute of ML algorithms. Hence, the obtained scores cannot be used to compute the partition function, base pair binding probabilities, or centroid structures, etc.

Probabilistic approaches based on ML.

Stochastic context-free grammars (SCFGs) are an important alternative for predicting RNA structures [29,8084]. SCFGs specify formal grammar rules and induce a joint probability distribution over possible RNA structures for a given sequence. In particular, an SCFG model specifies a probability parameter for each production rule in the grammar and thus assigns a probability to each sequence it derives. The probability parameters are learned from datasets of RNA sequences annotated using known secondary structures, without the need for external laboratory experiments [83].

Sakakibara and colleagues [29] first applied SCFGs to tRNA secondary structure prediction. The probability parameters in their SCFG model were learned using an expectation–maximization (EM) method. Knudsen and Hein [82] improved the SCFG model by combining the evolutionary information, and, subsequently, the robust and practical tool Pfold [81] was developed. Sato and colleagues [85] proposed a nonparametric Bayesian extension of SCFGs with the hierarchical Dirichlet process to find an optimal RNA grammar from the training dataset. Using another ML model, the conditional log-linear model (CLLM), Do and colleagues [86] identified probability parameters that are most likely to discriminate correct structures from incorrect ones. CLLM is a flexible probabilistic ML model that generalizes upon SCFGs; the parameters are easily estimated, and arbitrary features can be incorporated in the model. CONTRAfold has achieved the highest single-sequence prediction accuracy to date, compared with the currently available probabilistic models. However, CLLM is very slow, which prevents its application to large training sets, and the estimated parameters have no intrinsic biological meaning. Finally, to take full advantage of the substantial numbers of RNA sequences with unknown structures, Yonemoto and colleagues [87] proposed a semi-supervised learning algorithm to obtain probability parameters in a probabilistic model that combines SCFG and a conditional random field.

However, the probabilistic approach cannot replace MFE methods for secondary structure prediction, as the accuracy of the currently best SCFG has yet to match those of the best free energy–based models. In addition, SCFG cannot describe all RNA structures, e.g., a structure containing special base pairs.

Preprocessing and postprocessing based on ML

ML can be also used in pretreatment, for selecting an appropriate prediction method or a group of appropriate parameters (Figs 3 and S1). A tool based on a support vector machine (SVM) was proposed by Hor and colleagues [88] for selecting the prediction method, based on the notion that different RNA sequences have different features and different prediction methods work best with specific RNA species. In another study, Zhu and colleagues [89] assumed that different RNA sequences follow different folding rules. The authors consequently proposed an SCFG model to identify the most probable folding rules before RNA secondary structure prediction.

thumbnail
Fig 3. Framework for RNA secondary structure prediction methods with ML-based preprocessing or postprocessing.

In RNA secondary structure prediction, ML models (trained by sequence data, in green) can be also used in pretreatment for selecting an appropriate prediction method or a group of appropriate parameters; ML models (trained by structure data, in brown) also can provide a means of determining the most likely structures among the outcomes.

https://doi.org/10.1371/journal.pcbi.1009291.g003

Since different prediction methods return several different structures, the ML model can provide a means of determining the most likely structures among the outcomes (Figs 3 and S1). Combined with the graph theory, Haynes and colleagues [90] used trees to represent RNA graphical structures (edges as helices, and verticals as loops or bulges). They then trained a multilayer perceptron (MLP) model to distinguish whether a structure is RNA-like or not, using graphical invariants as input features. Assuming that a larger secondary structure is formed upon bonding of 2 smaller secondary RNA structures, Koessler and colleagues [91] also used an MLP model to predict the RNA-like probability of a structure using a special feature vector extracted from the merged trees.

Predicting process based on ML

ML techniques are also directly used to predict RNA secondary structure in an end-to-end fashion or combined with other algorithms as constraints, base state detector, or structure selector. The general framework is shown in Figs 4 and S1.

thumbnail
Fig 4. Framework for the RNA secondary structure prediction methods with ML-based prediction process.

ML models (trained by wet lab, RNA sequence, or RNA structure data) are directly used to predict RNA secondary structures in an end-to-end way or followed by a filter or optimizer to obtain the optimal RNA secondary structure.

https://doi.org/10.1371/journal.pcbi.1009291.g004

End-to-end approach.

To the best of our knowledge, the ML technique was first introduced into the RNA secondary structure predicting process by Takefuji and colleagues [92]. The authors built on Nussinov and Jacobson’s hypothesis (see Section 3.2) [52] and attempted to obtain a near-maximum independent set (MIS) from an adjacent graph (where the vertices represent base pairs, and the edges connect the incompatible vertices) using a system composed of m interactional neurons (m is the number of edges). Liu and colleagues [93] enhanced Takefuji’s work by considering the energy contribution of possible base pairs, and a Hopfield neural network (HNN) was employed to obtain MIS. However, HNN is easily trapped in local minima, limiting the accuracy of this method. To avoid this problem, Steeg and Evan [94] made use of the mean field theory (MFT) networks to identify the optimal structure, which was coupled with a sophisticated objective function with additional biological constraints. The inputs into the MFT networks are the 4 types of bases in an RNA sequence encoded in a one-hot fashion, and the output is in a format similar to CT table. Subsequently, Apolloni and colleagues [95] further developed Steeg’s method, especially with respect to the computation speed, so that it could be applied to slightly longer RNA sequences. In addition, this model uses mean field approximation to update the node in both the learning phase and the instant resolution phase. In another study, Qasim and colleagues [96] modified Takefuji’s work by building a novel MLP model to obtain MIS. This model contains h neurons in the hidden layer, whose activation function is based on the Kolgomorov’s theorem (h is the number of possible base pairs in an RNA sequence).

However, because of the relatively poor performance of the above ML models and a small amount of the available data, ML-based RNA secondary structure prediction models can only process tRNAs, with relatively low accuracy. Currently, the use of DL techniques is rising rapidly, and they are dramatically changing these circumstances. Singh and colleagues [97] proposed the first end-to-end DL model, SPOT-RNA, to predict RNA secondary structure. SPOT-RNA treats the RNA secondary structure as a CT table and employs an ensemble of ultradeep hybrid networks of ResNets and 2D-BLSTMs for the prediction. Of these, the former captures the contextual information from the whole sequence, and the latter is effective for the propagation of long-range sequence dependencies in RNA structure. Transfer learning is used to train SPOT-RNA to effectively utilize limited sample numbers. SPOT-RNA showed superior performance with several RNA benchmark datasets, greatly outperforming the best score-based methods and SCFG-based methods. Recently, the SPOT-RNA2 model [98] was proposed by the same research group. This model employed evolution-derived sequence profiles and mutational coupling as inputs and outperformed SPOT-RNA for all types of base pairs using the same transfer learning approach. E2Efold is another DL model for RNA secondary structure prediction, proposed by Chen and colleagues [99]. It integrates 2 coupled parts, i.e., a transformer-based deep model that encodes sequence information, and a multilayer network based on an unrolled algorithm that gradually enforces the constraints and restricts the output space.

In addition to the encoded RNA sequences being used as the input, other information can also be incorporated into the DL model. Calonaci and colleagues [100] trained an ensemble model based on a combination of SHAPE data, co-evolutionary data (DCA), and RNA sequence data. Their model consists of a convolutional neural network (CNN) subnetwork and an MLP subnetwork to predict penalties based on SHAPE and DCA data, respectively, with an RNAfold [17] module to generate structures using RNA sequences and penalties.

Hybrid approach.

Alternatively, ML can be combined with other methods for a hybrid approach for RNA secondary structure prediction. Consequently, the ML model is usually considered as a scoring machine, mapping a score to each (pair of) base(s) in an RNA sequence, whose output is then passed to an independent filter to identify a reasonable structure.

Bindewald and Shapiro [49] combined an ML model and a filter to predict the consensus structure for a group of aligned RNAs. The authors chose a hierarchical network of k-nearest neighbor model to predict the possibility score for each pair of alignment columns and defined the filter by a set of rules derived from native RNA structures. Considering structure prediction as a sequence-labeling question, Lu and colleagues [101] and Wu and colleagues [102] employed a more powerful DL model, Bi-LSTM, to predict the state of each base in an RNA sequence, using a similar rule-based filter to deal with conflicting pairing. Differently from the above studies, Bi-LSTM was used as a structure filter in DpacoRNA [103], and a parallel ant colony optimization method was used to predict the most probable structures. Another type of an ML-based hybrid approach combines ML models and optimization methods. Liu’s group [104] used a CNN model to predict the status distribution of each base in an RNA sequence, and a DP algorithm was employed to find the maximum probability structure. The same group [105] also used the Bi-LSTM model instead and another optimization algorithm, similar to that used in [106]. Instead of developing a new optimizer, Willmott and colleagues [107] utilized an existing SHAPE-directed method (SDM) [108] as the optimizer, which can predict optimal structure from SHAPE data, and trained a Bi-LSTM model to generate SHAPE-like data (i.e., determine the state of each nucleotide) of an RNA sequence as the inputs of SDM.

Compared with the end-to-end approach, the performance of the hybrid approach is relatively poor, perhaps because of a bias between the training objective of the ML part and the overall system objective. Most methods in the hybrid approach are trained and tested using small-scale datasets. Hence, generalization of their abilities requires further verification.

Discussions

It is well known that transcript abundance helps to identify transcripts of interest under different conditions, while the RNA structure helps to explain how these transcripts function. An excellent RNA structure prediction method is not only important for inferring RNA function, but also relates to many downstream studies, including ncRNA detection [109111], folding dynamics simulations [112], hybridization stability assessment [113], and oligonucleotide [114,115] or drug design [116120]. It is worth noting that RNA secondary structure prediction is also a useful tool for studying viruses, such as the SARS-CoV-2 virus responsible for the current pandemic [121,122].

The advantages of ML-based methods

Compared with comparative sequence analysis and traditional score-based methods, ML-based methods have some advantages. First, ML-based methods do not necessarily rely on the biological mechanism, which is usually difficult to thoroughly understand. Instead, they can utilize the information contained in various types of data, and, therefore, performance limitation caused by the mechanism hypothesis can be circumvented. ML-based methods can also be easily coupled with known biological mechanisms. Further, in terms of prediction performance, where a large amount of data is available, models with no or little knowledge of biological mechanisms usually perform better than mechanism-dependent ones. This also suggests that the assumed mechanism of RNA folding may be incomplete or not accurate. Second, in contrast to traditional score-based methods, the end-to-end DL methods do not need to consider the difficulties caused by base matching rules. Traditional score-based methods employ sophisticated algorithms to satisfy base matching rules at the cost of high time complexity. However, without the constraint of these rules, end-to-end models [97] can train and predict all the base pairs in RNA structures, regardless of whether the base pairs associate with secondary or tertiary interactions. Third, compared with traditional methods, the ML-based methods can be considerably flexible. The inputs of ML-based models can be either one-dimensional or multidimensional, extracted features or encoded bases, and homogeneous data or heterogeneous data, and the outputs can be CT tables, labeled sequences, nucleotide states, or free energy values. In addition, the construction of the ML models is diverse, from simple Hopfield networks to complex ensemble deep neural networks. Fourth, once the model training is completed, the ML-based end-to-end prediction methods run very fast. Unlike DP algorithm, the time complexity of ML models is independent of the input scale, which is advantageous when dealing with long RNAs.

Datasets and their impacts on ML-based methods

Today, many public RNA structure databases and other related datasets are available online, which provide abundant data for model training. Generally, these databases can be classified into 2 types, i.e., comprehensive databases and dedicated databases. A comprehensive database often consists of RNA structures with different conformations and in different RNA species, for example, RNA Strand (4,666 RNAs available) [123], RCSB Protein Data Bank (PDB, 4,962 RNAs available) [124], and bpRNA-1m (102,348 RNAs available) [125]. Some of these databases (e.g., PDB) collect tertiary structures obtained by wet lab experiments, while others obtained data using comparative sequence analysis method (less accurate than those obtained by wet lab experiments, e.g., pbRNA-1m). Dedicated databases generally involve only a single RNA species (tRNA [126], rRNA [127], or tmRNA databases [128]) or a single type of RNA structure (such as loop [129], pseudoknot [130], or noncanonical base pair [131]) generally. Based on these dedicated databases, some public benchmark datasets were established, such as ArchiveII [132] and RNAStralign [133]. These datasets are generally composed of tens of thousands of RNAs in different RNA species (rRNA, tRNA, SRP, tmRNA, etc.). In addition, other databases used in ML-based methods are Rfam [51] and NNDB [57], which provide RNA family information and thermodynamic parameters, respectively.

Data are extremely important for building ML-based RNA secondary prediction models, especially DL-based models with a large number of parameters. One of the reasons that the recent DL-based methods [79,97,99] outperform the traditional ML-based models is the improvement of the quality and quantity of the training sets. It is worth noting that the performance of DL-based methods may be overestimated due to the data similarity between the training and test set. Most of studies only ensured that the RNAs in test sets of these methods were not so similar (80% similarity [134] as a cutoff typically) to those in the training sets, but RNAs from the same families were not explicitly excluded from the testing set. The sequences and structures in the same RNA family are similar, resulting that the model performance obtained on testing sets is better than reality [79,97].

Another issue that may affect the model performance is the imbalanced RNA families in training sets, e.g., thousands of 16S rRNAs but only a small number of telomerases occur in one dataset. When the length of the input RNA is comparable, trained models tend to perform better on the RNA species that are more prevalence in the training set [99]. How to deal with unbalanced data is an active topic in the ML community. Study [99] adopted an up-sampling strategy to balance the RNAs in different families, and their model performance was further improved.

Generally, the enhancement of predictive ability is associated with the relatively large scale of the ML model, which requires large amounts of data for parameter training. Although a large number of RNA structure data in various formats is available, these are insufficient in terms of training large-scale DL models, especially with respect to the availability of high-accuracy data. Hence, questions on how to effectively utilize the limited data and cope with overfitting of a large-scale DL model are also important issues that remain to be resolved.

Current pending challenges

Enormous progress has been made toward predicting RNA secondary structure by using ML-based methods. These methods are state of the art when considering most indices of prediction performance. However, some issues still require resolving.

First, the accuracy of prediction should be improved further. Sato and colleagues [79] used the RNAs in the newly discovered RNA families to form an independent test set (not used in all the tested methods), and based on this dataset, a rigorous test was performed among 6 most accurate RNA secondary structure prediction methods. The test results showed that, among these methods, the highest positive predictive value (PPV) is 0.636 (achieved by TORNADO) [84], the highest sensitivity is 0.720 (achieved by RNAfold) [17], and the highest F value is 0.632 (achieved by MXfold2) [79]. Using another independent dataset collected from PDB, Singh and colleagues [98] performed a comprehensive comparison among 27 kinds of well-known RNA secondary structure prediction methods. Their results showed when homologous sequences were available, the highest F value and sensitivity achieved were 0.774 and 0.727, respectively (both by SPOT-RNA2). These results objectively show that there is still much room for improvement in RNA secondary structure prediction. Moreover, many traditional methods neglect special base pairs to avoid a large number of false positives or to limit computational complexity [71,135]. While some methods can predict RNA secondary structures containing pseudoknots [46] or noncanonical base pairs [63], none of them can predict both. Although the recently proposed ML-based methods can predict all kinds of special base pairs, the special base pair prediction accuracy is still limited.

The RNA sequence length limitation is another intractable issue, which becomes quite problematic with the recently discovered long (1,000 to 10,000 nt) ncRNA [136]. Although ML-based methods do not suffer from high time complexity as most score-based methods do, they are unable to effectively capture such long-range interactions within an RNA sequence. On the other hand, training an ML model with such a large-scale input consumes a huge amount of computational resources and is often unrealistic.

For ML-based RNA secondary structure prediction models, overfitting is a very important issue [84], especially for DL-based models with a large number of parameters. The overfitted models perform well on the test RNAs similar to that in the training data but generalize poorly on dissimilar ones. It seems that they only memorize the secondary structure of RNAs in the training data, rather than actually learn the folding mechanism from them. A result in paper [79] showed that E2Efold [99] outperformed many traditional methods on the dataset ArchiveII but performed poorly on the RNAs from newly discovered RNA families. This suggested that E2Efold might suffer from a heavy overfitting. Similarly, another paper [137] reported that the F score of ContextFold also lowered by 24% when testing on a set of structurally dissimilar RNAs to the training set. Although most DL-based methods take many precautions to alleviate overfitting by many techniques (such as using regularization [100], enlarging dataset [97], adding constraints [99], or combining Turner’s nearest neighbor free energy parameters), the concerns about overfitting remain.

At last, the folding mechanisms need further exploration. Traditional RNA secondary structure prediction is based on different RNA folding mechanism hypotheses (S1 Table), while data-driven ML-based methods can learn such mechanism implicitly from known data based on different RNA sequences or sequence features. However, to the best of our knowledge, few folding mechanisms have been revealed from the established ML-based models, although great advances have been made in terms of prediction accuracy. Part of the reason is that the interpretability [138] of DL models is still a challenge today.

Future trends of development

Currently, RNA secondary structure prediction is successfully shifting toward ML-based approaches, away from traditional score-based methods, and DL will surely continue to improve the prediction performance. The subtle structure of the DL model is a prerequisite to this end. Since the DL model is being rapidly developed in the natural language processing and image processing fields, using mature DL blocks from these fields, or combining them in such fields constitutes a feasible way to generate an excellent DL model for RNA secondary structure prediction.

Further, using a DL model to predict the free energy parameter is an inevitable trend for more accurate energy estimations, when additional wet lab experimental data become available. However, these parameters may not improve RNA secondary prediction accuracy because they have to be combined with traditional score-based methods. On the other hand, combing an ML-based method and an optimization method is a promising approach for improving prediction performance.

Conclusions

RNA structure is one of the central pieces of information for understanding biological processes, and determining RNA secondary structure will continue to be a hot topic in the computation and biology fields. In this review, we focused on ML-based methods, which involve many aspects of RNA secondary structure prediction. ML techniques have greatly improved the performance of prediction methods, including accuracy, applicability, and running speed. However, to thoroughly resolve the RNA secondary structure prediction problem, a more subtle ML model is still needed. At the moment, ML-based methods cannot be used as substitutes for wet lab experiments for obtaining high-resolution structures. Nonetheless, the advent of DL technologies and high-performance hardware will foster a new generation of RNA secondary prediction tools with an improved accuracy and running speed.

Supporting information

S1 Fig. Classification of ML-based RNA secondary structure prediction methods. According to the subprocess that ML participates in, the ML-based RNA secondary structure prediction methods were classified into 3 categories, i.e., score scheme based on ML (containing 3 subcategories: free energy–refining approach, weighted approach, and probabilistic approach), preprocessing and postprocessing based on ML (containing 2 subcategories: preprocessing and postprocessing), and prediction process based on ML (containing 2 subcategories: end-to-end approach and hybrid approach).

https://doi.org/10.1371/journal.pcbi.1009291.s001

(TIF)

S1 Table. Comparison of RNA secondary structure prediction methods.

https://doi.org/10.1371/journal.pcbi.1009291.s002

(DOCX)

References

  1. 1. Fu Y, Xu ZZ, Lu ZJ, Zhao S, Mathews DH. Discovery of Novel ncRNA Sequences in Multiple Genome Alignments on the Basis of Conserved and Stable Secondary Structures. PLoS ONE. 2015;10(6):e0130200. pmid:26075601.
  2. 2. Consortium TEP. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74. pmid:22955616.
  3. 3. Consortium TF. The transcriptional landscape of the mammalian genome. Science. 2006;311(5768):1713. pmid:16556825.
  4. 4. Doudna JA, Cech TR. The chemical repertoire of natural ribozymes. Nature. 2002;418(6894):222–8. pmid:12110898.
  5. 5. Higgs PG, Lehman N. The RNA World: molecular cooperation at the origins of life. Nat Rev Genet. 2015;16(1):7–17. pmid:25385129.
  6. 6. Mortimer SA, Kidwell MA, Doudna JA. Insights into RNA structure and function from genome-wide studies. Nat Rev Genet. 2014;15(7):469–79. pmid:24821474.
  7. 7. Meister G, Tuschl T. Mechanisms of gene silencing by double-stranded RNA. Nature. 2004;431(7006):343–9. pmid:15372041.
  8. 8. Serganov A, Nudler E. A Decade of Riboswitches. Cell. 2013;152(1–2):17–24. pmid:23332744.
  9. 9. Wu L, Belasco JG. Let me count the ways: Mechanisms of gene regulation by miRNAs and siRNAs. Mol Cell. 2008;29(1):1–7. pmid:18206964.
  10. 10. Zou Q, Li J, Hong Q, Lin Z, Wu Y, Shi H, et al. Prediction of MicroRNA-Disease Associations Based on Social Network Analysis Methods. Biomed Res Int. 2015;2015:810514. pmid:26273645.
  11. 11. Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, et al. Big Data: Astronomical or Genomical? PLoS Biol. 2015;13(7):e1002195. pmid:26151137.
  12. 12. Tinoco I, Bustamante C. How RNA folds. J Mol Biol. 1999;293(2):271–81. pmid:10550208.
  13. 13. Celander DW, Cech TR. Visualizing the higher order folding of a catalytic RNA molecule. Science. 1991;251(4992):401–7. pmid:1989074.
  14. 14. Zarrinkar PP, Williamson JR. Kinetic Intermediates in RNA Folding. Science. 1994;265(5174):918–24. pmid:8052848.
  15. 15. Chen SJ, Tan ZJ, Cao S, Zhang WB. The Statistical Mechanics of RNA Folding. Phys Ther. 2006;35(3):106–17.
  16. 16. Anfinsen CB. Principles that govern the folding of protein chains. Science. 1973;181(4096):223–30. pmid:4124164.
  17. 17. Lorenz R, Bernhart SH, Siederdissen CHZ, Tafer H, Flamm C, Stadler PF, et al. ViennaRNA Package 2.0. Algorithms Mol Biol. 2011;6:26. pmid:22115189.
  18. 18. Zuker M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 2003;31(13):3406–15. pmid:12824337.
  19. 19. Bellaousov S, Reuter JS, Seetin MG, Mathews DH. RNAstructure: web servers for RNA secondary structure prediction and analysis. Nucleic Acids Res. 2013;41(W1):W471–W4. pmid:23620284.
  20. 20. Condon A, editor Problems on RNA Secondary Structure Prediction and Design. 30th International Colloquium on Automata, Languages and Programming (ICALP 2003); 2003; Berlin, Heidelberg: Springer Berlin Heidelberg.
  21. 21. Fallmann J, Will S, Engelhardt J, Grüning B, Backofen R, Stadler PF. Recent advances in RNA folding. J Biotechnol. 2017;261:97–104. pmid:28690134.
  22. 22. Seetin MG, Mathews DH. RNA structure prediction: an overview of methods. Methods Mol Biol. 2012;905:99–122. pmid:22736001.
  23. 23. Zhao Y, Wang J, Zeng C, Xiao Y. Evaluation of RNA secondary structure prediction for both base-pairing and topology. Biophysics Reports. 2018;4(3):123–32. pmid:20699301.
  24. 24. Leontis NB, Westhof E. Geometric nomenclature and classification of RNA base pairs. RNA. 2001;7(4):499–512. pmid:11345429.
  25. 25. Abu Almakarem AS, Petrov AI, Stombaugh J, Zirbel CL, Leontis NB. Comprehensive survey and geometric classification of base triples in RNA structures. Nucleic Acids Res. 2012;40(4):1407–23. pmid:22053086.
  26. 26. Doherty EA, Batey RT, Masquida B, Doudna JA. A universal mode of helix packing in RNA. Nat Struct Biol. 2001;8(4):339–43. pmid:11276255.
  27. 27. van Batenburg FHD, Gultyaev AP, Pleij CWA. PseudoBase: structural information on RNA pseudoknots. Nucleic Acids Res. 2001;29(1):194–5. pmid:11125088.
  28. 28. Staple DW, Butcher SE. Pseudoknots: RNA structures with diverse functions. PLoS Biol. 2005;3(6):e213. pmid:15941360.
  29. 29. Sakakibara Y, Brown M, Hughey R, Mian IS, Sjölander K, Underwood RC, et al. Stochastic context-free grammars for tRNA modeling. Nucleic Acids Res. 1994;22(23):5112–20. pmid:7800507.
  30. 30. Westhof E. Twenty years of RNA crystallography. RNA. 2015;21(4):486–7. pmid:25780106.
  31. 31. Fürtig B, Richter C, Wöhnert J, Schwalbe H. NMR Spectroscopy of RNA. ChemBioChem. 2003;4(10):936–62. pmid:14523911.
  32. 32. Kertesz M, Wan Y, Mazor E, Rinn JL, Nutter RC, Chang HY, et al. Genome-wide measurement of RNA secondary structure in yeast. Nature. 2010;467(7311):103–7. pmid:20811459.
  33. 33. Underwood JG, Uzilov AV, Katzman S, Onodera CS, Mainzer JE, Mathews DH, et al. FragSeq: transcriptome-wide RNA structure probing using high-throughput sequencing. Nat Methods. 2010;7(12):995–1001. pmid:21057495.
  34. 34. Tijerina P, Mohr S, Russell R. DMS footprinting of structured RNAs and RNA-protein complexes. Nat Protoc. 2007;2(10):2608–23. pmid:17948004.
  35. 35. Wilkinson KA, Merino EJ, Weeks KM. Selective 2′-hydroxyl acylation analyzed by primer extension (SHAPE): quantitative RNA structure analysis at single nucleotide resolution. Nat Protoc. 2006;1(3):1610–6. pmid:17406453.
  36. 36. Bevilacqua PC, Ritchey LE, Su Z, Assmann SM. Genome-Wide Analysis of RNA Secondary Structure. Annu Rev Genet. 2016;50:235–66. pmid:27648642.
  37. 37. Tian S, Das R. RNA structure through multidimensional chemical mapping. Q Rev Biophys. 2016;49:e7. pmid:27266715.
  38. 38. Consortium TR. RNAcentral: a comprehensive database of non-coding RNA sequences. Nucleic Acids Res. 2017;45(D1):D128–D34. pmid:27794554.
  39. 39. Gutell RR, Lee JC, Cannone JJ. The accuracy of ribosomal RNA comparative structure models. Curr Opin Struct Biol. 2002;12(3):301–10. pmid:12127448.
  40. 40. Madison JT, Everett GA, Kung H. Nucleotide Sequence of a Yeast Tyrosine Transfer RNA. Science. 1966;153(3735):531–4. pmid:5938777.
  41. 41. Gutell RR, Weiser B, Woese CR, Noller HF. Comparative anatomy of 16-S-like ribosomal RNA. Prog Nucleic Acid Res Mol Biol. 1985;32:155–216. pmid:3911275.
  42. 42. Han K, Kim HJ. Prediction of common folding structures of homologous RNAs. Nucleic Acids Res. 1993;21(5):1251–7. pmid:7681944.
  43. 43. Tahi F, Gouy M, Regnier M. Automatic RNA secondary structure prediction with a comparative approach. Comput Chem. 2002;26(5):521–30. pmid:12144180.
  44. 44. Tahi F, Engelen S, Regnier M. A fast algorithm for RNA secondary structure prediction including pseudoknots. Third IEEE Symposium on Bioinformatics and Bioengineering. 2003:11–7.
  45. 45. Engelen S, Tahi F. Tfold: efficient in silico prediction of non-coding RNA secondary structures. Nucleic Acids Res. 2010;38(7):2453–66. pmid:20047957.
  46. 46. Bellaousov S, Mathews DH. ProbKnot: fast prediction of RNA secondary structure including pseudoknots. RNA. 2010;16(10):1870–80. pmid:20699301.
  47. 47. Ruan J, Stormo GD, Zhang W. An iterated loop matching approach to the prediction of RNA secondary structures with pseudoknots. Bioinformatics. 2004;20(1):58–66. pmid:14693809.
  48. 48. Hofacker IL, Fekete M, Flamm C, Huynen MA, Rauscher S, Stolorz PE, et al. Automatic detection of conserved RNA structure elements in complete RNA virus genomes. Nucleic Acids Res. 1998;26(16):3825–36. pmid:9685502.
  49. 49. Bindewald E, Shapiro BA. RNA secondary structure prediction from sequence alignments using a network of k-nearest neighbor classifiers. RNA. 2006;12(3):342–52. pmid:16495232.
  50. 50. Legendre A, Angel E, Tahi F. Bi-objective integer programming for RNA secondary structure prediction with pseudoknots. BMC Bioinformatics. 2018;19(1):13. pmid:29334887.
  51. 51. Burge SW, Daub J, Eberhardt R, Tate J, Barquist L, Nawrocki EP, et al. Rfam 11.0: 10 years of RNA families. Nucleic Acids Res. 2013;41(D1):D226–D32. WOS:000312893300031. pmid:23125362
  52. 52. Nussinov R, Jacobson AB. Fast algorithm for predicting the secondary structure of single-stranded RNA. Proc Natl Acad Sci U S A. 1980;77(11):6309–13. pmid:6161375.
  53. 53. Zuker M, Stiegler P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 1981;9(1):133–48. pmid:6163133.
  54. 54. Mathews DH, Sabina J, Zuker M, Turner DH. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J Mol Biol. 1999;288(5):911–40. pmid:10329189.
  55. 55. Andronescu M, Condon A, Turner DH, Mathews DH. The determination of RNA folding nearest neighbor parameters. Methods Mol Biol. 2014;1097:45–70. _3. pmid:24639154.
  56. 56. Xia TB, SantaLucia J, Burkard ME, Kierzek R, Schroeder SJ, Jiao XQ, et al. Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. Biochemistry. 1998;37(42):14719–35. pmid:9778347.
  57. 57. Turner DH, Mathews DH. NNDB: the nearest neighbor parameter database for predicting stability of nucleic acid secondary structure. Nucleic Acids Res. 2010;38(Database issue):D280–2. pmid:19880381.
  58. 58. Tinoco I Jr., Uhlenbeck OC, Levine MD. Estimation of secondary structure in ribonucleic acids. Nature. 1971;230(5293):362–7. pmid:4927725.
  59. 59. Wuchty S, Fontana W, Hofacker IL, Schuster P. Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers. 1999;49(2):145–65. pmid:10070264.
  60. 60. Reuter JS, Mathews DH. RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics. 2010;11:129. pmid:20230624.
  61. 61. Gultyaev AP, van Batenburg FH, Pleij CW. The computer simulation of RNA folding pathways using a genetic algorithm. J Mol Biol. 1995;250(1):37–51. pmid:7541471.
  62. 62. Huang L, Zhang H, Deng D, Zhao K, Liu K, Hendrix DA, et al. LinearFold: linear-time approximate RNA folding by 5′-to-3′ dynamic programming and beam search. Bioinformatics. 2019;35(14):i295–i304. pmid:31510672.
  63. 63. Parisien M, Major F. The MC-Fold and MC-Sym pipeline infers RNA structure from sequence data. Nature. 2008;452(7183):51–5. pmid:18322526.
  64. 64. Honer zu Siederdissen C, Bernhart SH, Stadler PF, Hofacker IL. A folding algorithm for extended RNA secondary structures. Bioinformatics. 2011;27(13):i129–36. pmid:21685061.
  65. 65. Dallaire P, Major F. Exploring Alternative RNA Structure Sets Using MC-Flashfold and db2cm. Methods Mol Biol. 2016;1490:237–51. pmid:27665603.
  66. 66. Sloma MF, Mathews DH. Base pair probability estimates improve the prediction accuracy of RNA non-canonical base pairs. PLoS Comput Biol. 2017;13(11):e1005827. pmid:29107980.
  67. 67. Poolsap U, Kato Y, Akutsu T. Prediction of RNA secondary structure with pseudoknots using integer programming. BMC Bioinformatics. 2009;10. pmid:19133123.
  68. 68. Bon M, Micheletti C, Orland H. McGenus: a Monte Carlo algorithm to predict RNA secondary structures with pseudoknots. Nucleic Acids Res. 2013;41(3):1895–900. pmid:23248008.
  69. 69. Reeder J, Giegerich R. Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC Bioinformatics. 2004;5. pmid:14718068.
  70. 70. Dirks RM, Pierce NA. A partition function algorithm for nucleic acid secondary structure including pseudoknots. J Comput Chem. 2003;24(13):1664–77. pmid:12926009.
  71. 71. Rivas E, Eddy SR. A dynamic programming algorithm for RNA structure prediction including pseudoknots. J Mol Biol. 1999;285(5):2053–68. pmid:9925784.
  72. 72. Jordan MI, Mitchell TM. Machine learning: Trends, perspectives, and prospects. Science. 2015;349(6245):255–60. pmid:26185243.
  73. 73. Andronescu M, Condon A, Hoos HH, Mathews DH, Murphy KP. Efficient parameter estimation for RNA secondary structure prediction. Bioinformatics. 2007;23(13):i19–i28. pmid:17646296.
  74. 74. Andronescu M, Condon A, Hoos HH, Mathews DH, Murphy KP. Computational approaches for RNA energy parameter estimation. RNA. 2010;16(12):2304–18. pmid:20940338.
  75. 75. Rehmsmeier M, Steffen P, Hochsmann M, Giegerich R. Fast and effective prediction of microRNA/target duplexes. RNA. 2004;10(10):1507–17. pmid:15383676.
  76. 76. Tang X, Thomas S, Tapia L, Giedroc DP, Amato NM. Simulating RNA folding kinetics on approximated energy landscapes. J Mol Biol. 2008;381(4):1055–67. pmid:18639245.
  77. 77. Zakov S, Goldberg Y, Elhadad M, Ziv-Ukelson M. Rich parameterization improves RNA structure prediction. J Comput Biol. 2011;18(11):1525–42. pmid:22035327.
  78. 78. Akiyama M, Sato K, Sakakibara Y. A max-margin training of RNA secondary structure prediction integrated with the thermodynamic model. J Bioinform Comput Biol. 2018;16(6):1840025. pmid:30616476.
  79. 79. Sato K, Akiyama M, Sakakibara Y. RNA secondary structure prediction using deep learning with thermodynamic integration. Nat Commun. 2021;12(1):941. pmid:33574226.
  80. 80. Woodson SA. Recent insights on RNA folding mechanisms from catalytic RNA. Cell Mol Life Sci. 2000;57(5):796–808. pmid:10892344.
  81. 81. Knudsen B, Hein J. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res. 2003;31(13):3423–8. pmid:12824339.
  82. 82. Knudsen B, Hein J. RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics. 1999;15(6):446–54. pmid:10383470.
  83. 83. Dowell RD, Eddy SR. Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinformatics. 2004;5:71. pmid:15180907.
  84. 84. Rivas E, Lang R, Eddy SR. A range of complex probabilistic models for RNA secondary structure prediction that includes the nearest-neighbor model and more. RNA. 2012;18(2):193–212. pmid:22194308.
  85. 85. Sato K, Hamada M, Mituyama T, Asai K, Sakakibara Y. A non-parametric Bayesian approach for predicting RNA secondary structures. J Bioinform Comput Biol. 2010;8(4):727–42. WOS:000271458900024.
  86. 86. Do CB, Woods DA, Batzoglou S. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics. 2006;22(14):e90–e8. pmid:16873527.
  87. 87. Yonemoto H, Asai K, Hamada M. A semi-supervised learning approach for RNA secondary structure prediction. Comput Biol Chem. 2015;57:72–9. pmid:25748534.
  88. 88. Hor C-Y, Yang C-B, Chang C-H, Tseng C-T, Chen H-H. A Tool Preference Choice Method for RNA Secondary Structure Prediction by SVM with Statistical Tests. Evol Bioinformatics Online. 2013;9:163–84. pmid:23641141.
  89. 89. Zhu Y, Xie ZY, Li YZ, Zhu M, Chen YPP. Research on folding diversity in statistical learning methods for RNA secondary structure prediction. Int J Biol Sci. 2018;14(8):872–82. pmid:29989089.
  90. 90. Haynes T, Knisley D, Knisley J. Using a neural network to identify secondary RNA structures quantified by graphical invariants. Match Commun Math Comput Chem. 2008;60(2):277–90. WOS:000259765200002.
  91. 91. Koessler DR, Knisley DJ, Knisley J, Haynes T. A predictive model for secondary RNA structure using graph theory and a neural network. BMC Bioinformatics. 2010;11(Suppl 6):S21. pmid:20946605.
  92. 92. Takefuji Y, Chen LL, Lee KC, Huffman J. Parallel algorithms for finding a near-maximum independent set of a circle graph. IEEE Trans Neural Netw. 1990;1(3):263–7. pmid:18282845.
  93. 93. Liu Q, Ye X, Zhang Y. A Hopfield Neural Network based algorithm for RNA secondary structure prediction. 1st International Multi Symposium on Computer and Computational Sciences; Hangzhou, China: IEEE; 2006.
  94. 94. Steeg EW. Neural networks, adaptive optimization, and RNA secondary structure prediction. Artificial intelligence and molecular biology. 1993:121–60.
  95. 95. Apolloni B, Torto LL, Morpurgo A, Zanaboni AM. RNA Secondary Structure Prediction by MFT Neural Networks 2003.
  96. 96. Qasim R, Kauser N, Jilani T. Secondary Structure Prediction of RNA using Machine Learning Method. Int J Comput Appl. 2011;10(6):0975–8887.
  97. 97. Singh J, Hanson J, Paliwal K, Zhou YQ. SPOT-RNA: RNA Secondary Structure Prediction using an Ensemble of Two-dimensional Deep Neural Networks and Transfer Learning. Nat Commun. 2019;10(1):1–13. pmid:30602773.
  98. 98. Singh J, Paliwal K, Zhang T, Singh J, Litfin T, Zhou Y. Improved RNA Secondary Structure and Tertiary Base-pairing Prediction Using Evolutionary Profile, Mutational Coupling and Two-dimensional Transfer Learning. Bioinformatics. 2021. Epub 2021/03/12. pmid:33704363.
  99. 99. Chen X, Li Y, Umarov R, Gao X, Song L. RNA Secondary Structure Prediction By Learning Unrolled Algorithms. International Conference on Learning Representations. 2020.
  100. 100. Calonaci N, Jones A, Cuturello F, Sattler M, Bussi G. Machine learning a model for RNA structure prediction. 2020;2(4):lqaa090. pmid:33575634.
  101. 101. Lu W, Tang Y, Wu H, Huang H, Fu Q, Qiu J, et al. Predicting RNA secondary structure via adaptive deep recurrent neural networks with energy-based filter. BMC Bioinformatics. 2019;20(Suppl 25):684. pmid:31874602.
  102. 102. Wu H, Tang Y, Lu W, Chen C, Huang H, Fu Q, editors. RNA Secondary Structure Prediction Based on Long Short-Term Memory Model. 14th International Conference on Intelligent Computing (ICIC); 2018; Wuhan, China.
  103. 103. Quan L, Cai L, Chen Y, Mei J, Sun X, Lyu Q. Developing parallel ant colonies filtered by deep learned constrains for predicting RNA secondary structure with pseudo-knots. Neurocomputing. 2020;384:104–14. WOS:000513853600009.
  104. 104. Zhang H, Zhang C, Li Z, Li C, Wei X, Zhang B, et al. A New Method of RNA Secondary Structure Prediction Based on Convolutional Neural Network and Dynamic Programming. Front Genet. 2019;10:467. pmid:31191603.
  105. 105. Wang L, Liu Y, Zhong X, Liu H, Lu C, Li C, et al. DMfold: A Novel Method to Predict RNA Secondary Structure With Pseudoknots Based on Deep Learning and Improved Base Pair Maximization Principle. Front Genet. 2019;10:143. pmid:30886627.
  106. 106. Liu Y, Zhao Q, Zhang H, Xu R, Li Y, Wei L. A New Method to Predict RNA Secondary Structure Based on RNA Folding Simulation. IEEE/ACM Trans Comput Biol Bioinform. 2016;13(5):990–5. pmid:26552091.
  107. 107. Willmott D, Murrugarra D, Ye Q. Improving RNA secondary structure prediction via state inference with deep recurrent neural networks. Comput Math Biophys. 2020;8:36–50.
  108. 108. Deigan KE, Li TW, Mathews DH, Weeks KM. Accurate SHAPE-directed RNA structure determination. Proc Natl Acad Sci U S A. 2009;106(1):97–102. pmid:19109441.
  109. 109. Gruber AR, Findeiss S, Washietl S, Hofacker IL, Stadler PF. RNAZ 2.0: Improved Noncoding RNA Detection. Biocomputing. 2010;15:69–79. _0009. pmid:19908359.
  110. 110. Washietl S, Will S, Hendrix DA, Goff LA, Rinn JL, Berger B, et al. Computational analysis of noncoding RNAs. Wiley Interdiscip Rev RNA. 2012;3(6):759–78. pmid:22991327.
  111. 111. Moulton V. Tracking down noncoding RNAs. Proc Natl Acad Sci U S A. 2005;102(7):2269–70. pmid:15703286.
  112. 112. Wolfinger MT, Svrcek-Seiler WA, Flamm C, Hofacker IL, Stadler PF. Efficient computation of RNA folding dynamics. J Phys A Math Gen. 2004;37(17):4731–41. WOS:000221482800006.
  113. 113. Rouillard JM, Zuker M, Gulari E. OligoArray 2.0: design of oligonucleotide probes for DNA microarrays using a thermodynamic approach. Nucleic Acids Res. 2003;31(12):3057–62. pmid:12799432.
  114. 114. Lu ZJ, Mathews DH. Efficient siRNA selection using hybridization thermodynamics. Nucleic Acids Res. 2008;36(2):640–7. pmid:18073195.
  115. 115. Tafer H, Ameres SL, Obernosterer G, Gebeshuber CA, Schroeder R, Martinez J, et al. The impact of target site accessibility on the design of effective siRNAs. Nat Biotechnol. 2008;26(5):578–83. pmid:18438400.
  116. 116. Sazani P, Gemignani F, Kang SH, Maier MA, Manoharan M, Persmark M, et al. Systemically delivered antisense oligomers upregulate gene expression in mouse tissues. Nat Biotechnol. 2002;20(12):1228–33. pmid:12426578.
  117. 117. Childs-Disney JL, Wu M, Pushechnikov A, Aminova O, Disney MD. A small molecule microarray platform to select RNA internal loop-ligand interactions. ACS Chem Biol. 2007;2(11):745–54. pmid:17975888.
  118. 118. Palde PB, Ofori LO, Gareiss PC, Lerea J, Miller BL. Strategies for Recognition of Stem-Loop RNA Structures by Synthetic Ligands: Application to the HIV-1 Frameshift Stimulatory Sequence. J Med Chem. 2010;53(16):6018–27. pmid:20672840.
  119. 119. Castanotto D, Rossi JJ. The promises and pitfalls of RNA-interference-based therapeutics. Nature. 2009;457(7228):426–33. pmid:19158789.
  120. 120. Gareiss PC, Sobczak K, McNaughton BR, Palde PB, Thornton CA, Miller BL. Dynamic Combinatorial Selection of Molecules Capable of Inhibiting the (CUG) Repeat RNA-MBNL1 Interaction In Vitro: Discovery of Lead Compounds Targeting Myotonic Dystrophy (DM1). J Am Chem Soc. 2008;130(48):16254–61. pmid:18998634.
  121. 121. Tavares RdCA, Mahadeshwar G, Wan H, Huston NC, Pyle AM. The global and local distribution of RNA structure throughout the SARS-CoV-2 genome. J Virol. 2020;95(6):e02190–20. pmid:33268519.
  122. 122. Vandelli A, Monti M, Milanetti E, Armaos A, Rupert J, Zacco E, et al. Structural analysis of SARS-CoV-2 and predictions of the human interactome. Nucleic Acids Res. 2020;48(20):11270–77283. pmid:33068416.
  123. 123. Andronescu M, Bereg V, Hoos HH, Condon A. RNA STRAND: the RNA secondary structure and statistical analysis database. BMC Bioinformatics. 2008;9:340. pmid:18700982.
  124. 124. Burley SK, Bhikadiya C, Bi CX, Bittrich S, Chen L, Crichlow GV, et al. RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences. Nucleic Acids Res. 2021;49(D1):D437–D51. pmid:33211854.
  125. 125. Danaee P, Rouches M, Wiley M, Deng D, Huang L, Hendrix D. bpRNA: large-scale automated annotation and analysis of RNA secondary structure. Nucleic Acids Res. 2018;46(11):5381–94. pmid:29746666.
  126. 126. Juhling F, Morl M, Hartmann RK, Sprinzl M, Stadler PF, Putz J. tRNAdb 2009: compilation of tRNA sequences and tRNA genes. Nucleic Acids Res. 2009;37:D159–D62. pmid:18957446.
  127. 127. Gutell RR. Collection of small subunit (16S- and 16S-like) ribosomal RNA structures. Nucleic Acids Res. 1993;21(13):3051–4. pmid:8332526; PubMed Central PMCID: PMC7524024.
  128. 128. Zwieb C, Gorodkin J, Knudsen B, Burks J, Wower J. tmRDB (tmRNA database). Nucleic Acids Res. 2003;31(1):446–7. pmid:12520048.
  129. 129. Richardson KE, Kirkpatrick CC, Znosko BM. RNA CoSSMos 2.0: an improved searchable database of secondary structure motifs in RNA three-dimensional structures. Database-Oxford. 2020:baz153. pmid:31950189.
  130. 130. Korunes KL, Myers RB, Hardy R, Noor MAF. PseudoBase: a genomic visualization and exploration resource for the Drosophila pseudoobscura subgroup. Fly. 2021;15(1):38–44. pmid:33319644.
  131. 131. Nagaswamy U, Larios-Sanz M, Hury J, Collins S, Zhang ZD, Zhao Q, et al. NCIR: a database of non-canonical interactions in known RNA structures. Nucleic Acids Res. 2002;30(1):395–7. 10.1093/nar/30.1.395. pmid:11752347.
  132. 132. Sloma MF, Mathews DH. Exact calculation of loop formation probability identifies folding motifs in RNA secondary structures. RNA. 2016;22(12):1808–18. pmid:27852924.
  133. 133. Tan Z, Fu YH, Sharma G, Mathews DH. TurboFold II: RNA structural alignment and secondary structure prediction informed by multiple homologs. Nucleic Acids Res. 2017;45(20):11570–81. pmid:29036420.
  134. 134. Fu LM, Niu BF, Zhu ZW, Wu ST, Li WZ. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150–2. pmid:23060610.
  135. 135. Lyngso RB, Pedersen CN. RNA pseudoknot prediction in energy-based models. J Comput Biol. 2000;7(3–4):409–27. pmid:11108471.
  136. 136. Johnsson P, Lipovich L, Grander D, Morris KV. Evolutionary conservation of long non-coding RNAs; sequence, structure, function. Biochim Biophys Acta. 2014;1840(3):1063–71. pmid:24184936.
  137. 137. Rivas E. The four ingredients of single-sequence RNA secondary structure prediction. A unifying perspective. RNA Biol. 2013;10(7):1185–96. pmid:23695796.
  138. 138. Carvalho DV, Pereira EM, Cardoso JS. Machine Learning Interpretability: A Survey on Methods and Metrics. Electronics-Switz. 2019;8(8). WOS:000483554300063.
  139. 139. Apolloni B, Lotorto L, Morpurgo A, Zanaboni A. RNA Secondary Structure Prediction by MFT Neural Networks. Psychol Forsch. 2003:143–8.