Abstract

We employ the Rayleigh entropy maximization to develop a novel IDP scheme which requires computing only five features for each residue of a protein sequence, that is, the Shannon entropy, topological entropy, and the weighted average values of three propensities. Furthermore, our scheme is a linear classification method and hence requires computing simpler decision curves which are more robust as well as using fewer learning samples to compute. The simulation results of our scheme as well as some existing schemes demonstrate its effectiveness.

1. Introduction

Accurately identifying intrinsically disordered proteins (IDPs) which have at least a region lacking a unique 3D structure with a dynamic conformational ensemble [1, 2] is vital to obtain more effective drug designs, better protein expressions, and functional annotations. This is because it is confirmed that some of these intrinsically disordered proteins are involved in some of the most important regulatory functions in the cell [3], which have a great impact on diseases such as Alzheimer’s disease, Parkinson’s disease, and certain types of cancer [4]. It is essential to investigate the IDPs through the computation of the amino acid sequence of a protein [4]. This is because it is often difficult to purify and crystallize the disordered protein regions [5], which creates great problems for the disordered protein regions identification with the experimental approaches. Furthermore, experimental approaches for the disordered protein regions identification are usually both expensive and time-consuming [4].

Many IDP schemes have been proposed in the past decades, which can be roughly classified into two categories. (1) The first category is to exploit the amino acid propensity scales of the protein sequences for IDPs, such as FoldIndex [6], GlobPlot [7], IUPred [8], and FoldUnfold [9]. These schemes utilize the amino acid propensity scales to compute parameters such as the ratio of mean net charges with the mean hydropathy, the relative propensity of an amino acid residue, and the interresidue contacts for IDPs. These IDP schemes are simple but not accurate enough in general [10]. (2) The second category is to employ machine learning techniques for the IDPs. The examples of these include PONDR®s [11], RONN [12], DISOPRED2 [13], BVDEA [4], and DisPSSMP [14]. Many of these schemes are based on the artificial neural networks as well as support vector machine (SVM) which in general require computing a lot of features of a given protein sequence for IDPs. The computation of these features of a protein sequence could be expensive and time-consuming. More recently, MetaPrDOS [15] and Meta-Disorder predictor [16] which use several different predictors and their trade-off to yield an optimal decision for IDPs are also reported.

In this paper, we employ the Rayleigh entropy maximization to develop a novel IDP scheme which requires computing only five features for each residue of a protein sequence, that is, the Shannon entropy, topological entropy, and the weighted average values of three propensities. In contrast with most existing IDP schemes which need to compute no less than 30 features for each residue of a protein sequence, our scheme with a similar performance greatly reduces the computational complexity. Furthermore, our scheme based on the linear classification method has simpler decision curves which are more robust and require fewer learning samples to compute. Our scheme is trained and tested by the dataset DIS803 with 10-fold cross-validation, firstly. The dataset DIS803 is comprised of 803 protein sequences with 1254 disordered regions and 1343 ordered regions, which include 92423 disordered and 315503 ordered residues. As a comparison, we run our scheme together with some existing schemes, such as PONDR [11], FoldIndex [6], DISOPRED2 [13], RONN [12], and DISPRO [17] on the datasets PU159 and R80 which are comprised of 239 protein sequences with 183 disordered regions and 231 ordered regions. They are comprised of 18111 disordered and 46477 ordered residues, respectively. The simulation results suggest that only our scheme, BVDEA [4], and DisPSSMP [14] have PE (probability excess) values exceeding 0.5 for both datasets PU159 and R80. Our scheme is at least as accurate as BVDEA [4] and DisPSSMP [14] and requires computing only 5 features for each residue of a protein sequence, while the other two need to compute 188 and 120 features for each residue, respectively. In addition, both BVDEA [4] and DisPSSMP [14] are based on nonlinear classification methods which require computing the complex decision curves that are less robust in general.

2. A Brief Review of Some Notations

In a protein sequence, the complexity denotes how a sequence can be rearranged in many different ways [18]. It has been demonstrated that the low complexity regions are more likely to be disordered than ordered [12]. Shannon entropy and topological entropy are two parameters to measure the complexity of a sequence. To begin with, let us first recall some notations.

Given a protein sequence of length , the Shannon entropy is where for is defined as with , being an ordered set of 20 amino acid symbols.

The complexity function representing the total number of different -length subwords of is defined as [19]where a subword of a length is one of any consecutive symbols of . denotes the length of . For example, for a given sequence , the subwords of length 2 are which yields Given a finite protein sequence of length , let be the unique integer satisfying and denote the first consecutive symbols of ; that is, .

The topological entropy of is where is defined in (3). Thus, we have when the subwords of run over all the possible subwords of length . On the other hand, is a repetition sequence comprising a single letter which suggests . Similar to [19], we also compute the average of the topological entropy of as

The Rayleigh entropy maximization [20] of , where represents the total number of all the samples and () represents the features of the th sample, is to compute the projection direction which optimizes the cost function and in (8) are, respectively, defined as where is the number of samples in the th class and is the set of samples in the th class.

Using the Lagrange method, the optimal and the corresponding optimal projection of on the direction of are given as

3. The Computation of the Optimal Projection Direction

In this section, we compute the Shannon entropy and the topological entropy of the dataset DIS803 from DisProt [21] (http://www.disprot.org/). Then, choosing Remark 465, Deleage/Roux, and Bfactor(2STD) propensities provided by the GlobPlot NAR paper [7] (http://globplot.embl.de/html/propensities.html), we compute the weighted average values of these propensities of the dataset DIS803. Finally, utilizing the computed Shannon entropy, topological entropy, and the weighted average values of three propensities of the dataset DIS803, we derive the optimum projection direction defined in (8). The procedure proceeds as follows:(1)Let be a protein sequence. We choose a window of length to extract consecutive residues from . Therefore, we assume the length of to be . Using (1), we can compute the Shannon entropy of . To compute the topological entropy of , we first map to the propensities as follows. We map bulky hydrophobic (I, L, V) as well as aromatic (F, W, Y) amino acid residues defined in [10] to 1 and the rest of residues to 0. We use to represent the mapped sequence of . Table 1 lists all the amino acid residues and their corresponding mapping values.Then, utilizing (7), we compute the average topological entropy of as where the parameter here satisfies . denotes the th consecutive symbols of ; that is, .For example, Table 1 suggests that the protein sequence is mapped to Therefore, we have which yields satisfying . Substituting into , we obtain . Thus, from (14), the topological entropy of is .(2)For this protein sequence of length , we also compute the weighted average values of Remark 465, Deleage/Roux, and Bfactor(2STD) propensities defined in the GlobPlot NAR paper [7]: where with represents the values of the th propensity of . We use the th propensity of with to denote Remark 465, Deleage/Roux, and Bfactor(2STD) propensities, respectively. The weight in (18) is identical to the sum function of the GlobPlot NAR paper [7].For example, Remark 465 propensity of the sequence in (15) is From (18), it follows that of Remark 465 propensity is 0.1551. Similarly, and , respectively, corresponding to the Deleage/Roux and Bfactor (2STD) propensities are and .(3)For a general protein sequence of length , we use a sliding window of length () to extract consecutive residues . For this sliced , we compute the Shannon entropy , the topological entropy , and for defined in (18). Define a vector to be Thus, we can compute the feature matrix of the protein sequence of length as where vector is with for being defined in (19).For the protein sequence of (15), we choose the size of window and compute the 10th and 30th residues of . and are where is defined in (19).(4)Utilizing 10-fold cross-validation [22], we randomly divide the dataset DIS803 into ten subsets of approximately equal size. The protocol uses nine subsets as the training dataset to build a model and the remaining 10th subset for testing. Using the training dataset of 10-fold cross-validation [22], we can compute the feature matrix where is the total number of the protein sequences of the training dataset. defined in (20) with is the feature matrix of the th protein sequence whose length is . Of all the residues of the training dataset obtained from DIS803 through -fold cross-validation described above, we divide it into two disjoint subsets: one comprised of all the disordered residues and the other of all the ordered residues of the training dataset. Let and , respectively, denote the number of all the disordered and all the ordered residues of the training dataset. and , respectively, represent the feature matrices defined in (23) corresponding to all the disordered and all the ordered residues of the training dataset. From (11), it follows that where and represent the th column vector in and , respectively. Using and , in (9) can be calculated as From (12), the projection direction is The projection can be computed by (13). Finally, using linear searching in , we can obtain the threshold of classification.

4. The Simulation Results

We employ the Rayleigh entropy maximization shown in the previous sections to develop an IDP scheme which requires computing only five features for each residue of a protein sequence, that is, the Shannon entropy, topological entropy, and the weighted average values of three propensities. In contrast, computing no less than 30 features is demanded by most existing schemes, such as PONDR [11], DISOPRED2 [13], RONN [12], DISPRO [17], BVDEA [4], and DisPSSMP [14], for the IDP identification. Furthermore, our scheme is based on the linear classification method which requires fewer learning samples to compute the simple decision curves that are more robust.

In order to train and test our scheme, the sequences in the dataset DIS803 are randomly split into ten subsets of approximately equal size to conduct a 10-fold cross-validation. The dataset DIS803 is comprised of 803 protein sequences. The results of our scheme with different window sizes are shown in Table 2. We use Sens., Spec., PE, and MCC to abbreviate sensitivity, specificity, probability excess, and Matthews’ correlation coefficient, respectively. In addition, the values on probability excess and Matthews’ correlation coefficient with different window sizes are shown in Figure 1. When the window size is larger than 35, the values tend to be smooth. Thus, we present our results with the window size of 35 in subsequent simulations.

As a comparison, we run our scheme together with some of the best known schemes, such as PONDR [11], FoldIndex [6], DISOPRED2 [13], RONN [12], and DISPRO [17], on the datasets PU159 and R80 which are comprised of 239 protein sequences with 183 disordered regions and 231 ordered regions. Dataset PU159 consists of P80 and U79 [23] where P80 and U79 with 80 completely ordered and 79 completely disordered proteins, respectively, are from PONDR® web site [23, 24]. Dataset R80 is from RONN [12] and contains 80 proteins with 183 disordered regions and 151 ordered regions.

Considering the classification method used, we use DISREM as the abbreviation of our scheme. The simulation results listed in Tables 3 and 4 show that the IDP identification accuracy of our scheme is approximately accurate as those of BVDEA [4] and DisPSSMP [14] whose performance exceeds the rest of the schemes mentioned above on the datasets PU159 and R80. From Tables 3 and 4, it is suggested that only our scheme, BVDEA [4], and DisPSSMP [14] have PE (probability excess) values exceeding 0.5 for both datasets PU159 and R80. To achieve these PE values, our scheme requires computing only 5 features of each residue, while computing 188 and 120 features for each residue of a protein sequence is demanded by DisPSSMP [14] and BVDEA [4], respectively.

Furthermore, unlike nonlinear classification of DisPSSMP [14] and BVDEA [4] which require computing the complex decision curves, our scheme is based on the Rayleigh entropy maximization which is the linear classification method. Therefore, our scheme has simpler decision curves to compute and hence decision curves are more robust and require fewer learning samples than those of DisPSSMP [14] and BVDEA [4].

5. Conclusions

In this paper, we compute the Shannon entropy, the topological entropy, and the weighted average values of three propensities to develop a criterion based on Rayleigh entropy maximization for predicting the intrinsically disordered regions of a protein. Compared with several existing schemes, the identification accuracy of our scheme is at least as accurate as those schemes whose performance exceeds the rest of the compared schemes. Particularly, in contrast with those schemes that require computing no less than 30 features, our scheme only relies on computing five features.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.