iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins

Kuo-Chen Chou; Zhi-Cheng Wu; Xuan Xiao

doi:10.1371/journal.pone.0018258

Abstract

Predicting protein subcellular localization is an important and difficult problem, particularly when query proteins may have the multiplex character, i.e., simultaneously exist at, or move between, two or more different subcellular location sites. Most of the existing protein subcellular location predictor can only be used to deal with the single-location or “singleplex” proteins. Actually, multiple-location or “multiplex” proteins should not be ignored because they usually posses some unique biological functions worthy of our special notice. By introducing the “multi-labeled learning” and “accumulation-layer scale”, a new predictor, called iLoc-Euk, has been developed that can be used to deal with the systems containing both singleplex and multiplex proteins. As a demonstration, the jackknife cross-validation was performed with iLoc-Euk on a benchmark dataset of eukaryotic proteins classified into the following 22 location sites: (1) acrosome, (2) cell membrane, (3) cell wall, (4) centriole, (5) chloroplast, (6) cyanelle, (7) cytoplasm, (8) cytoskeleton, (9) endoplasmic reticulum, (10) endosome, (11) extracellular, (12) Golgi apparatus, (13) hydrogenosome, (14) lysosome, (15) melanosome, (16) microsome (17) mitochondrion, (18) nucleus, (19) peroxisome, (20) spindle pole body, (21) synapse, and (22) vacuole, where none of proteins included has pairwise sequence identity to any other in a same subset. The overall success rate thus obtained by iLoc-Euk was 79%, which is significantly higher than that by any of the existing predictors that also have the capacity to deal with such a complicated and stringent system. As a user-friendly web-server, iLoc-Euk is freely accessible to the public at the web-site http://icpr.jci.edu.cn/bioinfo/iLoc-Euk. It is anticipated that iLoc-Euk may become a useful bioinformatics tool for Molecular Cell Biology, Proteomics, System Biology, and Drug Development Also, its novel approach will further stimulate the development of predicting other protein attributes.

Citation: Chou K-C, Wu Z-C, Xiao X (2011) iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins. PLoS ONE 6(3): e18258. https://doi.org/10.1371/journal.pone.0018258

Editor: Christian Schönbach, Kyushu Institute of Technology, Japan

Received: December 17, 2010; Accepted: February 24, 2011; Published: March 30, 2011

Copyright: © 2011 Chou et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by the grants from the National Natural Science Foundation of China (No. 60961003), the Key Project of Chinese Ministry of Education (No. 210116), the Province National Natural Science Foundation of Jiangxi (2009GZS0064), the Department of Education of Jiangxi Province (No. GJJ09271), and the plan for training youth scientists (stars of Jing-Gang) of Jiangxi Province. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Knowledge of the subcellular location of proteins is important as can be viewed from the following four aspects. (1) It can provide useful insights or clues about their functions; particularly, one of the fundamental goals in cell biology and proteomics is to identify the functions of proteins in the context of compartments that organize them in the cellular environment. (2) It can indicate how and in what kind of cellular environments the proteins interact with each other and with other molecules; this is especially important for the in-depth study of protein-protein interaction (PPI), one of the currently hot topics in proteomics. (3) It can help our understanding of the intricate pathways that regulate biological processes at the cellular level [1], [2] and hence it is indispensable for many studies in system biology. (4) It is very useful for identifying and prioritizing drug targets [3] during the process of drug development.

Although the knowledge of protein subcellular localization can be acquired by conducting various biochemical experiments, it is both time-consuming and costly by relying on doing experiments alone. Particularly, recent advances in large-scale genome sequencing have generated a huge number of protein sequences. For example, in 1986 the Swiss-Prot [4] database contained only 3,939 protein sequence entries, but now the number has jumped to 521,016 according to the release 2010_10 on 05-Oct-2010 by the UniProtKB/Swiss-Prot at http://www.expasy.org/sprot/relnotes/relstat.html; meaning that the number of protein sequence entries now is more than 132 times the number from about 24 years ago.

Facing the avalanche of protein sequences generated in the post-genomic age, it is highly desired to develop computational methods for timely and effectively identifying various biological features for newly found proteins [5], [6], [7], [8], particularly to develop user-friendly web-servers in this regard [9], [10]. In this study, we are to focus on the topic of protein subcellular localization.

Actually, the problem of predicting protein subcellular localization is somewhat reminiscent of the efforts by many previous investigators because during the past 19 years or so, a series of methods have been developed on this topic (see, e.g., [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24] as well as a long list of references cited in two comprehensive review articles [25], [26]). These methods each had their own advantages and indeed played a role in stimulating the development of this area although they also each had their own limitations.

The development of protein subcellular localization has generally followed two trends. One is to extract more useful information from protein sequences via different approaches or models, such as from the model of targeting or leader sequences [11], to the amino acid composition [13], [27], to the amino acid pair composition [28], to the various modes [21], [29], [30], [31], [32], [33], [34], [35], [36] of pseudo amino acid composition [37], and to the higher-level forms of pseudo amino acid composition by incorporating the functional domain information [38], gene ontology information [39], and sequential evolution information [40]. The other trend is to enhance the power of practical application by enlarging the coverage scope, such as from covering only 2 subcellular location sites [12], to 5 location sites [13], to 12 location sites [14], [28], and to 22 location sites [40].

Most of these existing methods were established based on the assumption that a protein resides at one, and only one, subcellular location (see, e.g., [13], [15], [28], [41], [42], [43], [44]). Such an assumption is valid only for the single-location or “singleplex” proteins but not for multiple-location or “multiplex” proteins that may simultaneously reside at, or move between, two or more different subcellular locations. Proteins with multiple location sites or dynamic feature of this kind are particularly interesting because they may have some unique biological functions worthy of our special notice [2], [3]. Particularly, as pointed out by Millar et al. [45], recent evidences have indicated that an increasing number of proteins have multiple locations in the cell.

Recently, a powerful predictor, called Euk-mPLoc 2.0 [40] was developed that can be used to predict the subcellular localization of eukaryotic proteins among their 22 location sites in which some of the proteins may belong to two and more subcellular locations. However, Euk-mPLoc 2.0 has the following shortcomings. (1) Only the integer numbers 0 and 1 were used to reflect the GO (gene ontology) [46], [47] information in formulating protein samples; this might cause some information lost and limit the prediction quality. (2) It was through an optimal threshold factor to control the prediction of multiple locations (see Eq.48 of [26]); it would be more natural if we could find a more intuitive approach to deal with such a problem. (3) Although a web-server for Euk-mPLoc has been established at http://www.csbio.sjtu.edu.cn/bioinf/euk-multi-2/, only one query protein sequence at a time is allowed when using the web-server to conduct prediction; for the convenience of users in handling many query protein sequences, such a rigid limit should be improved.

The present study was initiated in an attempt to develop a new and more powerful predictor by addressing the above three problems.

Methods

Given a query protein sequence as formulated by (1)where represents the 1^st residue of the protein , the 2^nd residue, …, the residue, and they each belong to one of the 20 native amino acids. How can we use its sequence information to predict which subcellular location(s) the protein belongs to? The most straightforward method to address this problem is to use the sequence-similarity-search-based tools, such as BLAST [48], [49], to search protein database for those proteins with high sequence similarity to the query protein. Subsequently, the subcellular location annotations of the proteins thus found are used to deduce the subcellular location(s) of . Unfortunately, this kind of straightforward and intuitive approach failed to work when the query protein did not have significant sequence similarity to any location-known proteins.

Thus, various non-sequential or discrete models to represent protein samples were proposed in hopes to establish some sort of correlation or cluster manner through which the prediction could be more effectively carried out.

The simplest discrete model used to represent a protein sample is its amino acid (AA) composition or AAC [50]. According to the AAC-discrete model, the protein of Eq.1 can be formulated by [51](2)where are the normalized occurrence frequencies of the 20 native amino acids in protein , and the transposing operator. Many methods for predicting protein subcellular localization were based on the AAC-discrete model (see, e.g., [12], [13],[14],[27] ). However, as we can see from Eq.2, if using the ACC model to represent the protein , all its sequence-order effects would be lost, and hence the prediction quality might be limited.

To avoid completely lose the sequence-order information, the pseudo amino acid composition (PseAAC) was proposed to represent the sample of a protein, as formulated by [37] (3)where the first 20 elements are associated with the 20 amino acid components of the protein, while the additional factors are used to incorporate some sequence-order information via a series of rank-different correlation factors along a protein chain.

Actually, the PseAAC for a protein can be generally formulated as

(4)where the subscript is an integer, and its value as well as the components , , … will depend on how to extract the desired information from the amino acid sequence of (cf. Eq.1). The form of Eq.4 can cover the PseAAC as originally formulated in [37]; ie, when(5)we immediately obtain the formulation of PseAAC as originally given in [37], where the meanings for , , and were clearly elaborated and hence there is no need to repeat here.

To develop a powerful method for statistically predicting protein subcellular localization, one of the most important things is to find a formulation to reflect the core and essential features of protein samples that are closely correlated with their subcellular localization. However, this is by no means an easy thing to do because this kind of features is usually deeply hidden or “buried” in piles of complicated sequences. To deal with this problem, let us consider the following approaches via the general form of PseAAC (Eq.4).

1. GO (Gene Ontology) Formulation

GO database [46] was established according to the molecular function, biological process, and cellular component. Accordingly, protein samples defined in a GO database space would be clustered in a way better reflecting their subcellular locations [26], [52]. However, in order to incorporate more information, instead of only using 0 and 1 elements as done in [40], here let us use a different approach as described below.

Step 1.

Compression and reorganization of the existing GO numbers. The GO database (version 740 released 30 July 2009) contains many GO numbers. However, these numbers do not increase successively and orderly. For easier handling, some reorganization and compression procedure was taken to renumber them. For example, after such a procedure, the original GO numbers GO:0000001, GO:0000002, GO:0000003, GO:0000009, GO:00000011, GO:0000012, GO:0000015, …, GO:0090204 would become GO_compress: 0000001, GO_compress: 0000002, GO_compress: 0000003, GO_compress: 0000004, GO_compress: 0000005, GO_compress: 0000006, GO_compress: 0000007, ……, GO_compress: 0011118, respectively. The GO database obtained thru such a treatment is called GO_compress database, which contains 11,118 numbers increasing successively from 1 to the last one.

Step 2.

Using Eq.4 with , the protein can be formulated as(6)where are defined via the following steps.

Step 3.

Use BLAST [53] to search the homologous proteins of the protein from the Swiss-Prot database (version 55.3), with the expect value for the BLAST parameter.

Step 4.

Those proteins which have pairwise sequence identity with the protein are collected into a set, , called the “homology set” of . All the elements in can be deemed as the “representative proteins” of , sharing some similar attributes such as structural conformations and biological functions [54], [55], [56]. Because they were retrieved from the Swiss-Prot database, these representative proteins must each have their own accession numbers.

Step 5.

Search the GO database at http://www.ebi.ac.uk/GOA/ to find the corresponding GO number(s) [57] for each of the accession numbers collected in Step 4, followed by converting the GO numbers thus obtained to their GO_compress numbers as described in Step 1. (Note that the relationships between the UniProtKB/Swiss-Port protein entries and the GO numbers may be one-to-many, ‘‘reflecting the biological reality that a particular protein may function in several processes, contain domains that carry out diverse molecular functions, and participate in multiple alternative interactions with other proteins, organelles or locations in the cell’’ [46]. For example, the Uni-ProtKB/Swiss-Prot protein entry ‘‘P01040’’ corresponds to three GO numbers, i.e., ‘‘GO:0004866’’, ‘‘GO:0004869’’, and ‘‘GO:0005622’’).

Step 6.

Thus, the elements in Eq.6 is given by(7)where is the number of representative proteins in , and (8)

As we can see from Eq.7, the GO formulation derived from the above steps consists of 11,118 real numbers rather than only the elements 0 and 1 as in the GO formulation adopted in [40].

Note that the GO formulation of Eq.6 may become a naught vector or meaningless under any of the following situations: (1) the protein does not have significant homology to any protein in the Swiss-Prot database, i.e., meaning the homology set is an empty one; (2)its representative proteins do not contain any useful GO information for statistical prediction based on a given training dataset.

Under such a circumstance, let us consider using the sequential evolution formulation to represent the protein , as described below.

2. SeqEvo (Sequential Evolution) Formulation

Biology is a natural science with historic dimension. All biological species have developed continuously starting out from a very limited number of ancestral species. It is true for protein sequence as well [56]. Their evolution involves changes of single residues, insertions and deletions of several residues [58], gene doubling, and gene fusion. With these changes accumulated for a long period of time, many similarities between initial and resultant amino acid sequences are gradually eliminated, but the corresponding proteins may still share many common attributes, such as having basically the same biological function and residing in a same subcellular location.

To incorporate the sequential evolution information into the PseAAC of Eq.4, here let us use the information of the PSSM (Position-Specific Scoring Matrix) [53], as described below.