A New Avenue for Classification and Prediction of Olive Cultivars Using Supervised and Unsupervised Algorithms

Amir H. Beiki; Saba Saboor; Mansour Ebrahimi

doi:10.1371/journal.pone.0044164

Abstract

Various methods have been used to identify cultivares of olive trees; herein we used different bioinformatics algorithms to propose new tools to classify 10 cultivares of olive based on RAPD and ISSR genetic markers datasets generated from PCR reactions. Five RAPD markers (OPA0a21, OPD16a, OP01a1, OPD16a1 and OPA0a8) and five ISSR markers (UBC841a4, UBC868a7, UBC841a14, U12BC807a and UBC810a13) selected as the most important markers by all attribute weighting models. K-Medoids unsupervised clustering run on SVM dataset was fully able to cluster each olive cultivar to the right classes. All trees (176) induced by decision tree models generated meaningful trees and UBC841a4 attribute clearly distinguished between foreign and domestic olive cultivars with 100% accuracy. Predictive machine learning algorithms (SVM and Naïve Bayes) were also able to predict the right class of olive cultivares with 100% accuracy. For the first time, our results showed data mining techniques can be effectively used to distinguish between plant cultivares and proposed machine learning based systems in this study can predict new olive cultivars with the best possible accuracy.

Citation: Beiki AH, Saboor S, Ebrahimi M (2012) A New Avenue for Classification and Prediction of Olive Cultivars Using Supervised and Unsupervised Algorithms. PLoS ONE 7(9): e44164. https://doi.org/10.1371/journal.pone.0044164

Editor: Jérémie Bourdon, Université de Nantes, France

Received: June 10, 2012; Accepted: July 30, 2012; Published: September 5, 2012

Copyright: © Ebrahimi et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: No current external funding sources for this study.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Olive (Olea europaea L.) has been domesticated by 5800 B.P. [1] probably both in Eastern and Western of the Mediterranean basin [2]–[4]. Archaeological findings revealed that olive cultivation in Iran dates back to more than 2000 years ago [5]. Until recent years, cultivar identification has been based on morphological and agronomic traits. However, the recognition of olive cultivars based on phenotypic characters is often problematic, especially at the early stages of tree development [6]. This has led to great confusion and uncertainty about the current status of olive germplasm in many countries. The ability to discriminate and predict olive cultivars is important for successful breeding programs and improved management of genetic resources [7]. With the development of PCR-based DNA markers such as RAPD [8] SSR [9], AFLPs [10] and SNP [11], marker technology today offers powerful tools to analysis the plant genome. They have enabled the identification of genes and genome associated with the expression of qualitative and quantitative traits and has led to a better understanding of the complex genome of various plants. The use of molecular markers to manage olive germplasm is particularly advantageous, due to the fact that the olive has an exceptionally long juvenile period [12]. Recently, bioinformatics and data mining application have been widely used in interpreting information from biological data. [13]–[16].

The main goal of this work was to construct a molecular database based on RAPD and ISSR markers for olive cultivares and to find specific molecular markers to quickly distinguish between Iranian and foreign olive tree cultivars.

Materials and Methods

Genomic DNA of five Iranian and five foreign olive (Olea europaea L.) cultivars were isolated from freshly harvested young leaves of five plants from IKIU fields of Qazvin University (with the permission from the head; school of agriculture, Qazvin University, Iran; the cultivars have not been designed as protected or endangered species) of each cultivar by Mini prep method. To eliminate the effects of impurity, just these ten cultivares; whom were officially proven by administrative bodies to be pure and the most reliable; were chosen for lab experiments. A total of 14 primers ((AG)₈T, (AG)₈C, (GA)₈T, (GA)₈C, (GA)₈A, (CA)₈G, (AG)₈CT, (AG)₈CC, (AG)₈CA, (GA)₈CC, (GA)₈CCY, (AC)₈YA, (GA)₈A and (GGAGA)₃) for inter-simple sequence repeat-polymerase chain reaction (ISSR-PCR) and 14 primers (5′-GTGATCGCAG-3′, 5′-CAATCGCCGT-3′, 5′-GTTTCGCTCC-3′, 5′-AAGACCCCTC-3′, 5′-GGTGACTGTG-3′, 5′-TCTGTGCCAC-3′, 5′-TCGGCGGTTC-3′, 5′-CCGAATTCCC-3′, 5′-CACAGAGGGA-3′, 5′-GTGACGTAGG-3′, 5′-TGAGCGGACA-3′, 5′-CATCCGTGCT-3′, 5′-CCTGGGCTTC-3′, and 5′-GTCCCGTTCA-3′) for random amplified polymorphic DNA were used in the study (Table 1).

Download:

Table 1. Names and the sequences of ISSR and RAPD marker.

https://doi.org/10.1371/journal.pone.0044164.t001

ISSR-PCR was conducted in a reaction volume of 15 µl containing 30 ng template DNA, 0.2 µmol/L primer, 200 µmol/L each dNTP, 10 mmol/L Tris-Cl (pH 8.3), 50 mmol/L KCl, 2.0 mmol/L MgCl2, and 1 U of Taq polymerase. PCR amplification conditions were set as initial denaturation at 94°C for 5 min, 40 cycles of denaturation at 94°C for 1 min, annealing at 50°C for 1 min, extension at 72°C for 2 min, and a final extension at 72°C for 7 min. PCR was performed in 96-well plate thermal cycler (Eppendorf, Germany). The amplified products were mixed with loading dye (0.4 g/ml sucrose and 2.5 mg/ml bromophenol blue), resolved on 18 mg/ml.

Download:

Table 2. The numbers and the averages of most important alleles (fragments) selected by different attribute weighting algorithms.

https://doi.org/10.1371/journal.pone.0044164.t002

The RAPD technique consists of preferential amplification of random sequences by PCR. In this assay, 10 different primers were used (Table 1). Each 25 µL PCR reaction mixture consisted of 50 ng genomic DNA, 0.2 mMdNTPs, 2 mM MgCl2, 10pmol primer, 2.5 µL 10× Taq buffer, and 1 unit of Taq polymerase. Samples were subjected to the following thermal profile: 4 min of denaturing at 94°C, forty-five cycles of three steps: 30 s of denaturing at 94°C, 1 min of annealing at 36°C, and 2 min of elongation at 72°C, with a final elongation step of 7 min 72°C. Separation of the amplified fragments was performed on 1.2% (w/v) agarose gels, TAE 1x] at 80V during 2 h. The gels were stained with ethidium bromide for visualizing the RAPD and ISSR fragments. The fragments between 200 and 4k base pair (bp) were visually scored as present (1) or absent (0).

A dataset of 10 cultivar with 402 RAPD and ISSR reproducible fragments or attributes prepared and was imported into RapidMiner software [RapidMiner 5.2, Rapid-I GmbH, Stochumer Str. 475, 44227 Dortmund, Germany]. Then, the steps detailed below were applied to this dataset.

Data Cleaning

Useless attributes were removed from the dataset. Nominal attributes were regarded as useless when the most frequent values were above or below per cent of all examples. After cleaning, this database was labelled the final cleaned database (FCdb).

Attribute Weighting

To identify the most important features that contribute to different olive cultivars, 10 different algorithms of attribute weightings (Information gain, Information Gain ratio, Rule, Deviation, Chi squared statistic, Gini index, Uncertainty, Relief, SVM and PCA) were used (for more information see [14], [15], [17]).

Download:

Table 3. The attribute weighting models and the numbers of important protein features selected by each model and the most important variables selected by each attribute weighting algorithms.

https://doi.org/10.1371/journal.pone.0044164.t003

Attribute selection

Application of attribute weighting models on the dataset gave each alleles attribute (feature) a value between 0 and 1, which revealed the importance of that attribute with regards to a target attribute (Iranian or foreign cultivar). All variables with weights higher than 0.50 were selected and 10 new datasets created. These newly formed datasets were named according to their attribute weighting models (Information gain, Information gain ratio, Rule, Deviation, Chi Squared, Gini index, Uncertainty, Relief, SVM and PCA) and were subjected to subsequent supervised or unsupervised models. Each supervised or unsupervised model was performed 11 times; the first time it ran on the main dataset (FCdb) and then on the 10 newly formed datasets from attribute weighting and selection.

Unsupervised Clustering Algorithms

The clustering algorithms listed below were applied on the 10 newly created datasets (generated as the outcomes of 10 different attribute weighing algorithms) as well as the main dataset (FCdb).

K-Means.

This operator uses kernels to estimate the distance between objects and clusters. Because of the nature of kernels, it is necessary to sum over all elements of a cluster to calculate one distance.

K-Medoids.

This operator represents an implementation of k-Medoids. This operator will create a cluster attribute if it is not yet present.

Support vector clustering (SVC).

This operator represents an implementation of Support Vector algorithm. This operator will create a cluster attribute if not present yet.

Expectation maximization (EM).

This operator represents an implementation of the EM-algorithm.

Download:

Figure 1. Application of K-Medoids to the SVM was able to categorize each cultivar into right cluster.

https://doi.org/10.1371/journal.pone.0044164.g001

Download:

Table 4. The numbers of olive cultivars correctly predicted by three different unsupervised clustering algorithms ran on all databases.

https://doi.org/10.1371/journal.pone.0044164.t004

Supervised Classification

Three classes of supervised classification (Decision Trees, SVM and Baysian models) applied as follows. To calculate the accuracy of each model, 10-fold cross validation [18] is used to train and test models on all patterns. To perform cross validation, all the records were randomly divided into five parts; four sets were used for training and the 5th one for testing. The process was repeated five times and the accuracy for true, false and total accuracy calculated. The final accuracy is the average of the accuracy in all five tests.

Download:

Figure 2. Decision Tree generated from three models ran with Gini Index criterion.

As may be inferred from the figure, UBC841A4 and UBC868A7 fragments were the most important attribute alleles in distinguishing Iranian from foreign cultivars.

https://doi.org/10.1371/journal.pone.0044164.g002

Decision Trees

Six tree induction models including Decision Tree, Decision Tree Parallel, Decision Stump, Random Tree, ID3 Numerical and Random Forest were run on the main dataset (FCdb). Each tree induction model ran with the following four different criteria: Gain Ratio, Information Gain, Gini Index and Accuracy. In addition, a weight-based parallel decision tree model, which learns a pruned decision tree based on an arbitrary feature relevance test (attribute weighting scheme as inner operator), was run with 13 different weighing criteria (SVM, Gini Index, Uncertainty, PCA, Chi Squared, Rule, Relief, Information Gain, Information Gain Ratio, Deviation, Correlation, Value Average, and Tree Importance). The accuracy of each tree computed based on the previous explanation.

Download:

Table 5. The accuracies, precisions and recalls of tree induction models on Final Cleaned database (FCdb) computed on 5-fold cross validation.

https://doi.org/10.1371/journal.pone.0044164.t005

Support Vector Machine Approach

Support Vector Machines (SVMs) are popular and powerful techniques for supervised data classification and prediction; so SVM, LibSVM, SVM Linear and SVME used here to implement different models to predict olive cultivars based on Iranian - foreign features. Briefly, main database (FCdb) transformed to SVM format and scaled by grid search (to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges) and to find the optimal values for operator parameters. To prevent overfitting problems, 5-fold cross validation applied. Dataset divided into 5 parts and 4 parts used as training set and the last part as testing set, the procedure repeated for 10 different testing sets and the average of accuracy computed. RBF kernel which nonlinearly maps samples into a higher dimensional space and can handle the case when the relation between class labels and attributes is nonlinear used to run the model. Other kernels such as linear, poly, sigmoid and pre-computed were also applied to the dataset to find the best accuracy.

Naïve Bayes

Naïve Bayes based on Bayes conditional probability rule is used for performing classification tasks. When the sample sizes tend to be small (as in our experiments with just 5 cultivars in each class), a Bayesian approach can be applied for classification problems with far more predictors than samples; the same have been widely used before (for more details see [19], [20]. Naïve Bayes assumes the predictors are statistically independent which makes it an effective classification tool that is easy to interpret. Two models, Naïve base (returns classification model using estimated normal distributions) and Naïve base kernel (returns classification model using estimated kernel densities) used and the model accuracy in predicting the right Iranian - foreign computed as stated before.

Download:

Figure 3. Kernel distribution model distinguishing between two classes of Olive cultivares based on allele attribute type.

https://doi.org/10.1371/journal.pone.0044164.g003

Results

As mentioned in Materials and Methods, the initial dataset contained 10 cultivars with 400 RAPD and ISSR reproducible fragments (attributes). Following removal of duplicates, useless attributes, and correlated features (data cleaning) 312 features remained; meaning these attribute fragments were polymorphic, ranging in size from 100 to 3000 bp.