A novel biclustering algorithm of binary microarray data: BiBinCons and BiBinAlter

Saber, Haifa Ben; Elloumi, Mourad

doi:10.1186/s13040-015-0070-4

Research
Open access
Published: 30 November 2015

A novel biclustering algorithm of binary microarray data: BiBinCons and BiBinAlter

Haifa Ben Saber¹ &
Mourad Elloumi^1,2

BioData Mining volume 8, Article number: 38 (2015) Cite this article

2315 Accesses
3 Citations
1 Altmetric
Metrics details

Abstract

The biclustering of microarray data has been the subject of a large research. No one of the existing biclustering algorithms is perfect. The construction of biologically significant groups of biclusters for large microarray data is still a problem that requires a continuous work. Biological validation of biclusters of microarray data is one of the most important open issues. So far, there are no general guidelines in the literature on how to validate biologically extracted biclusters. In this paper, we develop two biclustering algorithms of binary microarray data, adopting the Iterative Row and Column Clustering Combination (IRCCC) approach, called BiBinCons and BiBinAlter. However, the BiBinAlter algorithm is an improvement of BiBinCons. On the other hand, BiBinAlter differs from BiBinCons by the use of the EvalStab and IndHomog evaluation functions in addition to the CroBin one (Bioinformatics 20:1993–2003, 2004). BiBinAlter can extracts biclusters of good quality with better p-values.

Peer Review reports

Introduction

DNA microarray technology is a revolutionary tool enabling the measurement of expression levels of thousands of genes in a single experiment under diverse experimental conditions. This technology allows us to obtain big raw data that can provide a wealth of information on the concerned genes. It proved to be a valuable tool for many biological and medical applications. Indeed, microarray data analysis is a crucial step for these applications in order to extract pertinent biological knowledge embedded in these large masses of data. However, the extraction process of this knowledge is far from being trivial. From here comes the necessity to adopt data mining techniques. Many of these techniques were applied to these data in order to extract pertinent biological knowledge. Among the techniques that are used, we mention those of clustering [1]. Indeed, by making a clustering, we consider that all the genes of a group can have a similar behavior under all the conditions. However, there are genes that have a similar behavior only under a subset of conditions. Hence, clustering is too simplistic to detect such cases [1]. Another more interesting technique, called biclustering [2], allows to identify groups of genes that have a similar behavior only under a subset of conditions.

In this paper, we develop new biclustering algorithms of microarray data. These data are usually coded by a data matrix M(I,J), where the i^th row, i∈I={1,2,…,n}, represents the i^th gene, the j^th column, j∈J={1,2,…,m}, represents the j^th condition and the cell M[i,j] represents the expression level of the i^th gene under the j^th condition.

The main objective is then to identify groups of genes that are coherent under groups of conditions, these groups are called biclusters. Genes belonging to the same bicluster have close biological functions. Let’s note that, in its general form, the biclustering problem is NP-hard [2].

The rest of this chapter is organized as follow: In the second section, we introduce some preliminaries. In the third section, we present the BiBinCons algorithm. In the fourth section, we present the BiBinAlter algorithm. In the fifth section, we present an illustrative example and an experimental study. Finally, we present the conclusion of this paper.

Preliminaries

As we said in the introduction, the biclustering algorithms that we present in this paper are based on CroBin [1] function for the evaluation of a group of biclusters. So, let’s present some preliminaries related to this function. Let I={1,2,…,n} be a set of indices of n genes, J={1,2,…,m} be a set of indices of m conditions and $M_{b}(\textit {I,J}) = \left (m_{\textit {ij}}^{b}\right)$, i∈I and j∈J, be a binary data matrix associated with I and J. The biclustering problem of a binary microarray data can be formulated as a minimization of the criterion W(z,w,a):

$$ W(z,w,a)=\sum\limits_{k=1}^{a}\sum\limits_{l=1}^{m}\sum\limits_{i\in z_{k}}\sum\limits_{j\in w_{l}}\left|m_{ij}^{b}-a_{kl}\right|. $$

((2.1))

where z={z,z₂,…,z_g} is the matrix defined as a partition of I into g clusters, i.e. z_i is the cluster number of the i^th row of M_b(I,J). w={w₁,w₂,…,w_h} is the matrix defined as a partition of J into h clusters, i.e. w_i is the cluster number of the j^th column of M_b(I,J).white.whe

a = (a_kl) is a summary matrix of M_b(I,J), it is a binary g×h matrix where k (resp. l) is the number of clusters on rows (resp. columns) and a_kl is defined by the m_ij’s satisfaying the following condition:

$$ z_{ik}w_{jl}=1 $$

((2.2))

where z_ik=1 if the i^th row of M_b(I,J) belongs to the k^th cluster of I otherwise z_ik=0. w_jl=1 if the j^th column of M_b(I,J) belongs to the l^th cluster of J otherwise w_jl=0.

By using Eq. (2.2), Eq. (2.1) can be reformulated as follows:

$$ W(z,w,a)=\sum\limits_{i,j,k,l}z_{ik}w_{jl}\left|m_{ij}^{b}-a_{kl}\right| $$

((2.3))

By adopting the IRCCC approach, we can make biclustering by minimizing W(z,w,a) defined by Eq. (2.3) and by fixing either w or z:

If w is fixed, the minimization is given by:
$$ W(z,a|w)=\sum\limits_{i,k,l}z_{ik}|u_{il}-(|w_{l}|\times a_{kl})| $$
((2.4))

where $u_{\textit {il}}=\sum _{j\in w_{l}}m_{\textit {ij}}=\sum _{j}w_{\textit {jl}}m_{\textit {ij}}, \sum \limits _{\textit {i,j,k,l}} z_{\textit {ik}}w_{\textit {jl}}\left |m_{\textit {ij}}^{b}-a_{\textit {kl}}\right |=\underset {i,k}{\sum }z_{\textit {ik}}\underset {j,l}{\sum }w_{\textit {jl}}\left |m_{\textit {ij}}^{b}-a_{\textit {kl}}\right |=\underset {i,k}{\sum }z_{\textit {ik}}\underset {l}{\sum }|u_{\textit {il}}-(|w_{l}|\times a_{\textit {kl}})|$, u is a matrix of size |I|×l.
If z is fixed, the minimization is given by:
$$ W(w,a|z)=\sum\limits_{i,k,l}w_{jl}|v_{jl}-(|z_{k}|\times a_{kl})|\vspace*{-3pt} $$
((2.5))

where $v_{\textit {kj}}=\sum _{i\in z_{k}}m_{\textit {ij}}^{b}=\sum _{i}z_{\textit {ik}}m_{\textit {ij}}$, $\underset {\textit {i,j,k,l}}{\sum }z_{\textit {ik}}w_{\textit {jl}}\left |m_{\textit {ij}}^{b}-a_{\textit {kl}}\right |=\sum \limits _{\textit {j,l}} w_{\textit {jl}} \sum \limits _{i,k} z_{\textit {ik}}\left |m_{\textit {ij}}^{b}-a_{\textit {kl}}\right |= \sum \limits _{\textit {i,k}} z_{\textit {ik}} \sum \limits _{l} |v_{\textit {kj}}-(|z_{k}|\times a_{\textit {kl}})|$, v is a matrix of size k×|J|.

Remark.

A colored block in the binary matrix M_b(I,J) will be represented by a colored cell in the summary matrix A, where each colored cell contains the majority binary value in the corresponding colored block, e.g, if the majority of cells in a block in M_b(I,J) contains 1 then the corresponding cell in A contains also 1.

Example.

This example shows a binary data matrix M_b(I,J) and the corresponding cell in the summary matrix A.

A=(0,0;1,1;1,1), i.e., a₁₁=0,a₁₂=0;a₂₁=1, a₂₂=1;a₃₁=1,a₃₂=1.

In the section ‘FIRST IRCCC Algorithm: BiBinCons’, we develop two IRCCC algorithms of biclustering of binary microarray data, called respectively BiBinCons and BiBinAlter.

FIRST IRCCC Algorithm: BiBinCons

Our biclustering algorithm, BiBinCons receives as input a binary matrix M_b(I,J) and gives as output (z_opt,w_opt,A_opt), where z_opt and w_opt are respectively the final clustering of rows and columns of M_b(I,J), and A_opt is the summary matrix related to z_opt and w_opt. To describe more formally our biclustering algorithm, BiBinCons, we use the following notations:

z₀ : initial clustering of rows of M_b(I,J)

w₀ : initial clustering of columns of M_b(I,J),

A₀ : initial summary matrix related to z⁰ and w⁰

z_c : current clustering of rows of M_b(I,J)

w_c : current clustering of columns of M_b(I,J),

$A_{c}^{'}$ : current intermidate summary matrix related to z^c and w^c−1

A_c : current summary matrix related to z^c and w^c

z_opt : final clustering of rows of M_b(I,J)

w_opt : final clustering of columns of M_b(I,J)

A_opt : final summary matrix related to z^opt and w^opt

$A_{c}^{'}$ : intermediate current summary matrix.

Second IRCCC Algorithm: BiBinAlter

Our biclustering algorithm, BiBinAlter receives as input a binary matrix M_b(I,J) and gives as output (z_opt,w_opt,A_opt), where z_opt and w_opt are respectively the final clustering of rows and columns of M_b(I,J), and A^opt is the summary matrix related to z_opt and w_opt. By adopting BiBinAlter, we propose the use of functions defined:

EvalStab_c represents the frequency of 0’s in the current group of biclusters at the c^th iteration. It is defined as follows:

$$ EvalStab=\sum\limits_{k,l}\frac{|a_{kl}-(|z_{k}|\times|w_{l}|)|}{|z_{k}||w_{l}|} $$

((4.1))

IndHomog_c represents the tradeoff between the number of mixed biclusters (containing both 0’s and 1’s) and the total number of biclusters at the c^th iteration. It is defined as follows:

$$ IndHomog=\frac{MixedBic}{AllBic} $$

((4.2))

To describe more formally our biclustering algorithm, iBinAlter, we have used the same notations like previous algorithm besides of these notations:

(EvalStab_c,IndHomog_c): couple to present the frequency of 0’s in the current group of biclusters at the c^th iteration and the tradeoff between the number of mixed biclusters (containing both 0’s and 1’s) and the total number of biclusters at the c^th iteration.

(EvalStab_c−1,IndHomog_c−1): couple to present the frequency of 0’s in the group of biclusters at the (c−1)^th iteration and the tradeoff between the number of mixed biclusters (containing both 0’s and 1’s) and the total number of biclusters at the (c−1)^th iteration.

$\left (EvalStab_{(c-1)}^{'},IndHomog_{(c-1)}^{'}\right)$: couple to present the frequency of 0’s in the group of biclusters at the intermidate (c−1)^′^th iteration and the tradeoff between the number of mixed biclusters (containing both 0’s and 1’s) and the total number of biclusters at the intermidate (c−1)^′^th iteration.

Illustrative example

Let’s apply the BiBinAlter algorithm on the following binary matrix M_b(I,J):

Initialization step

First, we initialize the rows and columns thanks to the initialization step of BiMax algorithm Prelić [3] and we compute (z₀,w₀,A₀), we obtain:

z₀=(1,2,2,3), w₀=(1,1,0,0,0), A₀=(1,0;1,1;0,1)

A colored block in the binary matrix M_b(I,J) will be represented by a colored cell in the summary matrix A₀, where each colored cell contains the majority binary value in the corresponding colored block, e.g, if the majority of cells in a block in M_b(I,J) contains 1 then the corresponding cell in A₀ contains also 1.

Biclustering step:

Iteration 1: c=1

We compute ($z_{1},w_{0},A_{1}^{'}$) starting from (z₀,w₀,A₀) by using Eq. 2.4, we obtain:

$$\left(z_{1},w_{0},A_{1}^{'}\right)=\left((1,3,2,1),(1,1,2,2,2),(1,1;1,0;0,1)\right) $$

We compute $(EvalStab_{1}^{'},IndHomog_{1}^{'})$ by using Eq. 4.3, we obtain:

$$\left(EvalStab_{1}^{'},IndHomog_{1}^{'}\right)=\left(2,\frac{2}{3}\right) $$

We compute (z₁,w₁,A₁) starting from ($z_{1},w_{0},A_{1}^{'}$), by using Eq. 2.4, we obtain:

$$\left(z_{1},w_{1},A_{1}\right)=\left((1,3,2,1),(2,2,1,2,1),(1,1;1,0;0,1)\right) $$

We compute (EvalStab₁,IndHomog₁), by using Eq. 4.3, we obtain:

$$\left(EvalStab_{1},IndHomog_{1}\right)=\left(1,\frac{2}{6}\right) $$

Since we have

$${\kern30pt} \begin{aligned} ((({z}_{c},{w}_{c},{A}_{c}) & \neq ({z}_{c-1},{w}_{c-1},{A}_{{c}-1}))\,\, \textbf{and}\,\, ({EvalStab}_{c},{IndHomog}_{c})\\ &\left.\left. \neq ({EvalStab}_{c-1},{IndHomog}_{c-1})\right)\right) \end{aligned} $$

and

$${\kern30pt} \begin{aligned} (((z_{c},w_{c},A_{c})&\neq(z_{c},w_{c-1},A_{c}^{'})) \,\, \textbf{and}\,\,((EvalStab_{c},IndHomog_{c})\\ &\neq (EvalStab_{(c-1)}^{'},IndHomog_{(c-1)}^{'})) \end{aligned} $$

we make another iteration

Iteration 2: c =2

We compute ($z_{2},w_{1},A_{2}^{'}$) starting from (z₁,w₁,A₁), by using Eq. 2.4, we obtain:

$$\left(z_{2},w_{1},A_{2}^{'}\right)=\left((2,1,2,3),(2,2,1,2,1),(0,1;1,0,0,1)\right) $$

We compute $(EvalStab_{2}^{'},IndHomog_{2}^{'})$, by using Eq. 4.3, we obtain:

$$\left(EvalStab_{2}^{'},IndHomog_{2}^{'}\right)=\left(0,\frac{0}{6}\right) $$

We compute (z₂,w₂,A₂) starting from ($z_{2},w_{1},A_{2}^{'}$), by using Eq. 2.4, we obtain:

$$(z_{2},w_{2},A_{2})=((2,1,2,3),(2,2,1,2,1),(0,1;1,0;0,1)) $$

We compute (EvalStab²,IndHomog²), by using Eq. 4.3, we obtain:

$$\left(EvalStab^{2},IndHomog^{2}\right)=\left(0,\frac{0}{6}\right) $$

Since we have

$${\kern30pt} \begin{aligned} (((z_{c},w_{c},A_{c})&\neq(z_{c-1},w_{c-1},A_{c-1}))\,\, \textbf{and}\,\,(EvalStab_{c},IndHomog_{c})\\ &\neq(EvalStab_{c-1}, IndHomog_{c-1}))) \end{aligned} $$

and

$${\kern32pt} \begin{aligned} (((z_{c},w_{c},A_{c})&\neq(z_{c},w_{c-1},A_{c}^{'}))\,\,\textbf{and}\,\,((EvalStab_{c},IndHomog_{c})\\ &=(EvalStab_{(c-1)}^{'},IndHomog_{(c-1)}^{'})) \end{aligned} $$

we make another iteration

Iteration 3: c =3

We compute ($z_{3},w_{2},A_{3}^{'}$) starting from (z₂,w₂,A₂), by using Eq. 2.4, we obtain:

$$ \left(z_{3},w_{2},A_{3}^{'}\right)=\left((1,1,1,1),(2,2,1,2,1),(1,1)\right) $$

((4.3))

We compute $(EvalStab_{3}^{'},IndHomog_{3}^{'})$, by using Eq. 4.3, we obtain:

$$\left(EvalStab_{3}^{'},IndHomog_{3}^{'}\right)=(1,1) $$

We compute (z₃,w₃,A₃) starting from ($z_{3},w_{2},A_{3}^{'}$), by using Eq. 2.4, we obtain:

$$\left(z_{3},w_{3},A_{3}\right)=((2,1,2,3),(2,2,1,2,1),(0,1;1,0;0,1)) $$

We compute (EvalStab³,IndHomog³), by using Eq. 4.3, we obtain:

$$\left(EvalStab^{3},IndHomog^{3}\right)=\left(0,\frac{0}{6}\right) $$

Since we have

$${\kern30pt} \begin{aligned} (((z_{c},w_{c},A_{c})&\neq(z_{c-1},w_{c-1},A_{c-1}))\,\, \textbf{and}\,\, (EvalStab_{c},IndHomog_{c})\\ &\neq (EvalStab_{c-1},IndHomog_{c-1}))) \end{aligned} $$

and

$${\kern31pt} \begin{aligned} (((z_{c},w_{c},A_{c})&\neq(z_{c},w_{c-1},A_{c}^{'}))\,\,\textbf{and}\,\, ((EvalStab_{c},IndHomog_{c})\\ &\neq (EvalStab_{(c-1)}^{'},IndHomog_{(c-1)}^{'})) \end{aligned} $$

we make another iteration

Iteration 4: c =4

We compute ($z_{4},w_{3},A_{4}^{'}$) starting from (z₃,w₃,A₃), by using Eq. 2.4, we obtain:

$$\left(z_{4},w_{3},A_{4}^{'}\right)=((1,2,1,2),(2,2,1,2,1),(1,0;0,1)) $$

We compute $\left (EvalStab_{4}^{'},IndHomog_{4}^{'}\right)$, by using Eq. 4.3, we obtain:

$$\left(EvalStab_{4}^{'},IndHomog_{4}^{'}\right)=\left(0,\frac{0}{4}\right) $$

We compute (z₄,w₄,A₄) starting from ($z_{4},w_{3},A_{4}^{'}$), by using Eq. 2.4, we obtain:

$$\left(z_{4},w_{4},A_{4}\right)=((1,2,1,2),(1,1,1,1,1),(1;0)) $$

We compute (EvalStab⁴,IndHomog⁴), by using Eq. 4.3, we obtain:

$$\left(EvalStab^{4},IndHomog^{4}\right)=(1,1) $$

Since we have

$${\kern30pt} \begin{aligned} (((z_{c},w_{c},A_{c})&\neq(z_{c-1},w_{c-1},A_{c-1}))\; \textbf{and} \;(EvalStab_{c},IndHomog_{c})\\ &\neq (EvalStab_{c-1},IndHomog_{c-1}))) \end{aligned} $$

and

$${\kern30pt} \begin{aligned} (((z_{c},w_{c},A_{c})&\neq(z_{c},w_{c-1},A_{c}^{'}))\,\,\textbf{and}\,\,((EvalStab_{c},IndHomog_{c})\\ &\neq (EvalStab_{(c-1)}^{'},IndHomog_{(c-1)}^{'})) \end{aligned} $$

we make another iteration

Iteration 5: c =5

We compute ($z_{5},w_{4},A_{5}^{'}$) starting from (z₄,w₄,A₄), by using Eq. 2.4, we obtain:

$$\left(z_{5},w_{4},A_{5}^{'}\right)=((1,2,1,2),(1,1,1,1,1),(1;0)) $$

We compute $(EvalStab_{5}^{'},IndHomog_{5}^{'})$, by using Eq. 4.3, we obtain:

$$\left(EvalStab_{5}^{'},IndHomog_{5}^{'}\right)=(1,1) $$

We compute (z₅,w₅,A₅) starting from ($z_{5},w_{4},A_{5}^{'}$), by using Eq. 2.4, we obtain:

$$\left(z_{5},w_{5},A_{5}\right)=((1,2,1,2),(1,1,2,1,2),(1,0;0,1)) $$

We compute (EvalStab⁵,IndHomog⁵), by using Eq. 4.3, we obtain:

$$\left(EvalStab^{5},IndHomog^{5}\right)=\left(0,\frac{0}{4}\right) $$

We have $(EvalStab_{4}^{'},IndHomog_{4}^{'}) = (EvalStab^{5},IndHomog^{5}))$ and (z₅,w₅,A₅) $= (z_{4},w_{3},A_{4}^{'})$.

Since we have

$${\kern30pt} \begin{aligned} (((z_{c},w_{c},A_{c})&=(z_{c},w_{c-1},A_{c}^{'}))\,\,\textbf{and}\,\, ((EvalStab_{c},IndHomog_{c})\\ &= (EvalStab_{(c-1)}^{'},IndHomog_{(c-1)}^{'})) \end{aligned} $$

we stop the loop.

Then, we obtain (z_opt,w_opt,A_opt)=(z₅,w₅,A₅). Biclusters that contain only 0’s will not be considered because they represent genes that are not expressed under the related conditions. Finally, (z_opt,w_opt,A_opt) can be represented in M_b(I,J) as follows:

Results for synthetic datasets

In this section, we present an experimental study to evaluate the performance of our algorithms of microarray data. Indeed, we compare the results of our algorithms to those obtained by a selection of known algorithms cited in the literature. We conducted experiments on synthetic and real datasets of microarrays. The idea behind testing on synthetic datasets is to investigate the ability of our algorithms to extract different types of biclusters. However, on real datasets, we seek to assess the degree of response of our algorithms for statistical and biological criteria.

Synthetic microarray datasets and comparaison criteria

By adopting the strategy and data described in [1], we have experimented our algorithms on synthetic datasets by operarting as follows: First, we choose the number of biclusters, 3 clusters on rows (g=3) and 2 clusters on columns (m=2). Second, we use the Latent Bernoulli Mixture (LBM) model [1] to generate binary matrices (mixtures) by considering:

(a) Overlapping biclusters (overlapping rate =5 % (well separated), 15 % (fairly separated) and 25 % (poorly separated)).
(b) Different data sizes (matrix size =50×30 (small), 100×60 (medium) and 200×120 (large)).

Similar to [4], we use two indices, Recovery and Relevance, to evaluate our biclustering algorithms: Let B₁ be a group of true implemented biclusters in a binary data matrix M_b and B₂ be a group of output biclusters of a biclustering algorithm, Relevance reflects to what extent B₂ is similar to B₁, while Recovery quantifies how well each bicluster in B₁ is recovered by B₂ [3]:

$$ Recovery=Overlap(B_{2},B_{1}) $$

((6.1))

$$ Relevance=Overlap(B_{1},B_{2}) $$

((6.2))

where:

$$ Overlap(B_{1},B_{2})=\frac{1}{|B1|}\underset{(I_{1},J_{1})\in B_{1}}{\sum}\underset{(I_{2},J_{2})\in B_{2}}{max}\frac{|I_{1}\cap I_{2}||J_{1}\cap J_{2}|}{|I_{1}\cup I_{2}||J_{1}\cup J_{2}|} $$

((6.3))

We use also two other indices cited in [2],

$$ Shared=\frac{S_{cb}}{Tot_{size}}100 $$

((6.4))

$$ NotShared=\frac{S_{ncb}}{Tot_{size}}100 $$

((6.5))

where

S_cb is the volume of correctly extracted biclusters, Tot_size is the total volume of implemented biclusters and S_NCB is the volume of not correctly extracted biclusters.

The Shared index (resp. NotShared) represents the percentage of correctly (resp. not correctly) extracted biclusters with respect to all implemented biclusters in the data matrix. Indeed, when the Shared value is equal to 100 %, the algorithm extracts all the implemented biclusters. When the value of NotShared is 0 %, the algorithm extracts no cell outside the implemented biclusters.

Experimental protocol

We have compared our algorithms to CC Cheng and Wee-Chung [2], OPSM Ben-Dor and Yakhini [5], ISA Ihmels et al. [4] and BiMax Kaiser and Leisch [6]. These algorithms were implemented in the BIClustering Analysis Toolbox (Bicat) platform. After several simulations, the parameters of our algorithms were set as listed in Table 1. Indeed, at each simulation, we set a parameter and we vary the other and vice inverse. Finally, we keep the parameters which give the nearest implemented biclusters in the starting template.

Table 1 Corresponding parameters values of our algorithms

Full size table

For CC, OPSM, ISA and BiMax algorithms, we keep the value of the default parameters values. Indeed, these values give biclusters of reasonable quality. We have adopted Shared, NotShared, Recovery and Relevance as comparaison criteria. Table 2 shows the best biclusters extracted by each algorithm:

Table 2 Values of Shared and NotShared for non overlapping biclusters

Full size table

As we can notice in Table 2, for the generated binary matrices, the best values of Shared and NotShared for non overlapping biclusters were obtained by the BiBinAlter algorithm. Indeed, to get a solution B_opt, the combination between two biclusters provides additional volume for the conditions. This reasonnable additional volume is generated by a successive comparaisons between P_max and the other polynoms of L, and we locate the polynom P_uncomon that has the lowest number ρ of uncommon terms with P_max. In fact, the interesting resulat is obtained because of we keep only the conditions that have not been removed by the pretreatment process. Besides, the extracted bicluster from the current matrix M_b(I,J) is removed ans we set the cells of M_b(I,J), representing the new bicluster, to 0. Table 3 shows the best biclusters extracted by each algorithm. As we can notice in Table 3, for the generated binary matrices, the best values of Shared and NotShared for overlapping biclusters were also obtained by the BiBinAlter algorithm. We can explain this as follows: BiBinAlter results covers most of the implemented biclusters. Table 4 presents the number of biclusters obtained bu our algorithms on real datastes.

Table 3 Values of Shared and NotShared for overlapping biclusters

Full size table

Table 4 Number of biclusters obtained by our algorithms on real datasets

Full size table

Results of our algorithms on real datasets

In this section, we evaluate our algorithms on real microarray datasets.

Real microarray datasets

We have used two real microarray datasets: The Yeast cell cycle dataset which has been described and then pretreated in [1]. It contains the expression of 2884 genes in 17 terms ans the Human B-cell Lymphoma dataset which has been described by Alizadeh et al. [1], it contains 4026 genes and 96 conditions. These datasets are used frequently in the literature by biclustering algorithms.

Experimental protocol

The first experiments concern the statistical validation. It enables to calculate the coverage for Yeast cell cycle and Human B-cell Lymphoma datasets and the p-value adjusted forHuman B-cell Lymphoma datasets. The second experiments was applied to Yeast cell cycle in order to study the biological significance of extracted biclusters.

Statistical validation

In order to validate statistically our algorithms on these real datasets, we evaluate the performance of BiBinCons and BibinAlter. We calculate the total number of cells covered by the biclusters. To do this, we have processed as in [2], and we have compared the results of our algorithms to those reported in [2]. In the literature, the coverage test was performed on Yeast cell cycle and Human B-cell Lymphoma datasets. This test is not applied to RefineBicluster algorithm because it is only a refinement algorithm.

Table 5 reports the percentage of Coverage on the different algorithms for Yeast cell cycle and Human B-cell Lymphoma datasets. We note that most algorithms have more or less close rates. For example, for the Yeast cell cycle datase, BiBinCons has the lowest performance. This is explained by the fact that BiBinCons extracts thousands of small sized biclusters. The CC algorithm extracts biclusters with random values. Thus, CC prohibits the genes/conditions already discovered to be selected in the next search process. This type of mask leads to a high coverage and preventing the discovery of large biclusters.

Table 5 Values of Coverage for Yeast cell cycle and Human B-cell Lymphoma datasets

Full size table

Biological validation

To evaluate biologically extracted biclusters, we use the web tool GOTermFinder. To do this, we present the most significant shared biclusters. In this section, we evaluate BiBinCons and BiBinAlter algorithms on real microarray datasets. We have choosen this algorithm because it gaves the best results on synthetic datasets. Table 6 presents the most important terms of GO for the two most significant extracted biclusters from Yeast cell cycle dataset by BiBinCons and BiBinAlter.

Table 6 The most important terms of GO for the two most significant extracted biclusters from Yeast cell cycle dataset by BiBinCons and BiBinAlter

Full size table

Computing time

Table 7 shows the computing time of BiBinCons and BiBinAlter algorithms. All developed algorithms in this thesis were implemented in R under the R studio. The physical characteristics of the machine are as follows: a PC with an Intel Core 2 Duo T6400 with a clock frequency of 2.0 GHz and 3.5 GO of RAM. We note that BiBinAlter algorithm is the most time consuming and this is due to the use of proposed evaluation function.

Table 7 Computing time of our algorithms

Full size table

Conclusion

In this paper, we have developed two biclustering algorithms of binary microarray data, called BiBinCons and BiBinAlter, adopting the Iterative Row and Column Clustering Combination (IRCCC) approach, however, the BiBinAlter algorithm is an improvement of BiBinCons. On the other hand, BiBinAlter differs from BiBinCons by the use of the EvalStab and IndHomog evaluation functions in addition to the CroBin one [1]. BiBinAlter can extract biclusters of good quality with better p-values. In this paper, we have presented an experimental study of our biclustering algorithms of microarray data. We have compared the results of our algorithms to those obtained by a selection of the known biclustering algorithms. We have conducted experiments on both synthetic and real datasets of microarrays. For both synthetic and real datasets, our biclustering algorithm BiBinAlter outperforms the other algorithms, followed by our other biclustering algorithms nd BiBinCons.

References

Govaert G.La classification croisee. Modulad. 1983.
Law NF, Siu WC, Cheng KO, Alan WC. Identification of coherent patterns in gene expression data using an efficient biclustering algorithm and parallel coordinate visualization. BMC Bioinformatics. 2008.
Prelić A, Bleuler S, Zimmermann P, Wille A, Bühlmann P, Gruissem W, et al.A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics. 2006; 22:1122–29.
Article PubMed Google Scholar
Ihmels J, Bergmann S, Barkai N.Defining transcription modules using large-scale gene expression data. Bioinformatics. 2004; 20(13):1993–2003.
Article CAS PubMed Google Scholar
Benny C, Richard K, Amir BD, Yakhini Z. Discovering local structure in gene expression data: The order-preserving submatrix problem. In: Proceedings of the Sixth Annual International Conference on Computational Biology, RECOMB ’02. New York, NY, USA: ACM: 2002. p. 49–57.
Google Scholar
Santamaria R, Khamiakova T, Sill M, Theron R, Quintales L, Kaiser S, et al.biclust: Bicluster algorithms. R package. 2011.

Download references

Author information

Authors and Affiliations

Latice laboratory, ENSIT, Tunis Time université, Tunis, Tunisia
Haifa Ben Saber & Mourad Elloumi
Latice laboratory, Ensit, Tunis Université tunis el manar, Tunis, Tunisia
Mourad Elloumi

Authors

Haifa Ben Saber
View author publications
You can also search for this author in PubMed Google Scholar
Mourad Elloumi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haifa Ben Saber.

Additional information

Competing interests

The authors declare that they have no competing interests.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Saber, H.B., Elloumi, M. A novel biclustering algorithm of binary microarray data: BiBinCons and BiBinAlter. BioData Mining 8, 38 (2015). https://0-doi-org.brum.beds.ac.uk/10.1186/s13040-015-0070-4

Download citation

Received: 04 January 2015
Accepted: 08 November 2015
Published: 30 November 2015
DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s13040-015-0070-4

A novel biclustering algorithm of binary microarray data: BiBinCons and BiBinAlter

Abstract

Introduction

Preliminaries

Remark.

Example.

FIRST IRCCC Algorithm: BiBinCons

Second IRCCC Algorithm: BiBinAlter

Illustrative example

Results for synthetic datasets

Synthetic microarray datasets and comparaison criteria

Experimental protocol

Results of our algorithms on real datasets

Real microarray datasets

Experimental protocol

Statistical validation

Biological validation

Computing time

Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Rights and permissions

About this article

Cite this article

Keywords

BioData Mining

Contact us

A novel biclustering algorithm of binary microarray data: BiBinCons and BiBinAlter

Abstract

Introduction

Preliminaries

Remark.

Example.

FIRST IRCCC Algorithm: BiBinCons

Second IRCCC Algorithm: BiBinAlter

Illustrative example

Results for synthetic datasets

Synthetic microarray datasets and comparaison criteria

Experimental protocol

Results of our algorithms on real datasets

Real microarray datasets

Experimental protocol

Statistical validation

Biological validation

Computing time

Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BioData Mining

Contact us