1. Introduction
The most essential component of life known to exist are cells.The cells that make up our bodies can respond to stimuli, reproduce, process information, and carry out various chemical activities. However, to this day, nothing is understood about how cells interact, how their organelles react to chemicals, or how they behave inside of cells. A typical ensemble measurement of millions of cells put together is insufficient to provide precise information about cells for the understanding of cellular physiological interactions with various chemicals and organelles. Again, cellular heterogeneity features and dynamics in a given cell population cannot be investigated at the molecular level by bulk analysis of millions of cells. The physiological states of cells and the variability of cells within a particular population, however, have made it difficult to conduct investigations into the progression of any disease. Single-cell analysis (SCA), as opposed to the typical study of millions of cells in bulk, provides clear information on each cell, such as particular biological factors of the cell, which is helpful to understand stem cell activity or tumor progression [
1,
2]. All medical disciplines such as oncology, microbiology, neurology, reproduction, and immunology greatly benefited from single-cell RNA sequencing (scRNA-seq) technologies [
3,
4]. To detect the genome, transcriptome, other multi-omics of single cells, and single-cell sequencing technologies could be used. It also requires additional multi-omics data analysis to reveal the variations in cell populations and the evolutionary connections between cells. However, conventional RNA sequencing techniques have some significant drawbacks. For instance, they are unable to examine a few cells and can only determine the average of many cells. On the other hand, scRNA-seq technology can identify heterogeneity among individual cells and solve the problem of traditional sequencing technology potentially losing information about cellular heterogeneity. The workflow of scRNA-seq technology entails separating a single cell from a population of cells, analyzing cell heterogeneity, molecular mapping, and monitoring immune infiltration and epigenetic alterations. The current research focusing on scRNA-seq data analysis includes the classification of various types of tumors, examination of heterogeneity in various cell types, identification of existing and novel types of cells, cell representation, and cell fate prediction [
5]. Single-cell clustering plays an important role in conducting such a study. Cell clustering is an essential technique for identifying cluster-specific gene signatures for reconciling cell type configuration. The aim of this work is to identify the gene signature as differentially expressed and to simplify the bulk RNA-seq expression data by removing noise. However, several factors including high dimensionality, measurement noise, detection limit, and imbalanced size between uncommon and abundant populations make the computational analysis of scRNA-seq data difficult. One crucial aspect of scRNA-seq data that contributes to all of these difficulties is a phenomenon known as “drop-out” in which a gene is seen to have a low or moderate amount of expression in one cell but not in another of the same cell type [
6]. These dropout events arise from the stochasticity of mRNA expression, ineffective mRNA capture, and low levels of mRNA in individual cells. The dropouts are a major contributor to the sometimes very sparse scRNA-seq data. The data are zero-inflated due to the overwhelming zero counts, only capturing a portion of each cell’s transcriptome. Dimensionality reduction and clustering, which both have a significant impact on the analysis that comes after scRNA-seq data, are the two most crucial methods for overcoming analysis problems. To overcome the dropouts, almost all currently used computational approaches for scRNA-seq included gene selection and dimension reduction techniques. Pre-processing of the data with principal component analysis (PCA) [
7] or t-Distributed Stochastic Neighbor Embedding (tSNE) [
8] to accomplish dimension reduction and choose highly variable genes is a frequent approach for clustering and trajectory finding methods. There are various imputation techniques intended to specifically omit dropouts. MAGIC [
9], SAVER [
10], scImpute [
11], and RESCUE [
12] are a few recent examples. These imputation techniques often determine gene–gene or cell–cell similarities using highly variable genes and dimension reduction, which serves as the foundation for imputing the dropouts with acceptable values. The key component of all these approaches is to concentrate the analysis on highly variable genes that are less susceptible to dropouts, which has been shown to be efficient since genes with high variability are frequently able to capture the cell-to-cell heterogeneity of significant phenotypes.
Automatically detecting the dropout finder, scImpute [
11] solely imputes these dropout values, avoiding the inclusion of additional biases to the remaining data, and whereas MAGIC and SAVER would change all gene expression levels, even those unaffected by dropouts, this might add new biases into the data and even eliminate significant biological variance [
13].In addition, outlier cells are found by scImpute and are disregarded for imputation. According to analysis using both the real and simulated human and mouse scRNA-seq data, scImpute is a useful method for recovering transcriptome dynamics obscured by dropouts. It has been demonstrated that scImpute can help the research of gene expression dynamics as well as improve the clustering of cell sub-populations, increase the accuracy of differential expression analyses, and identify the likely dropouts. One interesting phenomenon of scImpute is that it can be integrated by the majority of currently used pipelines or downstream analyses of scRNA-seq data, including normalization, differential expression analysis, clustering, classification, etc. The authors in Li et al. [
11] claimed that a new analyzing tool specifically developed for the imputed scRNA-seq data by scImpute improved the performance over the state-of-the-art methods developed for raw scRNA-seq data by incorporating some features to decrease excess zero or near zero expression, dropout rates, and dropout probabilities estimated through the mixture models. Another key characteristic of scImpute is that it just uses two parameters, which are simple to comprehend and choose. The number of potential cell populations is indicated by the first parameter,
K. It can be selected based on the raw data clustering results and the users’ preferred level of resolution. Users can choose a relatively small
K, if they are only interested in the differences between the major clusters, in which scImpute can borrow more information from individual cells; otherwise, they can choose a relatively large
K, in which scImpute would be more cautious during the imputation process. The second parameter is a probability threshold
t for dropouts. In article [
11], authors proved that scImpute can handle a range of parameter values and most scRNA-seq data can be used with a default threshold of 0.5. In addition, for the scImpute technique, cell type information is not required. The computing efficiency can be significantly increased if a filtering step on cells can be carried out using biological information. ScImpute scales up effectively as the number of cells rises.
Discriminant analysis is widely used in bioinformatics for tasks like separating cancerous tissues from healthy tissues or one cancer subtype from another, predicting the fold or super-family of proteins from their sequences, etc. [
14]. Feature selection is a crucial problem in discriminant analysis: rather than using every variable (feature or attribute) in the data, one picks a subset of features to be employed in the discriminant system [
15,
16]. Feature selection has a number of benefits, including: (i) dimension reduction to decrease computation costs; (ii) noise reduction to increase classification accuracy; (iii) more observable features or characteristics to aid in identifying and keeping track of the targeted diseases or function types [
17,
18]. To reduce the whole feature set into one that is more accessible, various feature selection techniques are available [
19,
20,
21]. Filter methods, wrapper methods, and embedded methods are generally all three categories into which these techniques fall. The effectiveness of the computation and generalization to various machine learning models are benefits of the filter approach. In our framework, we have studied a Maximum Relevance and Minimum Redundancy (MRMR) [
14] feature selection method. Out of all of the filtering approaches, the MRMR method is preferred since it can efficiently eliminate the redundant features while retaining the necessary features for the model [
22]. Since many significant features are correlated and redundant, it may happen that the resulting
n best features might not be the actual (optimized) best
n features. MRMR technique addresses this issue by choosing features that take into account both their redundancy within the selected features as well as their relevancy for predicting the outcome variable. Ding and Peng [
14] proposed the MRMR algorithm and established that it selects features more accurately than the maximum relevance method alone. According to the notion behind MRMR, a feature should have a significant amount of mutual information if it is strongly differentially expressed for different classes. They employed mutual information in the MRMR model to assess the significance of the attributes. Due to this, MRMR is a feature selection technique that is frequently employed in the implementation of machine learning.
To categorize cells based on their similarity in gene expression profiles, cells can be grouped using an unsupervised machine learning phase, clustering. Researchers gain insights into the scRNA-Seq data and potential confounding variables from hidden patterns that emerge from clustering results. The most widely used scRNA-Seq clustering techniques are based on the three classes: Hierarchical, k-means, and graph-based. For powerful single-cell transcriptome analysis, the Louvain algorithm [
23], a graph-based technique that looks for densely connected communities in the network, is the preferred approach. Other well-known applications that adopt this strategy include PhenoGraph [
24], Seurat [
25,
26], Scanpy [
27], and FAVA [
28]. All of PhenoGraph, Seurat, and Scanpy methods are basically PCA as well as graph-based methods, which are very scalable, and have high accuracy in large datasets [
29]. On the other hand, FAVA is a deep learning-based classification technique. Shrinkage clustering can be widely used to resolve biomedical clustering issues, especially when working with large datasets, due to its simplicity of implementation, computing efficiency, and flexible structure [
30]. The shrinkage clustering approach, which is based on matrix factorization, partitions the data while concurrently determining the optimal number of clusters. Furthermore, due to the extensible structure of the algorithm, we can use it for clustering applications with minimum cluster size restrictions. The algorithm demonstrated significant resistance to noise, operated with high accuracy on both simulated and real data, and ran faster than some of the most widely used algorithms.
The process of assessing the accuracy of outputs from a clustering algorithm is known as cluster validation. This is significant both when trying to compare two clustering techniques and when trying to avoid seeing patterns in random data. Clustering validation statistics can often be divided into three types: internal cluster validation, which assesses the quality of a clustering structure using only the internal data from the clustering process and does not rely on outside data; external cluster validation involves comparing cluster analysis findings to an externally known outcome, like externally supplied class labels; relative cluster validation compares the results of the same procedure with different parameter values to determine the clustering structure. Internal validation metrics frequently represent the separation, connectivity, and compactness of cluster partitions. The degree of closeness among the objects in a cluster is referred to as compactness. How well a cluster is segregated from other clusters is determined by its separation. Connectivity measures how closely items are clustered with their closest neighbors in the data space. The silhouette coefficient is a popular metric for evaluating the quality of clustering [
31]. The silhouette analysis calculates the average distance between clusters and assesses how well an observation is clustered. The distance between each point in one cluster and points in its neighboring clusters is shown on the silhouette plot. The advantage of this method is it can detect outliers if present in a cluster. The silhouette validity index measure normally computes three matrices: (i) the silhouette width for each data point, (ii) the average silhouette width for each cluster, and (iii) the overall average silhouette width for the total dataset.
Dimensionality reduction and clustering have drawn more attention in recent years for the interpretation of single-cell RNA sequencing data [
32]. In our last published study [
33], we presented an integrated clustering model with dimensionality reduction for the identification of cluster-specific biomarkers in single-cell sequencing data. However, we applied this model to the raw data matrix. Since the “dropout” phenomenon is a crucial challenge of scRNA-seq data analysis, a data imputation or imputed matrix is desirable. In this article, we propose another framework by integrating the imputed matrix, MRMR feature selection, and shrinkage clustering to detect gene signatures from scRNA-seq data. Our developed integrated model consists of several steps: pre-filtering, feature selection, clustering, and differentially expressed marker detection. The motivation behind proposing such a framework with reduced set gene clustering for scRNA-seq data analysis is that scRNA-seq provides a potent tool for analyzing the complexity of biological tissues by identifying cell sub-populations along with clustering techniques. Enhancing the accuracy and interpretability of single-cell clustering requires feature selection [
34].
3. Result and Discussion
In this study, we used a scRNA-seq dataset for rare intestinal cell type in mice (GEO ID: GSE62270) which contains 23,630 features, 1872 cells, and 9 classes. Here, we attempted to resolve the “drop-out” problem of scRNA-seq data by our proposed framework for gene signature detection through integrated strategies of imputed matrix, MRMR feature selection, and shrinkage clustering from scRNA-seq data. To solve the “drop-out” problem, we used a statistical tool scImpute on the dataset. scImpute can automatically detect likely dropouts, and it performs imputation on those values only; however, it does not introduce any new biases to the remaining data. scImpute also eliminates outlier cells and eliminates them from imputation. Thereafter, we applied the feature selection algorithm, MRMR, to the imputed data matrix.Since MRMR produces one of the most promising results for any prediction method, we obtained the top 100 features based on the MRMR optimization scores for further downstream analysis. After feature selection, we obtained a feature sub-matrix that contains 100 features and 1872 cells. Next, we applied shrinkage clustering to detect cell clusters. Using this clustering technique, we first computed the similarity matrix and then applied the function SuperCluster (s,w = 48, iter = 1000, random = 1, Path = F) to reach the optimum. To reach the global optimum, we reran the entire procedure with a different initial cluster assignment up to 10 times by updating the value of the random variable from 1 to 10, i.e., SuperCluster (s,w = 48, iter = 1000, random = 10, Path = F). Our dataset has nine known class labels of cells. The shrinkage clustering also found nine clusters from the dataset.
Figure 2a shows a PCA plot to visualize the clustering result. This resulting cluster provided normalized mutual information (NMI) scores (=0.0584), Rand index score (=0.4670), and F1 score (=0.2314). Thereafter, to evaluate the validity of the cell clusters, we used the cluster validity silhouette index. Using the silhouette index, we found correct clusters and wrong clusters. From those clusters, we chose one as the best cluster, which has the highest silhouette index, (=0.54). The best cluster contains 1368 cells with different labels.
Table 1 shows how many cells of the best cluster rely on each label.
Figure 2b shows silhouette width measure. According to the plot, the cluster denoting number 6 is the best cluster.
After that, in the differential expression analysis phase, we took the best cluster as the experimental group and the rest of the clusters as the normal group. Then we applied voom normalization for pre-processing data. After differential expression analysis, we obtained 100 differentially expressed features by Limma having
, in a list accompanied by computed t-score,
p-value and FDR. The top 25 differentially expressed genes (DEGs) are shown in
Table 2.
After detecting 100 differentially expressed features or gene markers, we utilized them for further Gene Set Enrichment Analysis by David 6.8 software [
50]. In order to obtain enriched KEGG pathways and gene ontology (GO) terms (viz., biological process (BP), cellular component (CC), and molecular function (MF)), along with Bonferroni and FDR scores, we applied the DAVID 6.8 database to our differentially expressed markers. Here, we used our input dataset in the DAVID 6.8 software’s recommended format which is basically a list of gene names in a single column, with
as the identifier and Mus musculus as the species. Unfortunately, due to the quality of the used dataset, we failed to obtain a significant KEGG pathway (enriched
p-value < 0.05). However, we identified several significant GO-terms (viz., GO:BP, GO:CC and GO:MF) (enriched
p-value < 0.05). Significant GO-terms were discussed below, along with
Table 3,
Table 4 and
Table 5 that provide further information. For example,
GO:0019882 antigen processing and presentation was one of the top significant GO-BP terms. Four genes, (
Cd74, Unc93b1, Psmb8, Psmb9) were associated with this GO-BP term having
p-value
.
Table 3 contains the top five significant GO-BP terms. We provided the list of all GO-BP terms in a
Supplementary File, Table S1. Similarly, we found
GO:0031225 anchored component of membrane as one of the top significant GO-CC terms. A total of six genes (named as
Bst1, Akp3, Nrn1, Vnn1, Alpi, Car4) were associated with this GO-CC term having a
p-value of
. The rest of the terms are shown in
Table 4. We provided the list of all GO-CC terms in a
Supplementary File, Table S2. Furthermore,
GO:0016595 glutamate binding was one of the top significant GO-MF terms. Three genes (named as
Slc1a1, Slc1a3, Grin1) were associated with this GO-MF term having
p-value
. For details, see
Table 5. We provided the list of all GO-MF terms in
Supplementary File, Table S3.
In addition, we detected targeter miRNAs associated with the detected 100 gene markers using
miRWalk database version 2022-1 [
51].
miRWalk is basically an online database where miRNA target detection is an advanced search option of this database. In general, using the list of user-defined miRNAs, we can find their target genes. On the other hand, in a similar fashion, using the list of user-defined genes, we can easily identify the corresponding targeter miRNAs. Of note, during the use of miRWalk, it is mandatory to select the species and type of IDs (here it is the “official gene symbol”). Here, in a similar fashion, we identified corresponding targeter miRNAs from the list of earlier detected differentially expressed gene (DEG) symbols (as the input list in miRWalk) by selecting “mouse” as the species and “official Gene Symbol” as the type of ID. Thereafter, we constructed miRNA target gene interaction network analysis through the degree centrality by
Cytoscape online tool (version 3.8.0) [
52].
Table 6 shows more details of top markers and top miRNAs sorted by the degree centrality.
Table 6 presents the top five gene markers ranked by their in-degree centrality, betweenness centrality and closeness centrality scores. Similarly,
Table 6 provides the top five miRNAs accompanied by the out-degree centrality, betweenness centrality, and closeness centrality scores. We also provided the list of all gene markers and miRNAs in
Supplementary File, Table S4.
Figure 3 highlights the top ten nodes ranked by the degree centrality from the whole network of 100 gene markers and corresponding targeter miRNA interactions. Furthermore, we applied network analysis on the top GO terms (viz., GO:BP, GO:CC, and GO:MF). For the most enriched GO:BP, we obtained that four markers fall in this term ranked by the degree centrality score, i.e.,
Cd74 (score: 241),
Unc93b1 (score: 228),
Psmb8 (score: 46), and
Psmb9 (score: 22). Similarly, for the most enriched GO:CC, we found six gene markers lied in this term ranked by the degree centrality score, i.e.,
Nrn1 (score: 461),
Bst1 (score: 160),
Alpi (score: 135),
Car4 (score: 104),
Vnn1 (score: 102) and
Akp3 (score: 49). Similarly, for the topmost GO:MF, we identified three markers associated with this term ranked by the degree centrality score, i.e.,
Grin1 (score: 956),
Slc1a1 (score: 403), and
Slc1a3 (score: 299).
Figure 4 depicts all of the sub-networks regarding the topmost GO-terms (highlighted the top ten nodes ranked by the degree centrality).
As a comparative study, we compared the identified 100 gene markers with our previously detected top 100 cluster-specified gene markers in the article by Seth et al., 2022 [
33], ranked by FDR. We discovered three common gene markers between the papers; namely,
,
and
. The partial visualization is shown in the Venn diagram,
Figure 2c. A detailed Venn diagram is shifted in
Supplementary File, Figure S1. Furthermore, we applied Gene Set Enrichment Analysis (GSEA) on both of these gene marker sets and obtained similar outcomes. We obtained several common GO:CC terms viz., GO:0005737 cytoplasm (enriched
p-value = 0.006), GO:0031526 brush border membrane (enriched
p-value = 0.029), GO:0005829 cytosol (enriched
p-value = 0.035), GO:0005743 mitochondrial inner membrane (enriched
p-value = 0.05), and a common GO:MF term, GO:0016491 oxidoreductase activity (enriched
p-value = 0.022). Moreover, we also performed an additional comparative study of our proposed model with the work Mallik and Zhao, 2019 [
53], that used the same dataset. This work also identified the differentially expressed genes or gene markers, which are not common with the gene markers detected by our current model. However, the novel gene markers of our current model are more significant than the gene markers detected by the model of Mallik and Zhao, 2019 [
53], with respect to an adjusted
p-value. According to the previous model, the top ten genes (markers) were
Rps21 (adjusted
p-value =
),
Slc5a1 (adjusted
p-value =
),
Crip1 (adjusted
p-value =
),
Rpl15 (adjusted
p-value =
),
Rpl3 (adjusted
p-value =
),
Rpl27a (adjusted
p-value =
),
Khk (adjusted
p-value =
),
Rps3a1 (adjusted
p-value =
),
Aldob (adjusted
p-value =
), and
Rps17 (adjusted
p-value =
). Our currently proposed model detected the top ten markers which are
BC064078 (adjusted
p-value =
),
Psmb9 (adjusted
p-value =
),
Cd74 (adjusted
p-value =
),
Psmb8 (adjusted
p-value =
),
Rangrf (adjusted
p-value =
),
1700097N02Rik (adjusted
p-value =
),
Slc14a1 (adjusted
p-value =
),
Fgf1 (adjusted
p-value =
),
Cdo1 (adjusted
p-value =
), and
Gm13157 (adjusted
p-value =
). It is quite clear that the gene markers detected by our model generate lesser adjusted
p-values than another model that defines higher significance of our markers than the previous. Furthermore, our proposed model estimated a higher silhouette width score (=0.540) for the best cluster rather than the same (=0.482) by the other model which signifies better clustering of our method than the other since a higher value of silhouette width indicates good clustering. Thus, we can claim that our proposed model generates efficient clustering outcomes and significant gene signatures compared to the other model.
In addition, to measure the efficiency of our framework, we used another benchmark dataset with accession id GSE36552 (Yan dataset) [
36], which is responsible for tracing the pluripotency of early human embryos and embryonic stem cells by single-cell RNA sequencing as input. This dataset consists of 20,214 features, 90 cells, and 7 classes. We analyzed this dataset through our framework. After obtaining an imputed data matrix by scImpute, we used the MRMR feature selection algorithm to fetch the top 100 features with respect to their feature scores. After that, we applied shrinkage clustering on our resultant 100 × 90 data matrix for cell clustering. In shrinkage clustering, after computing similarity matrix (s) we applied the function SuperCluster (s,w = 10, iter = 1000, random = 10, Path = F) to reach the global optimum. We finally obtained two clusters with accuracy metrics such as normalized mutual information (NMI) (=0.4117), Rand index (=0.5713), and F1 scores (=0.4589).
Figure 5a visualizes the two clusters. Next, applying the silhouette validity index, we detected the best cluster with the highest silhouette width of +0.82, whereas the silhouette width of another cluster was +0.58. The silhouette-computed cluster plot is shown in
Figure 5b. In this study, we obtained 64 cells in the best cluster and 26 cells in another cluster. Then, for differential expression analysis, we took the best cluster as the experimental group and another cluster as the control group. After differential expression analysis, we obtained 100 differentially expressed features by Limma Voom having
cutoff less than
, in a list accompanied by computed t-score, logFC,
p-value, and FDR. We also obtained 20 up-regulated bio-markers with logFC cutoff greater than
. Twenty up-regulated bio-markers are shown in
Table 7. Furthermore, we also applied the DAVID 6.8 database for GSEA. Here, we used our 100 differentially expressed genes as an input dataset in the DAVID 6.8 software’s recommended format which is basically a list of gene names in a single column, with
as the identifier and Homo Sapiens as the species. For this dataset, we identified several significant KEGG pathways and GO-terms (viz., GO:BP, GO:CC, and GO:MF) (enriched
p-value < 0.05). Significant KEGG pathways and GO-terms were listed in
Table 8,
Table 9,
Table 10 and
Table 11. For further information of enriched KEGG pathways and GO-terms,
Supplementary Files Tables S5–S8 are also included with this article.
In this framework, we have used the scImpute imputation method to overcome the dropout problem of single-cell sequencing data. However, there are various methods for data imputation. So, a question may arise as to why we have chosen scImpute in our proposed work. Though in the introduction section we already mentioned the limitation of other methods like MAGIC, SAVER, etc., to prove that drawback through the experiment, we applied the MAGIC method as a replacement for scImpute in our proposed framework. According to Li and Li [
13], the imputed data from scImpute produce a clearer contrast between the up-regulated genes in various cell types, whereas the imputed data from MAGIC and SAVER are unable to recover this pattern. In our framework, we also found that the cluster validity index silhouette width for the best cluster is lower when using the MAGIC imputation algorithm (+0.73) than when using the scImpute algorithm (+0.82) for the Yan dataset. MAGIC also decreased the average silhouette width from +0.75 to +0.72.
Figure 6 shows all clustering results for the MAGIC imputed Yan dataset (GSE36552). After that, we found differentially expressed genes (DEG) using Limma-Voom. Using the scImpute method in our proposed framework, we obtained 100 significant DEGs where FDR < 0.05; however, using the MAGIC method, we obtained 70 significant DEGs with this FDR cutoff. The rest of the 30 features are not significantly differentially expressed based on their adjusted
p-value/FDR. In addition, to obtain up-regulated markers, we considered the LogFC value, and the cutoff of the up-regulated marker is LogFC > 1.75.By using scImpute, we obtained 20 up-regulated markers with the highest logFC = 2.182.
Table 7 already represented a detailed list of 20 up-regulated markers. However, after using the MAGIC imputation method in our framework, we obtained only one up-regulated marker (
) with logFC = 1.845.
Table 12 represents the top five markers for the MAGIC imputed Yan dataset (GSE36552).
Supplementary Table S9 stores more details for all DEGs obtained from the MAGIC imputed dataset. Therefore consistent with Li and Li [
13], we also concluded that the imputed data by scImpute had a stronger contrast between the up-regulated genes in different cell types, whereas the imputed data by MAGIC had limitations in recovering this pattern. After using the raw dataset for the ablation study, it was found that the result using the raw dataset was far better than the result using the MAGIC imputed dataset. One main reason is that MAGIC discarded many biological zeroes during imputation, which likely led to the loss of meaningful biological variation.
Furthermore, one more explanation of our proposed framework is also required to make it clearer to future researchers. In this work, using the MRMR feature selection algorithm, we have selected the top 100 features and used them for downstream analysis. A question may arise here: why have we taken the feature count 100 in this framework? The answer is very simple. The MRMR method is based on the concept that if a feature was differentially expressed among distinct classes, it would have large mutual information. To measure the relevance of features, they used mutual information in the MRMR model. Accordingly, in MRMR, the rank of features never changed. We changed the feature count and validated our result using all performance measure metrics. We have executed our framework using the top 500 rather than 100 features. However, we obtained similar clustering results and almost the same value for the silhouette index, Rand index (RI), NMI, and F1-score for the scRNA-seq dataset of rare intestinal cell type in mice (GEO ID: GSE62270). The silhouette index for the best cluster with the top 500 feature counts was +0.56 where it was +0.54 by using the top 100 features. The NMI, RI, and F1-score were 0.0596, 0.4739, and 0.2310, respectively, when considering the top 500 features. We obtained similar values for these three measures—the NMI scores (=0.0584), Rand index score (=0.4670), and F1 score (=0.2314)—by using the feature to count the top 100 from the same dataset.Therefore, for this dataset (GSE62270), there was no significant variation in results by using different feature counts. Similarly, for another dataset, Yan (GSE36552) [
36], after considering the top 500 features, we obtained some minor variations in the silhouette index, and this dataset obtained the best clustering when it took the feature count of the top 100. Since the ranking of features in MRMR never changes, we can take different feature counts by using trial and error procedures depending on the quality and pattern of the dataset. We have added the silhouette cluster plot by taking the feature count 500 for both datasets GSE62270 and GSE36552 as
Supplementary Figures S2 and S3, respectively.
Ablation Study of Our Proposed Model
To evaluate the effects of the key steps of our proposed method on the performance, we added two case studies here: (i) without imputation and (ii) without imputation and feature selection. In case (i), we used the raw dataset Yan (GSE36552), and obtained the same value for NMI, RI, F1-Score, and silhouette width, which we obtained by using the imputed (through scImpute method) dataset. The reason behind it is that in our dataset the number of redundant zeroes is low, and the maximum zeroes are the biological zero. That is why we have not observed variation in the result. However, when we use the dataset with a large number of redundant zeroes (significant dropout problem), we would have a large variation as a result, using raw data and imputed data. However, in our dataset, we also obtained a variation in the biomarker list. By using the imputed dataset (imputation through scImpute), we obtained 20 up-regulated markers (LogFC > 1.75); however, by using the raw dataset, we found 19 up-regulated markers; 18 markers are common. When we used the raw dataset, we failed to find two biomarkers that were highly significant; namely,
(LogFC = 2.148, FDR =
) and
(LogFC = 2.090, FDR =
). Of note, using the raw dataset, we had one new marker which is
(LogFC = 2.062, FDR =
). It is clear that, according to LogFC and the FDR-corrected
p-value, the markers
and
were more significant than
. In summary, if we removed the data imputation step from our framework, the performance of the framework would decrease.
Figure 7a shows the obtained up-regulated markers for scImputed data and raw (without imputation phase) data. In case (ii), we used raw data without feature selection. In this scenario, we used all features (i.e., 20,214 features) of the Yan (GSE36552) dataset. In this case, the result of clustering fell down significantly. The silhouette width of the best cluster decreased from +0.82 to +0.26 and the average silhouette width decreased from +0.75 to +0.21.
Figure 7b shows the silhouette plot of the dataset Yan (GSE36552) without the steps of data imputation and feature selection. From this performance, it is clear that if we removed the feature selection step from our proposed framework, it would create a negative impact on the performance of this framework. In summary, to improve the performance of our framework, data imputation, and feature selection are two necessary and crucial steps.
4. Conclusions and Future Scope
In this article, we followed several intermediate steps such as matrix imputation, feature selection, cell clustering, and a statistical hypothesis test to detect differentially expressed genes initially for signature detection in scRNA-seq data. To discard “drop-out” problems and outliers from the data, we applied a well-known statistical tool, scImpute. Next, to obtain the top 100 features, we applied the feature selection algorithm, MRMR, for further downstream analysis. Thereafter, we applied shrinkage clustering on the 100 top features to determine cell clusters through global optimization. As a result, we detected nine clusters with accuracy metrics such as normalized mutual information (NMI), Rand index, and F1 scores. To check the cluster validity index, we applied silhouette width measure and found that the best cluster had the highest silhouette width(=0.54) of 1368 cells with different labels. Next, we considered the best cluster as the experimental group and the rest of the clusters as the control group and applied the Limma-Voom R tool, employing the empirical Bayes test to detect differentially expressed genes having an FDR cutoff of less than 0.001. Furthermore, we performed Gene Set Enrichment Analysis (GSEA) on the resultant gene markers using David 6.8 software. Interestingly, we identified targeter miRNAs associated with the detected gene markers using the MiRWalk database, and then performed miRNA target gene interaction network analysis using the Cytoscape online tool to validate the biological significance and relevance of our detected gene signatures. In this paper, we solved the limitation of our last published work, Seth et al., 2022, by omitting dropout values through matrix imputation. Moreover, we provided a comparative study between our 100 presently detected gene markers and our previously detected top 100 (based on FDR ranking) cluster-specified gene markers of the published framework by Seth et al., 2022. We found three common markers; namely, , , and 97 novel markers (the top 5 novel markers are , , , , and ). In addition, we further performed Gene Set Enrichment Analysis (GSEA) on both the gene-marker sets and obtained similar kinds of outcomes. Furthermore, we have presented another comparative study with the published framework of Mallik and Zhao, 2019 and showed that the gene markers detected by our current proposed model are more significant than the other. Furthermore, to show the efficiency of our framework, we used another single-cell RNA sequencing dataset (Yan dataset). After analyzing every step of our framework through this dataset, we obtained clusters with high accuracy metrics and validity index. We also detected 20 up-regulated markers by logFC from 100 bio-markers. We have conducted a comparative study between the scImpute imputation method and the MAGIC imputation method and proved that scImpute is comparatively better. Furthermore, we considered 500 top features for both datasets and compared the result with our framework by considering the top 100 features. We found that the result is almost similar to changing the feature count since MRMR never changes the rank of a feature. In addition, we have added here an ablation study by considering two cases, one without the imputation method and the other without imputation and feature selection methods. The results proved that both steps are essential for this framework. In conclusion, it can be said that our framework is useful for biological interpretation of the single-cell sequencing data analysis and for efficiently identifying the differentially expressed gene signatures. We would like to summarize the advantages of our proposed framework as below.
This framework discards “drop-out”problems and outliers from the data without hampering the biological relevance of data using the scImpute method.
The feature selection step of our framework helps to improve the performance of this framework by considering significant features based on minimum redundancy and maximum relevance (MRMR) and ignoring redundant features. It is proved by ablation study that the inclusion of redundant features decreases the performance of a model significantly.
The clustering phase of this framework helps to identify highly validated cell clusters through global optimization with respect to silhouette width. In single-cell RNA sequencing data analysis, cell clustering plays an important role in the identification of existing and novel types of cells, the depiction of cells, cell fate prediction, the classification of several types of tumors, and the investigation of heterogeneity in different cells. In this framework, we obtained significant superclusters using a matrix factorization-based algorithm, shrinkage clustering.
This framework obtains up-regulated markers using the statistical tool Limma-Voom. up-regulated gene expressions are significantly correlated with the detection and variation in rare or critical diseases like cancer. up-regulated gene expression in tissue served as an independent detector of any risk and a promising biomarker to predict further changes in differently expressed (diseased) cells/samples and to detect rare cells. Using pathway and gene ontology enrichment analysis, we found that our markers are biologically significant.
In our future study, we will apply our framework to more high-dimensional single-cell sequencing datasets to measure its scalability and robustness. Furthermore, to improve the clustering process, we will extend our current work by including multi-objective optimization techniques. It can be used for big data analysis and the discovery of rare cells in single-cell RNA sequencing data. In this model, we focused more on computational prospects rather than the typical biological characteristics of the dataset. However, in the future, to resolve any kinds of biological errors in the dataset, we are keen to extend our framework and validate the findings by some standard biological test with the collaboration of a wet laboratory.