Identifying Genetic Signatures from Single-Cell RNA Sequencing Data by Matrix Imputation and Reduced Set Gene Clustering

Seth, Soumita; Mallik, Saurav; Islam, Atikul; Bhadra, Tapas; Roy, Arup; Singh, Pawan Kumar; Li, Aimin; Zhao, Zhongming

doi:10.3390/math11204315

Open AccessArticle

Identifying Genetic Signatures from Single-Cell RNA Sequencing Data by Matrix Imputation and Reduced Set Gene Clustering

¹

Department of Computer Science and Engineering, Future Institute of Engineering and Management, Narendrapur, Kolkata 700150, West Bengal, India

²

Department of Computer Science and Engineering, Aliah University, Kolkata 700160, West Bengal, India

³

Department of Environmental Health, Harvard T H Chan School of Public Health, Boston, MA 02115, USA

⁴

Center for Precision Health, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA

⁵

Department of Computer Science and Engineering, University of Kalyani, Kalyani 741235, West Bengal, India

⁶

Department of Computer Science and Engineering, Budge Budge Institute of Technology, Kolkata 700137, West Bengal, India

⁷

Department of Information Technology, Jadavpur University, Jadavpur University Second Campus, Plot No. 8, Salt Lake Bypass, LB Block, Sector III, Kolkata 700106, West Bengal, India

⁸

Shaanxi Key Laboratory for Network Computing and Security Technology, School of Computer Science and Engineering, Xi’an University of Technology, Xi’an 710048, China

⁹

Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA

^*

Authors to whom correspondence should be addressed.

Mathematics 2023, 11(20), 4315; https://0-doi-org.brum.beds.ac.uk/10.3390/math11204315

Submission received: 3 August 2023 / Revised: 28 September 2023 / Accepted: 12 October 2023 / Published: 17 October 2023

Download

Browse Figures

Versions Notes

Abstract

:

In this current era, the identification of both known and novel cell types, the representation of cells, predicting cell fates, classifying various tumor types, and studying heterogeneity in various cells are the key areas of interest in the analysis of single-cell RNA sequencing (scRNA-seq) data. Due to the nature of the data, cluster identification in single-cell sequencing data with high dimensions presents several difficulties. In this paper, we introduce a new framework that combines various strategies such as imputed matrix, minimum redundancy maximum relevance (MRMR) feature selection, and shrinkage clustering to discover gene signatures from scRNA-seq data. Firstly, we conducted the pre-filtering of the “drop-out” value in the data focusing solely on imputing the identified “drop-out” values. Next, we applied the MRMR feature selection method to the imputed data and obtained the top 100 features based on the MRMR feature selection optimization scores for further downstream analysis. Thereafter, we employed shrinkage clustering on the selected feature matrix to identify the cell clusters using a global optimization approach. Finally, we applied the Limma-Voom R tool employing voom normalization and an empirical Bayes test to detect differentially expressed features with a false discovery rate (FDR) < 0.001. In addition, we performed the KEGG pathway and gene ontology enrichment analysis of the identified biomarkers using David 6.8 software. Furthermore, we conducted miRNA target detection for the top gene markers and performed miRNA target gene interaction network analysis using the Cytoscape online tool. Subsequently, we compared our detected 100 markers with our previously detected top 100 cluster-specified markers ranked by FDR of the latest published article and discovered three common markers; namely,

C y p 2 b 10

,

M t 1

,

A l p i

, along with 97 novel markers. In addition, the Gene Set Enrichment Analysis (GSEA) of both marker sets also yields similar outcomes. Apart from this, we performed another comparative study with another published method, demonstrating that our model detects more significant markers than that model. To assess the efficiency of our framework, we apply it to another dataset and identify 20 strongly significant up-regulated markers. Additionally, we perform a comparative study of different imputation methods and include an ablation study to prove that every key phase of our framework is essential and strongly recommended. In summary, our proposed integrated framework efficiently discovers differentially expressed stronger gene signatures as well as up-regulated markers in single-cell RNA sequencing data.

Keywords:

single-cell sequencing; gene signature; data imputation; feature selection; maximum relevance and minimum redundancy; shrinkage clustering; differentially expressed markers

MSC:

92B05

1. Introduction

The most essential component of life known to exist are cells.The cells that make up our bodies can respond to stimuli, reproduce, process information, and carry out various chemical activities. However, to this day, nothing is understood about how cells interact, how their organelles react to chemicals, or how they behave inside of cells. A typical ensemble measurement of millions of cells put together is insufficient to provide precise information about cells for the understanding of cellular physiological interactions with various chemicals and organelles. Again, cellular heterogeneity features and dynamics in a given cell population cannot be investigated at the molecular level by bulk analysis of millions of cells. The physiological states of cells and the variability of cells within a particular population, however, have made it difficult to conduct investigations into the progression of any disease. Single-cell analysis (SCA), as opposed to the typical study of millions of cells in bulk, provides clear information on each cell, such as particular biological factors of the cell, which is helpful to understand stem cell activity or tumor progression [1,2]. All medical disciplines such as oncology, microbiology, neurology, reproduction, and immunology greatly benefited from single-cell RNA sequencing (scRNA-seq) technologies [3,4]. To detect the genome, transcriptome, other multi-omics of single cells, and single-cell sequencing technologies could be used. It also requires additional multi-omics data analysis to reveal the variations in cell populations and the evolutionary connections between cells. However, conventional RNA sequencing techniques have some significant drawbacks. For instance, they are unable to examine a few cells and can only determine the average of many cells. On the other hand, scRNA-seq technology can identify heterogeneity among individual cells and solve the problem of traditional sequencing technology potentially losing information about cellular heterogeneity. The workflow of scRNA-seq technology entails separating a single cell from a population of cells, analyzing cell heterogeneity, molecular mapping, and monitoring immune infiltration and epigenetic alterations. The current research focusing on scRNA-seq data analysis includes the classification of various types of tumors, examination of heterogeneity in various cell types, identification of existing and novel types of cells, cell representation, and cell fate prediction [5]. Single-cell clustering plays an important role in conducting such a study. Cell clustering is an essential technique for identifying cluster-specific gene signatures for reconciling cell type configuration. The aim of this work is to identify the gene signature as differentially expressed and to simplify the bulk RNA-seq expression data by removing noise. However, several factors including high dimensionality, measurement noise, detection limit, and imbalanced size between uncommon and abundant populations make the computational analysis of scRNA-seq data difficult. One crucial aspect of scRNA-seq data that contributes to all of these difficulties is a phenomenon known as “drop-out” in which a gene is seen to have a low or moderate amount of expression in one cell but not in another of the same cell type [6]. These dropout events arise from the stochasticity of mRNA expression, ineffective mRNA capture, and low levels of mRNA in individual cells. The dropouts are a major contributor to the sometimes very sparse scRNA-seq data. The data are zero-inflated due to the overwhelming zero counts, only capturing a portion of each cell’s transcriptome. Dimensionality reduction and clustering, which both have a significant impact on the analysis that comes after scRNA-seq data, are the two most crucial methods for overcoming analysis problems. To overcome the dropouts, almost all currently used computational approaches for scRNA-seq included gene selection and dimension reduction techniques. Pre-processing of the data with principal component analysis (PCA) [7] or t-Distributed Stochastic Neighbor Embedding (tSNE) [8] to accomplish dimension reduction and choose highly variable genes is a frequent approach for clustering and trajectory finding methods. There are various imputation techniques intended to specifically omit dropouts. MAGIC [9], SAVER [10], scImpute [11], and RESCUE [12] are a few recent examples. These imputation techniques often determine gene–gene or cell–cell similarities using highly variable genes and dimension reduction, which serves as the foundation for imputing the dropouts with acceptable values. The key component of all these approaches is to concentrate the analysis on highly variable genes that are less susceptible to dropouts, which has been shown to be efficient since genes with high variability are frequently able to capture the cell-to-cell heterogeneity of significant phenotypes.

Automatically detecting the dropout finder, scImpute [11] solely imputes these dropout values, avoiding the inclusion of additional biases to the remaining data, and whereas MAGIC and SAVER would change all gene expression levels, even those unaffected by dropouts, this might add new biases into the data and even eliminate significant biological variance [13].In addition, outlier cells are found by scImpute and are disregarded for imputation. According to analysis using both the real and simulated human and mouse scRNA-seq data, scImpute is a useful method for recovering transcriptome dynamics obscured by dropouts. It has been demonstrated that scImpute can help the research of gene expression dynamics as well as improve the clustering of cell sub-populations, increase the accuracy of differential expression analyses, and identify the likely dropouts. One interesting phenomenon of scImpute is that it can be integrated by the majority of currently used pipelines or downstream analyses of scRNA-seq data, including normalization, differential expression analysis, clustering, classification, etc. The authors in Li et al. [11] claimed that a new analyzing tool specifically developed for the imputed scRNA-seq data by scImpute improved the performance over the state-of-the-art methods developed for raw scRNA-seq data by incorporating some features to decrease excess zero or near zero expression, dropout rates, and dropout probabilities estimated through the mixture models. Another key characteristic of scImpute is that it just uses two parameters, which are simple to comprehend and choose. The number of potential cell populations is indicated by the first parameter, K. It can be selected based on the raw data clustering results and the users’ preferred level of resolution. Users can choose a relatively small K, if they are only interested in the differences between the major clusters, in which scImpute can borrow more information from individual cells; otherwise, they can choose a relatively large K, in which scImpute would be more cautious during the imputation process. The second parameter is a probability threshold t for dropouts. In article [11], authors proved that scImpute can handle a range of parameter values and most scRNA-seq data can be used with a default threshold of 0.5. In addition, for the scImpute technique, cell type information is not required. The computing efficiency can be significantly increased if a filtering step on cells can be carried out using biological information. ScImpute scales up effectively as the number of cells rises.

Discriminant analysis is widely used in bioinformatics for tasks like separating cancerous tissues from healthy tissues or one cancer subtype from another, predicting the fold or super-family of proteins from their sequences, etc. [14]. Feature selection is a crucial problem in discriminant analysis: rather than using every variable (feature or attribute) in the data, one picks a subset of features to be employed in the discriminant system [15,16]. Feature selection has a number of benefits, including: (i) dimension reduction to decrease computation costs; (ii) noise reduction to increase classification accuracy; (iii) more observable features or characteristics to aid in identifying and keeping track of the targeted diseases or function types [17,18]. To reduce the whole feature set into one that is more accessible, various feature selection techniques are available [19,20,21]. Filter methods, wrapper methods, and embedded methods are generally all three categories into which these techniques fall. The effectiveness of the computation and generalization to various machine learning models are benefits of the filter approach. In our framework, we have studied a Maximum Relevance and Minimum Redundancy (MRMR) [14] feature selection method. Out of all of the filtering approaches, the MRMR method is preferred since it can efficiently eliminate the redundant features while retaining the necessary features for the model [22]. Since many significant features are correlated and redundant, it may happen that the resulting n best features might not be the actual (optimized) best n features. MRMR technique addresses this issue by choosing features that take into account both their redundancy within the selected features as well as their relevancy for predicting the outcome variable. Ding and Peng [14] proposed the MRMR algorithm and established that it selects features more accurately than the maximum relevance method alone. According to the notion behind MRMR, a feature should have a significant amount of mutual information if it is strongly differentially expressed for different classes. They employed mutual information in the MRMR model to assess the significance of the attributes. Due to this, MRMR is a feature selection technique that is frequently employed in the implementation of machine learning.

To categorize cells based on their similarity in gene expression profiles, cells can be grouped using an unsupervised machine learning phase, clustering. Researchers gain insights into the scRNA-Seq data and potential confounding variables from hidden patterns that emerge from clustering results. The most widely used scRNA-Seq clustering techniques are based on the three classes: Hierarchical, k-means, and graph-based. For powerful single-cell transcriptome analysis, the Louvain algorithm [23], a graph-based technique that looks for densely connected communities in the network, is the preferred approach. Other well-known applications that adopt this strategy include PhenoGraph [24], Seurat [25,26], Scanpy [27], and FAVA [28]. All of PhenoGraph, Seurat, and Scanpy methods are basically PCA as well as graph-based methods, which are very scalable, and have high accuracy in large datasets [29]. On the other hand, FAVA is a deep learning-based classification technique. Shrinkage clustering can be widely used to resolve biomedical clustering issues, especially when working with large datasets, due to its simplicity of implementation, computing efficiency, and flexible structure [30]. The shrinkage clustering approach, which is based on matrix factorization, partitions the data while concurrently determining the optimal number of clusters. Furthermore, due to the extensible structure of the algorithm, we can use it for clustering applications with minimum cluster size restrictions. The algorithm demonstrated significant resistance to noise, operated with high accuracy on both simulated and real data, and ran faster than some of the most widely used algorithms.

The process of assessing the accuracy of outputs from a clustering algorithm is known as cluster validation. This is significant both when trying to compare two clustering techniques and when trying to avoid seeing patterns in random data. Clustering validation statistics can often be divided into three types: internal cluster validation, which assesses the quality of a clustering structure using only the internal data from the clustering process and does not rely on outside data; external cluster validation involves comparing cluster analysis findings to an externally known outcome, like externally supplied class labels; relative cluster validation compares the results of the same procedure with different parameter values to determine the clustering structure. Internal validation metrics frequently represent the separation, connectivity, and compactness of cluster partitions. The degree of closeness among the objects in a cluster is referred to as compactness. How well a cluster is segregated from other clusters is determined by its separation. Connectivity measures how closely items are clustered with their closest neighbors in the data space. The silhouette coefficient is a popular metric for evaluating the quality of clustering [31]. The silhouette analysis calculates the average distance between clusters and assesses how well an observation is clustered. The distance between each point in one cluster and points in its neighboring clusters is shown on the silhouette plot. The advantage of this method is it can detect outliers if present in a cluster. The silhouette validity index measure normally computes three matrices: (i) the silhouette width for each data point, (ii) the average silhouette width for each cluster, and (iii) the overall average silhouette width for the total dataset.

Dimensionality reduction and clustering have drawn more attention in recent years for the interpretation of single-cell RNA sequencing data [32]. In our last published study [33], we presented an integrated clustering model with dimensionality reduction for the identification of cluster-specific biomarkers in single-cell sequencing data. However, we applied this model to the raw data matrix. Since the “dropout” phenomenon is a crucial challenge of scRNA-seq data analysis, a data imputation or imputed matrix is desirable. In this article, we propose another framework by integrating the imputed matrix, MRMR feature selection, and shrinkage clustering to detect gene signatures from scRNA-seq data. Our developed integrated model consists of several steps: pre-filtering, feature selection, clustering, and differentially expressed marker detection. The motivation behind proposing such a framework with reduced set gene clustering for scRNA-seq data analysis is that scRNA-seq provides a potent tool for analyzing the complexity of biological tissues by identifying cell sub-populations along with clustering techniques. Enhancing the accuracy and interpretability of single-cell clustering requires feature selection [34].

2. Methods

The steps of our proposed framework are demonstrated in Figure 1.

2.1. Data Sources

We used a single-cell RNA sequencing dataset for the rare intestinal cell types in mice (NCBI accession ID: GSE62270) in this study, which can be found in the NCBI Gene Expression Omnibus repository, accessed on 19 August 2015 https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/geo/query/acc.cgi?acc=GSE62270. The dataset consists of 23,630 features, 1872 cells, and 9 classes [35]. In addition, we also used another dataset, the Yan dataset (NCBI accession ID: GSE36552), which is responsible for tracing the pluripotency of human early embryos and embryonic stem cells by single-cell RNA sequencing. This dataset [36] consists of 20,214 features, 90 cells, and 7 classes. The Yan dataset is also available in the NCBI Gene Expression Omnibus repository, accessed on 10 August 2013. https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/geo/query/acc.cgi?acc=GSE36552.

2.2. Matrix Imputation

The major challenge researchers face in single-cell sequencing data analysis is a phenomenon called “dropout”. The “dropout” problem means that a feature is expressed at a low or moderate expression level in one cell but in another cell of the same cell type the feature is not detected [6]. To overcome this problem, we have used a popular single-cell imputation algorithm called scImpute on the experimental dataset. scImpute [11] is a statistical method for imputing the dropouts in scRNA-seq data with high accuracy and robustness. scImpute can identify likely dropouts automatically, and it performs imputation on those values only; however, it does not introduce any new biases to the rest of the data. scImpute also detects outlier cells and eliminates them from imputation. scImpute increases the clustering of cell sub-populations, improves the accuracy of differential expression analysis, and supports the study of gene expression dynamics. scImpute works using the following steps: first, it uses learning techniques on each gene’s dropout likelihood in individual cells based on a mixture model. Thereafter, by using data from other similar cells having the same genes and being chosen based on genes that are unlikely to be impacted by dropout occurrences, scImpute imputes the (highly probable) dropout values in the cell [13].

2.3. Max Relevance Min Redundancy (MRMR) Feature Selection

For filtering large-scale data, benchmark prefers to select the top-ranked features, say, the top 100 or 500 features [14]. The feature selection methods can be categorized into three types: filter methods, wrapper methods, and embedded methods. The MRMR method is a widely used filter-based feature selection method for machine learning implementation [22]. Applying filter method in different machine learning models is advantageous due to the computation efficiency and generalizability. However, among various filter methods, we have chosen MRMR because it can effectively reduce the redundant features while keeping the relevant features for the model [22]. Maximum filter methods are based on data distribution and correlation [22]. Frequently, it is pointed out that simply incorporating a “highly effective” feature with another “highly effective” feature sometimes cannot form a better feature set. The main reason is that these two features are highly correlated but they raise the issue of “redundancy” of the feature set. That means the n best features are not the best n features, there are many important features which are correlated and redundant. To address this problem, MRMR is the best option since it selects features on the basis of both, the relevance for predicting the outcome variable and the redundancy within the selected features. Specifically, Ding and Peng [14] proposed the MRMR algorithm and they proved that features selected by this process lead to higher accuracy than features selected via maximum relevance only. The MRMR method is based on the concept, if a feature was differentially expressed among distinct classes, it would have large mutual information. To measure the relevance of features, they used mutual information in MRMR model. MRMR feature selection algorithm runs in such a way, suppose A is a preliminary feature set with ‘N’ features and B is the resultant feature set. The goal is to select top ‘t’ features. Initially set A contains ‘N’ features and set B is a NULL set. In step 1, normalized mutual information is computed for individual features with respect to class labels, then in step 2, we choose the feature that has the highest score of normalized mutual information among all features and move this feature to the resultant set B. So, set B now contains one top feature which is the best feature. Thereafter, to determine the second top feature, we identify the current most relevant and lowest redundant feature from set A and move to set B. By repeating the iterations, top ‘t’ features can be selected and added in set B [37].

In our proposed model, we apply the MRMR filter to the imputed matrix, which is continuous data. We used the R tool mRMRe [38] with the mRMR.classic() function which converts the imputed matrix to the mutual information matrix(mim) using

c o n t i n u o u s_e s t i m a t o r

which is an A correlation between continuous features. Anyone from the set of correlations

{p e a r s o n, s p e a r m a n, k e n d a l l, f r e q u e n c y}

can be set here. By default, Spearman’s correlation is chosen. If the mutual information function is

M I (f_{i}, f_{j})

and F denotes the subset of features we are looking for, then the minimum redundancy condition is [14].

m i n R d_{t}, R d_{t} = \frac{1}{{| F |}^{2}} \sum_{f_{i}, f_{j} \in F} M I (f_{i}, f_{j})

(1)

To measure the level of discriminant powers of features by considering that they are expressed differentially for various target classes, we again compute mutual information

M I (h, f_{i})

between targeted classes

c = {c_{1}, c_{2}, \dots, c_{k}}

(c is also termed as a classification variable), and the expression of feature

f_{i}

.

M I (h, f_{i})

evaluates the relevance of

f_{i}

for the classification task. To obtain the maximum relevance, it is required to maximize the total relevance of all features in F [14]

m a x R l_{t}, R l_{t} = \frac{1}{| F |} \sum_{f_{i} \in F} M I (c, f_{i})

(2)

For obtaining the MRMR feature set, it is necessary to optimize Equations (1) and (2) simultaneously. Simplifying the MRMR optimization of both conditions requires combining them into a single criterion function. Ding and Peng [14] mentioned the two simplest combination criteria:

m a x (R l_{t} - R d_{t})

(3)

m a x (R l_{t} / R d_{t})

(4)

In our proposed framework, we obtained the top 100 features based on MRMR scores and used them only in our further procedures since MRMR produces the best results for any prediction method.

2.4. Cell Clustering

The major challenge of clustering is handling large datasets, especially in biomedical applications. Shrinkage clustering [30] can solve this issue. This clustering technique is based on matrix factorization that simultaneously splits the dataset and computes the optimal number of clusters. This algorithm is mathematically straightforward, computationally efficient, and structurally flexible. In our proposed framework, we have applied shrinkage clustering to the cells of the top 100 feature sets. The working principle of this algorithm follows two steps: one is to obtain similarity relationships among all objects (e.g., Euclidean distance), the other is to cluster objects based on these relationships. The objective function of shrinkage clustering is formulated as

m i n_{X} f (X) = \sum_{i = 1}^{N} \sum_{j \in {j | X_{i} = X_{j}}} (1 - 2 S_{i j})

(5)

where

S_{N * N}

denotes similarity matrix,

S_{i j} \in [0, 1]

, i, j are respective objects.

X_{N * K}

is the clustering solution for objects with similarity relationships

S_{N * N}

and K is the number of clusters obtained.

X_{i k} \in [0, 1]

and

\sum_{k = 1}^{K} X_{i k} = 1

.

X_{i k}

= 1 if object i belongs to cluster k, otherwise it is 0.

Shrinkage clustering is executed in the following way. At first, there is the need to assign objects randomly to a large enough number of initial clusters. Through every iteration, the algorithm first takes out any empty clusters detected in the prior iteration, it shrinks the number of clusters slowly. After that, it permutes the cluster membership of that object which minimizes the objective function at most. The algorithm reaches the termination stage either when there are no possible cluster membership permutations, which can further minimize the objective function, meaning the solution converges or when a predefined maximum number of iterations is reached. It is guaranteed shrinkage clustering always converges to a local optimum.

In our proposed framework, we applied an R package of the shrinkage clustering algorithm, called the “shrinkageClust” package [39]. The package has a function simiMatrix() which converts a feature matrix to a similarity matrix. The main function of this package is SuperCluster (), which takes the similarity matrix as input and obtains an optimal number of clusters. The function is defined as [39] SuperCluster (s,w = NULL, k = NULL, iter = 1000, random = 1, Path = F), where ‘s’ denotes the similarity matrix, ‘w’ denotes the minimum number of members to form a cluster, ‘k’ represents the maximum number of clusters, and ‘iter’ denotes the number of iterations for updating the clustering assignment. It converges to a local optimum. Finally, ‘random’ denotes the number of random initializations, which is essentially how many times the function will re-run the algorithm with different initial cluster assignments. To run the algorithm with different random initialization is sometimes needed to reach the global optimum. The path is a boolean variable, which allows users to see how the number of clusters changes in each iteration.

In our model, we have applied shrinkage clustering to the feature matrix, consisting of the top 100 features and 1872 cells for cell clustering. There are nine known group labels of cells in our dataset. After obtaining the similarity matrix, we applied SuperCluster (s,w = 48, iter = 1000, random = 1, Path = F) to reach a local optimum and notice that cluster structures are changing over time in different multiple time runs. To solve this issue and reach the global optimum, we re-run the entire procedure with different initial cluster assignments up to 10 times by updating the value of random variables from 1 to 10, i.e., SuperCluster (s,w = 48, iter = 1000, random = 10, Path = F). The PCA plot is used to visualize the clusters. To quantitatively assess the accuracy of the clustering result, we have computed evaluation scores of clusters such as normalized mutual information (NMI) scores [40], Rand index scores [41], and F1 scores [42]. Normalized mutual information (NMI) plays an efficient role in overlapping cluster detection [40]. The Rand Index normally measures the similarity between two cluster results [41]. The F1 score is widely used to assess the performance of clustering and classification, which measures the harmonic mean of precision and recall [42]. Apart from this, after clustering, we proceed to detect the best cluster using the cluster validity measure.

2.5. Cluster Validity Evaluation

To decide whether a clustering result is acceptable or not, several cluster validity techniques are needed. We also performed an evaluation of the results to assess the quality of clusters. We have chosen the silhouette method [31] since it computes silhouette coefficients of each point that measure how much a point is similar to its own cluster compared to other clusters. This method is based on the separation distance between the resulting clusters. The advantage of this method is it can detect outliers if present in a cluster. The silhouette validity index measure normally computes three of several matrices: (i) the silhouette width for each data point, (ii) the average silhouette width for each cluster, and (iii) the overall average silhouette width for the total dataset. The silhouette width for a particular data point(i) is formulated as

S_{i} = (y_{i} - x_{i}) / m a x (x_{i}, y_{i})

(6)

where

x_{i}

is the average dissimilarity of the ith data point to all other points in the same cluster, and

y_{i}

is the minimum of average dissimilarity of ith data point to all data points in another cluster. According to Equation (6), the silhouette index (

S_{i}

) value should be in the range of −1 to +1. The

S_{i}

value close to 1 means that the data point is assigned to an almost correct cluster. When the

S_{i}

value is close to zero, it means that that data point could be more fit for another close cluster, as it is the same distance from both clusters.

S_{i}

value close to −1 indicates that either the data are in the wrong cluster or it is an outlier [43].

Here, we detected the cluster with the highest silhouette index as the best cluster and applied a pairwise comparison to the best cluster versus the rest.

2.6. Identifying Differentially Expressed Genes Using Limma-Voom

In this step to identify differentially expressed features for downstream analysis, we applied the empirical Bayes test using Limma [44] after applying Voom normalization [45]. According to benchmark methods, the performance of Limma is very good for any kind of data distribution for any sample size. The definition of the moderated t-statistic of Limma is as follows [44]:

{\bar{t}}_{k} = (\frac{1}{\sqrt{(\frac{1}{m_{1}}) + (\frac{1}{m_{2}})}}) * (\frac{\hat{β_{k}}}{{\bar{p}}_{k}})

(7)

where

m_{1}

denotes the sample size for the diseased group,

m_{2}

signifies the sample size for the control group, and the total sample size

m = m_{1} + m_{2}

.

\hat{β_{k}}

,

p_{k}

notify the corresponding contrast estimator and posterior sample variance for the genes, respectively, [46,47,48].

Here, we have taken the best cluster as one group (experimental group) and all other clusters as another group (control group) to find differentially expressed genes. We detected biomarkers by the p-value and the Benjamini–Hochberg(BH) [49]-adjusted p-value, which is known as the false discovery rate (FDR). We used the t-table or cumulative distribution function (cdf) to find the false discovery rate (FDR) through empirical Bayes t-statistics. In the method, empirical Bayes t-statistics using Limma, p-value actually measures the probability of observing a t-value that is either equal to or greater than the actually observed t-value in which the given null hypothesis is true. We also computed log fold change(LogFC) value to detect highly expressed or up-regulated markers.

2.7. Gene Set Enrichment Analysis

The function of Gene Set Enrichment Analysis (GSEA) is to assess the potential function, biological significance, and disease relevance of a list of signature genes. After detecting cluster differentially expressed genes, we applied KEGG pathways and gene ontology (GO) annotations (three domains: biological process (BP), cellular component (CC), and molecular function (MF)) by DAVID 6.8 software [50]. We obtained all KEGG pathways and gene ontology (GO) terms along with the number of genes in that pathway or GO-term, enriched adjusted p-value, and FDR. We kept KEGG pathways or GO terms whose FDR were less than 0.05.

2.8. miRNA-Gene Interaction Network Analysis

Furthermore, we incorporated miRNA target gene network analysis into our model. To analyze the biological significance and disease relevance of our signature genes, firstly, we detected targeted miRNAs of the identified gene markers using miRWalk online database [51]. After detecting all miRNAs with interacting gene signatures, we applied the Cytoscape [52] online tool to network and sub-network analysis using several centrality measures viz., degree centrality, betweenness centrality, and closeness centrality. The term degree centrality measures the number of edges associated with specific nodes (biomolecules). Betweenness centrality means the shortest path centrality which measures the number of shortest paths passing through a node. Closeness centrality calculates the sum of the length of the shortest paths between the node and all other nodes in the network. Further, for the gene sets of the most enriched GO-terms by the GSEA database, we applied Cytoscape-based bi-molecular network analysis to enhance the validation of the efficiency and effectiveness of the detected gene signatures.

3. Result and Discussion

In this study, we used a scRNA-seq dataset for rare intestinal cell type in mice (GEO ID: GSE62270) which contains 23,630 features, 1872 cells, and 9 classes. Here, we attempted to resolve the “drop-out” problem of scRNA-seq data by our proposed framework for gene signature detection through integrated strategies of imputed matrix, MRMR feature selection, and shrinkage clustering from scRNA-seq data. To solve the “drop-out” problem, we used a statistical tool scImpute on the dataset. scImpute can automatically detect likely dropouts, and it performs imputation on those values only; however, it does not introduce any new biases to the remaining data. scImpute also eliminates outlier cells and eliminates them from imputation. Thereafter, we applied the feature selection algorithm, MRMR, to the imputed data matrix.Since MRMR produces one of the most promising results for any prediction method, we obtained the top 100 features based on the MRMR optimization scores for further downstream analysis. After feature selection, we obtained a feature sub-matrix that contains 100 features and 1872 cells. Next, we applied shrinkage clustering to detect cell clusters. Using this clustering technique, we first computed the similarity matrix and then applied the function SuperCluster (s,w = 48, iter = 1000, random = 1, Path = F) to reach the optimum. To reach the global optimum, we reran the entire procedure with a different initial cluster assignment up to 10 times by updating the value of the random variable from 1 to 10, i.e., SuperCluster (s,w = 48, iter = 1000, random = 10, Path = F). Our dataset has nine known class labels of cells. The shrinkage clustering also found nine clusters from the dataset. Figure 2a shows a PCA plot to visualize the clustering result. This resulting cluster provided normalized mutual information (NMI) scores (=0.0584), Rand index score (=0.4670), and F1 score (=0.2314). Thereafter, to evaluate the validity of the cell clusters, we used the cluster validity silhouette index. Using the silhouette index, we found correct clusters and wrong clusters. From those clusters, we chose one as the best cluster, which has the highest silhouette index, (=0.54). The best cluster contains 1368 cells with different labels. Table 1 shows how many cells of the best cluster rely on each label. Figure 2b shows silhouette width measure. According to the plot, the cluster denoting number 6 is the best cluster.

After that, in the differential expression analysis phase, we took the best cluster as the experimental group and the rest of the clusters as the normal group. Then we applied voom normalization for pre-processing data. After differential expression analysis, we obtained 100 differentially expressed features by Limma having

F D R < 0.001

, in a list accompanied by computed t-score, p-value and FDR. The top 25 differentially expressed genes (DEGs) are shown in Table 2.

After detecting 100 differentially expressed features or gene markers, we utilized them for further Gene Set Enrichment Analysis by David 6.8 software [50]. In order to obtain enriched KEGG pathways and gene ontology (GO) terms (viz., biological process (BP), cellular component (CC), and molecular function (MF)), along with Bonferroni and FDR scores, we applied the DAVID 6.8 database to our differentially expressed markers. Here, we used our input dataset in the DAVID 6.8 software’s recommended format which is basically a list of gene names in a single column, with

O F F I C I A L_G E N E_S Y M B O L

as the identifier and Mus musculus as the species. Unfortunately, due to the quality of the used dataset, we failed to obtain a significant KEGG pathway (enriched p-value < 0.05). However, we identified several significant GO-terms (viz., GO:BP, GO:CC and GO:MF) (enriched p-value < 0.05). Significant GO-terms were discussed below, along with Table 3, Table 4 and Table 5 that provide further information. For example, GO:0019882 antigen processing and presentation was one of the top significant GO-BP terms. Four genes, ( Cd74, Unc93b1, Psmb8, Psmb9) were associated with this GO-BP term having p-value

3.69 \times 10^{- 4}

. Table 3 contains the top five significant GO-BP terms. We provided the list of all GO-BP terms in a Supplementary File, Table S1. Similarly, we found GO:0031225 anchored component of membrane as one of the top significant GO-CC terms. A total of six genes (named as Bst1, Akp3, Nrn1, Vnn1, Alpi, Car4) were associated with this GO-CC term having a p-value of

1.73 \times 10^{- 4}

. The rest of the terms are shown in Table 4. We provided the list of all GO-CC terms in a Supplementary File, Table S2. Furthermore, GO:0016595 glutamate binding was one of the top significant GO-MF terms. Three genes (named as Slc1a1, Slc1a3, Grin1) were associated with this GO-MF term having p-value

1.09 \times 10^{- 3}

. For details, see Table 5. We provided the list of all GO-MF terms in Supplementary File, Table S3.

In addition, we detected targeter miRNAs associated with the detected 100 gene markers using miRWalk database version 2022-1 [51]. miRWalk is basically an online database where miRNA target detection is an advanced search option of this database. In general, using the list of user-defined miRNAs, we can find their target genes. On the other hand, in a similar fashion, using the list of user-defined genes, we can easily identify the corresponding targeter miRNAs. Of note, during the use of miRWalk, it is mandatory to select the species and type of IDs (here it is the “official gene symbol”). Here, in a similar fashion, we identified corresponding targeter miRNAs from the list of earlier detected differentially expressed gene (DEG) symbols (as the input list in miRWalk) by selecting “mouse” as the species and “official Gene Symbol” as the type of ID. Thereafter, we constructed miRNA target gene interaction network analysis through the degree centrality by Cytoscape online tool (version 3.8.0) [52]. Table 6 shows more details of top markers and top miRNAs sorted by the degree centrality. Table 6 presents the top five gene markers ranked by their in-degree centrality, betweenness centrality and closeness centrality scores. Similarly, Table 6 provides the top five miRNAs accompanied by the out-degree centrality, betweenness centrality, and closeness centrality scores. We also provided the list of all gene markers and miRNAs in Supplementary File, Table S4. Figure 3 highlights the top ten nodes ranked by the degree centrality from the whole network of 100 gene markers and corresponding targeter miRNA interactions. Furthermore, we applied network analysis on the top GO terms (viz., GO:BP, GO:CC, and GO:MF). For the most enriched GO:BP, we obtained that four markers fall in this term ranked by the degree centrality score, i.e., Cd74 (score: 241), Unc93b1 (score: 228), Psmb8 (score: 46), and Psmb9 (score: 22). Similarly, for the most enriched GO:CC, we found six gene markers lied in this term ranked by the degree centrality score, i.e., Nrn1 (score: 461), Bst1 (score: 160), Alpi (score: 135), Car4 (score: 104), Vnn1 (score: 102) and Akp3 (score: 49). Similarly, for the topmost GO:MF, we identified three markers associated with this term ranked by the degree centrality score, i.e., Grin1 (score: 956), Slc1a1 (score: 403), and Slc1a3 (score: 299). Figure 4 depicts all of the sub-networks regarding the topmost GO-terms (highlighted the top ten nodes ranked by the degree centrality).

As a comparative study, we compared the identified 100 gene markers with our previously detected top 100 cluster-specified gene markers in the article by Seth et al., 2022 [33], ranked by FDR. We discovered three common gene markers between the papers; namely,

C y p 2 b 10

,

M t 1

and

A l p i

. The partial visualization is shown in the Venn diagram, Figure 2c. A detailed Venn diagram is shifted in Supplementary File, Figure S1. Furthermore, we applied Gene Set Enrichment Analysis (GSEA) on both of these gene marker sets and obtained similar outcomes. We obtained several common GO:CC terms viz., GO:0005737 cytoplasm (enriched p-value = 0.006), GO:0031526 brush border membrane (enriched p-value = 0.029), GO:0005829 cytosol (enriched p-value = 0.035), GO:0005743 mitochondrial inner membrane (enriched p-value = 0.05), and a common GO:MF term, GO:0016491 oxidoreductase activity (enriched p-value = 0.022). Moreover, we also performed an additional comparative study of our proposed model with the work Mallik and Zhao, 2019 [53], that used the same dataset. This work also identified the differentially expressed genes or gene markers, which are not common with the gene markers detected by our current model. However, the novel gene markers of our current model are more significant than the gene markers detected by the model of Mallik and Zhao, 2019 [53], with respect to an adjusted p-value. According to the previous model, the top ten genes (markers) were Rps21 (adjusted p-value =

4.36 \times 10^{- 37}

), Slc5a1 (adjusted p-value =

7.47 \times 10^{- 37}

), Crip1 (adjusted p-value =

1.28 \times 10^{- 36}

), Rpl15 (adjusted p-value =

5.71 \times 10^{- 35}

), Rpl3 (adjusted p-value =

6.49 \times 10^{- 35}

), Rpl27a (adjusted p-value =

1.50 \times 10^{- 34}

), Khk (adjusted p-value =

4.09 \times 10^{- 34}

), Rps3a1 (adjusted p-value =

1.00 \times 10^{- 33}

), Aldob (adjusted p-value =

1.19 \times 10^{- 33}

), and Rps17 (adjusted p-value =

1.84 \times 10^{- 33}

). Our currently proposed model detected the top ten markers which are BC064078 (adjusted p-value =

2.08 \times 10^{- 56}

), Psmb9 (adjusted p-value =

2.23 \times 10^{- 53}

), Cd74 (adjusted p-value =

1.31 \times 10^{- 52}

), Psmb8 (adjusted p-value =

3.12 \times 10^{- 52}

), Rangrf (adjusted p-value =

3.20 \times 10^{- 52}

), 1700097N02Rik (adjusted p-value =

5.92 \times 10^{- 52}

), Slc14a1 (adjusted p-value =

8.56 \times 10^{- 52}

), Fgf1 (adjusted p-value =

5.82 \times 10^{- 51}

), Cdo1 (adjusted p-value =

7.88 \times 10^{- 51}

), and Gm13157 (adjusted p-value =

1.30 \times 10^{- 50}

). It is quite clear that the gene markers detected by our model generate lesser adjusted p-values than another model that defines higher significance of our markers than the previous. Furthermore, our proposed model estimated a higher silhouette width score (=0.540) for the best cluster rather than the same (=0.482) by the other model which signifies better clustering of our method than the other since a higher value of silhouette width indicates good clustering. Thus, we can claim that our proposed model generates efficient clustering outcomes and significant gene signatures compared to the other model.

In addition, to measure the efficiency of our framework, we used another benchmark dataset with accession id GSE36552 (Yan dataset) [36], which is responsible for tracing the pluripotency of early human embryos and embryonic stem cells by single-cell RNA sequencing as input. This dataset consists of 20,214 features, 90 cells, and 7 classes. We analyzed this dataset through our framework. After obtaining an imputed data matrix by scImpute, we used the MRMR feature selection algorithm to fetch the top 100 features with respect to their feature scores. After that, we applied shrinkage clustering on our resultant 100 × 90 data matrix for cell clustering. In shrinkage clustering, after computing similarity matrix (s) we applied the function SuperCluster (s,w = 10, iter = 1000, random = 10, Path = F) to reach the global optimum. We finally obtained two clusters with accuracy metrics such as normalized mutual information (NMI) (=0.4117), Rand index (=0.5713), and F1 scores (=0.4589). Figure 5a visualizes the two clusters. Next, applying the silhouette validity index, we detected the best cluster with the highest silhouette width of +0.82, whereas the silhouette width of another cluster was +0.58. The silhouette-computed cluster plot is shown in Figure 5b. In this study, we obtained 64 cells in the best cluster and 26 cells in another cluster. Then, for differential expression analysis, we took the best cluster as the experimental group and another cluster as the control group. After differential expression analysis, we obtained 100 differentially expressed features by Limma Voom having

F D R

cutoff less than

0.001

, in a list accompanied by computed t-score, logFC, p-value, and FDR. We also obtained 20 up-regulated bio-markers with logFC cutoff greater than

1.75

. Twenty up-regulated bio-markers are shown in Table 7. Furthermore, we also applied the DAVID 6.8 database for GSEA. Here, we used our 100 differentially expressed genes as an input dataset in the DAVID 6.8 software’s recommended format which is basically a list of gene names in a single column, with

O F F I C I A L_G E N E_S Y M B O L

as the identifier and Homo Sapiens as the species. For this dataset, we identified several significant KEGG pathways and GO-terms (viz., GO:BP, GO:CC, and GO:MF) (enriched p-value < 0.05). Significant KEGG pathways and GO-terms were listed in Table 8, Table 9, Table 10 and Table 11. For further information of enriched KEGG pathways and GO-terms, Supplementary Files Tables S5–S8 are also included with this article.

In this framework, we have used the scImpute imputation method to overcome the dropout problem of single-cell sequencing data. However, there are various methods for data imputation. So, a question may arise as to why we have chosen scImpute in our proposed work. Though in the introduction section we already mentioned the limitation of other methods like MAGIC, SAVER, etc., to prove that drawback through the experiment, we applied the MAGIC method as a replacement for scImpute in our proposed framework. According to Li and Li [13], the imputed data from scImpute produce a clearer contrast between the up-regulated genes in various cell types, whereas the imputed data from MAGIC and SAVER are unable to recover this pattern. In our framework, we also found that the cluster validity index silhouette width for the best cluster is lower when using the MAGIC imputation algorithm (+0.73) than when using the scImpute algorithm (+0.82) for the Yan dataset. MAGIC also decreased the average silhouette width from +0.75 to +0.72. Figure 6 shows all clustering results for the MAGIC imputed Yan dataset (GSE36552). After that, we found differentially expressed genes (DEG) using Limma-Voom. Using the scImpute method in our proposed framework, we obtained 100 significant DEGs where FDR < 0.05; however, using the MAGIC method, we obtained 70 significant DEGs with this FDR cutoff. The rest of the 30 features are not significantly differentially expressed based on their adjusted p-value/FDR. In addition, to obtain up-regulated markers, we considered the LogFC value, and the cutoff of the up-regulated marker is LogFC > 1.75.By using scImpute, we obtained 20 up-regulated markers with the highest logFC = 2.182. Table 7 already represented a detailed list of 20 up-regulated markers. However, after using the MAGIC imputation method in our framework, we obtained only one up-regulated marker (

C A L C O C O 2

) with logFC = 1.845. Table 12 represents the top five markers for the MAGIC imputed Yan dataset (GSE36552). Supplementary Table S9 stores more details for all DEGs obtained from the MAGIC imputed dataset. Therefore consistent with Li and Li [13], we also concluded that the imputed data by scImpute had a stronger contrast between the up-regulated genes in different cell types, whereas the imputed data by MAGIC had limitations in recovering this pattern. After using the raw dataset for the ablation study, it was found that the result using the raw dataset was far better than the result using the MAGIC imputed dataset. One main reason is that MAGIC discarded many biological zeroes during imputation, which likely led to the loss of meaningful biological variation.

Furthermore, one more explanation of our proposed framework is also required to make it clearer to future researchers. In this work, using the MRMR feature selection algorithm, we have selected the top 100 features and used them for downstream analysis. A question may arise here: why have we taken the feature count 100 in this framework? The answer is very simple. The MRMR method is based on the concept that if a feature was differentially expressed among distinct classes, it would have large mutual information. To measure the relevance of features, they used mutual information in the MRMR model. Accordingly, in MRMR, the rank of features never changed. We changed the feature count and validated our result using all performance measure metrics. We have executed our framework using the top 500 rather than 100 features. However, we obtained similar clustering results and almost the same value for the silhouette index, Rand index (RI), NMI, and F1-score for the scRNA-seq dataset of rare intestinal cell type in mice (GEO ID: GSE62270). The silhouette index for the best cluster with the top 500 feature counts was +0.56 where it was +0.54 by using the top 100 features. The NMI, RI, and F1-score were 0.0596, 0.4739, and 0.2310, respectively, when considering the top 500 features. We obtained similar values for these three measures—the NMI scores (=0.0584), Rand index score (=0.4670), and F1 score (=0.2314)—by using the feature to count the top 100 from the same dataset.Therefore, for this dataset (GSE62270), there was no significant variation in results by using different feature counts. Similarly, for another dataset, Yan (GSE36552) [36], after considering the top 500 features, we obtained some minor variations in the silhouette index, and this dataset obtained the best clustering when it took the feature count of the top 100. Since the ranking of features in MRMR never changes, we can take different feature counts by using trial and error procedures depending on the quality and pattern of the dataset. We have added the silhouette cluster plot by taking the feature count 500 for both datasets GSE62270 and GSE36552 as Supplementary Figures S2 and S3, respectively.

Ablation Study of Our Proposed Model

To evaluate the effects of the key steps of our proposed method on the performance, we added two case studies here: (i) without imputation and (ii) without imputation and feature selection. In case (i), we used the raw dataset Yan (GSE36552), and obtained the same value for NMI, RI, F1-Score, and silhouette width, which we obtained by using the imputed (through scImpute method) dataset. The reason behind it is that in our dataset the number of redundant zeroes is low, and the maximum zeroes are the biological zero. That is why we have not observed variation in the result. However, when we use the dataset with a large number of redundant zeroes (significant dropout problem), we would have a large variation as a result, using raw data and imputed data. However, in our dataset, we also obtained a variation in the biomarker list. By using the imputed dataset (imputation through scImpute), we obtained 20 up-regulated markers (LogFC > 1.75); however, by using the raw dataset, we found 19 up-regulated markers; 18 markers are common. When we used the raw dataset, we failed to find two biomarkers that were highly significant; namely,

SERPINF 1

(LogFC = 2.148, FDR =

7.39 \times 10^{- 44}

) and

NCRNA 00081

(LogFC = 2.090, FDR =

1.75 \times 10^{- 42}

). Of note, using the raw dataset, we had one new marker which is

LGMN

(LogFC = 2.062, FDR =

1.34 \times 10^{- 39}

). It is clear that, according to LogFC and the FDR-corrected p-value, the markers

SERPINF 1

and

NCRNA 00081

were more significant than

LGMN

. In summary, if we removed the data imputation step from our framework, the performance of the framework would decrease. Figure 7a shows the obtained up-regulated markers for scImputed data and raw (without imputation phase) data. In case (ii), we used raw data without feature selection. In this scenario, we used all features (i.e., 20,214 features) of the Yan (GSE36552) dataset. In this case, the result of clustering fell down significantly. The silhouette width of the best cluster decreased from +0.82 to +0.26 and the average silhouette width decreased from +0.75 to +0.21. Figure 7b shows the silhouette plot of the dataset Yan (GSE36552) without the steps of data imputation and feature selection. From this performance, it is clear that if we removed the feature selection step from our proposed framework, it would create a negative impact on the performance of this framework. In summary, to improve the performance of our framework, data imputation, and feature selection are two necessary and crucial steps.

4. Conclusions and Future Scope

In this article, we followed several intermediate steps such as matrix imputation, feature selection, cell clustering, and a statistical hypothesis test to detect differentially expressed genes initially for signature detection in scRNA-seq data. To discard “drop-out” problems and outliers from the data, we applied a well-known statistical tool, scImpute. Next, to obtain the top 100 features, we applied the feature selection algorithm, MRMR, for further downstream analysis. Thereafter, we applied shrinkage clustering on the 100 top features to determine cell clusters through global optimization. As a result, we detected nine clusters with accuracy metrics such as normalized mutual information (NMI), Rand index, and F1 scores. To check the cluster validity index, we applied silhouette width measure and found that the best cluster had the highest silhouette width(=0.54) of 1368 cells with different labels. Next, we considered the best cluster as the experimental group and the rest of the clusters as the control group and applied the Limma-Voom R tool, employing the empirical Bayes test to detect differentially expressed genes having an FDR cutoff of less than 0.001. Furthermore, we performed Gene Set Enrichment Analysis (GSEA) on the resultant gene markers using David 6.8 software. Interestingly, we identified targeter miRNAs associated with the detected gene markers using the MiRWalk database, and then performed miRNA target gene interaction network analysis using the Cytoscape online tool to validate the biological significance and relevance of our detected gene signatures. In this paper, we solved the limitation of our last published work, Seth et al., 2022, by omitting dropout values through matrix imputation. Moreover, we provided a comparative study between our 100 presently detected gene markers and our previously detected top 100 (based on FDR ranking) cluster-specified gene markers of the published framework by Seth et al., 2022. We found three common markers; namely,

C y p 2 b 10 (F D R : 2.00 \times 10^{- 32})

,

M t 1 (F D R : 1.94 \times 10^{- 30})

,

A l p i (F D R : 2.61 \times 10^{- 26})

and 97 novel markers (the top 5 novel markers are

B C 064078 (F D R : 2.08 \times 10^{- 56})

,

P s m b 9 (F D R : 2.23 \times 10^{- 53})

,

C d 74 (F D R : 1.31 \times 10^{- 52})

,

P s m b 8 (F D R : 3.12 \times 10^{- 52})

, and

R a n g r f (F D R : 3.20 \times 10^{- 52})

). In addition, we further performed Gene Set Enrichment Analysis (GSEA) on both the gene-marker sets and obtained similar kinds of outcomes. Furthermore, we have presented another comparative study with the published framework of Mallik and Zhao, 2019 and showed that the gene markers detected by our current proposed model are more significant than the other. Furthermore, to show the efficiency of our framework, we used another single-cell RNA sequencing dataset (Yan dataset). After analyzing every step of our framework through this dataset, we obtained clusters with high accuracy metrics and validity index. We also detected 20 up-regulated markers by logFC from 100 bio-markers. We have conducted a comparative study between the scImpute imputation method and the MAGIC imputation method and proved that scImpute is comparatively better. Furthermore, we considered 500 top features for both datasets and compared the result with our framework by considering the top 100 features. We found that the result is almost similar to changing the feature count since MRMR never changes the rank of a feature. In addition, we have added here an ablation study by considering two cases, one without the imputation method and the other without imputation and feature selection methods. The results proved that both steps are essential for this framework. In conclusion, it can be said that our framework is useful for biological interpretation of the single-cell sequencing data analysis and for efficiently identifying the differentially expressed gene signatures. We would like to summarize the advantages of our proposed framework as below.

This framework discards “drop-out”problems and outliers from the data without hampering the biological relevance of data using the scImpute method.
The feature selection step of our framework helps to improve the performance of this framework by considering significant features based on minimum redundancy and maximum relevance (MRMR) and ignoring redundant features. It is proved by ablation study that the inclusion of redundant features decreases the performance of a model significantly.
The clustering phase of this framework helps to identify highly validated cell clusters through global optimization with respect to silhouette width. In single-cell RNA sequencing data analysis, cell clustering plays an important role in the identification of existing and novel types of cells, the depiction of cells, cell fate prediction, the classification of several types of tumors, and the investigation of heterogeneity in different cells. In this framework, we obtained significant superclusters using a matrix factorization-based algorithm, shrinkage clustering.
This framework obtains up-regulated markers using the statistical tool Limma-Voom. up-regulated gene expressions are significantly correlated with the detection and variation in rare or critical diseases like cancer. up-regulated gene expression in tissue served as an independent detector of any risk and a promising biomarker to predict further changes in differently expressed (diseased) cells/samples and to detect rare cells. Using pathway and gene ontology enrichment analysis, we found that our markers are biologically significant.

In our future study, we will apply our framework to more high-dimensional single-cell sequencing datasets to measure its scalability and robustness. Furthermore, to improve the clustering process, we will extend our current work by including multi-objective optimization techniques. It can be used for big data analysis and the discovery of rare cells in single-cell RNA sequencing data. In this model, we focused more on computational prospects rather than the typical biological characteristics of the dataset. However, in the future, to resolve any kinds of biological errors in the dataset, we are keen to extend our framework and validate the findings by some standard biological test with the collaboration of a wet laboratory.

Supplementary Materials

The following supporting information can be downloaded at: https://0-www-mdpi-com.brum.beds.ac.uk/article/10.3390/math11204315/s1, Table S1: list of all GO-BP terms obtained by DAVID 6.8 for dataset GSE62270. Table S2: list of all GO-CC terms obtained by DAVID 6.8 for dataset GSE62270. Table S3: list of all GO-MF terms obtained by DAVID 6.8 for dataset GSE62270. Table S4: list of miRNA-gene interaction analysis by Cytoscape 2022-1 for dataset GSE62270. Table S5: list of all KEGG pathways obtained by DAVID 6.8 for dataset GSE36552. Table S6: list of all GO-BP terms obtained by DAVID 6.8 for dataset GSE36552. Table S7: list of all GO-CC terms obtained by DAVID 6.8 for dataset GSE36552. Table S8: list of all GO-MF terms obtained by DAVID 6.8 for dataset GSE36552. Table S9: list of all DEGs obtained from the MAGIC imputed dataset GSE36552. Figure S1: image of detailed Venn diagram of comparative study for dataset GSE62270. Figure S2: image of silhouette cluster plot with the feature count 500 for dataset GSE62270. Figure S3: image of silhouette cluster plot with the feature count 500 for dataset GSE36552.

Author Contributions

Conceptualization and design of the experiments, S.S., S.M. and T.B.; execution of the experiments, S.S., S.M., A.I. and T.B.; data analysis, S.S., S.M. and T.B.; manuscript writing, S.S.; manuscript reviewing and editing, S.M., T.B., A.R., P.K.S., A.L. and Z.Z.; funding acquisition: S.M. and Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

Z.Z. was partially supported by the Cancer Prevention and Research Institute of Texas (CPRIT RP170668 and RP180734) (to Z.Z.). Publication costs were funded by Z.Z.’s Professorship Fund. The funder had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability Statement

Dataset 1: RNA sequencing dataset for rare intestinal cell types in mice (NCBI accession ID: GSE62270), available at https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/geo/query/acc.cgi?acc=GSE62270, accessed on 19 August 2015. Dataset 2: Yan dataset for tracing pluripotency of early human embryos and embryonic stem cells by single-cell RNA sequencing (NCBI accession ID: GSE36552), available at https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/geo/query/acc.cgi?acc=GSE36552, accessed on 10 August 2013. The code is available online at Github with readme file https://github.com/soumita-seth/imputed_scRNA-seq, accessed on 1 October 2023.

Acknowledgments

We thank the department of computer science and engineering (CSE) of Aliah University, Kolkata and the Department of Information Technology (IT) of Jadavpur University, Kolkata.

Conflicts of Interest

The authors declare no conflict of interest.

References

Luecken, M.D.; Theis, F.J. Current best practices in single-cell RNA-seq analysis: A tutorial. J. Mol. Syst. Biol. 2019, 15, e8746. [Google Scholar] [CrossRef]
Santra, T.S.; Tseng, F.-G. Single-Cell Analysis. Cells 2020, 9, 1993. [Google Scholar] [CrossRef] [PubMed]
Tang, X.; Huang, Y.; Lei, J.; Luo, H.; Zhu, X. The Single-cell Sequencing: New Developments and Medical Applications. Cell Biosci. 2019, 9, 53. [Google Scholar]
Khandelwal, M.; Sheikh, S.; Rout, R.K.; Umer, S.; Mallik, S.; Zhao, Z. Unsupervised Learning for Feature Representation Using Spatial Distribution of Amino Acids in Aldehyde Dehydrogenase (ALDH2) Protein Sequences. Mathematics 2022, 10, 2228. [Google Scholar] [CrossRef]
Huh, R.; Yang, Y.; Jiang, Y.; Shen, Y.; Li, Y. SAME-clustering: Single-cell Aggregated Clustering via Mixture Model Ensemble. Nucleic Acids Res. 2020, 48, 86–95. [Google Scholar] [PubMed]
Qiu, P. Embracing the dropouts in single-cell RNA-seq analysis. Nat. Commun. 2020, 11, 1169. [Google Scholar] [CrossRef]
Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Phil. Trans. R. Soc. A 2016, 374, 20150202. [Google Scholar] [CrossRef]
Cieslak, M.C.; Castelfranco, A.M.; Roncalli, V.; Lenz, P.H.; Hartline, D.K. t-Distributed Stochastic Neighbor Embedding (t-SNE): A tool for eco-physiological transcriptomic analysis. Mar. Genom. 2020, 51, 100723. [Google Scholar] [CrossRef]
Dijk, D.V.; Nainys, J.; Sharma, R.; Kaithail, P.; Carr, A.J.; Moon, K.R.; Peer, D. MAGIC: A diffusion-based imputation method reveals gene-gene interactions in single-cell RNA-sequencing data. Cell 2018, 174, 716–729.e27. [Google Scholar] [CrossRef]
Huang, M.; Wang, J.; Torre, E.; Dueck, H.; Shaffer, S.; Bonasio, R.; Murray, J.I.; Raj, A.; Li, M.; Zhang, N.R. SAVER: Gene expression recovery for single-cell RNA sequencing. Nat. Methods 2018, 15, 539–542. [Google Scholar] [CrossRef]
Li, W.V.; Li, J.J. scImpute: Accurate and robust imputation for single cell RNA-seq data. bioRxiv 2017, 141598. [Google Scholar] [CrossRef] [PubMed]
Tracy, S.; Yuan, G.C.; Dries, R. RESCUE: Imputing dropout events in single-cell RNA-sequencing data. BMC Bioinform. 2019, 20, 388. [Google Scholar] [CrossRef] [PubMed]
Li, W.V.; Li, J.J. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat. Commun. 2018, 9, 997. [Google Scholar] [CrossRef] [PubMed]
Ding, C.; Peng, H. Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 2005, 3, 185–205. [Google Scholar] [CrossRef]
Bandyopadhyay, S.; Bhadra, T.; Mitra, P.; Maulik, U. Integration of Dense Subgraph Finding with Feature Clustering for Unsupervised Feature Selection. Pattern Recognit. Lett. 2014, 40, 104–112. [Google Scholar] [CrossRef]
Bhadra, T.; Bandyopadhyay, S. Unsupervised feature selection using an improved version of Differential Evolution. Expert Syst. Appl. 2015, 42, 4042–4053. [Google Scholar] [CrossRef]
Bandyopadhyay, S.; Bhadra, T.; Maulik, U. Variable Weighted Maximal Relevance Minimal Redundancy Criterion for Feature Selection using Normalized Mutual Information. J. Mult.-Valued Log. Soft Comput. 2015, 25, 189–213. [Google Scholar]
Bhadra, T.; Bandyopadhyay, S. Supervised feature selection using integration of densest subgraph finding with floating forward–backward search. Inf. Sci. 2021, 566, 1–18. [Google Scholar] [CrossRef]
Bolón-Canedo, V.; Sánchez-Maroño, N.; Alonso-Betanzos, A. A review of feature selection methods on synthetic data. Knowl. Inf. Syst. 2013, 34, 483–519. [Google Scholar] [CrossRef]
Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
Tang, J.; AlelYani, S.; Liu, H. Feature selection for classification: A review. In Data Classification: Algorithms and Applications; Chapman and Hall/CRC: London, UK, 2014; p. 37. [Google Scholar]
Zhao, Z.; Anand, R.; Wang, M. Maximum Relevance and Minimum Redundancy Feature Selection Methods for a Marketing Machine Learning Platform. In Proceedings of the 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Washington, DC, USA, 5–8 October 2019; pp. 442–452. [Google Scholar] [CrossRef]
Blondel, V.; Guillaume, J.; Lambiotte, R.; Lefebvre, E. Fast Unfolding of Communities in Large Networks. J. Stat. Mech. Theor. Exp. 2008, 83, 10008. [Google Scholar] [CrossRef]
Liu, X.; Song, W.; Wong, B.Y.; Zhang, T.; Yu, S.; Lin, G.N.; Ding, X. A comparison framework and guideline of clustering methods for mass cytometry data. Genome Biol. 2019, 20, 297. [Google Scholar] [CrossRef] [PubMed]
Butler, A.; Hoffman, P.; Smibert, P.; Papalexi, E.; Satija, R. Integrating Single-cell Transcriptomic Data Across Different Conditions. Technol. Species Nat. Biotechnol. 2018, 36, 411–420. [Google Scholar] [CrossRef] [PubMed]
Stuart, T.; Butler, A.; Hoffman, P.; Hafemeister, C.; Papalexi, E.; Mauck, W.M.; Satija, R. Comprehensive Integration of Single-Cell Data. Cell 2019, 177, 1888–1902. [Google Scholar] [CrossRef]
Wolf, F.; Angerer, P.; Theis, F. SCANPY: Large-scale single-cell gene expression data analysis. Genome Biol. 2018, 19, 15. [Google Scholar] [CrossRef] [PubMed]
Koutrouli, M.; Líndez, P.P.; Nastou, K.; Bouwmeester, R.; Rasmussen, S.; Martens, L.; Jensen, L.J. FAVA: High-quality functional association networks inferred from scRNA-seq and proteomics data. bioRxiv 2022. [Google Scholar] [CrossRef]
Kiselev, V.Y.; Andrews, T.S.; Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 2019, 20, 273–282. [Google Scholar] [CrossRef]
Hu, C.; Li, H.; Qutub, A. Shrinkage Clustering: A fast and size-constrained clustering algorithm for biomedical applications. BMC Bioinform. 2018, 19, 19. [Google Scholar] [CrossRef]
Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Yu, Y.; Liu, J. SCM Enables Improved Single-Cell Clustering by Scoring Consensus Matrices. Mathematics 2023, 11, 3785. [Google Scholar] [CrossRef]
Seth, S.; Mallik, S.; Bhadra, T.; Zhao, Z. Dimensionality Reduction and Louvain Agglomerative Hierarchical Clustering for Cluster-Specified Frequent Biomarker Discovery in Single-Cell Sequencing Data. Front. Genet. 2022, 13, 828479. [Google Scholar] [CrossRef] [PubMed]
Xu, Y.; Li, H.D.; Lin, C.X.; Zheng, R.; Li, Y.; Xu, J.; Wang, J. CellBRF: A feature selection method for single-cell clustering using cell balance and random forest. Bioinformatics 2023, 39 (Suppl. S1), i368–i376. [Google Scholar] [CrossRef] [PubMed]
Grün, D.; Lyubimova, A.; Kester, L.; Wiebrands, K.; Basak, O.; Sasaki, N.; Van Oudenaarden, A. Single-cell Messenger RNA Sequencing Reveals Rare Intestinal Cell Types. Nature 2015, 525, 251–255. [Google Scholar] [CrossRef] [PubMed]
Yan, L.; Yang, M.; Guo, H.; Yang, L.; Wu, J.; Li, R.; Tang, F. Single-cell RNA-seq profiling of human preimplantation embryos and embryonic stem cells. Nat. Struct. Mol. Biol. 2013, 20, 1131. [Google Scholar] [CrossRef] [PubMed]
Mallik, S.; Zhao, Z. Towards integrated oncogenic marker recognition through mutual information-based statistically significant feature extraction: An association rule mining based study on cancer expression and methylation profiles. Quant. Biol. 2017, 5, 302–327. [Google Scholar] [CrossRef]
De Jay, N.; Papillon-Cavanagh, S.; Olsen, C.; El-Hachem, N.; Bontempi, G.; Haibe-Kains, B. mRMRe: An R package for parallelized mRMR ensemble feature selection. Bioinformatics 2013, 29, 2365–2368. [Google Scholar] [CrossRef]
Hu, C.W.; Li, H.Y.; Qutub, A.A. shrinkageClust: An R Package for Shrinkage Clustering. Available online: https://github.com/quentinli8/Shrinkage-Clustering (accessed on 8 May 2018).
McDaid, A.F.; Greene, D.; Hurley, N. Normalized Mutual Information to evaluate overlapping community finding algorithms. arXiv 2011, arXiv:1110.2515. [Google Scholar] [CrossRef]
Yeung, K.Y.; Ruzzo, W.L. Details of the Adjusted Rand Index and Clustering Algorithms Supplement to the Paper “An Empirical Study on Principal Component Analysis for Clustering Gene Expression Data” (to Appear in Bioinformatics). 3 May 2001. Available online: https://faculty.washington.edu/kayee/pca/supp.pdf (accessed on 3 May 2001).
Hand, D.J.; Christen, P.; Kirielle, N. F*: An interpretable transformation of the F-measure. Mach. Learn. 2021, 110, 451–456. [Google Scholar] [CrossRef]
Ansari, Z.; Azeem, M.F.; Waseem, A.; Babu, A.V. Quantitative evaluation of performance and validity indices for clustering the web navigational sessions. World Comput. Sci. Inf. Technol. J. 2015, 1, 217–226. [Google Scholar]
Ritchie, M.E.; Phipson, B.; Wu, D.I.; Hu, Y.; Law, C.W.; Shi, W.; Smyth, G.K. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015, 43, e47. [Google Scholar] [CrossRef]
Law, C.W.; Chen, Y.; Shi, W.; Smyth, G.K. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014, 15, R29. [Google Scholar] [CrossRef] [PubMed]
Mallik, S.; Mukhopadhyay, A.; Maulik, U. RANWAR: Rank-Based Weighted Association Rule Mining From Gene Expression and Methylation Data. IEEE Trans. Nanobiosci. 2015, 14, 59–66. [Google Scholar] [CrossRef] [PubMed]
Mallik, S.; Seth, S.; Bhadra, T.; Zhao, Z. A Linear Regression and Deep Learning Approach for Detecting Reliable Genetic Alterations in Cancer Using DNA Methylation and Gene Expression Data. Genes 2020, 11, 931. [Google Scholar] [CrossRef] [PubMed]
Mallik, S.; Seth, S.; Bhadra, T.; Tomar, N.; Zhao, Z. A Multi-classifier Model to Identify Mitochondrial Respiratory Gene Signatures in Human Cancer. In Proceedings of the 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA, 18–21 November 2019; pp. 1928–1935. [Google Scholar]
Benjamini, Y.; Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B. 1995, 57, 289–300. [Google Scholar] [CrossRef]
Dennis, G.; Sherman, B.T.; Hosack, D.A.; Yang, J.; Gao, W.; Lane, H.C.; Lempicki, R.A. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003, 4, R60. [Google Scholar] [CrossRef]
Sticht, C.; De La Torre, C.; Parveen, A.; Gretz, N. miRWalk: An online resource for prediction of microRNA binding sites. PLoS ONE 2018, 13, E0206239. [Google Scholar] [CrossRef] [PubMed]
Shannon, P.; Markiel, A.; Ozier, O.; Baliga, N.S.; Wang, J.T.; Ramage, D.; Ideker, T. Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Res. 2003, 13, 2498–2504. [Google Scholar] [CrossRef] [PubMed]
Mallik, S.; Zhao, Z. Multi-objective optimized fuzzy clustering for detecting cell clusters from single cell expression profiles, Special Issue of Technologies and Resources for Genetics. Genes 2019, 10, 61. [Google Scholar] [CrossRef]

Figure 1. Flowchart of the proposed framework.

Figure 2. All resultant figures and comparative study diagram of dataset: GSE62270. (a) PCA plot to visualize clusters; (b) the cluster plot during the computation of silhouette width; (c) visualization of comparative study with Seth et al., 2022 [33] using Venn diagram.

Figure 3. Gene and miRNA interactions among top 10 neighbors (ranked by degree centrality measure).

Figure 4. Gene and miRNA interaction network analysis for top GO terms (BP, CC, MF). (a) Gene and miRNA interactions among top 10 neighbors (ranked by degree centrality measure) of top 1 GO-BP term; (b) gene and miRNA interactions among top 10 neighbors (ranked by degree centrality measure) of top 1 GO-CC term; (c) gene and miRNA interactions among top 10 neighbors (ranked by degree centrality measure) of top 1 GO-MF term.

Figure 5. All resultant clustering figures of scImputed dataset: GSE36552. (a) PCA plot to visualize clusters; (b) the cluster plot during the computation of silhouette width.

Figure 6. All resultant clustering figures of MAGIC imputed dataset: GSE36552. (a) PCA plot to visualize clusters; (b) the cluster plot during the computation of silhouette width.

Figure 7. All resultant figures for ablation study of dataset: GSE36552. (a) Venn diagram of imputed up-regulated markers vs. raw up-regulated markers; (b) the cluster plot during the computation of silhouette width without imputation and feature selection.

Table 1. Number of cells with a specific label in the best cluster (dataset: GSE62270).

Label Names	Number of Cells in Best Cluster
$IOLEN 2 q 0$	73
$Lgr 5$	85
$Lgr 5 SC$	154
$PSANNAq 0$	89
$Reg 4 Positive Cells Replicate 1$	77
$Reg 4 Positive Cells Replicate 2$	172
$Whole Organoid Replicate 1$	141
$Whole Organoid Replicate 2$	225
$YFPpos$	352
$Total$	$1368$

Table 2. List of differentially expressed genes (FDR sorted) (dataset: GSE62270).

Gene Symbol	t-Score	p-Value	FDR
$BC 064078$	16.59	$2.08 \times 10^{- 58}$	$2.08 \times 10^{- 56}$
$Psmb 9$	16.07	$4.46 \times 10^{- 55}$	$2.23 \times 10^{- 53}$
$Cd 74$	15.92	$3.92 \times 10^{- 54}$	$1.31 \times 10^{- 52}$
$Psmb 8$	15.84	$1.25 \times 10^{- 53}$	$3.12 \times 10^{- 52}$
$Rangrf$	15.82	$1.60 \times 10^{- 53}$	$3.20 \times 10^{- 52}$
$1700097 N 02 Rik$	15.76	$3.55 \times 10^{- 53}$	$5.92 \times 10^{- 52}$
$Slc 14 a 1$	15.73	$5.99 \times 10^{- 53}$	$8.56 \times 10^{- 52}$
$Fgf 1$	15.58	$4.66 \times 10^{- 52}$	$5.82 \times 10^{- 51}$
$Cdo 1$	15.55	$7.09 \times 10^{- 52}$	$7.88 \times 10^{- 51}$
$Gm 13157$	15.51	$1.30 \times 10^{- 51}$	$1.30 \times 10^{- 50}$
$Agr 3$	15.42	$4.29 \times 10^{- 51}$	$3.78 \times 10^{- 50}$
$Gm 1821$	15.42	$4.54 \times 10^{- 51}$	$3.78 \times 10^{- 50}$
$Serhl$	15.34	$1.36 \times 10^{- 50}$	$1.05 \times 10^{- 49}$
$Agk$	15.30	$2.43 \times 10^{- 50}$	$1.67 \times 10^{- 49}$
$Slc 1 a 1$	15.30	$2.50 \times 10^{- 50}$	$1.67 \times 10^{- 49}$
$Slc 1 a 3$	15.27	$3.91 \times 10^{- 50}$	$2.44 \times 10^{- 49}$
$Flywch 2$	15.15	$1.74 \times 10^{- 49}$	$1.02 \times 10^{- 48}$
$Aqp 1$	15.12	$2.97 \times 10^{- 49}$	$1.65 \times 10^{- 48}$
$Dnahc 8$	15.11	$3.31 \times 10^{- 49}$	$1.74 \times 10^{- 48}$
$2610301 G 19 Rik$	15.04	$8.68 \times 10^{- 49}$	$4.34 \times 10^{- 48}$
$Erdr 1$	15.02	$1.23 \times 10^{- 48}$	$5.85 \times 10^{- 48}$
$Osr 2$	14.99	$1.56 \times 10^{- 48}$	$7.09 \times 10^{- 48}$
$Unc 93 b 1$	14.96	$2.72 \times 10^{- 48}$	$1.18 \times 10^{- 47}$
$Ephx 2$	14.89	$7.22 \times 10^{- 48}$	$3.01 \times 10^{- 47}$
$Odf 3 b$	14.81	$2.00 \times 10^{- 47}$	$7.99 \times 10^{- 47}$

Table 3. Top significant GO-BP term enriched (p-value sorted) (dataset: GSE62270).

GO-BP Term Name *	#Genes	p-Value
GO:0019882 antigen processing and presentation	4	$3.69 \times 10^{- 4}$
GO:0071577 zinc II ion transmembrane transport	3	$1.37 \times 10^{- 3}$
GO:0019233 sensory perception of pain	4	$2.10 \times 10^{- 3}$
GO:0006793 phosphorus metabolic process	2	$6.49 \times 10^{- 3}$
GO:0051938 L-glutamate import	2	$1.29 \times 10^{- 2}$

* See Supplementary Table S1 for details.

Table 4. Top significant GO-CC term enriched (p-value sorted) (dataset: GSE62270).

GO-CC Term Name *	#Genes	p-Value
GO:0031225 anchored component of membrane	6	$1.73 \times 10^{- 4}$
GO:0016323 basolateral plasma membrane	7	$2.47 \times 10^{- 4}$
GO:0016020 membrane	36	$8.05 \times 10^{- 4}$
GO:0005886 plasma membrane	31	$2.16 \times 10^{- 3}$
GO:0005791 rough endoplasmic reticulum	4	$2.37 \times 10^{- 3}$

* See Supplementary Table S2 for details.

Table 5. Top significant GO-MF term enriched (p-value sorted) (dataset: GSE62270).

GO-MF Term Name *	#Genes	p-Value
GO:0016595 glutamate binding	3	$1.09 \times 10^{- 3}$
GO:0015293 symporter activity	4	$8.06 \times 10^{- 3}$
GO:0005314 high-affinity glutamate transmembrane transporter activity	2	$1.41 \times 10^{- 2}$
GO:0015501 glutamate:sodium symporter activity	2	$1.76 \times 10^{- 2}$
GO:0005372 water transmembrane transporter activity	2	$1.76 \times 10^{- 2}$

* See Supplementary Table S3 for details.

Table 6. Top-ranked gene markers and miRNAs (sorted by degree centrality) (dataset: GSE62270).

Marker Name *	In-Degree Centrality	Betweenness Centrality	Closeness Centrality
Fgf1	1758	830,321.7	1160.8
Slc14a1	1023	341,228.6	950.2
Grin1	957	135,208.1	767.7
Slc30a2	872	276,554.9	914.8
Cap1	840	182,865.8	840.8
(a) Top-ranked markers
miRNA Name *	Out-Degree Centrality	Betweenness Centrality	Closeness Centrality
mmu-miR-7016-5p	41	7527.4	885.2
mmu-miR-7076-5p	39	7733.2	894.2
mmu-miR-7052-5p	39	5130.1	842
mmu-miR-7045-5p	39	6416.2	879.5
mmu-miR-7056-5p	38	4914.1	884.2
(b) Top-ranked miRNAs

* See Supplementary Table S4 for details.

Table 7. List of up-regulated markers from dataset GSE36552 (LogFc sorted in descending order).

Gene Symbol	LogFC	t-Score	p-Value	FDR
$CDC 25 B$	2.18	29.13	$2.17 \times 10^{- 49}$	$2.17 \times 10^{- 47}$
$SERPINF 1$	2.15	26.05	$2.96 \times 10^{- 45}$	$7.39 \times 10^{- 44}$
$NCRNA 00081$	2.09	24.91	$1.23 \times 10^{- 43}$	$1.75 \times 10^{- 42}$
$CALCOCO 2$	2.07	23.95	$3.24 \times 10^{- 42}$	$3.24 \times 10^{- 41}$
$RSPH 10 B 2$	2.07	23.06	$6.96 \times 10^{- 41}$	$4.97 \times 10^{- 40}$
$SOX 15$	2.07	25.69	$9.36 \times 10^{- 45}$	$1.56 \times 10^{- 43}$
$NGDN$	2.06	23.67	$8.46 \times 10^{- 42}$	$7.69 \times 10^{- 41}$
$PRKRIP 1$	2.05	23.13	$5.54 \times 10^{- 41}$	$4.26 \times 10^{- 40}$
$RSPH 10 B$	2.04	23.60	$1.09 \times 10^{- 41}$	$9.08 \times 10^{- 41}$
$C 6 orf 48$	2.03	24.81	$1.71 \times 10^{- 43}$	$2.14 \times 10^{- 42}$
$MLF 1$	2.03	27.35	$4.80 \times 10^{- 47}$	$2.40 \times 10^{- 45}$
$TGS 1$	2.03	22.04	$2.72 \times 10^{- 39}$	$1.82 \times 10^{- 38}$
$RABGEF 1$	2.02	23.99	$2.84 \times 10^{- 42}$	$3.15 \times 10^{- 41}$
$KIAA 1191$	2.00	21.10	$8.93 \times 10^{- 38}$	$5.25 \times 10^{- 37}$
$KHDC 1$	1.98	18.93	$3.87 \times 10^{- 34}$	$1.93 \times 10^{- 33}$
$CCNB 1$	1.97	25.98	$3.73 \times 10^{- 45}$	$7.45 \times 10^{- 44}$
$ZNF 584$	1.97	26.13	$2.25 \times 10^{- 45}$	$7.39 \times 10^{- 44}$
$IER 3$	1.88	20.43	$1.11 \times 10^{- 36}$	$6.17 \times 10^{- 36}$
$GAS 5$	1.88	21.69	$9.67 \times 10^{- 39}$	$6.05 \times 10^{- 38}$
$STAU 1$	1.83	20.02	$5.32 \times 10^{- 36}$	$2.80 \times 10^{- 35}$

Table 8. Top significant KEGG pathway enriched (p-value sorted) (dataset: GSE36552).

KEGG Pathway Name *	#Genes	p-Value
hsa00100:Steroid biosynthesis	4	$2.12 \times 10^{- 4}$
hsa00020:Citrate cycle (TCA cycle)	4	$7.26 \times 10^{- 4}$
hsa01100:Metabolic pathways	20	$8.44 \times 10^{- 4}$
hsa01200:Carbon metabolism	5	$4.77 \times 10^{- 3}$
hsa04510:Focal adhesion	5	$3.14 \times 10^{- 2}$

* See Supplementary Table S5 for details.

Table 9. Top significant GO-BP term enriched (p-value sorted) (dataset: GSE36552).

GO-BP Term Name *	#Genes	p-Value
GO:0033490 cholesterol biosynthetic process via lathosterol	3	$1.06 \times 10^{- 4}$
GO:0033489 cholesterol biosynthetic process via desmosterol	3	$1.06 \times 10^{- 4}$
GO:0006099 tricarboxylic acid cycle	4	$3.66 \times 10^{- 4}$
GO:0006749 glutathione metabolic process	4	$1.04 \times 10^{- 3}$
GO:0050728 negative regulation of inflammatory response	5	$2.41 \times 10^{- 3}$

* See Supplementary Table S6 for details.

Table 10. Top significant GO-CC term enriched (p-value sorted) (dataset: GSE36552).

GO-CC Term Name *	#Genes	p-Value
GO:0070062 extracellular exosome	35	$1.09 \times 10^{- 12}$
GO:1903561 extracellular vesicle	35	$3.68 \times 10^{- 12}$
GO:0043230 extracellular organelle	35	$3.77 \times 10^{- 12}$
GO:0044444 cytoplasmic part	70	$8.27 \times 10^{- 10}$
GO:0005737 cytoplasm	76	$1.29 \times 10^{- 9}$

* See Supplementary Table S7 for details.

Table 11. Top significant GO-MF term enriched (p-value sorted) (dataset: GSE36552).

GO-MF Term Name *	#Genes	p-Value
GO:0005515 protein binding	72	$1.93 \times 10^{- 4}$
GO:0003723 RNA binding	18	$2.18 \times 10^{- 4}$
GO:0019901 protein kinase binding	9	$1.94 \times 10^{- 3}$
GO:0003725 double-stranded RNA binding	4	$4.53 \times 10^{- 3}$
GO:0042803 protein homodimerization activity	10	$4.57 \times 10^{- 3}$

* See Supplementary Table S8 for details.

Table 12. List of top 5 biomarkers from dataset GSE36552 with MAGIC Imputation (LogFc sorted in descending order).

Gene Symbol *	logFC	t-Score	p-Value	FDR
$CALCOCO 2$	1.84	23.99	$1.33 \times 10^{- 42}$	$2.22 \times 10^{- 41}$
$CDC 25 B$	1.68	19.87	$5.36 \times 10^{- 36}$	$4.88 \times 10^{- 35}$
$CD 37$	1.67	31.18	$2.17 \times 10^{- 52}$	$2.17 \times 10^{- 50}$
$TMEM 11$	1.66	18.29	$3.07 \times 10^{- 33}$	$2.05 \times 10^{- 32}$
$TANC 1$	1.62	26.79	$1.18 \times 10^{- 46}$	$5.89 \times 10^{- 45}$

* See Supplementary Table S9 for details.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Seth, S.; Mallik, S.; Islam, A.; Bhadra, T.; Roy, A.; Singh, P.K.; Li, A.; Zhao, Z. Identifying Genetic Signatures from Single-Cell RNA Sequencing Data by Matrix Imputation and Reduced Set Gene Clustering. Mathematics 2023, 11, 4315. https://0-doi-org.brum.beds.ac.uk/10.3390/math11204315

AMA Style

Seth S, Mallik S, Islam A, Bhadra T, Roy A, Singh PK, Li A, Zhao Z. Identifying Genetic Signatures from Single-Cell RNA Sequencing Data by Matrix Imputation and Reduced Set Gene Clustering. Mathematics. 2023; 11(20):4315. https://0-doi-org.brum.beds.ac.uk/10.3390/math11204315

Chicago/Turabian Style

Seth, Soumita, Saurav Mallik, Atikul Islam, Tapas Bhadra, Arup Roy, Pawan Kumar Singh, Aimin Li, and Zhongming Zhao. 2023. "Identifying Genetic Signatures from Single-Cell RNA Sequencing Data by Matrix Imputation and Reduced Set Gene Clustering" Mathematics 11, no. 20: 4315. https://0-doi-org.brum.beds.ac.uk/10.3390/math11204315

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Identifying Genetic Signatures from Single-Cell RNA Sequencing Data by Matrix Imputation and Reduced Set Gene Clustering

Abstract

1. Introduction

2. Methods

2.1. Data Sources

2.2. Matrix Imputation

2.3. Max Relevance Min Redundancy (MRMR) Feature Selection

2.4. Cell Clustering

2.5. Cluster Validity Evaluation

2.6. Identifying Differentially Expressed Genes Using Limma-Voom

2.7. Gene Set Enrichment Analysis

2.8. miRNA-Gene Interaction Network Analysis

3. Result and Discussion

Ablation Study of Our Proposed Model

4. Conclusions and Future Scope

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI