The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers

Bohnsack, Katrin Sophie; Kaden, Marika; Abel, Julia; Saralajew, Sascha; Villmann, Thomas

doi:10.3390/e23101357

Open AccessFeature PaperArticle

The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers

¹

Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany

²

Bosch Center for Artificial Intelligence, 71272 Renningen, Germany

^*

Authors to whom correspondence should be addressed.

Entropy 2021, 23(10), 1357; https://0-doi-org.brum.beds.ac.uk/10.3390/e23101357

Submission received: 19 September 2021 / Revised: 11 October 2021 / Accepted: 14 October 2021 / Published: 17 October 2021

(This article belongs to the Special Issue Information Theory Based Methods in Machine Learning and Bioinformatics)

Download

Browse Figures

Versions Notes

Abstract

:

In the present article we propose the application of variants of the mutual information function as characteristic fingerprints of biomolecular sequences for classification analysis. In particular, we consider the resolved mutual information functions based on Shannon-, Rényi-, and Tsallis-entropy. In combination with interpretable machine learning classifier models based on generalized learning vector quantization, a powerful methodology for sequence classification is achieved which allows substantial knowledge extraction in addition to the high classification ability due to the model-inherent robustness. Any potential (slightly) inferior performance of the used classifier is compensated by the additional knowledge provided by interpretable models. This knowledge may assist the user in the analysis and understanding of the used data and considered task. After theoretical justification of the concepts, we demonstrate the approach for various example data sets covering different areas in biomolecular sequence analysis.

Keywords:

mutual information; sequence analysis; classification; machine learning; interpretable models

1. Introduction

The accumulation of information based on physical organization processes like structure generation and self-organization belongs to the key aspects of living systems [1,2,3,4]. Thus, information theoretic concepts play an important role in sequence analysis for understanding biomolecular entities like RNA, DNA, and proteins to explain biological systems [5,6,7]. In case of DNA/RNA, the biomolecular information is coded by the nucleotide sequence, particularly their sequence element’s frequencies, correlations, and other topological features. The extensive influence of information theoretic concepts and applications in the fields of computational, molecular, and systems biology is captured in various reviews [8,9,10,11].

The study of sequences in consideration of their biological properties is still crucial for such diverse applications as drug design, phylogenetic analyses, prediction of molecular interactions, identification of polymorphisms or definition of pathogenic mutations [12,13,14,15]. With the availability of powerful machine learning methods like deep and convolutional networks [16,17], and support vector machines [18] as well as the supporting hardware (graphic processing units—GPU), self-learning procedures have entered (and revolutionized) many of these areas of biomolecular research [19,20,21,22,23]. Although these models provide promising performance by automated training and outperform many statistical approaches, the disadvantage is their general “black-box” behavior, i.e., the model decisions are usually hardly interpretable. Thus, explanations are at least difficult to give and usually require additional tools [24]. However, current focus is given to develop interpretable machine learning models instead [25,26]. According to [27], interpretable models are designed to be interpretable in contrast to explainable methods, which can be comprehended post-hoc by experts in the field using additional tools and elaborate considerations. Generally, interpretability increases the trustworthiness of the machine learning method and hence contributes to making them transparent for the applicants. However, interpretable models require meaningful features describing the objects to be considered, ideally taking domain knowledge into account.

It is precisely this identification or rather the design of problem-adequate features that is the subject of research in the field of alignment-free sequence comparison in computational biology. By overcoming some of the major disadvantages of alignments, such as strong evolutionary assumptions [28], high computational costs [29] as well as non-numerical sequence representation, alignment-free methods evolved as a true alternative for quantifying sequence (dis-)similarity [30]. At present, respective methods are used in the domains of phylogenetics [31,32,33], (meta-)genomics [34,35], database similarity search [36], or next-generation sequencing data analyses [37,38,39].

In particular, information theoretic and statistical quantities provide a natural way to generate unique signatures or fingerprints of molecular sequence data by considering the distribution of nucleotides as elements of sequence as well as their statistical correlations [40,41,42]. Long-range correlations in sequences are well-known and intensively studied in alignment-free sequence comparison [43,44,45,46]. A promising statistical descriptor approach for sequences is the concept of natural vectors. It considers the moments of the distribution of the nucleotides in a sequence as the determining quantities to characterize the molecular sequence [47]. The equality of all statistical moments of two sequences implies the equality of statistical distributions and, hence, can be seen as an equivalence relation. Natural vectors were successfully applied for DNA-analysis, virus, and protein sequence classification [48,49].

The use of so-called mutual information functions (MIFs) as an alternative to correlation profiles as sequence descriptors was first investigated in [50]. This idea was reconsidered in [51] and renewed in [52,53]. In bioinformatics, this concept was established as average mutual information profile (AMI-profile) and proposed to serve as a genomic signature [54]. Molecular descriptors based on the mutual information are considered in [55]. Further applications of MIF in computational biology involve its use for species identification from DNA sequences [56], for finding (species-independent) patterns that differ in coding and non-coding DNA [57] as well as for investigating co-variation of mutations in protein sequences [58]. To make the idea accessible to a larger audience, a user interface program for MIF calculation is provided in [59] and more applications are reviewed in [60].

The mutual information is known as a similarity measure between distributions, which originally is based on the Shannon entropy [61]. It implicitly takes all correlation moments of the distributions for the comparison into account. Popular alternatives to the standard mutual information are the Rényi and the Tsallis mutual information, which are based on the Rényi and the Tsallis entropy, respectively [62,63,64,65]. The numerical estimations of these mutual information variants seem to be more robust than the estimations for the Shannon original [66].

These mutual information concepts can be used to generate information theoretic features for sequence analysis: Rényi entropic profiles were considered for DNA classification problems based on chaos game representation [67,68]. Molecular descriptors based on the Rényi entropy were investigated in [69], whereas long range correlation using Tsallis mutual information was considered in [70]. However, to our best knowledge, MIF for these variants are not known so far.

Furthermore, in [71], it is criticized that the (Shannon) AMI profile, i.e., the MIF, suffers from an average effect of 16 kinds of base correlations. Therefore, the authors of this study proposed, based on the earlier work in [40], a partial information correlation.

This critic together with the above mentioned robustness observations for the Rényi and the Tsallis entropy estimators motivated our investigations: first, we introduce a resolved variant of the Shannon-based MIF as a more adequate information theoretic signature of molecular sequences reducing the average effects. Afterwards, we transfer this concept to both the Rényi and the Tsallis variants obtaining respective (resolved) mutual information functions. The resulting signature vectors serve as data descriptors for sequence classification problems to be tackled by machine learning methods. In this machine learning part, we focus on dissimilarity based and interpretable classifier models according to the above discussion about interpretability. Particularly, we apply a variant of learning vector quantization which delivers feature correlation information regarding the classification problem as an additional information beyond the classifier predicting performance [72]. Furthermore, this method is known to be robust and optimizing the class separating hypothesis margin [73].

The paper is structured as follows: first, we introduce variants of the mutual information functions for the Shannon-, the Rényi-, and the Tsallis-entropy and give theoretical justifications. Second, we describe the interpretable machine learning classifier based on learning vector quantization and show how knowledge about the decision process and the regarding data properties can be extracted. Thereafter, we apply this methodology for three biomolecular sequence data sets covering different application areas. For this purpose, we describe in detail the feature generation and the parameter setting. Furthermore, we show in the example for one data set how knowledge extraction from the trained classifier model is done to provide useful additional information. Concluding remarks and outlook for future work complete the paper.

2. Variants of Mutual Information Functions as Biomolecular Sequence Signatures

In the following section, we introduce the concept of variants of mutual information functions, which later serve as determining fingerprints of nucleotide sequences. These functions reflect structural characteristics and spatial relations within the sequences. For this purpose, we consider several types of mutual information regarding different entropy concepts. Thereby, we concentrate on those approaches, which are frequently used in machine learning. For a general overview of entropies, divergences, and mutual information, we refer to [74].

2.1. The Resolved Mutual Information Function Based on the Shannon Entropy

We consider the Shannon entropy

H (X) = \int_{X} p (x) \cdot log (\frac{1}{p (x)}) d x

(1)

of a random quantity

X \subseteq X

with the density measure

p (x)

being the expectation value of the information

log (\frac{1}{p (x)})

. In the machine learning context here, we interpret X as a feature or object quantity. The maximum value of the entropy

H (X)

is obtained for a uniform density

p (x)

and, hence,

H (X)

serves as a measure of uncertainty [75].

The corresponding divergence is the Kullback–Leibler-divergence

D_{K L} (p (x) ‖ p (y)) = \int_{X} \int_{Y} p (x) \cdot log (\frac{p (x)}{p (y)}) d y d x

(2)

as dissimilarity measure between the densities

p (x)

and

p (y)

[61,76]. The corresponding mutual information is

I (X, Y) = D_{K L} (p (x, y) ‖ p (x) \cdot p (y))

(3)

quantifying the joint information of

p (x)

and

p (y)

. Here,

p (x, y)

is the joint density. Alternatively, the mutual information can be written as

I (X, Y) = H (X) - H (X | Y)

(4)

using the conditional entropy

H (X | Y)

which can be written as

\begin{matrix} H (X | Y) & = & H (X, Y) - H (Y) \\ = & - \int_{X} \int_{Y} p (x, y) \cdot log (\frac{p (x, y)}{p (y)}) d y d x \end{matrix}

(5)

known as the chain rule of the entropies [61,76]. Equivalently, the mutual information can be formulated as the difference between the sum of the marginal entropies and the joint entropy, i.e.,

I (X, Y) = H (X) + H (Y) - H (X, Y)

(6)

is valid.

We can rewrite the divergence formulation of the mutual information

I (X, Y)

from Equation (3) as

I (X, Y) = \int_{X} F (x, Y) d x

where

F (x, Y) = \int_{Y} p (x, y) \cdot log (\frac{p (x, y)}{p (x) \cdot p (y)}) d y

(7)

describes a mutual information relation of a particular object (feature) x with respect to the random quantity Y. We denote

F (x, Y)

as the (feature) resolved mutual information (rMI).

The mutual information for sequences

X (t)

and

Y (t + τ)

at time (position) t with shift

τ \geq 0

is defined as

I (X (t), Y (t + τ)) = \int_{X} \int_{Y} p (x (t), y (t + τ)) \cdot log (\frac{p (x (t), y (t + τ))}{p (x (t)) \cdot p (y (t + τ))}) d y d x

(8)

which yields by setting

Y (t + τ) = X (t + τ)

I (X (t), X (t + τ)) = \int_{X} \int_{X} p (x (t), x (t + τ)) \cdot log (\frac{p (x (t), x (t + τ))}{p (x (t)) \cdot p (x (t + τ))}) d x (t + τ) d x

as the auto mutual information at time/position t with shift (delay)

τ

[77,78]. If

p (x (t))

is independent from t, only the joint probability

p (x (t), x (t + τ))

remains t-dependent or, more precisely, it becomes dependent only on the shift

τ

such that we simply write

p (x, x (τ))

for this. Thus, the auto mutual information in dependence on the shift

τ

is obtained as

I (X, τ) = \int_{X} \int_{X} p (x, x (τ)) \cdot log (\frac{p (x, x (τ))}{p (x) \cdot p (x (τ))}) d x (τ) d x

(9)

as an information theoretic analogous to the auto-correlation function. In [50,79], this shift-dependent auto mutual information is denoted as the mutual information function (MIF)

F (X, τ) = I (X, τ)

. Adapting the rMI from Equation (7) to the auto mutual information

I (X, τ)

results in the function

\begin{matrix} F (x, τ) & = & \int_{X} p (x, x (τ)) \cdot log (\frac{p (x, x (τ))}{p (x) \cdot p (x (τ))}) d x (τ) \\ = & \int_{X} p (x, x (τ)) \cdot log (\frac{p (x, x (τ))}{p (x (τ))}) d x (τ) - p (x) \cdot log (p (x)) \end{matrix}

(10)

which can be seen as a quantity characterizing the inherent correlations of the sequence values

x (t)

. We denote

F (x, τ)

as the (feature) resolved mutual information function (rMIF), which trivially fulfills

I (X, τ) = \int_{X} F (x, τ) d x

(11)

according to its definition. Note, more precisely would be the notation

F (X, x, τ)

. We drop the dependency on X for better readability. For (finite) discrete distributions, it becomes simply a matrix

F

and

I (X, τ)

constitutes a vector. Hence, we can compare those vectors in terms of respective norms, e.g., by the Euclidean norm for vectors or the corresponding Frobenius norm for matrices [80,81].

2.2. Rényi $α$ -Entropy and Related Mutual Information Functions

The Rényi-entropy

H_{α}^{R} (X) = \frac{1}{1 - α} log (\int_{X} {(p (x))}^{α} d x)

(12)

is a generalization of the Shannon-entropy, where

α > 0

and

α \neq 1

is a parameter [62]. Depending on the context, it is also denoted as

α

-entropy. In the limit

α \to 1

, the Shannon entropy is obtained. The corresponding Rényi-divergence is

D_{α}^{R} (p (x) ‖ p (y)) = \frac{1}{α - 1} log (\int_{X} \int_{Y} \frac{{(p (x))}^{α}}{{(p (y))}^{α - 1}} d y d x)

(13)

with the limit

{lim}_{α \to 1} D_{α}^{R} (p (x) ‖ p (y)) = D_{K L} (p (x) ‖ p (y))

being valid, such that the

α

-dependent Rényi-mutual-information (RMI) is defined as

I_{α}^{R} (X, Y) = D_{α}^{R} (p (x, y) ‖ p (x) \cdot p (y))

(14)

analogous to the Shannon case (3). This mutual information is widely applied in data analysis and pattern recognition as well as in information theoretic machine learning [82,83,84,85,86,87,88,89,90]. Unfortunately, a relation comparable to (6) does not hold, i.e.,

I_{α}^{R} (X, Y) \neq H_{α}^{R} (X) + H_{α}^{R} (Y) - H_{α}^{R} (X, Y)

is generally valid. This problem arises from the difficulty to define a conditional Rényi entropy to be consistent with the setting in the Shannon case [91,92,93]. Several variants are known [94,95]. The Jizba–Arimitsu conditional Rényi-entropy

H_{α}^{JA} (X | Y)

defined as

H_{α}^{JA} (X | Y) = H_{α} (X, Y) - H_{α} (Y)

(15)

fulfills the chain rule by definition [96]. Obviously,

H_{α}^{JA} (X | Y)

can be interpreted as an extension of the conditional Shannon entropy

H (X | Y)

because the definition (15) precisely coincides with Shannons chain rule (5). The resulting mutual entropy

M_{α}^{R} (X, Y) = H_{α} (X) + H_{α}^{JA} (X | Y)

(16)

is consistent with (4) and preserves the symmetry [97]. However, it may violate the non-negativity as well as

I_{α}^{R} (X, Y) \neq M_{α}^{R} (X, Y)

is valid. For further variants, we refer to [95].

Analogous to the resolved mutual information

F (x, Y)

in the Shannon case from Equation (10), we denote

F_{α}^{R} (x, Y) = \int_{Y} \frac{{(p (x, y))}^{α}}{{(p (x))}^{α - 1} \cdot {(p (y))}^{α - 1}} d y

(17)

as the

α

-scaled (feature) resolved Rényi mutual information (rRMI). Obviously,

I_{α}^{R} (X, Y) = \frac{1}{α - 1} log (\int_{X} F_{α}^{R} (x, Y) d x)

holds. The Rényi variant of the cross mutual information for sequences

X (t)

and

Y (t + τ)

at time t with shift

τ \geq 0

is defined as

I_{α}^{R} (X (t), Y (t + τ)) = \frac{1}{α - 1} log (\int_{X} \int_{Y} \frac{{(p (x (t), y (t + τ)))}^{α}}{{(p (x (t)))}^{α - 1} \cdot {(p (y (t + τ)))}^{α - 1}} d y (t + τ) d x (t))

(18)

which gives by setting

Y (t + τ) = X (t + τ)

I_{α}^{R} (X (t), X (t + τ)) = \frac{1}{α - 1} log (\int_{X} \int_{X} \frac{{(p (x (t), x (t + τ)))}^{α}}{{(p (x (t)))}^{α - 1} \cdot {(p (x (t + τ)))}^{α - 1}} d x (t + τ) d x (t))

as the Rényi variant of the auto mutual information at time t with shift (delay)

τ

. Again, if

p (x (t))

is independent from t, only the joint probability

p (x (t), x (t + τ))

remains t-dependent such that it becomes dependent only on the shift

τ

and we simply write

p (x, x (τ))

for this. Hence, the Rényi auto mutual information in dependence on the shift

τ

is obtained as

I_{α}^{R} (X, τ) = \frac{1}{α - 1} log (F_{α}^{R} (X, τ))

(19)

with

F_{α}^{R} (X, τ) = \int_{X} \int_{X} \frac{{(p (x, x (τ)))}^{α}}{{(p (x))}^{α - 1} \cdot {(p (x (τ)))}^{α - 1}} d x (τ) d x

(20)

denoted as the Rényi variant of, or

α

-scaled Rényi mutual information function (RMIF). Accordingly, the

α

-scaled resolved version of the RMIF is

F_{α}^{R} (x, τ) = \int_{X} \frac{{(p (x, x (τ)))}^{α}}{{(p (x))}^{α - 1} \cdot {(p (x (τ)))}^{α - 1}} d x (τ)

(21)

describing again the inherent correlations of the sequence and, hence, can serve as a characterizing quantity of the sequence. Accordingly, we denote the function

F_{α}^{R} (x, τ)

as the

α

-scaled resolved Rényi mutual information function (rRMIF) for Rényi entropies. Obviously,

I_{α}^{R} (X, τ) = \frac{1}{α - 1} log (\int_{X} F_{α}^{R} (x, τ) d x)

(22)

is valid analogous to Equation (11).

2.3. Tsallis $α$ -Entropy and Related Mutual Information Functions

Recently, the Tsallis mutual information came into the focus for studying long range correlations in symbol sequences [70]. It is related to the Tsallis

α

-entropy

H_{α}^{T} (X) = \frac{1}{α - 1} (1 - \int_{X} {(p (x))}^{α} d x)

(23)

which becomes in the limit

α \to 1

the Shannon entropy

H (X)

. It was first introduced by Havrda and Charvát in 1967 [98] and later rediscovered by Tsallis [64]. It is related to the Rényi

α

-entropy

H_{α}^{R} (X)

by

H_{α}^{R} (X) = \frac{log (1 - (1 - α) H_{α}^{T} (X))}{1 - α}

as stated in [65]. The Tsallis-divergence is given by

D_{α}^{T} (p (x) ‖ p (y)) = \frac{1}{α - 1} (1 - \int_{X} \int_{Y} \frac{{(p (x))}^{α}}{{(p (y))}^{α - 1}} d y d x)

(24)

as explained in [64,74]. Using the same procedure as for the Shannon case (3), we obtain

I_{α}^{T} (X, Y) = D_{α}^{T} (p (x, y) ‖ p (x) \cdot p (y))

(25)

for the

α

-dependent Tsallis mutual information (TMI) [99]. As for the Rényi mutual information, the inequality

I_{α}^{T} (X, Y) \neq H_{α}^{T} (X) + H_{α}^{T} (Y) - H_{α}^{T} (X, Y)

is generally valid except the case

α = 1

being the Shannon case. It is symmetric and always non-negative but not consistent with the conditional Tsallis entropy

H_{α}^{T} (X | Y) = \frac{H_{α}^{T} (X, Y) - H_{α}^{T} (Y)}{1 + (1 - α) \cdot H_{α}^{T} (Y)}

(26)

as explained in [65]. To avoid these and other difficulties, finally, the Tsallis

α

-entropy based mutual entropy (information) is suggested to be

\begin{matrix} M_{α}^{T} (X, Y) & = & \frac{H_{α}^{T} (X) + H_{α}^{T} (Y) - H_{α}^{T} (X, Y) + (1 - α) H_{α}^{T} (X) H_{α}^{T} (Y)}{1 + (1 - α) H_{α}^{T} (X)} \end{matrix}

as proposed in [65]. However, the inequality

M_{α}^{T} (X, Y) \neq I_{α}^{T} (X, Y)

holds.

As for the Shannon and the Rényi variants of the mutual information, we consider a resolved Tsallis mutual information (rTMI)

F_{α}^{T} (x, Y) = \int_{Y} \frac{{(p (x, y))}^{α}}{{(p (x) \cdot p (y))}^{α - 1}} d y

(27)

such that

I_{α}^{T} (X, Y) = \frac{1}{α - 1} (1 - \int_{X} F_{α}^{T} (x, Y) d x)

(28)

holds. For the auto mutual information with shift

τ

, we get

I_{α}^{T} (X, τ) = \frac{1}{α - 1} (1 - F_{α}^{T} (X, τ))

(29)

with

F_{α}^{T} (X, τ) = \int_{X} F_{α}^{T} (x, τ) d x

(30)

as the Tsallis mutual information function (TMIF) by the same arguments as before with

F_{α}^{T} (x, τ) = \int_{X} \frac{{(p (x, x (τ)))}^{α}}{{(p (x) \cdot p (x (τ)))}^{α - 1}} d x (τ)

(31)

denoted as the

α

-scaled resolved Tsallis mutual information function (rTMIF) for Tsallis entropies. In bioinformatics context, it can be seen as a

α

-scaled object dependent average Tsallis mutual information profile.

Comparing TMIF and RMIF as well as rTMIF and rRMIF, we can obviously state the equalities

F_{α}^{R} (X, τ) = F_{α}^{T} (X, τ)

and

F_{α}^{R} (x, τ) = F_{α}^{T} (x, τ)

.

All quantities relevant for the later data analysis are summarized and adapted for biomolecular sequences in Table 2.

3. Interpretable Classification Learning by Learning Vector Quantization

Learning vector quantization (LVQ) as introduced by T. Kohonen is a neural network approach for classification trained by Hebbian competitive learning to achieve an approximation of a Bayes classifier model [100,101]. It is based on the intuitive nearest prototype principle, i.e., prototype vectors are distributed in the data space during the learning phase to detect the data class distribution. In the recall phase, a data point is assigned to that class the nearest prototype is referencing based on a given data dissimilarity. It is known as a robust variant of the nearest neighbor principle [102]. In this way, LVQ is easy to interpret [27].

Particularly, LVQ supposes data vectors

x \in X = {\{x_{k}\}}_{k = 1}^{K} \subseteq R^{n}

together with class labels

c (x) \in C = \{1, \dots, C\}

for training [100]. Furthermore, the LVQ-model requires prototype vectors

w_{j} \in W = {\{w_{k}\}}_{k = 1}^{N} \subset R^{n}

with class labels

c (w_{j})

such that each class of

C

is represented by at least one prototype. As already mentioned, a new data vector is assigned to a class by means of the nearest prototype principle

x \mapsto c (w^{*}) with w^{*} = \underset{w_{j} \in W}{argmin} (d (x, w_{j}))

where

w^{*}

is denoted as the winning prototype for the input

x

with respect to W. Here, d is a predefined dissimilarity measure in

R^{n}

frequently chosen as the (squared) Euclidean distance. According to [103], prototype learning in GLVQ can be realized as a stochastic gradient descent learning (SGDL) for the prototype set

W

. The respective cost function

E = \sum_{x \in X} E (x, W)

approximates the overall classification error for the training set

X

by local errors

E (x, W) = f (μ (x, W))

taking into account the classifier function

μ (x, W) = \frac{d (x, w^{+}) - d (x, w^{-})}{d (x, w^{+}) + d (x, w^{-})}

such that

μ (x, W) \in [- 1, 1]

is valid and f is a monotonically increasing sigmoid squashing function. Here,

w^{+}

is the closest prototype to

x

with a correct label, whereas

w^{-}

is the closest prototype with incorrect label, i.e.,

\begin{matrix} w^{+} = \underset{\begin{matrix} w_{j} \in W \\ c (x) = c (w_{j}) \end{matrix}}{argmin} d (x, w_{j}) and w^{-} = \underset{\begin{matrix} w_{j} \in W \\ c (x) \neq c (w_{j}) \end{matrix}}{argmin} d (x, w_{j}) \end{matrix}

(32)

such that

μ (x, W) < 0

holds in case of a correct classification.

The SGDL-step for a given input

x

is

Δ w^{\pm} \propto \frac{\partial E (x, W)}{\partial w^{\pm}}

realizing an attraction scheme (vector shift) for

w^{+}

towards

x

in case of the (squared) Euclidean distance as dissimilarity d, whereas

w^{-}

is repelled from

x

. This variant of LVQ is known as generalized LVQ (GLVQ, [103]).

The interpretability and the power of the GLVQ can be improved taking the dissimilarity d as

d_{Ω} (x, w) = {(Ω (x - w))}^{2}

(33)

where

Ω \in R^{m \times n}

is a mapping matrix with

m \leq n

. This mapping matrix is also the subject of adaptation during learning with

Δ Ω_{i j} \propto \frac{\partial E (x, W)}{\partial Ω_{i j}}

realizing the SGDL-step for a given input

x

. This approach is known as the generalized matrix LVQ (GMLVQ) [72]. In case of

m < n

, it is the limited rank GMLVQ (LiRaM-LVQ) [104].

The resulting matrix

Λ = Ω^{T} Ω

is denoted as classification correlation matrix (CCM) [105]. The matrix entries

Λ_{i j}

reflect after training those correlations between the

i^{t h}

and

j^{t h}

data features, which contribute to a class discrimination. More specifically, if

|Λ_{i j}| ≫ 0

is valid, the respective correlation of the features is important to separate the classes, whereas

|Λ_{i j}| \approx 0

indicates that either the correlation between the

i^{t h}

and

j^{t h}

data feature does not improve the classification or that this correlation information is already contained in another significant correlation. The vector

λ = {(λ_{1}, \dots, λ_{n})}^{T}

with

λ_{i} = Λ_{i i}

being the non-negative diagonal elements of the classification correlation matrix is denoted as classification relevance profile (CRP) of the features [106]. It describes the relevance of the features for class discrimination with an analog interpretation as for

|Λ_{i j}|

. The classification influence profile (CIP), defined as

κ = {(κ_{1}, \dots, κ_{n})}^{T}

with

κ_{i} = \sum_{j} |Λ_{i j}|

provides the importance of the

i^{t h}

data feature in combination with all other features for the separation of the data set. Both profiles, as well as the classification correlation matrix, provide additional information beyond the pure classification performance and, hence, contribute to a high interpretability of the classification model [107].

Moreover, all mentioned GLVQ variants are robust classification learning models maximizing the hypothesis margin for most appropriate class separation [73,108].

4. Applications of Mutual Information Functions for Sequence Classification

In the following, we apply the described information theoretic quantities as characterizing features for biomolecular sequences. Particularly, we use the introduced variants of mutual information functions summarized in Table 2 and natural vectors as feature generators. Their performance is evaluated in combination with the LiRaM-LVQ for three biological classification tasks.

4.1. Data Sets

The chosen data sets summarized in Table 1 are representatives of biological applications facing the common challenges of varying sequence lengths and containing ambiguous characters (see Section 4.2.3).

4.1.1. Quadruplex Detection

This data set consists of 368 nucleotide sequences that were experimentally validated to either build or not build a G-quadruplex during folding. Quadruplexes are structural (3D) motifs of one or more nucleic acid strands consisting of at least two stacked tetrads. These are characterized by the planar arrangement of four nucleotides, each of which forms non-canonical bonds (base pairing schemes other than Watson-Crick) with two of the other nucleotides. If all tetrad-forming nucleotides are guanine, it is also denoted as the G-quadruplex, or G4. The utilized data are equivalent to that published by [109] without the random sequences (background sequences assumed to be non-G4). The data source is the G4RNA database [110].

4.1.2. lncRNA vs. mRNA

For the next task, we used a data set containing 10,000 human long non-coding RNA (lncRNA) sequences and 10,000 protein-coding transcripts (mRNA). lncRNA are transcripts that do not encode proteins, i.e., they are not translated, but play a role in gene regulation. Their typical length of more than 200 nucleotides (nt) delineates them from small non-coding RNA such as miRNAs or snoRNAs and similarities in sequence structure compared to mRNA make their differentiation challenging [111]. The data set was generated analogous to [111]: The data were retrieved from the GENCODE database [112] in the latest version v.38 at the time of access (11 August 2021). Data preprocessing comprising filtering of sequences with length 250–3000 nt and random selection of 10,000 sequences per class was applied. In contrast to [111], we decided to use the same interval for both classes in order to not prone our classifier to use the sequence length as class discriminating property.

4.1.3. COVID Types

As a third data set, we took 156 coronavirus sequences from human hosts of the types A, B, and C, implicitly coding the evolution in time of the virus. The SARS-CoV-2 sequence data source is the GISAID (Global Initiative on Sharing Avian Influenza Data) coronavirus repository from 4 March 2020 with types derived from a phylogenetic network analysis in [113]. Type A is most similar to the bat virus, type B evolves from A by a non-synonymous and a synonymous mutation (evolutionary substitutions that do or do not modify the resulting amino acid sequence, respectively) and type C is characterized by a further non-synonymous mutation.

4.2. Feature Generation

In the following, we introduce the concept of natural vectors and provide a description of how to generate feature vectors from the information theoretical quantities MIF and rMIF introduced in Section 2 for machine learning applications. Both feature generators are sequence length independent and capable of handling ambiguous characters in biological data as covered in more detail in Section 4.2.3.

4.2.1. Natural Vectors

Natural vectors (NV) in biomolecular context accumulate statistical descriptors concerning the distribution of nucleotide positions within a sequence

s = [s_{1} \dots s_{n}]

over the alphabet

A = {A, C, G, T}

. They were generalized in [47] from [114]. Natural vectors are known to be characteristic fingerprints for biomolecular sequences reflecting statistical and, hence, information theoretic properties. Therefore, we consider them as baseline for comparison with mutual information functions.

To define NV accurately, let

n_{k} = \sum_{i = 1}^{n} w_{k} (s_{i})

be the absolute frequency of nucleotide k in s, given that

w_{k} (s_{i}) \in {0, 1}

indicates the absence (0) or presence (1) of k at sequence position

s_{i}

. Furthermore, let

μ_{k} = \sum_{i = 1}^{n} i \cdot \frac{w_{k} (s_{i})}{n_{k}}

denote the mean sequence position of nucleotide k and

D_{k}^{j} = \sum_{i = 1}^{n} \frac{{(i - μ_{k})}^{j} w_{k} (s_{i})}{n_{k}^{j - 1} n^{j - 1}}

be the normalized central moment of order j. Then, the natural vector is defined as

x = (n_{A}, μ_{A}, D_{A}^{2}, \dots, D_{A}^{n_{A}}, n_{C}, μ_{C}, D_{C}^{2}, \dots, D_{C}^{n_{C}}, n_{G}, μ_{G}, D_{G}^{2}, \dots, D_{G}^{n_{G}}, n_{T}, μ_{T}, D_{T}^{2}, \dots, D_{T}^{n_{T}})

(34)

Obviously,

μ_{k} = D_{k}^{1}

is valid and one can take just

n_{k} = D_{k}^{0}

in terms of the statistical moments. Furthermore, it was stated that this setting guarantees a unique coding of the molecular sequences [47]. In practical applications, the maximum order

j_{\max}

of moments to be calculated is fixed equally in dependence of the data set for all nucleotides to achieve equal-length vectors for all considered sequences. Hence, the data dimension becomes

n = 4 \cdot (j_{\max} + 1)

.

In the experiments, we determined an optimal setting of

j_{\max}

under consideration of the sequence length via grid search. Therefore, we evaluated

j_{\max} \in {2, 3, 4}

,

{2, \dots, 15}

and

{2, \dots, 15}

for the Quadruplex detection, lncRNA vs. mRNA and COVID types data set, respectively. We directly take

x

from Equation (34) as input (feature vector) for the LVQ model. The maximum order 15 for

j_{\max}

was taken as an upper bound because higher moments were numerically vanishing for the used data sets.

4.2.2. Mutual Information Functions

In case of mutual information functions, the feature vector

x = {(x_{1}, \dots, x_{τ_{\max}})}^{T}

is generated from a sequence X by setting

x_{τ} = F (X, τ)

or

x_{τ} = F_{α}^{R} (X, τ)

for Shannon and Rényi, respectively. The maximum distance between pairs of nucleotides considered in the sequence is

τ_{\max}

.

For the resolved mutual information functions, we take

x = {(x_{1}^{A}, \dots, x_{τ_{\max}}^{A} x_{1}^{C}, \dots, x_{τ_{\max}}^{C} x_{1}^{G}, \dots, x_{τ_{\max}}^{G} x_{1}^{T}, \dots, x_{τ_{\max}}^{T})}^{T}

with

x_{τ}^{k} = F (k, τ)

or

x_{τ}^{k} = F_{α}^{R} (k, τ)

, where

k \in A = {A, C, G, T}

.

Table 2 summarizes the applied mutual information functions for the Shannon and Rényi case.

In the literature on MIF, there is disagreement on how to calculate the marginal probabilities of the nucleotides: one camp propagates a symmetric version, i.e.,

p (x)

denotes the relative frequency of a nucleotide x in a sequence [52,54,56], while the other distinguishes the frequencies of the nucleotides at the positions x depending on

x (τ)

, i.e.,

p (x) = \sum_{x (τ)} p ((x, x (τ)) | x)

and

p (x (τ)) = \sum_{x} p ((x, x (τ)) | x (τ))

[57,58,59]. We used the latter (non-symmetric) version, since biological sequences have a chemically reasonable reading direction, such that a nucleotide’s neighbor is determined in the 3’ direction.

An optimal setting of the hyper-parameter

τ_{\max}

was obtained under consideration of the sequence length via grid search. We evaluated

τ_{\max} \in {2, \dots, 8}

,

{10, 25, 50, 100}

and

{5, 10, 50, 100}

for the Quadruplex detection, lncRNA vs. mRNA and COVID types data set, respectively.

The

α

-value for the Rényi variants was set to

α = 2

as usual. This choice leads to low computational costs and provides numerical stability [82].

4.2.3. Handling of Ambiguous Characters

Ambiguous characters are introduced by the IUPAC (International Union of Pure and Applied Chemistry) degenerate base notation [115]. Thereby, the notation ambiguous refers to the concept that a single character from the alphabet extension

E = {R, Y, M, K, S, W, H, B, V, D, N}

represents more than one nucleotide, present in data to describe incompletely specified bases or uncertainty of them [115]. For instance R denotes either A or G, the ambiguous character H stands for either A, C, or T, whereas N codes for all four possible nucleotides.

In order to make the feature generators cope with these representations, the weights

0 \leq w_{k} (s_{i}) \leq 1

now code the probability for, and not just the presence (1) or absence (0) of, a nucleotide at one specific sequence position, i.e.,

w_{A} (s_{i}) = \{\begin{matrix} 1 & i f s_{i} = A \\ 0 & i f s_{i} = C, G, T, Y, K, S or B \\ \frac{1}{2} & i f s_{i} = R, M or W \\ \frac{1}{3} & i f s_{i} = H, V or D \\ \frac{1}{4} & i f s_{i} = N \end{matrix} .

(35)

In [116], natural vectors were expanded to handle this extended alphabet. We designed a solution for the MIF variants analogously.

4.3. Classification

Following all the feature extractors mentioned above, we have applied a Z-score normalization in order to make the individual features comparable. Classification was then done using the LiRaM-LVQ implementation from the Python toolbox prototorch in 3-fold cross validation. In all cases, the prototypes for learning were initialized as randomly selected data points and the learning rate was set to

0.01

in all cases. The mapping dimension m was set to 10 independent of data set or feature set. The choice of the number of prototypes was data set depending: for Quadruplex detection and COVID types, we took only one prototype per class. For the lncRNA vs. mRNA data set, the grid search for optimal setting resulted in 50 prototypes per class as balance between complexity of the model and performance.

5. Results and Discussion

5.1. Classification Performance

Table 3 displays the achieved test accuracies by LiRaM-LVQ in combination with the optimal parameter setting of

j_{\max}

and

τ_{\max}

for the NV and MIF variant feature extractors, respectively.

Considering these results in Table 3, we see that rMIF outperforms the MIF variants as well as the models which use NV for feature generations for all three data sets. Furthermore, the developed Rényi variant shows in the Quadruplex detection example for rMIF significantly better results compared to the Shannon counterpart. However, for the second data set, the performance of rMIF depends on the choice of

τ_{\max}

. In general, it can be said that for long sequences

τ_{\max}

need to be chosen adequately if long range correlations are to be considered as well.

For deeper investigation of these results and to show the capabilities of the applied LiRaM-LVQ classifier, we will consider the CCM and CIP. Furthermore, visualizations of the mean MIF and rMIF per class and data set are considered for deeper understanding of the generated features and their potential differences between classes. In order to not overload the reader, we will restrict a more in-depth interpretation and discussion to one of the data sets, the quadruplex detection challenge.

It should be noted that a feature generation procedure based on pure statistics might achieve comparable or even better results. For this reason, it is not surprising that the statistical feature extractor Bag of words [117,118] has been successful in related works on the data sets mentioned: 92.8% AUC were achieved for the quadruplex data in combination with a simple neural network [109], an accuracy of 98.7% was described in [111] for the lncRNA vs. mRNA data by use of a convolutional neural network and 97.4% accuracy were obtained in [119] for COVID type detection using GMLVQ. However, the focus of this paper is on the investigation and further development of information theoretical methods and their suitability for sequence analyses in computational biology.

5.2. Visualization of MIF Variants

A closer look at the class-wise averaged MIF variant profiles in Figure 1 allows for assessing the methods behavior on the quadruplex data set. The plotted means suggest a clear class delineation, while the standard deviation adds depth/difficulty to the problem. All profiles are plotted prior to Z-score normalization, but with a slight vertical shift between the classes for better visual perception.

Comparison of the MIF and rMIF clearly shows a more accurate resolution of the information for rMIF, not only in terms of inter-sequential distances, but also in terms of individual nucleotides. Obviously, the sum of the four

F (x, τ)

with

x \in {A, C, G, T}

yields

F (X, τ)

. The features with respect to G-nucleotides stand out in particular.

5.3. Interpretation of CCM and CIP of the Trained LiRaM-LVQ Model

The resulting CCM of the trained LiRaM-LVQ model gives domain experts, here biologists, immediate assistance to evaluate whether the classifier works reasonably. Furthermore, it allows statements to be made about whether the classification decision is based on some data biases or artifacts that were not necessarily known during data generation [104]. This interpretation possibility of the LiRaM-LVQ model is a huge advantage in comparison to black box models [120] especially in biological issues: Together with meaningful data features, as given here, biologists can draw conclusions regarding expected biological and biochemical properties.

In the experiments, we verified that the CCM can serve as a basis for interpretation by repeating the classification process multiple times and analyzing whether the matrix is visually stable. If significant deviations had been seen, an interpretation would have been spurious. Each depicted CCM is the result of averaging the individual CCMs obtained form the three validation folds. Furthermore, we limit the visualization to the best hyper-parameter setting according to our grid search.

For the quadruplex data set, the best choice

τ_{\max} = 7

gives a CCM with dimensionality

7 \times 7

and

28 \times 28

for the MIF and rMIF case, respectively. These quantities are visualized in Figure 2 giving insights into the classification decision of LiRaM-LVQ:

As can be seen from the CIP and from the CCMs’ main diagonal, the CRP, in Figure 2a, the MIF values for

τ

equals 1, 4, and 6 mainly influence the classifier’s decision to discriminate the classes of G-quadruplex (G4) and non-G4 forming sequences. Moreover, the CCM shows positive and negative correlations between the features. For example, features

τ = 4

and 6 are strongly positive and

τ = 1

and 4 are in strong negative correlation with each other. Thus, if

τ = 1

has a high value, it is important for the class discrimination, but only if

τ = 4

has a small value and vice versa. It is striking that the feature

τ = 5

does neither alone nor in combination with any other feature contribute to the differentiation for this learned model.

In Figure 2b, the CIP illustrates that eight features stand out with their influence on the class discrimination. Sorted by importance, these are: the information for

(G, 2)

,

(G, 3)

,

(A, 1)

,

(C, 3)

,

(G, 7)

,

(G, 4)

,

(G, 5)

, and

(A, 6)

. Taking the CCM into consideration, a high positive correlation between

(A, 1)

and

(G, 2)

as well as between

(C, 3)

and

(G, 2)

is obvious. Examples for high negative correlations would be the pairs of

(G, 2)

and

(G, 3)

or between

(A, 1)

and

(G, 3)

. The clearly recognizable significance of Gs at different distances is biologically sound due to the general characterization of a G-quadruplex by a pattern of recurring guanines in the sequence as described in [121]. Insights like this would not have been possible with the standard MIF but only with our introduced resolved variant rMIF.

At first glance, one might claim an inconstancy between the high influence values in the CIPs for MIF and rMIF features. However, in the MIF calculation, there is an averaging of the information over the alphabet, such that the classifier can make use of more detailed information with rMIF. This means that the summation of the classification influence values for all four

(x, τ)

does not necessarily result in the influence value for the MIF for a specific

τ

and vice versa. As the individual nucleotides play a key role in the bioinformatics domain, there might be an essential information loss if an averaging procedure takes place as it is done for the simpler MIF.

Beside biological interpretations, these insights offer the possibility to adjust or rather fine-tune the classification model. For example, by taking just the seven most important rMIF features into account, we still obtain a performance of

77.1 \pm 0.7 %

. Hence, we could reduce the model complexity with moderate performance decrease.

To sum up, the LiRaM-LVQ classifier is transparent in the decision process as well as in the Hebbian learning process. Now, the expert can start to evaluate the results and either extract knowledge from the classifier or question the quality of the data/model if the results seem peculiar.

For the sake of completeness, Figure 3 shows the CCM and CIP for the lncRNA vs. mRNA as well for the COVID type data set. The Shannon rMIF features were superior in these tasks. Our grid search resulted in high optimal values for

τ_{\max}

which is alright for pure performance evaluation but poses a problem in visually evaluating the CCMs and drawing conclusions. Therefore, we decided to take advantage of the same procedure described above: we identified the 30 most important/valuable features using the CIP, ran the classification procedure again using only these, and finally visualized both characteristics.

An in-depth analysis of the results including biological interpretation is up to the well-disposed reader.

6. Conclusions, Remarks, and Future Work

In this contribution, we propose information theoretic concepts and quantities to characterize spatial correlations in sequences. In particular, we introduced several variants of mutual information functions for Shannon, Rényi, and Tsallis information theoretic approaches. In particular, the resolved mutual information functions provide subtle information regarding the internal spatial correlation of the sequences.

These functions/quantities can be used as sequence signatures/fingerprints and thus for comparison in machine learning approaches. In particular, interpretable machine learning models can make use of this resolved information to achieve insights about the sequence class differences. As we have shown using our favored LiRaM-LVQ, detailed information can be extracted as an add-on to the pure classification model.

We see applications for sequence analysis in bioinformatics, especially in the context of alignment-free sequence comparison. Additionally, we remark that this concept can be extended to the analysis of more general categorical sequential data such as natural language texts or sheet music.

In the future work, we will extend this approach to further mutual information concepts related to other widely considered entropy measures and information theoretic quantities, e.g., the Cauchy–Schwarz-divergence [85], or more general

α

-,

β

- and

γ

-divergences with related mutual information concepts [74,91,122]. Further considerations could be a generalization to higher than two-body correlations as suggested in [123] or performing the calculation for sequences not 1 by 1 residue (position), but multiple residues [59].

Furthermore, we want to compare these methods with other feature generators taking statistical (spatial) correlation into account such as the return time distribution [124] known from stochastic modeling, DMk method [125] incorporating the occurrence, location, and order relation of k-mers, compression based methods with the underlying concept of minimum description length [126], methods based on domain transform, i.e., Fourier/Wavelet [127,128], DNA walks [45,129] and iterated function systems, e.g., chaos game representation or universal sequence maps [42,130].

However, interpretability should be kept always as a key feature when considering alternative models [25,131,132]. Interpretability increases the trustworthiness and hence the acceptance of models for the potential users [27]. Further extensions improving transparency of the decision and already known for GLVQ approaches are the incorporation of reject options for ambiguous decisions or outliers as well as the use of interpretable probabilistic classifiers [133,134,135].

Author Contributions

Conceptualization, K.S.B., M.K. and T.V.; methodology, K.S.B., M.K., S.S. and T.V.; software and visualization, K.S.B. and M.K.; validation, resources and data curation, K.S.B. and J.A.; formal analysis, K.S.B., S.S. and T.V.; investigation, K.S.B., J.A. and M.K.; writing—original draft preparation, T.V. and K.S.B.; writing—review and editing, M.K., J.A. and S.S.; supervision and project administration, T.V.; funding acquisition, T.V. and M.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the European Social Fund Grant No. 100381749.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data set for quadruplex detection is publicly available at https://0-academic-oup-com.brum.beds.ac.uk/bioinformatics/article/33/22/3532/4061281#supplementary-data, that for lncRNA vs. mRNA at https://www.gencodegenes.org/human/ (version v.38), and the accession numbers for the COVID type detection at https://www.springerprofessional.de/learning-vector-quantization-as-an-interpretable-classifier-for-/19111526?fulltextView=true (all accessed on 11 August 2021). The toolbox prototorch is publicly available at https://github.com/si-cim/prototorch and was used in version 0.2.0. The code for the NV and MIF variant calculation can be obtained from the authors upon request.

Acknowledgments

The authors would like to thank Mirko Weber, Daniel Staps, Alexander Engelsberger and Jensun Ravichandran, all from the University of Applied Sciences Mittweida, for useful discussions and technical support.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AMI	Average Mutual Information
CCM	Classification Correlation Matrix
CIP	Classification Influence Profile
GISAID	Global Initiative on Sharing Avian Influenza Data
GLVQ	Generalized Matrix Learning Vector Quantization
IUPAC	International Union of Pure and Applied Chemistry
lncRNA	Long Non-Coding RNA
LiRaM-LVQ	Limited Rank Matrix Learning Vector Quantization
LVQ	Learning Vector Quantization
MIF	Mutual Information Function
mRNA	messenger RNA
NV	Natural Vectors
rMIF	resolved Mutual Information Function
RMIF	Rényi Mutual Information Function
rRMIF	resolved Rényi Mutual Information Function
rTMIF	resolved Tsallis Mutual Information Function
SGDL	Stochastic Gradient Descent Learning
TMIF	Tsallis Mutual Information Function

References

Schrödinger, E. What Is Life? Cambridge University Press: Cambridge, UK, 1944. [Google Scholar]
Eigen, M.; Schuster, P. Stages of emerging life —Five principles of early organization. J. Mol. Evol. 1982, 19, 47–61. [Google Scholar] [CrossRef]
Haken, H. Synergetics—An Introduction Nonequilibrium Phase Transitions and Self-Organization in Physics, Chemistry and Biology; Springer: Berlin/Heidelberg, Germany, 1983. [Google Scholar]
Haken, H. Information and Self-Organization; Springer: Berlin/Heidelberg, Germany, 1988. [Google Scholar]
Baldi, P.; Brunak, S. Bioinformatics, 2nd ed.; MIT Press: Cambridge, MA, USA, 2001. [Google Scholar]
Gatlin, L. The information content of DNA. J. Theor. Biol. 1966, 10, 281–300. [Google Scholar] [CrossRef]
Gatlin, L. The information content of DNA. II. J. Theor. Biol. 1968, 18, 181–194. [Google Scholar] [CrossRef]
Chanda, P.; Costa, E.; Hu, J.; Sukumar, S.; Hemert, J.V.; Walia, R. Information Theory in Computational Biology: Where We Stand Today. Entropy 2020, 22, 627. [Google Scholar] [CrossRef] [PubMed]
Adami, C. Information Theory in Molecular Biology. Phys. Life Rev. 2004, 1, 3–22. [Google Scholar] [CrossRef] [Green Version]
Vinga, S. Information Theory Applications for Biological Sequence Analysis. Briefings Bioinform. 2014, 15, 376–389. [Google Scholar] [CrossRef] [Green Version]
Uda, S. Application of Information Theory in Systems Biology. Biophys. Rev. 2020, 12, 377–384. [Google Scholar] [CrossRef] [Green Version]
Smith, M. DNA Sequence Analysis in Clinical Medicine, Proceeding Cautiously. Front. Mol. Biosci. 2017, 4, 24. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Mardis, E.R. DNA sequencing technologies: 2006–2016. Nat. Protoc. 2017, 12, 213–218. [Google Scholar] [CrossRef]
Hall, B.G. Building Phylogenetic Trees from Molecular Data with MEGA. Mol. Biol. Evol. 2013, 30, 1229–1235. [Google Scholar] [CrossRef] [Green Version]
Xia, X. Bioinformatics and Drug Discovery. Curr. Top. Med. Chem. 2017, 17, 1709–1726. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS); Curran Associates, Inc.: San Diego, CA, USA, 2012; pp. 1097–1105. [Google Scholar]
Schölkopf, B.; Smola, A. Learning with Kernels; MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
Angermueller, C.; Pärnamaa, T.; Parts, L.; Stegle, O. Deep Learning for Computational Biology. Mol. Sys. Biol. 2016, 12, 878. [Google Scholar] [CrossRef] [PubMed]
Min, S.; Lee, B.; Yoon, S. Deep learning in bioinformatics. Briefings Bioinform. 2016, 1–16. [Google Scholar] [CrossRef] [Green Version]
Nguyen, N.; Tran, V.; Ngo, D.; Phan, D.; Lumbanraja, F.; Faisal, M.; Abapihi, B.; Kubo, M.; Satou, K. DNA Sequence Classification by Convolutional Neural Network. J. Biomed. Sci. Eng. 2016, 9, 280–286. [Google Scholar] [CrossRef] [Green Version]
Jaakkola, T.; Diekhans, M.; Haussler, D. A discrimitive framework for detecting remote protein homologies. J. Comput. Biol. 2000, 7, 95–114. [Google Scholar] [CrossRef] [Green Version]
Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nat. 2021, 596, 583–596. [Google Scholar] [CrossRef] [PubMed]
Samek, W.; Monatvon, G.; Vedaldi, A.; Hansen, L. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning; Number 11700 in LNAI; Müller, K.R., Ed.; Springer: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef] [Green Version]
Zeng, J.; Ustun, B.; Rudin, C. Interpretable classification models for recidivism prediction. J. R. Stat. Soc. Ser. A 2017, 180, 1–34. [Google Scholar] [CrossRef]
Lisboa, P.; Saralajew, S.; Vellido, A.; Villmann, T. The coming of age of interpretable and explainable machine learning models. In Proceedings of the 29th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN’2021), Bruges, Belgium, 6–8 October 2021; Verleysen, M., Ed.; i6doc.com: Louvain-La-Neuve, Belgium, 2021; pp. 547–556. [Google Scholar]
Zielezinski, A.; Vinga, S.; Almeida, J.; Karlowski, W.M. Alignment-Free Sequence Comparison: Benefits, Applications, and Tools. Genome Biol. 2017, 18, 186. [Google Scholar] [CrossRef] [Green Version]
Just, W. Computational Complexity of Multiple Sequence Alignment with SP-Score. J. Comput. Biol. 2001, 8, 615–623. [Google Scholar] [CrossRef] [Green Version]
Kucherov, G. Evolution of Biosequence Search Algorithms: A Brief Survey. Bioinformatics 2019, 35, 3547–3552. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Haubold, B. Alignment-Free Phylogenetics and Population Genetics. Briefings Bioinform. 2014, 15, 407–418. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Chan, C.X.; Bernard, G.; Poirion, O.; Hogan, J.M.; Ragan, M.A. Inferring Phylogenies of Evolving Sequences without Multiple Sequence Alignment. Sci. Rep. 2014, 4, 6504. [Google Scholar] [CrossRef] [Green Version]
Hatje, K.; Kollmar, M. A Phylogenetic Analysis of the Brassicales Clade Based on an Alignment-Free Sequence Comparison Method. Front. Plant Sci. 2012, 3, 192. [Google Scholar] [CrossRef] [Green Version]
Wu, Y.W.; Ye, Y. A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples. J. Comput. Biol. J. Comput. Mol. Cell Biol. 2011, 18, 523–534. [Google Scholar] [CrossRef] [Green Version]
Leung, G.; Eisen, M.B. Identifying Cis-Regulatory Sequences by Word Profile Similarity. PLoS ONE 2009, 4, e6901. [Google Scholar] [CrossRef] [Green Version]
de Lima Nichio, B.T.; de Oliveira, A.M.R.; de Pierri, C.R.; Santos, L.G.C.; Lejambre, A.Q.; Vialle, R.A.; da Rocha Coimbra, N.A.; Guizelini, D.; Marchaukoski, J.N.; de Oliveira Pedrosa, F.; et al. RAFTS3G: An Efficient and Versatile Clustering Software to Analyses in Large Protein Datasets. BMC Bioinform. 2019, 20, 392. [Google Scholar] [CrossRef] [Green Version]
Bray, N.L.; Pimentel, H.; Melsted, P.; Pachter, L. Near-Optimal Probabilistic RNA-Seq Quantification. Nat. Biotechnol. 2016, 34, 525–527. [Google Scholar] [CrossRef]
Zerbino, D.R.; Birney, E. Velvet: Algorithms for de Novo Short Read Assembly Using de Bruijn Graphs. Genome Res. 2008, 18, 821–829. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Pajuste, F.D.; Kaplinski, L.; Möls, M.; Puurand, T.; Lepamets, M.; Remm, M. FastGT: An Alignment-Free Method for Calling Common SNVs Directly from Raw Sequencing Reads. Sci. Rep. 2017, 7, 2537. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Luo, L.; Lee, W.; Jia, L.; Ji, F.; Tsai, L. Statistical correlatation of nucleotides in a DNA sequence. Phys. Rev. E 1998, 58, 861–871. [Google Scholar] [CrossRef]
Luo, L.; Li, H. The statistical correlation of nucleotides in protein-coding DNA sequences. Bull. Math. Biol. 1991, 53, 345–353. [Google Scholar] [CrossRef]
Jeffrey, H. Chaos Game Representation of Gene Structure. Nucleic Acids Res. 1990, 18, 2163–2170. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Lin, J.; Adjeroh, D.; Jiang, B.H.; Jiang, Y. K₂ and K2*: Efficient alignment-free sequence similarity measurement based on Kendall statistics. Bioinformatics 2018, 34, 1682–1689. [Google Scholar] [CrossRef]
Li, W. The study of correlation structures of DNA sequences: A critical review. Comput. Chem. 1997, 21, 257–271. [Google Scholar] [CrossRef] [Green Version]
Peng, C.K.; Buldyrev, S.V.; Goldberger, A.L.; Havlin, S.; Sciortino, F.; Simons, M.; Stanley, H.E. Long-Range Correlations in Nucleotide Sequences. Nature 1992, 356, 168–170. [Google Scholar] [CrossRef]
Voss, R. Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys. Rev. A 1992, 68, 3805–3808. [Google Scholar] [CrossRef]
Deng, M.; Yu, C.; Liang, Q.; He, R.L.; Yau, S.S.T. A Novel Method of Characterizing Genetic Sequences: Genome Space with Biological Distance and Applications. PLoS ONE 2011, 6, e17293. [Google Scholar] [CrossRef]
Li, Y.; Tian, K.; Yin, C.; He, R.; Yau, S.T. Virus classification in 60-dimensional protein space. Mol. Phylogenet. Evol. 2016, 99, 53–62. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Tian, K.; Yau, S. Proteine Sequence Classification using natural vectors and the convex hull method. J. Comput. Biol. 2019, 26, 315–321. [Google Scholar] [CrossRef]
Li, W. Mutual information functions versus correlation function. J. Stat. Phys. 1990, 60, 823–837. [Google Scholar] [CrossRef]
Herzel, H.; Grosse, I. Maesuring correlations in symbol sequences. Phys. A 1995, 216, 518–542. [Google Scholar] [CrossRef]
Berryman, M.; Allison, A.; Abbott, D. Mutual information for examining correlataions in DNA. Fluct. Noise Lett. 2004, 4, 237–246. [Google Scholar] [CrossRef] [Green Version]
Swati, D. Use of Mutual Information Function and Power Spectra for Analyzing the Structure of Some Prokaryotic Genomes. Am. J. Math. Manag. Sci. 2007, 27, 179–198. [Google Scholar] [CrossRef]
Bauer, M.; Schuster, S.; Sayood, K. The average mutual information profile as a genomic signature. BMC Bioinform. 2008, 9, 1–11. [Google Scholar] [CrossRef] [Green Version]
Gregori-Puigjané, E.; Mestres, J. SHED: Shannon Entropy Descriptors from Topological Feature Distributions. J. Chem. Inf. Model. 2006, 46, 1615–1622. [Google Scholar] [CrossRef] [PubMed]
Dehnert, M.; Helm, W.E.; Hütt, M.T. Information Theory Reveals Large-Scale Synchronisation of Statistical Correlations in Eukaryote Genomes. Gene 2005, 345, 81–90. [Google Scholar] [CrossRef]
Grosse, I.; Herzel, H.; Buldyrev, S.V.; Stanley, H.E. Species Independence of Mutual Information in Coding and Noncoding DNA. Phys. Rev. E 2000, 61, 5624–5629. [Google Scholar] [CrossRef] [Green Version]
Korber, B.T.; Farber, R.M.; Wolpert, D.H.; Lapedes, A.S. Covariation of Mutations in the V3 Loop of Human Immunodeficiency Virus Type 1 Envelope Protein: An Information Theoretic Analysis. Proc. Natl. Acad. Sci. USA 1993, 90, 7176–7180. [Google Scholar] [CrossRef] [Green Version]
Lichtenstein, F.; Antoneli, F.; Briones, M.R.S. MIA: Mutual Information Analyzer, a Graphic User Interface Program That Calculates Entropy, Vertical and Horizontal Mutual Information of Molecular Sequence Sets. BMC Bioinform. 2015, 16, 409. [Google Scholar] [CrossRef] [Green Version]
Nalbantoglu, Ö.U.; Russell, D.J.; Sayood, K. Data Compression Concepts and Algorithms and Their Applications to Bioinform. Entropy 2010, 12, 34–52. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Shannon, C. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–432. [Google Scholar] [CrossRef] [Green Version]
Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20–30 July 1960; Neyman, J., Ed.; University of California Press: Berkeley, CA, USA, 1961. [Google Scholar]
Rényi, A. Probability Theory; North-Holland Publishing Company: Amsterdam, The Netherlands, 1970. [Google Scholar]
Tsallis, C. Possible generalization of Bolzmann-Gibbs statistics. J. Math. Phys. 1988, 52, 479–487. [Google Scholar]
Sparavigna, A. Mutual Information and Nonadditive Entropies: The Case of Tsallis Entropy. Int. J. Sci. 2015, 4. [Google Scholar] [CrossRef]
Villmann, T.; Geweniger, T. Multi-class and Cluster Evaluation Measures Based on Rényi and Tsallis Entropies and Mutual Information. In Proceedings of the 17th International Conference on Artificial Intelligence and Soft Computing-ICAISC, Zakopane; Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J., Eds.; Springer International Publishing: Cham, Switzerland, 2018; LNCS 10841; pp. 724–735. [Google Scholar] [CrossRef]
Vinga, S.; Almeida, J. Local Rényi entropic profiles of DNA sequences. BMC Bioinform. 2007, 8, 1–19. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Vinga, S.; Almeida, J. Rényi continuous entropy of DNA sequences. J. Theor. Biol. 2004, 231, 377–388. [Google Scholar] [CrossRef]
Delgado-Soler, L.; Toral, R.; Tomás, M.; Robio-Martinez, J. RED: A Set of Molecular Descriptors Based on Rényi Entropy. J. Chem. Inf. Model. 2009, 49, 2457–2468. [Google Scholar] [CrossRef]
Papapetrou, M.; Kugiumtzis, D. Tsallis conditional mutual information in investigating long range correlation in symbol sequences. Phys. A 2020, 540, 1–13. [Google Scholar] [CrossRef]
Gao, Y.; Luo, L. Genome-based phylogeny of dsDNA viruses by a novel alignment-free method. Gene 2012, 492, 309–314. [Google Scholar] [CrossRef]
Schneider, P.; Biehl, M.; Hammer, B. Adaptive Relevance Matrices in Learning Vector Quantization. Neural Comput. 2009, 21, 3532–3561. [Google Scholar] [CrossRef] [Green Version]
Saralajew, S.; Holdijk, L.; Villmann, T. Fast Adversarial Robustness Certification of Nearest Prototype Classifiers for Arbitrary Seminorms. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Virtual-only Conference, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 13635–13650. [Google Scholar]
Cichocki, A.; Amari, S. Families of Alpha- Beta- and Gamma-Divergences: Flexible and Robust Measures of Similarities. Entropy 2010, 12, 1532–1568. [Google Scholar] [CrossRef] [Green Version]
Mackay, D. Inf. Theory, Inference Learn. Algorithms; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Kullback, S.; Leibler, R. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Kantz, H.; Schreiber, T. Nonlinear Time Series Analysis; Cambridge Nonlinear Science Series; Cambridge University Press: Cambridge, UK, 1997; Volume 7. [Google Scholar]
Fraser, A.; Swinney, H. Independent coordinates for strange attractors from mutual information. Phys. Rev. A 1986, 33, 1134–1140. [Google Scholar] [CrossRef]
Li, W. Mutual Information Functions of Natural Language Texts; Technical Report SFI-89-10-008; Santa Fe Institute: Santa Fe, NM, USA, 1989. [Google Scholar]
Golub, G.; Loan, C.V. Matrix Computations, 4th ed.; Johns Hopkins Studies in the Mathematical Sciences; John Hopkins University Press: Baltimore, MD, USA, 2013. [Google Scholar]
Horn, R.; Johnson, C. Matrix Analysis, 2nd ed.; Cambridge University Press: Cambridge, UK, 2013. [Google Scholar]
Erdogmus, D.; Principe, J.; II, K.H. Beyond second-order statistics for learning: A pairwise interaction model for entropy estimation. Nat. Comput. 2002, 1, 85–108. [Google Scholar] [CrossRef]
Hild, K.; Erdogmus, D.; Principe, J. Blind Source Separation Using Rényi’s Mutual Information. IEEE Signal Process. Lett. 2001, 8, 174–176. [Google Scholar] [CrossRef]
Jenssen, R.; Principe, J.; Erdogmus, D.; Eltoft, T. The Cauchy-Schwarz divergence and Parzen windowing: Connections to graph theory and Mercer kernels. J. Frankl. Inst. 2006, 343, 614–629. [Google Scholar] [CrossRef]
Lehn-Schiøler, T.; Hegde, A.; Erdogmus, D.; Principe, J. Vector quantization using information theoretic concepts. Nat. Comput. 2005, 4, 39–51. [Google Scholar] [CrossRef] [Green Version]
Principe, J. Information Theoretic Learning; Springer: Heidelberg, Germany, 2010. [Google Scholar]
Singh, A.; Principe, J. Information theoretic learning with adaptive kernels. Signal Process. 2011, 91, 203–213. [Google Scholar] [CrossRef]
Villmann, T.; Haase, S. Divergence based vector quantization. Neural Comput. 2011, 23, 1343–1392. [Google Scholar] [CrossRef] [PubMed]
Mwebaze, E.; Schneider, P.; Schleif, F.M.; Aduwo, J.; Quinn, J.; Haase, S.; Villmann, T.; Biehl, M. Divergence based classification in Learning Vector Quantization. Neurocomputing 2011, 74, 1429–1435. [Google Scholar] [CrossRef] [Green Version]
Bunte, K.; Haase, S.; Biehl, M.; Villmann, T. Stochastic Neighbor Embedding (SNE) for Dimension Reduction and Visualization Using Arbitrary Divergences. Neurocomputing 2012, 90, 23–45. [Google Scholar] [CrossRef] [Green Version]
Csiszár, I. Axiomatic Characterization of Information Measures. Entropy 2008, 10, 261–273. [Google Scholar] [CrossRef] [Green Version]
Fehr, S.; Berens, S. On the Conditional Rényi Entropy. IEEE Trans. Inf. Theory 2014, 60, 6801–6810. [Google Scholar] [CrossRef]
Teixeira, A.; Matos, A.; Antunes, L. Conditional Rényi Entropies. IEEE Trans. Inf. Theory 2012, 58, 4273–4277. [Google Scholar] [CrossRef]
Iwamoto, M.; Shikata, J. Revisiting Conditional Rényi Entropies and Generalizing Shannons Bounds in Information Theoretically Secure Encryption; Technical Report; Cryptology ePrint Archive 440/2013; International Association for Cryptologic Research (IACR): Lyon, France, 2013. [Google Scholar]
Ilić, V.; Djordjević, I.; Stanković, M. On a General Definition of Conditional Rényi Entropies. In Proceedings of the 4th International Electronic Conference on Entropy and Its Application (ECEA 2017), Online, 21 November–1 December 2017; Volume 2, pp. 1–6. [Google Scholar] [CrossRef] [Green Version]
Jizba, P.; Arimitsu, T. The world according to Rényi: Thermodynamics of multifractal systems. In AIP Conference Proceedings; American Institute of Physics (AIP): Ellipse College Park, MD, USA, 2001; Volume 597, pp. 341–348. [Google Scholar] [CrossRef] [Green Version]
Cai, C.; Verdú, S. Conditional Rényi divergence saddlepoint and the maximization of α-mutual information. Entropy 2020, 21, 316. [Google Scholar] [CrossRef] [Green Version]
Havrda, J.; Charvát, F. Quantification method of classification processes: Concept of structrual α-entropy. Kybernetika 1967, 3, 30–35. [Google Scholar]
Vila, M.; Bardera, A.; Sbert, M.F.M. Tsallis Mutual Information for Document Classification. Entropy 2011, 13, 1694–1707. [Google Scholar] [CrossRef]
Kohonen, T. Learning Vector Quantization. Neural Networks 1988, 1, 303. [Google Scholar]
Kohonen, T. Self-Organizing Maps; Springer Series in Information Sciences; Springer: Berlin/Heidelberg, Germany, 1995; Volume 30. [Google Scholar]
Biehl, M.; Hammer, B.; Villmann, T. Prototype-based Models for the Supervised Learning of Classification Schemes. Proc. Int. Astron. Union 2017, 12, 129–138. [Google Scholar] [CrossRef] [Green Version]
Sato, A.; Yamada, K. Generalized learning vector quantization. In Advances in Neural Information Processing Systems 8, Proceedings of the 1995 Conference; Touretzky, D.S., Mozer, M.C., Hasselmo, M.E., Eds.; MIT Press: Cambridge, MA, USA, 1996; pp. 423–429. [Google Scholar]
Bunte, K.; Schneider, P.; Hammer, B.; Schleif, F.M.; Villmann, T.; Biehl, M. Limited Rank Matrix Learning, discriminative dimension reduction and visualization. Neural Netw. 2012, 26, 159–173. [Google Scholar] [CrossRef] [Green Version]
Villmann, T.; Bohnsack, A.; Kaden, M. Can Learning Vector Quantization be an Alternative to SVM and Deep Learning? J. Artif. Intell. Soft Comput. Res. 2017, 7, 65–81. [Google Scholar] [CrossRef] [Green Version]
Hammer, B.; Villmann, T. Generalized Relevance Learning Vector Quantization. Neural Netw. 2002, 15, 1059–1068. [Google Scholar] [CrossRef]
Biehl, M.; Hammer, B.; Villmann, T. Prototype-based models in machine learning. Wiley Interdiscip. Rev. Cogn. Sci. 2016, 7, 92–111. [Google Scholar] [CrossRef]
Crammer, K.; Gilad-Bachrach, R.; Navot, A.; Tishby, A. Margin analysis of the LVQ algorithm. In Advances in Neural Information Processing (Proc. NIPS 2002); Becker, S., Thrun, S., Obermayer, K., Eds.; MIT Press: Cambridge, MA, USA, 2003; Volume 15, pp. 462–469. [Google Scholar]
Garant, J.M.; Perreault, J.P.; Scott, M.S. Motif Independent Identification of Potential RNA G-Quadruplexes by G4RNA Screener. Bioinformatics 2017, 33, 3532–3537. [Google Scholar] [CrossRef] [Green Version]
Garant, J.M.; Luce, M.J.; Scott, M.S.; Perreault, J.P. G4RNA: An RNA G-Quadruplex Database. Database 2015, 2015. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wen, J.; Liu, Y.; Shi, Y.; Huang, H.; Deng, B.; Xiao, X. A Classification Model for lncRNA and mRNA Based on K-Mers and a Convolutional Neural Network. BMC Bioinform. 2019, 20, 469. [Google Scholar] [CrossRef] [PubMed]
Frankish, A.; Diekhans, M.; Ferreira, A.M.; Johnson, R.; Jungreis, I.; Loveland, J.; Mudge, J.M.; Sisu, C.; Wright, J.; Armstrong, J.; et al. GENCODE Reference Annotation for the Human and Mouse Genomes. Nucleic Acids Res. 2019, 47, D766–D773. [Google Scholar] [CrossRef] [Green Version]
Forster, P.; Forster, L.; Renfrew, C.; Forster, M. Phylogenetic Network Analysis of SARS-CoV-2 Genomes. Proc. Natl. Acad. Sci. USA 2020, 117, 9241–9243. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Liu, L.; Ho, Y.k.; Yau, S. Clustering DNA Sequences by Feature Vectors. Mol. Phylogenet. Evol. 2006, 41, 64–69. [Google Scholar] [CrossRef] [PubMed]
Cornish-Bowden, A. Nomenclature for Incompletely Specified Bases in Nucleic Acid Sequences: Recommendations 1984. Nucleic Acids Res. 1985, 13, 3021–3030. [Google Scholar] [CrossRef]
Yu, C.; Hernandez, T.; Zheng, H.; Yau, S.C.; Huang, H.H.; He, R.L.; Yang, J.; Yau, S.S.T. Real Time Classification of Viruses in 12 Dimensions. PLoS ONE 2013, 8, e64328. [Google Scholar] [CrossRef]
Blaisdell, B.E. Average Values of a Dissimilarity Measure Not Requiring Sequence Alignment Are Twice the Averages of Conventional Mismatch Counts Requiring Sequence Alignment for a Variety of Computer-Generated Model Systems. J. Mol. Evol. 1989, 29, 538–547. [Google Scholar] [CrossRef]
Goldberg, Y. Neural Network Methods for Natural Language Processing. Synth. Lect. Hum. Lang. Technol. 2017, 10, 1–309. [Google Scholar] [CrossRef]
Kaden, M.; Bohnsack, K.S.; Weber, M.; Kudła, M.; Gutowska, K.; Blazewicz, J.; Villmann, T. Learning Vector Quantization as an Interpretable Classifier for the Detection of SARS-CoV-2 Types Based on Their RNA Sequences. Neural Comput. Appl. 2021, 1–12. [Google Scholar] [CrossRef]
Riley, P. Three pitfalls to avoid in machine learning. Nature 2019, 572, 27–29. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Todd, A.K.; Johnston, M.; Neidle, S. Highly prevalent putative quadruplex sequence motifs in human DNA. Nucleic Acids Res. 2005, 33, 2901–2907. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Csiszár, I. Information-type measures of differences of probability distributions and indirect observations. Studia Sci. Math. Hungaria 1967, 2, 299–318. [Google Scholar]
Hnizdo, V.; Tan, J.; Killian, B.J.; Gilson, M.K. Efficient Calculation of Configurational Entropy from Molecular Simulations by Combining the Mutual-Information Expansion and Nearest-Neighbor Methods. J. Comput. Chem. 2008, 29, 1605–1614. [Google Scholar] [CrossRef] [Green Version]
Kolekar, P.; Kale, M.; Kulkarni-Kale, U. Alignment-Free Distance Measure Based on Return Time Distribution for Sequence Analysis: Applications to Clustering, Molecular Phylogeny and Subtyping. Mol. Phylogenet. Evol. 2012, 65, 510–522. [Google Scholar] [CrossRef]
Wei, D.; Jiang, Q.; Wei, Y.; Wang, S. A Novel Hierarchical Clustering Algorithm for Gene Sequences. BMC Bioinform. 2012, 13, 174. [Google Scholar] [CrossRef] [Green Version]
Li, M.; Chen, X.; Li, X.; Ma, B.; Vitanyi, P. The Similarity Metric. IEEE Trans. Inf. Theory 2004, 50, 3250–3264. [Google Scholar] [CrossRef]
Yin, C.; Chen, Y.; Yau, S.S.T. A Measure of DNA Sequence Similarity by Fourier Transform with Applications on Hierarchical Clustering. J. Theor. Biol. 2014, 359, 18–28. [Google Scholar] [CrossRef]
Bao, J.; Yuan, R. A Wavelet-Based Feature Vector Model for DNA Clustering. Genet. Mol. Res. 2015, 14, 19163–19172. [Google Scholar] [CrossRef] [PubMed]
Berger, J.A.; Mitra, S.K.; Carli, M.; Neri, A. New Approaches to Genome Sequence Analysis Base Don Digital Signal Processing. Proceedings of IEEE Workshop on Genomic Signal Processing and Statistics (GENSIPS), Raleigh, NC, USA, 12–13 October 2002; p. 4. [Google Scholar]
Almeida, J.S.; Vinga, S. Universal Sequence Map (USM) of Arbitrary Discrete Sequences. BMC Bioinform. 2002, 3, 1–11. [Google Scholar] [CrossRef] [PubMed]
Vellido, A. The importance of interpretability and visualization in machine learning for applications in medicine and health care. Neural Netw. Appl. 2020, 32, 18069–18083. [Google Scholar] [CrossRef] [Green Version]
Bittrich, S.; Kaden, M.; Leberecht, C.; Kaiser, F.; Villmann, T.; Labudde, D. Application of an Interpretable Classification Model on Early Folding Residues during Protein Folding. Biodata Min. 2019, 12, 1. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Fischer, L.; Hammer, B.; Wersing, H. Efficient rejection strategies for prototype-based classification. Neurocomputing 2015, 169, 334–342. [Google Scholar] [CrossRef] [Green Version]
Villmann, A.; Kaden, M.; Saralajew, S.; Villmann, T. Probabilistic Learning Vector Quantization with Cross-Entropy for Probabilistic Class Assignments in Classification Learning. In Proceedings of the 17th International Conference on Artificial Intelligence and Soft Computing-ICAISC, Zakopane, Zakopane, Poland, 3–7 June 2018; Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J., Eds.; Springer International Publishing: Cham, Switzerland, 2018; LNCS 10841; pp. 736–749. [Google Scholar] [CrossRef]
Saralajew, S.; Holdijk, L.; Rees, M.; Asan, E.; Villmann, T. Classification-by-Components: Probabilistic Modeling of Reasoning over a Set of Components. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; MIT Press: Cambridge, MA, USA, 2019; pp. 2788–2799. [Google Scholar]

Figure 1. Data insights, i.e., class-wise mean and standard deviation of the MIF variants for the quadruplex data set (G4 and non-G4 forming sequences). (a) Rényi MIF; (b) Rényi rMIF.

Figure 2. Classification insights, i.e., CCM and CIP of LiRaM-LVQ for the quadruplex data set. The color bars code the correlation values of the CCM. (a) Rényi MIF; (b) Rényi rMIF.

Figure 3. Classification insights, i.e., CCM and CIP of LiRaM-LVQ using the Shannon rMIF. The color bars code the correlation values of the CCM. (a) COVID data; (b) lncRNA vs. mRNA data.

Table 1. Overview of the used data sets.

Data Set	Classes	Sequences	Per Class *	Mean Length	Std. Length
Quadruplex detection	2	368	175/193	62.1	43.7
lncRNA vs. mRNA	2	20,000	10,000 each	1197.3	710.8
COVID types	3	156	44/90/22	29,862.9	34.1

* Sample size per class.

Table 2. Overview of the computed mutual information functions for biomolecular sequences X from Section 2.1 and Section 2.2.

	MIF	rMIF
Shannon	$F (X, τ) = \sum_{x \in X} F (x, τ)$	$F (x, τ) = \sum_{x (τ) \in X} p (x, x (τ)) \cdot log (\frac{p (x, x (τ))}{p (x) \cdot p (x (τ))})$
Rényi	$F_{α}^{R} (X, τ) = \sum_{x \in X} F_{α}^{R} (x, τ)$	$F_{α}^{R} (x, τ) = \sum_{x (τ) \in X} \frac{p {(x, x (τ))}^{α}}{{(p (x) \cdot p (x (τ)))}^{α - 1}}$

Table 3. Achieved test accuracies ± standard deviation by LiRaM-LVQ in percent and the respective parameter setting.

Data Set	NV	MIF		rMIF
Data Set	NV	Shannon	Rényi	Shannon	Rényi
Quadruplex detection	$78.8 \pm 1.0$ $j_{\max} = 4$	$68.9 \pm 1.2$ $τ_{\max} = 7$	$68.2 \pm 1.7$ $τ_{\max} = 7$	$77.4 \pm 1.3$ $τ_{\max} = 8$	$82.0 \pm 1.0$ $τ_{\max} = 7$
lncRNA vs. mRNA	$71.9 \pm 0.1$ $j_{\max} = 7$	$75.4 \pm 0.2$ $τ_{\max} = 100$	$75.5 \pm 0.3$ $τ_{\max} = 100$	$81.4 \pm 0.1$ $τ_{\max} = 100$	$76.3 \pm 0.6$ $τ_{\max} = 100$
COVID types	$86.0 \pm 1.2$ $j_{\max} = 5$	$98.1 \pm 0.6$ $τ_{\max} = 50$	$97.4 \pm 1.0$ $τ_{\max} = 50$	$99.7 \pm 0.3$ $τ_{\max} = 50$	$99.3 \pm 0.5$ $τ_{\max} = 50$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bohnsack, K.S.; Kaden, M.; Abel, J.; Saralajew, S.; Villmann, T. The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers. Entropy 2021, 23, 1357. https://0-doi-org.brum.beds.ac.uk/10.3390/e23101357

AMA Style

Bohnsack KS, Kaden M, Abel J, Saralajew S, Villmann T. The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers. Entropy. 2021; 23(10):1357. https://0-doi-org.brum.beds.ac.uk/10.3390/e23101357

Chicago/Turabian Style

Bohnsack, Katrin Sophie, Marika Kaden, Julia Abel, Sascha Saralajew, and Thomas Villmann. 2021. "The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers" Entropy 23, no. 10: 1357. https://0-doi-org.brum.beds.ac.uk/10.3390/e23101357

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers

Abstract

1. Introduction

2. Variants of Mutual Information Functions as Biomolecular Sequence Signatures

2.1. The Resolved Mutual Information Function Based on the Shannon Entropy

2.2. Rényi $α$ -Entropy and Related Mutual Information Functions

2.3. Tsallis $α$ -Entropy and Related Mutual Information Functions

3. Interpretable Classification Learning by Learning Vector Quantization

4. Applications of Mutual Information Functions for Sequence Classification

4.1. Data Sets

4.1.1. Quadruplex Detection

4.1.2. lncRNA vs. mRNA

4.1.3. COVID Types

4.2. Feature Generation

4.2.1. Natural Vectors

4.2.2. Mutual Information Functions

4.2.3. Handling of Ambiguous Characters

4.3. Classification

5. Results and Discussion

5.1. Classification Performance

5.2. Visualization of MIF Variants

5.3. Interpretation of CCM and CIP of the Trained LiRaM-LVQ Model

6. Conclusions, Remarks, and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers

Abstract

1. Introduction

2. Variants of Mutual Information Functions as Biomolecular Sequence Signatures

2.1. The Resolved Mutual Information Function Based on the Shannon Entropy

2.2. Rényi α -Entropy and Related Mutual Information Functions

2.3. Tsallis α -Entropy and Related Mutual Information Functions

3. Interpretable Classification Learning by Learning Vector Quantization

4. Applications of Mutual Information Functions for Sequence Classification

4.1. Data Sets

4.1.1. Quadruplex Detection

4.1.2. lncRNA vs. mRNA

4.1.3. COVID Types

4.2. Feature Generation

4.2.1. Natural Vectors

4.2.2. Mutual Information Functions

4.2.3. Handling of Ambiguous Characters

4.3. Classification

5. Results and Discussion

5.1. Classification Performance

5.2. Visualization of MIF Variants

5.3. Interpretation of CCM and CIP of the Trained LiRaM-LVQ Model

6. Conclusions, Remarks, and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.2. Rényi $α$ -Entropy and Related Mutual Information Functions

2.3. Tsallis $α$ -Entropy and Related Mutual Information Functions