The Fisher–Rao Distance between Multivariate Normal Distributions: Special Cases, Bounds and Applications

Pinele, Julianna; Strapasson, João E.; Costa, Sueli I. R.

doi:10.3390/e22040404

Open AccessArticle

The Fisher–Rao Distance between Multivariate Normal Distributions: Special Cases, Bounds and Applications

by

Julianna Pinele

^1,*

,

João E. Strapasson

²

and

Sueli I. R. Costa

³

¹

Center of Exact and Technological Sciences, University of Reconcavo of Bahia, Cruz das Almas 44380-000, Brazil

²

School of Applied Sciences, University of Campinas, Limeira 13484-350, Brazil

³

Institute of Mathematics, University of Campinas, Campinas 13083-859, Brazil

^*

Author to whom correspondence should be addressed.

Entropy 2020, 22(4), 404; https://0-doi-org.brum.beds.ac.uk/10.3390/e22040404

Submission received: 26 January 2020 / Revised: 6 March 2020 / Accepted: 11 March 2020 / Published: 1 April 2020

Download

Browse Figures

Versions Notes

Abstract

:

The Fisher–Rao distance is a measure of dissimilarity between probability distributions, which, under certain regularity conditions of the statistical model, is up to a scaling factor the unique Riemannian metric invariant under Markov morphisms. It is related to the Shannon entropy and has been used to enlarge the perspective of analysis in a wide variety of domains such as image processing, radar systems, and morphological classification. Here, we approach this metric considered in the statistical model of normal multivariate probability distributions, for which there is not an explicit expression in general, by gathering known results (closed forms for submanifolds and bounds) and derive expressions for the distance between distributions with the same covariance matrix and between distributions with mirrored covariance matrices. An application of the Fisher–Rao distance to the simplification of Gaussian mixtures using the hierarchical clustering algorithm is also presented.

Keywords:

information geometry; Fisher–Rao distance; multivariate normal distributions; Gaussian mixture simplification

1. Introduction

A proper measure to determine the dissimilarity between probability distributions has been approached in many problems and applications. The Fisher–Rao distance is a very special metric for statistical models of probability distributions. This distance is invariant by reparametrization of the sample space and covariant by reparameterization of the parameter space [1]. Moreover, the Fisher–Rao metric is preserved under Markov morphisms and under centain conditions it is, up to a scaling factor, the unique Riemannian metric satisfying this condition [2,3]. Markov morphisms are associated with the notion of statistical sufficiency which express the criterion of passing from one statistical model to another with no loss of information [4,5,6]. Therefore it is natural to require the invariance of the geometric structures of statistical models under Markov morphisms. Between finite sample size simplex model

S_{k - 1} = {p \in R^{k}; p_{i} \geq 0 and \sum_{i = 1}^{k} p_{i} = 1}

, a Markov morphism is a linear map

T_{Q} (x) = x Q

, where

Q \in R^{n \times l}

, with

n \leq l

, is a matrix with non-negative entries such that every row sums to 1 and every column has precisely one non-zero element. The mapping

T_{Q}

corresponds to probabilistic refining of the event space

{1, \dots, n} \to {1, \dots, l}

where the refinement

i \to j

occurs with probability

Q_{i j}

[7]. Chentsov [8,9] has proved the Fisher–Rao uniqueness invariance property under Markov morphisms for the finite sample spaces. The extension of this result to more general statistical models requires careful formulations of statistical sufficiency and Markov morphisms and has been evolved since then [3,10]. More recently in [5,6] it is shown this uniqueness of the Fisher–Rao metric under an assumption of strong continuity of the information metric.

After previous papers [11,12,13] connecting geometry and statistics, C. R.Rao in an independent landmark paper [14] considered statistical models with the metric induced by the information matrix defined by R. Fisher in 1921 [15]. This work encouraged several authors to calculate the Fisher–Rao metric distance between other probability distributions [16,17,18] as well as stimulated approaches to other dissimilarity measures such as Kullback-Leibler divergence [19], total variation and Wasserstein distances [20]. Amari [3,4,21] unified the information geometry theory by organizing and introducing other concepts regarding statistical models [2].

An explicit form for the Fisher–Rao distance in the univariate normal distribution space is known via an association with the classical model of the hyperbolic plane [14,16,18,22]. It was applied to quantization of hyperspectral images [23] and to the space of projected lines in the paracatadioptric images [24]. This Fisher–Rao model was used to simplify Gaussian mixtures through the k-means method [25] and a hierarchical clustering technique [26].

An expression for the geodesic curve (initial value problem) in the multivariate normal distributions space was derived in [27] and in [28]. However, the calculus of the Fisher–Rao distance requires solving non-trivial differential equations under boundary conditions to find the geodesic connecting two distributions and then to calculate the integral along the geodesic. A closed form for this distance in the general case is still an open problem. Expressions for the distance are known only in special cases [16,17,18].

The Fisher–Rao distance between multivariate normal distributions in specific cases, such as distributions with a common mean, was considered in diffusion tensor image analysis [29,30,31], in color texture discrimination in several classification experiments [32], in the problem of distributed estimation fusion with unknown correlations [33], and in the machine learning technique [34]. In [35,36], the authors described shapes representing landmarks by a Gaussian model with diagonal covariance matrices and used the Fisher–Rao distance to quantify the difference between two shapes. In [17], this model was applied to statistical inference. Bounds for the Fisher–Rao distance were used to track quality monitoring [37].

This paper is organized as follows. In Section 2, we gather known results (closed forms for special cases and bounds) for the Fisher–Rao distance between multivariate normal distributions. In Section 3, we describe a closed form for the Fisher–Rao distance between distributions with the same covariance matrix and a non-linear system to find the distance between distributions with mirrored covariance matrices. An application of the Fisher–Rao distance to the simplification of Gaussian mixtures using the hierarchical clustering algorithm is presented in Section 4. Some conclusions and perspectives are drawn in Section 5.

2. The Fisher–Rao Distance in the Multivariate Normal Distribution Space: Special Submanifolds and Bounds

In this section, as in [38], we summarize previous results regarding the Fisher–Rao distance in the space of multivariate normal distributions including closed forms for this distance restricted to submanifolds and general bounds.

Given a statistical model

S = {p_{θ} = p (x; θ); θ = (θ_{1}, θ_{2}, \dots, θ_{k}) \in Θ \subset R^{k}}

, a natural Riemannian structure [21] can be provided by the Fisher information matrix

G (θ) = [g_{i j} (θ)]

:

\begin{matrix} g_{i j} (θ) = & E_{θ} (\frac{\partial}{\partial θ_{i}} log p (x; θ) \frac{\partial}{\partial θ_{j}} log p (x; θ)) \end{matrix}

(1)

\begin{matrix} = & \int \frac{\partial}{\partial θ_{i}} log p (x; θ) \frac{\partial}{\partial θ_{j}} log p (x; θ) p (x; θ) d x, \end{matrix}

(2)

where

E_{θ}

is the expected value with respect to the distribution

p_{θ}

. This matrix can also be viewed as the Hessian matrix of the Shannon entropy (concave function) [39],

H (p) = - \int p (x; θ) log p (x; θ) d x,

(3)

and is used to establish connections between inequalities in information theory and geometrical inequalities.

The Fisher–Rao distance,

d_{F} (\cdot, \cdot)

, between two distributions

p_{θ_{1}}

and

p_{θ_{2}}

in

S

, identified with their parameters

θ_{1}

and

θ_{2}

, is given by the shortest length of a curve

γ (t)

in the parameter space

Θ

connecting these distributions,

d_{F} (p_{θ_{1}}, p_{θ_{2}}) \equiv d_{F} (θ_{1}, θ_{2}) = min_{γ} \int {| γ^{'} (t) |}_{G} d t

, where

| γ^{'} {(t) |}_{G} = \sqrt{γ^{'} {(t)}^{t} G (θ) γ (t)}

. Note that this is in fact a metric, since for any

θ_{1}

,

θ_{2}

, and

θ_{3}

in

Θ

, we have: (i)

d_{F} (θ_{1}, θ_{2}) \geq 0

and

d_{F} (θ_{1}, θ_{2}) = 0

if only if

θ_{1} = θ_{2}

; (ii)

d_{F} (θ_{1}, θ_{2}) = d_{F} (θ_{2}, θ_{1})

; (iii)

d_{F} (θ_{1}, θ_{2}) \leq d_{F} (θ_{1}, θ_{3}) + d_{F} (θ_{3}, θ_{2})

. A curve that provides the shortest length is called a geodesic and is given by the solutions of the differential equations

\frac{d^{2} θ_{m}}{d t^{2}} + \sum_{i, j} Γ_{i j}^{m} \frac{d θ_{i}}{d t} \frac{d θ_{j}}{d t} = 0, m = 1, \dots, k,

(4)

where

Γ_{i j}^{m}

are the Christoffel symbols,

Γ_{i j}^{m} = \frac{1}{2} \sum_{l} (\frac{\partial g_{j l}}{\partial θ_{i}} + \frac{\partial g_{l i}}{\partial θ_{j}} - \frac{\partial g_{i j}}{\partial θ_{l}}) g^{l m}

(5)

and

[g^{i j}]

is the inverse matrix of the Fisher information matrix.

We consider here the space of the multivariate normal distributions given by:

p (x; μ, Σ) = \frac{{(2 π)}^{- (\frac{n}{2})}}{\sqrt{D e t (Σ)}} exp (- \frac{{(x - μ)}^{t} Σ^{- 1} (x - μ)}{2}),

(6)

where

x^{t} = (x_{1}, \dots, x_{n}) \in R^{n}

is the variable vector,

μ^{t} = (μ_{1}, \dots, μ_{n}) \in R^{n}

is the mean vector, and

Σ

is the covariance matrix in

P_{n} (R)

, the space of order n positive definite symmetric matrices.

In this case, the model

S = M = {p_{θ}; θ = (μ, Σ) \in R^{n} \times P_{n} (R)}

is a statistical

(n + \frac{n (n + 1)}{2})

-dimensional manifold.

In this case, the model

S = M = {p_{θ}; θ = (μ, Σ) \in R^{n} \times P_{n} (R)}

is a statistical manifold of dimension

k = (n + \frac{n (n + 1)}{2})

. Considering a parametrization

(μ, Σ) = ϕ (θ_{1}, \dots, θ_{k})

of the model

M

, the Fisher information matrix is given by [40]

g_{i j} (θ) = \frac{\partial μ^{t}}{\partial θ_{i}} Σ^{- 1} \frac{\partial μ}{\partial θ_{j}} + \frac{1}{2} tr (Σ^{- 1} \frac{\partial Σ}{\partial θ_{i}} Σ^{- 1} \frac{\partial Σ}{\partial θ_{i}}) .

(7)

The metric provided by this matrix is invariant with respect to affine transformations. In other words, for any

(c, Q) \in R^{n} \times G L_{n} (R)

, where

G L_{n} (R)

is the group of non-singular n-square matrices, the mapping:

\begin{matrix} ψ_{(c, Q)} : & M & \to & M \\ (μ, Σ) & \mapsto & (Q μ + c, Q Σ Q^{t}), \end{matrix}

(8)

is an isometry in

M

[16]. Consequently, the Fisher–Rao distance between

θ_{1} = (μ_{1}, Σ_{1})

and

θ_{2} = (μ_{2}, Σ_{2})

in

M

satisfies:

d_{F} (θ_{1}, θ_{2}) = d_{F} ((Q μ_{1} + c, Q Σ_{1} Q^{t}), (Q μ_{2} + c, Q Σ_{2} Q^{t}))

(9)

for any

(c, Q) \in R^{n} \times G L_{n} (R)

. In particular, for

Q = Σ_{1}^{- (1 / 2)}

and

c = - Σ_{1}^{(- 1 / 2)} μ_{1}

,

θ_{3} = (μ_{3}, Σ_{3}) = (Σ_{1}^{- (1 / 2)} (μ_{2} - μ_{1}), Σ_{1}^{- (1 / 2)} Σ_{2} Σ_{1}^{- (1 / 2)})

, the Fisher–Rao distance admits the form:

d_{F} (θ_{1}, θ_{2}) = d_{F} (θ_{0}, θ_{3}),

(10)

where

θ_{0} = (0, I_{n})

,

I_{n}

is the n-order identity matrix, and

0 \in R^{n}

is the null vector.

The geodesic equations in

M

can be expressed as [17]:

\{\begin{matrix} \frac{d^{2} μ}{d t^{2}} - (\frac{d Σ}{d t}) Σ^{- 1} (\frac{d μ}{d t}) = 0 \\ \frac{d^{2} Σ}{d t^{2}} + (\frac{d μ}{d t}) {(\frac{d μ}{d t})}^{t} - (\frac{d Σ}{d t}) Σ^{- 1} (\frac{d Σ}{d t}) = 0 . \end{matrix}

(11)

and could be partially integrated [27]:

\{\begin{matrix} \frac{d μ}{d t} = Σ x \\ \frac{d Σ}{d t} = Σ (B - x^{t} μ), \end{matrix}

(12)

\{\begin{matrix} \frac{d Δ}{d t} = - B Δ + x δ^{t} \\ \frac{d δ}{d t} = - B δ + (1 + δ Δ^{- 1} δ) x \end{matrix} .

(13)

where

(δ (t), Δ (t)) = (Σ^{- 1} (t) μ (t), Σ^{- 1} (t))

,

x \in R^{n}

, and B is a symmetric matrix. The initial conditions for this problem can be taken as:

\{\begin{matrix} (δ (0), Δ (0)) = (0, I_{n}) \\ (\frac{d δ}{d t} (0), \frac{d Δ}{d t} (0)) = (x, - B) . \end{matrix}

(14)

Eriksen [27] and Calvo and Oller [28], in independent works, solved this initial value problem. An explicit solution to the geodesic curve in

M

[28] is:

\{\begin{matrix} δ (t) = & - B (cosh (t G) - I_{n}) {(G^{-})}^{2} x + sinh (t G) G^{-} x \\ Δ (t) = & I_{n} + \frac{1}{2} (cosh (t G) - I_{n}) + \frac{1}{2} B (cosh (t G) - I_{n}) {(G^{-})}^{2} B \\ - \frac{1}{2} sinh (t G) G^{-} B - \frac{1}{2} B sinh (t G) G^{-} \end{matrix},

(15)

where

I_{n}

is an n-order identity matrix,

G^{2} = B^{2} + 2 x x^{t}

, and

G^{-}

is the generalized inverse square matrix of G, that is

G G^{-} G = G

.

Due the fact that the geodesic curve has constant velocity at any point, given

(x, B)

in the tangent space of

M

, the Fisher–Rao distance between

(0, I_{n})

and

(δ (1), Δ (1))

is:

\int_{0}^{1} \sqrt{\frac{d μ}{d t} (0) Σ^{- 1} (0) \frac{d μ}{d t} (0) + \frac{1}{2} tr [{(Σ^{- 1} (0) \frac{Σ}{d t} (0))}^{2}]} d t = \sqrt{\frac{1}{2} tr (B^{2}) + {| x |}^{2}},

(16)

where

| \cdot |

is the standard Euclidean norm. Note that the above expression provides the Fisher–Rao distance between two distributions only if we can determine the initial value problem from the boundary conditions, which usually is very difficult.

Han and Park in [31] presented a numerical shooting method for computing the minimum geodesic distance between two normal distributions, through parallel transport of a vector field defined along the geodesic curve given in Equation (15).

A closed form for the Fisher–Rao distance between two normal distributions in

M

is still an important open question. Next, we present closed forms for this distance in some submanifolds of

M

.

2.1. Closed Forms for the Fisher–Rao Distance in Submanifolds of $M$

In this subsection, we consider submanifolds

M_{*} \subset M

with the distance induced by the Fisher–Rao metric in

M

. It is important to remark that, in general, given two distributions

θ_{1}

and

θ_{2}

in

M_{*}

, the distance between

θ_{1}

and

θ_{2}

when restricted to a submanifold

M_{*}

is bigger than the distance between

θ_{1}

and

θ_{2}

in

M

, that is

d_{M_{*}} (θ_{1}, θ_{2}) \geq d_{M} (θ_{1}, θ_{2})

. This is due to the fact that to get

d_{M_{*}}

, we consider the minimum length of restricted curves, which are the ones contained in the submanifold

M_{*}

. We say that

M_{*}

is totally geodesic if only if

d_{M_{*}} (θ_{1}, θ_{2}) = d_{M} (θ_{1}, θ_{2})

, for any

θ_{1}, θ_{2} \in M_{*}

, which means that the geodesic in

M

connecting

θ_{1}

and

θ_{2}

is contained in

M_{*}

.

2.1.1. The Submanifold $M_{Σ}$ Where $Σ$ Is Constant

In the n-dimensional manifold composed by multivariate normal distributions with common covariance matrix

Σ

,

M_{Σ} = {p_{θ}; θ = (μ, Σ), Σ = Σ_{0} \in P_{n} (R) constant}

, the Fisher–Rao distance between two distributions

θ_{1} = (μ_{1}, Σ_{0})

and

θ_{2} = (μ_{2}, Σ_{0})

is [18]:

d_{Σ} (θ_{1}, θ_{2}) = \sqrt{{(μ_{1} - μ_{2})}^{t} Σ_{0}^{- 1} (μ_{1} - μ_{2})} .

(17)

This distance is equal to the Mahalanobis distance [11], which is equal to the Euclidean distance between the image of

μ_{1}

and

μ_{2}

under the transformation

μ \mapsto P^{- 1} μ

, where

Σ_{0} = P P^{t}

is the Cholesky decomposition [18]. This distance was one of the first dissimilarity measures between datasets with some correlation. Note that this submanifold is not totally geodesic, as it can be seen even in the space of univariate normal distributions [22] and in Example 1 in the next section.

A geodesic curve

γ_{Σ} (t)

in

M_{Σ}

connecting

θ_{1}

and

θ_{2}

can be provided by:

γ_{Σ} (t) = ((1 - t) μ_{1} - t μ_{2}, Σ_{0}) .

(18)

2.1.2. The Submanifold $M_{μ}$ Where $μ$ Is Constant

A totally geodesic submanifold of

M

is given by

M_{μ} = {p_{θ}; θ = (μ, Σ), μ = μ_{0} \in R^{n} constant}

of dimension

\frac{n (n + 1)}{2}

composed by distributions that have the same mean vector

μ_{0}

. The Fisher–Rao distance in

M_{μ}

was studied by several authors in different contexts [16,18,30,41] and for

θ_{1} = (μ_{0}, Σ_{1})

and

θ_{2} = (μ_{0}, Σ_{2})

is given by:

d_{F} (θ_{1}, θ_{2}) = \sqrt{\frac{1}{2} \sum_{i = 1}^{n} {[log (λ_{i})]}^{2}},

(19)

where

0 < λ_{1} \leq λ_{2} \leq \dots \leq λ_{n}

are the eigenvalues of

Σ_{1}^{- 1 / 2} Σ_{2} Σ_{1}^{- 1 / 2}

.

An expression for the geodesic curve connecting these two distributions is [30]:

γ_{μ} (t) = (μ_{0}, Σ_{1}^{1 / 2} exp (t log (Σ_{1}^{- 1 / 2} Σ_{2} Σ_{1}^{- 1 / 2})) Σ_{1}^{1 / 2}) .

(20)

2.1.3. The Submanifold $M_{D}$ Where $Σ$ Is Diagonal

Let

M_{D} = {p_{θ}; θ = (μ, Σ), Σ = diag (σ_{1}^{2}, σ_{2}^{2}, \dots, σ_{n}^{2}), σ_{i} > 0, i = 1, \dots, n}

, the submanifold of

M

composed by distributions with a diagonal covariance matrix. If we consider the parameter

θ = (μ_{1}, σ_{1}, μ_{2}, σ_{2}, \dots, μ_{n}, σ_{n})

, it can be shown [22] that the metric in the parametric space of

M_{D}

is equal to the product metric:

d_{D} (θ_{1}, θ_{2}) = \sqrt{\sum_{i = 1}^{n} d_{F *}^{2} ((μ_{1 i}, σ_{1 i}), (μ_{2 i}, σ_{2 i}))},

(21)

where

d_{F *}

is the Fisher–Rao distance in the univariate case given by [22]:

d_{F *} ((μ_{1}, σ_{1}), (μ_{2}, σ_{2})) = \sqrt{2} log \frac{|(\frac{μ_{1}}{\sqrt{2}}, σ_{1}) - (\frac{μ_{2}}{\sqrt{2}}, - σ_{2})| + |(\frac{μ_{1}}{\sqrt{2}}, σ_{1}) - (\frac{μ_{2}}{\sqrt{2}}, σ_{2})|}{|(\frac{μ_{1}}{\sqrt{2}}, σ_{1}) - (\frac{μ_{2}}{\sqrt{2}}, - σ_{2})| - |(\frac{μ_{1}}{\sqrt{2}}, σ_{1}) - (\frac{μ_{2}}{\sqrt{2}}, σ_{2})|} .

(22)

In this space, a curve

γ_{D} (t) = (γ_{1} (t), \dots, γ_{n} (t))

is a geodesic if, and only if,

γ_{i} (t)

is a geodesic curve in the univariate case, for all

i = 1, \dots, n

. The geodesic curves in the univariate normal distributions space (upper half plane

R \times R^{+}

) are half-vertical lines and half-ellipses centered at

σ = 0

, with eccentricity

\frac{1}{\sqrt{2}}

[22].

It is important to note that

M_{D} \subset M

is not totally geodesic. The submanifold of

M_{D}

composed only by normal distributions with covariance matrices which are multiples of the identity (round normals) is totally geodesic [22]. In fact, this submanifold of round normals is also contained in the totally geodesic submanifold described next.

2.1.4. The Submanifold $M_{D μ}$ Where $Σ$ Is Diagonal and $μ$ Is an Eigenvector of $Σ$

Let

M_{D μ}

be the

n + 1

-dimensional submanifold composed by distributions with the mean vector

μ = μ_{1} e_{i}

for some

e_{i} \in {e_{1}, \dots, e_{n}}

(the canonical basis of

R^{n}

) and diagonal covariance matrix

Σ

, and without loss of generality, we shall assume that

e_{i} = e_{1}

. An analytic expression for the distance in

M_{D μ}

is:

d_{D μ}^{2} (θ_{1}, θ_{2}) = d_{F *}^{2} ((μ_{11}, σ_{11}), (μ_{21}, σ_{21})) + \sum_{i = 2}^{n} d_{F *}^{2} ((0, σ_{1 i}), (0, σ_{2 i})) .

(23)

We proved in [42] that this submanifold is totally geodesic.

2.2. Bounds for the Fisher–Rao in $M$

As mentioned, a closed form for the Fisher–Rao distance between two general normal distributions is not known. In this subsection, we present some bounds for this distance.

2.2.1. A Lower Bound

Calvo and Oller [43] derived a lower bound for the Fisher–Rao distance through an isometric embedding of the parametric space

M

into the manifold of the positive definite matrices.

Proposition 1.

[43] Given

θ_{1} = (μ_{1}, Σ_{1})

and

θ_{2} = (μ_{2}, Σ_{2})

, let:

S_{i} = (\begin{matrix} Σ_{i} + μ_{i} μ_{i}^{t} & μ_{i}^{t} \\ μ_{i} & 1 \end{matrix}),

(24)

i = 1, 2

. A lower bound for the distance between

θ_{1}

and

θ_{2}

is:

L B (θ_{1}, θ_{2}) = \sqrt{\frac{1}{2} \sum_{i = 1}^{n + 1} {[log (λ_{i})]}^{2}},

(25)

where

λ_{i}

,

1 \leq i \leq n + 1

, are the eigenvalues of

S_{1}^{- 1 / 2} S_{2} S_{1}^{- 1 / 2}

.

We note that this bound satisfies the distance proprieties in

M

. In [44], through a similar approach, a lower bound for the Fisher–Rao distance was obtained in the more general space of elliptical distributions, restricted to normal distributions, is the above bound.

2.2.2. The Upper Bound $U B_{1}$

In [45], we proposed an upper bound based on an isometry (8) in the manifold

M

and on the distance in the non-totally geodesic submanifold

M_{D}

(21), as follows:

Proposition 2.

[45] The Fisher–Rao distance between two multivariate normal distributions

θ_{1} = (μ_{1}, Σ_{1})

and

θ_{2} = (μ_{2}, Σ_{2})

is upper bounded by,

U B_{1} (θ_{1}, θ_{2}) = \sqrt{\sum_{i = 1}^{n} d_{F *}^{2} ((0, 1), (μ_{i}, λ_{i}))},

(26)

where

λ_{i}

are the diagonal terms of the matrix Λ given by the eigenvalues of

A = Σ_{1}^{- (1 / 2)} Σ_{2} Σ_{1}^{- (1 / 2)} = Q Λ Q^{t}

,

μ_{i}

are the coordinates of

μ = Q^{t} Σ_{1}^{- (1 / 2)} (μ_{2} - μ_{1})

, Q is the orthogonal matrix whose columns are the eigenvectors of A and

d_{F *}

is the Fisher–Rao distance between univariate normal distributions given in Equation (22).

2.2.3. The Upper Bounds $U B_{2}$ and $U B_{3}$

Considering the Fisher–Rao distance in the totally geodesic submanifold

M_{D μ}

and the triangular inequality, we propose another upper bound [42].

Given

θ_{1} = (μ_{1}, Σ_{1})

and

θ_{2} = (μ_{2}, Σ_{2})

, we consider the Fisher–Rao distance between

θ_{0} = (0, I_{n})

and

θ_{3} = (μ_{3}, Σ_{3})

as in Equation (10). Let

\bar{θ} = (\bar{μ}, \bar{Σ})

; by the triangular inequality, it follows that:

d_{F} (θ_{0}, θ_{3}) \leq d_{F} (θ_{0}, \bar{θ}) + d_{F} (\bar{θ}, θ_{3}) .

(27)

To calculate this bound, we choose

\bar{θ}

appropriately. For

\bar{μ} = μ_{3}

, note that

d_{F} (\bar{θ}, θ_{3}) = d_{μ} (\bar{θ}, θ_{3})

. Let P be an orthogonal matrix such that

P μ = (| μ_{3} |, 0, \dots, 0)

and

D = diag (d_{1}^{2}, d_{2}^{2}, \dots, d_{n}^{2})

a diagonal matrix. We will consider

\bar{Σ} = P^{- 1} D P^{- t}

and

θ_{P} = (P μ, D)

. By the isometry

ψ_{(c, Q)}

, given in Equation (9), for

Q = P^{- 1}

and

c = 0

, it follows:

d_{F} (θ_{0}, \bar{θ}) = d_{D μ} (θ_{0}, θ_{P}) .

(28)

Then, combining Inequality (27) and Equation (28), the left side of the equation below is an upper bound for the Fisher–Rao distance between

θ_{1}

and

θ_{2}

,

d_{F} (θ_{0}, θ_{3}) \leq d_{D μ} (θ_{0}, θ_{P}) + d_{μ} (\bar{θ}, θ_{3}) .

(29)

In [42], we derived the upper bound:

U B_{2} = d_{D μ} (θ_{0}, θ_{P}) + d_{μ} (\bar{θ}, θ_{3}) .

(30)

through a numerical minimization process by considering the diagonal elements of D as a vector that minimizes

d_{D μ} (θ_{0}, θ_{P}) + d_{μ} (\bar{θ}, θ_{3})

,

({\bar{d}}_{1}, {\bar{d}}_{2}, \dots, {\bar{d}}_{n}) = min_{(d_{1}, d_{2}, \dots, d_{n})} {d_{D μ} (θ_{0}, θ_{P}) + d_{μ} (\bar{θ}, θ_{3})} .

(31)

We also derive an analytic upper bound

U B_{3}

by minimizing of the distance

d_{D μ} (θ_{0}, θ_{P})

. By expressing this distance in terms of the parameters

(θ_{0}, θ_{P}) = ((0, I_{n}), (P μ, D))

, we can show that it reaches the minimum at:

D = diag (\sqrt{\frac{| μ_{3} | + 2}{2}}, 1, \dots, 1) .

(32)

The lower bound of Section 2.2.1 and the upper bounds of Section 2.2.2 and Section 2.2.3 are summarized in Table 1.

Upper and lower bounds have been used to estimate the Fisher–Rao distance in applications such as [37].

2.2.4. Comparisons of the Bounds

In this section, as in [42], we illustrate comparisons between the bounds presented previously.

We consider the bivariate normal distributions model (

n = 2

) and distributions

θ_{0}

and

\hat{θ} = (\hat{μ}, \hat{Σ})

, where:

\hat{θ} = (\hat{μ}, \hat{Σ}) = ((\begin{matrix} μ \\ 0 \end{matrix}), (\begin{matrix} cos (α) & sin (α) \\ - sin (α) & cos (α) \end{matrix}) (\begin{matrix} λ_{1} & 0 \\ 0 & λ_{2} \end{matrix}) (\begin{matrix} cos (α) & - sin (α) \\ sin (α) & cos (α) \end{matrix})) .

(33)

From (10), we can see that there always exists an isometry that converts any two pairs of bivariate distributions into a pair of distributions as above.

We present next a comparison between the lower bound “LB” (25), the upper bounds

U B_{1}

(26),

U B_{2}

(31), and

U B_{3}

(32), and the numerical solution given by the geodesic shooting algorithm (GS) [31] in specific situations.

In Figure 1, we consider the eigenvalues

λ_{1} = 2

,

λ_{2} = 0.5

, and

μ = 1

to be fixed and

α

varying from zero to

\frac{π}{2}

. We note that the upper bound

U B_{1}

is very near the lower bound

L B

and to the numerical solution

G S

. The other upper bounds are bigger than the bound

U B_{1}

. In Figure 2, it is considered

μ = 10

and the previous eigenvalues. Now, the best performance is of bounds

U B_{2}

and

U B_{3}

, which are similar. In Figure 3a, we again keep the eigenvalues; the rotation angle is fixed

α = \frac{π}{4}

; and

μ

varies from zero to 10. We can see similar performances of

U B_{2}

and

U B_{3}

, which are better than

U B_{1}

for larger values of

μ

.

We may also consider the upper bound:

U B_{123} (θ_{1}, θ_{2}) = min {U B_{1} (θ_{1}, θ_{2}), U B_{2} (θ_{1}, θ_{2}), U B_{3} (θ_{1}, θ_{2})} .

(34)

Figure 3b displays the comparison between

L B

,

U B_{123}

, and

G S

for the same data of the comparison in Figure 3a.

3. Fisher–Rao Distance Between Special Distributions

In this section, we describe the Fisher–Rao distance in the full space

M

between special kinds of distributions.

3.1. The Fisher–Rao Distance Between Distributions with Common Covariance Matrices

The Fisher–Rao distance between distributions with common covariance matrices given in Section 2.1.1 was restricted to non-totally geodesic submanifold

M_{Σ}

. We show next that using the isometry given in (8) and the distance in the submanifold

M_{D μ}

, it is possible to find a closed form for the distance between two distributions with the same covariance matrix, in the full manifold

M

.

Proposition 3.

Given two distributions

θ_{1} = (μ_{1}, Σ)

and

θ_{2} = (μ_{2}, Σ)

in

M

, let P be an orthogonal matrix such that

P (μ_{2} - μ_{1}) = | μ_{2} - μ_{1} | e_{1}

, and consider the decomposition

U D U^{t}

of the matrix

P Σ P^{t}

,

P Σ P^{t} = U D U^{t},

(35)

where U is an upper triangular matrix with all diagonal entries equal to one and D is a diagonal matrix. The Fisher–Rao distance between

θ_{1}

and

θ_{2}

is given by:

d_{F} (θ_{1}, θ_{2}) = d_{D μ} ((0, D), (| μ_{2} - μ_{1} | e_{1}, D)) .

(36)

Proof.

By considering the isometries

ψ = ψ_{(- P μ_{1}, P)}

and

\hat{ψ} = ψ_{(0, U^{- 1})}

and the decomposition given by Equation (35), it follows from Equation (9) that:

\begin{matrix} d_{F} (θ_{1}, θ_{2}) & = d_{F} (ψ_{(- P μ_{1}, P)} (θ_{1}), ψ_{(- P μ_{1}, P)} (θ_{2})) \\ = d_{F} ((P μ_{1} - P μ_{1}, P Σ P^{t}), (P μ_{2} - P μ_{1}, P Σ P^{t})) \\ = d_{F} ((0, P Σ P^{t}), (| μ_{2} - μ_{1} | e_{1}, P Σ P^{t})) . \\ = d_{F} (\hat{ψ} (0, P Σ P^{t}), \hat{ψ} (| μ_{2} - μ_{1} | e_{1}, P Σ P^{t})) \\ = d_{F} ((U^{- 1} 0, U^{- 1} P Σ P^{t} U^{- t}), (| μ_{2} - μ_{1} | U^{- 1} e_{1}, U^{- 1} P Σ P^{t} U^{- t})) \\ = d_{F} ((0, D), (| μ_{2} - μ_{1} | e_{1}, D)) . \end{matrix}

(37)

Since the distributions

(0, D)

and

(| μ_{2} - μ_{1} | e_{1}, D)

belong to the submanifold

M_{D μ}

, we conclude that:

d_{F} (θ_{1}, θ_{2}) = d_{D μ} ((0, D), (| μ_{2} - μ_{1} | e_{1}, D)) .

(38)

□

Example 1.

Consider two bivariate normal distributions

θ_{1} = ({(- 1, 0)}^{t}, Σ)

and

θ_{2} = ({(6, 3)}^{t}, Σ)

with the same covariance matrix:

Σ = (\begin{matrix} 1.1 & 0.9 \\ 0.9 & 1.1 \end{matrix}) .

Figure 4a illustrates the normal distributions in the geodesic curve connecting connecting

θ_{1}

and

θ_{2}

in

M

, and Figure 4b illustrates the geodesic in the submanifold

M_{Σ}

. We observe that in

M

, the shape of the ellipses (contour curves) changes along the path. Furthermore, the Fisher–Rao distance between

θ_{1}

and

θ_{2}

is

d_{F} (θ_{1}, θ_{2}) = 5.00648

, which is less than the Mahalanobis distance given in Equation (17),

d_{Σ} (θ_{1}, θ_{2}) = 8.06226

, as expected, since the submanifold

M_{Σ}

is not totally geodesic.

3.2. The Fisher–Rao Distance Between Mirrored Distributions

We consider here two mirrored normal distributions; that is, without loss of generality, if we consider up rotation, the line connecting

μ_{1}

and

μ_{2}

as parallel to the

e_{1}

-axis, and the covariance matrices

Σ_{1}

and

Σ_{2}

satisfying:

Σ_{2} = M_{1} Σ_{1} M_{1}, where M_{1} = (\begin{matrix} - 1 & 0 \\ 0 & I_{n - 1} \end{matrix}) .

(39)

This condition implies also the same eigenvalues for both matrices.

For bivariate normal distributions, we then should have:

θ_{1} = ((\begin{matrix} μ_{1} \\ μ_{0} \end{matrix}), (\begin{matrix} σ_{11} & σ_{12} \\ σ_{12} & σ_{22} \end{matrix})) e θ_{2} = ((\begin{matrix} μ_{2} \\ μ_{0} \end{matrix}) (\begin{matrix} σ_{11} & - σ_{12} \\ - σ_{12} & σ_{22} \end{matrix}));

(40)

see Figure 5.

After several experiments using the algorithm geodesic shooting for the

θ_{1}

and

θ_{2}

, we have observed that for

t = 0

the geodesic curve connecting these distributions (

γ (t) = (μ (t), Σ (t))

, with

γ (- 1) = θ_{1}

and

γ (1) = θ_{2}

), satisfies

γ (0) \approx θ_{1 / 2} = (μ_{1 / 2}, Σ_{1 / 2}) = ((\begin{matrix} \frac{μ_{1} + μ_{2}}{2} \\ η \end{matrix}), (\begin{matrix} d_{11}^{2} & 0 \\ 0 & d_{22}^{2} \end{matrix})),

(41)

γ^{'} (0) \approx {\hat{θ}}_{1 / 2} = ({\hat{μ}}_{1 / 2}, {\hat{Σ}}_{1 / 2}) = ((\begin{matrix} {\hat{μ}}_{1} \\ 0 \end{matrix}), (\begin{matrix} 0 & {\hat{σ}}_{12} \\ {\hat{σ}}_{12} & 0 \end{matrix})) .

(42)

where

η

,

d_{11}

, and

d_{22}

are real values; see Figure 6.

The focus here is the “shape” of these distributions. Note that at

t = 0

, the distribution

γ (0)

appears as

θ_{1 / 2}

, which has a diagonal covariance matrix, and the tangent vector

γ^{'} (0)

appears as

{\hat{θ}}_{1 / 2}

, which is composed by a mean vector with the second entry equal to zero and by a symmetric covariance matrix with a null diagonal.

This observation inspired us to get an explicit expression for the geodesic connecting two mirrored distributions. Starting with the bi-dimensional case again, we will prove that in fact we have equality in Expressions (41) and (42).

Let

γ (t) = (μ (t), Σ (t))

,

- 1 \leq t \leq 1

, and the geodesic curve in

M

connecting

θ_{1}

and

θ_{2}

, and consider that

γ (0) = θ_{1 / 2}

and

γ^{'} (0) = {\hat{θ}}_{1 / 2}

. Given the isometry

ψ = ψ_{(- Σ_{1 / 2}^{- 1 / 2} μ_{1 / 2}, Σ_{1 / 2}^{- 1 / 2})}

, we define:

\bar{γ} (t) = (\bar{μ} (t), \bar{Σ} (t)) : = ψ (γ (t)) = (Σ_{1 / 2}^{- 1 / 2} (μ (t) - μ_{1 / 2}), Σ_{1 / 2}^{- 1 / 2} Σ (t) Σ_{1 / 2}^{- 1 / 2}) .

(43)

Then:

{\bar{γ}}^{'} (t) = (\frac{d \bar{μ} (t)}{d t}, \frac{\bar{Σ} (t)}{d t}) = (Σ_{1 / 2}^{- 1 / 2} (\frac{d μ (t)}{d t}), Σ_{1 / 2}^{- 1 / 2} (\frac{Σ (t)}{d t}) Σ_{1 / 2}^{- 1 / 2}),

(44)

\begin{matrix} \bar{γ} (0) = & (Σ_{1 / 2}^{- 1 / 2} (μ_{1 / 2} - μ_{1 / 2}), Σ_{1 / 2}^{- 1 / 2} Σ_{1 / 2} Σ_{1 / 2}^{- 1 / 2}) \\ = & (0, I_{2}) = : θ_{0} \end{matrix}

(45)

and:

\begin{matrix} {\bar{γ}}^{'} (0) & = (Σ_{1 / 2}^{- 1 / 2} μ_{1 / 2}^{'}, Σ_{1 / 2}^{- 1 / 2} Σ_{1 / 2}^{'} Σ_{1 / 2}^{- 1 / 2}) \\ = ((\begin{matrix} \frac{μ^{'} (0)}{d_{11}} \\ 0 \end{matrix}), (\begin{matrix} 0 & \frac{σ_{12}^{'} (0)}{d_{11} d_{22}} \\ \frac{σ_{12}^{'} (0)}{d_{11} d_{22}} & 0 \end{matrix})) . \end{matrix}

(46)

Applying the natural changing of parameters:

(δ (t), Δ (t)) = φ (\bar{μ} (t), \bar{Σ} (t)) = (\bar{Σ} {(t)}^{- 1} \bar{μ} (t), \bar{Σ} {(t)}^{- 1}),

(47)

it follows that:

\{\begin{matrix} \frac{d Δ}{d t} (t) = & - Δ (t) (\frac{d \bar{Σ}}{d t} (t)) Δ (t) \\ \frac{d δ}{d t} (t) = & (\frac{d Δ}{d t} (t)) \bar{μ} (t) + Δ (t) (\frac{d \bar{μ}}{d t} (t)) \end{matrix} .

(48)

Then, given that

(δ (0), Δ (0)) = (\bar{μ} (0), \bar{Σ} (0)) = (0, I_{2})

,

\{\begin{matrix} \frac{d Δ}{d t} (0) = & - Δ (0) (\frac{d \bar{Σ}}{d t} (0)) Δ (0) = - \frac{d \bar{Σ}}{d t} (0) \\ \frac{d δ}{d t} (0) = & (\frac{d Δ}{d t} (0)) \bar{μ} (0) + Δ (0) (\frac{d \bar{μ}}{d t} (0)) = \frac{d \bar{μ}}{d t} (0) \end{matrix} .

(49)

That is, at

t = 0

, the tangent vector

(\frac{d δ}{d t} (0), \frac{d Δ}{d t} (0))

is equal to the tangent vector in (46). Furthermore, the distributions

ϑ_{1} = φ ({\bar{θ}}_{1}) = ({\bar{Σ}}_{1}^{- 1} {\bar{μ}}_{1}, {\bar{Σ}}_{1}^{- 1})

and

ϑ_{2} = φ ({\bar{θ}}_{2}) = ({\bar{Σ}}_{2}^{- 1} {\bar{μ}}_{2}, {\bar{Σ}}_{2}^{- 1})

are also mirrored

{\bar{Σ}}_{2}^{- 1} = M_{1} {\bar{Σ}}_{1}^{- 1} M_{1}

. In fact,

\begin{matrix} ϑ_{1} = & ({\bar{Σ}}_{1}^{- 1} {\bar{μ}}_{1}, {\bar{Σ}}_{1}^{- 1}) \\ = & ({(Σ_{1 / 2}^{- 1 / 2} Σ_{1} Σ_{1 / 2}^{- 1 / 2})}^{- 1} Σ_{1 / 2}^{- 1 / 2} (μ_{1} - μ_{1 / 2}), {(Σ_{1 / 2}^{- 1 / 2} Σ_{1} Σ_{1 / 2}^{- 1 / 2})}^{- 1}) \\ = & (\frac{1}{det (Σ_{1})} (\begin{matrix} σ_{22} d_{11} \frac{μ_{1} - μ_{2}}{2} - σ_{12} d_{11} (μ_{0} - η) \\ σ_{11} d_{22} (μ_{0} - η) - σ_{12} d_{22} \frac{μ_{1} - μ_{2}}{2} \end{matrix}), \frac{1}{det (Σ_{1})} (\begin{matrix} σ_{22} d_{11}^{2} & - σ_{12} d_{11} d_{22} \\ - σ_{12} d_{11} d_{22} & σ_{11} d_{22}^{2} \end{matrix})), \end{matrix}

(50)

and by similar arguments, we obtain:

\begin{matrix} ϑ_{2} = (\frac{1}{det (Σ_{1})} (\begin{matrix} σ_{22} d_{11} \frac{μ_{2} - μ_{1}}{2} + σ_{12} d_{11} (μ_{0} - η) \\ σ_{11} d_{22} (μ_{0} - η) - σ_{12} d_{22} \frac{μ_{1} - μ_{2}}{2} \end{matrix}), \frac{1}{det (Σ_{1})} (\begin{matrix} σ_{22} d_{11}^{2} & σ_{12} d_{11} d_{22} \\ σ_{12} d_{11} d_{22} & σ_{11} d_{22}^{2} \end{matrix})) . \end{matrix}

(51)

Figure 7 illustrates the distributions

θ_{0}

,

ϑ_{1}

, and

ϑ_{2}

.

Conversely, by considering:

(x, B) = ((\begin{matrix} x \\ 0 \end{matrix}), (\begin{matrix} 0 & b \\ b & 0 \end{matrix}))

(52)

in the initial value problem given in Equations (13) and (14), it follows that the matrix

G^{2} = B^{2} + x x^{t}

is diagonal. Therefore, the geodesic curve

(δ (t), Δ (t))

with initial value

(δ (0), Δ (0)) = θ_{0}

and tangent vector

(x, B)

given in Equation (15) can be simplified as follows:

\{\begin{matrix} δ (t) = & (\begin{matrix} \frac{x sinh (t \sqrt{b^{2} + 2 x^{2}})}{\sqrt{b^{2} + 2 x^{2}}} \\ - \frac{b x (cosh (t \sqrt{b^{2} + 2 x^{2}}) - 1)}{b^{2} + 2 x^{2}} \end{matrix}) \\ Δ (t) = & (\begin{matrix} \frac{1}{2} (cosh (b t) + cosh (t \sqrt{b^{2} + 2 x^{2}})) & - \frac{1}{2} (sinh (b t) + \frac{b sinh (t \sqrt{b^{2} + 2 x^{2}})}{\sqrt{b^{2} + 2 x^{2}}}) \\ - \frac{1}{2} (sinh (b t) + \frac{b sinh (t \sqrt{b^{2} + 2 x^{2}})}{\sqrt{b^{2} + 2 x^{2}}}) & \frac{1}{2} (cosh (b t) + \frac{2 x^{2} + b^{2} cosh (t \sqrt{b^{2} + 2 x^{2}})}{b^{2} + 2 x^{2}}) \end{matrix}) . \end{matrix}

(53)

From the parity of the functions

sinh (t)

and

cosh (t)

, it is possible to show that, given

t_{0} \in R

, the distributions:

(δ (- t_{0}), Δ (- t_{0})) and (δ (t_{0}), Δ (t_{0}))

are also mirrored.

By the above discussion, we conclude that it is possible to calculate the geodesic curve connecting

θ_{1}

and

θ_{2}

making

ψ^{- 1} (φ^{- 1} (δ (- 1), Δ (- 1))) = θ_{1}

and

ψ - 1 (φ^{- 1} (δ (1), Δ (1))) = θ_{2}

. That is, we need to find the values of

η

,

d_{11}

, and

d_{22}

of the isometry

ψ

and the values of

(x, B)

such that:

\{\begin{matrix} φ (ψ (θ_{1})) = & (δ (- 1), Δ (- 1)) \\ φ (ψ (θ_{2})) = & (δ (1), Δ (1)) \end{matrix} .

(54)

Since the two equations above are equivalent, it is enough to solve the equation:

(δ (1), Δ (1)) = φ (ψ (μ_{2}, Σ_{2})) .

(55)

This is equivalent to solving the system:

\{\begin{matrix} (\begin{matrix} \frac{1}{d_{11}} & 0 \\ 0 & \frac{1}{d_{22}} \end{matrix}) Δ (1) (\begin{matrix} \frac{1}{d_{11}} & 0 \\ 0 & \frac{1}{d_{22}} \end{matrix}) = Δ_{2} \\ (\begin{matrix} \frac{1}{d_{11}} & 0 \\ 0 & \frac{1}{d_{22}} \end{matrix}) δ (1) + Δ_{2} (\begin{matrix} \frac{μ_{1} + μ_{2}}{2} \\ η \end{matrix}) = δ_{2} \end{matrix},

(56)

where

(δ_{2}, Δ_{2}) = φ (μ_{2}, Σ_{2})

.

The above non-linear system has five equations and five variables

(d_{11}, d_{22}, η, x, b)

and can be solved by an iterative method. With the solution of this system, we can determine the geodesic curve connecting the distributions

θ_{1}

and

θ_{2}

. Moreover, by Equation (16), the Fisher–Rao distance is:

d_{F} (θ_{1}, θ_{2}) = 2 d_{F} (θ_{0}, θ_{2}) = 2 d_{F} ((0, I_{n}), (δ (1), Δ (1))) = 2 \sqrt{\frac{1}{2} tr (B^{2}) + x^{t} x} = 2 \sqrt{b^{2} + x^{2}} .

(57)

We also remark that the curve of the means

δ (t)

(and therefore,

μ (t)

) satisfies the equation of a hyperbola; in fact:

\frac{{(- \frac{b x (cosh (t \sqrt{b^{2} + 2 x^{2}}) - 1)}{b^{2} + 2 x^{2}} - \frac{b x}{b^{2} + 2 x^{2}})}^{2}}{{(\frac{b x}{b^{2} + 2 x^{2}})}^{2}} - \frac{{(\frac{x sinh (t \sqrt{b^{2} + 2 x^{2}})}{\sqrt{b^{2} + 2 x^{2}}})}^{2}}{{(\frac{x}{\sqrt{b^{2} + 2 x^{2}}})}^{2}} = 1 .

(58)

Summarizing the above discussion, we have:

Proposition 4.

(i): Expression (57) provides a closed form for the Fisher–Rao distance between two mirrored bivariate normal distributions, based on the solutions of the non-linear system (56).
(ii): The plane curve given by the coordinates of the mean vector in the geodesic connecting two of these distributions is a hyperbola.

Table 2 shows a time comparison between the numerical method proposed here and the geodesic shooting to obtain the Fisher–Rao distance. The distributions used in this experiment were:

θ_{1} = ((\begin{matrix} - μ \\ 0 \end{matrix}), (\begin{matrix} 0.55 & - 0.45 \\ - 0.45 & 0.55 \end{matrix})) and θ_{2} = ((\begin{matrix} μ \\ 0 \end{matrix}) (\begin{matrix} 0.55 & 0.45 \\ 0.45 & 0.55 \end{matrix})) .

(59)

for different values of

μ

.

The method proposed here uses a non-linear system for the calculus of the Fisher–Rao distance, so it is faster the geodesic shooting algorithm. Furthermore, we remark that for

μ \geq 7

, the geodesic shooting requires additional adaptation to convergence.

Next, we generalize the results of Proposition 4 to pairs of general multivariate normal mirrored distributions. Without loss of generality, we may assume:

θ_{1} = (μ_{1} e_{1}, Σ_{1}) and θ_{2} = (μ_{2} e_{1}, Σ_{2}),

(60)

with

Σ_{2} = M_{1} Σ_{1} M_{1}

as in (39), that is:

Σ_{2} = \{\begin{matrix} {\hat{σ}}_{1 j} = & {\hat{σ}}_{j 1} = - σ_{1 j}, j = 2, \dots, n \\ {\hat{σ}}_{i j} = & σ_{i j}, & otherwise . \end{matrix} .

Proposition 5.

The Fisher–Rao distance between a pair of multivariate mirrored normal distributions

θ_{1}

and

θ_{2}

(65) is:

d_{F} (θ_{1}, θ_{2}) = 2 \sqrt{\sum_{l = 1}^{n - 1} b_{l}^{2} + x^{2}},

(61)

where:

(x, B) = ((\begin{matrix} x \\ 0 \\ ⋮ \\ 0 \end{matrix}), (\begin{matrix} 0 & b_{1} & \dots & b_{n - 1} \\ b_{1} & 0 & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ b_{n - 1} & 0 & \dots & 0 \end{matrix})) .

(62)

The values x and

b_{l}

, the non-zero entries of

(x, B)

, are obtained by the solution of the

(n + \frac{n (n + 1)}{2})

order non-linear system:

\{\begin{matrix} L^{- 1} Δ (1) L^{- 1} = Δ_{2} \\ L^{- 1} δ (1) + Δ_{2} μ_{1 / 2} = δ_{2} \end{matrix},

(63)

where

μ_{1 / 2} = {(\frac{μ_{1} + μ_{2}}{2}, η_{1}, \dots, η_{n - 1})}^{t}

, L is the Cholesky factor of the matrix

Σ_{1 / 2} = (\begin{matrix} d_{11} & 0^{t} \\ 0 & D \end{matrix})

, with D a symmetric

(n - 1)

order matrix,

(δ_{2}, Δ_{2}) = φ (μ_{2}, Σ_{2})

, and

(δ (t), Δ (t))

is the geodesic curve with initial value

(δ (0), Δ (0)) = θ_{0}

and tangent vector

(x, B)

given in Equation (15).

Let

γ (t) = (μ (t), Σ (t))

,

- 1 \leq t \leq 1

, be the geodesic curve in

M

connecting

θ_{1}

and

θ_{2}

. The proof is similar to the bivariate case, by considering

γ (0) = (μ_{1 / 2}, Σ_{1 / 2})

,

γ^{'} (0) = {\hat{θ}}_{1 / 2} = ({\hat{μ}}_{1 / 2}, {\hat{Σ}}_{1 / 2})) = ((\begin{matrix} \hat{μ} \\ 0 \\ ⋮ \\ 0 \end{matrix}), (\begin{matrix} 0 & {\hat{σ}}_{12} & \dots & {\hat{σ}}_{1 n} \\ {\hat{σ}}_{12} & 0 & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ {\hat{σ}}_{1 n} & 0 & \dots & 0 \end{matrix}))

(64)

and

\bar{γ} (t) = ψ (γ (t))

where

ψ = ψ_{(- L^{- 1} μ_{1 / 2} L^{- 1})}

.

θ_{1} = (μ_{1} e_{1}, Σ_{1}) and θ_{2} = (μ_{2} e_{1}, Σ_{2}),

(65)

with

Σ_{2} = M_{1} Σ_{1} M_{1}

.

Table 3 collects the results in Section 2.1 and the new results of this section.

4. Hierarchical Clustering for Diagonal Gaussian Mixture Simplification

A parameterized Gaussian mixture model f is a weighted sum of m multivariate normal distributions, that is,

f (x) = \sum_{i = 1}^{m} w_{i} p_{i} (x; μ_{i}, Σ_{i}),

where

x \in R^{n}

,

p_{i} (x; μ_{i}, Σ_{i})

,

i = 1, \dots, m

, are normal distributions and

w_{i}

,

i = 1, \dots, m

, are mixture,

\sum_{i = 1}^{m} w_{i} = 1

. In this paper, we call the diagonal Gaussian mixture model (DGMM) the mixture composed only by distributions with diagonal covariance matrices.

Gaussian mixture models (GMM) are used in modeling datasets: image processing, signal processing, and density estimation problems [46,47,48]. In many applications involving mixture models, the computational requirements are of a very high level due to the large number of mixture components. This can be handled if we reduce the number of components of the mixture: given a mixture f of m components, we want to find a mixture g of l components,

1 \leq l < m

, such that g is a good approximation of f with respect to a similarity measure [49]. Gaussian mixture simplification was considered in statistical inference in [50] and to decode low-density lattice codes [51].

In [49] was proposed a hierarchical clustering algorithm to simplify an exponential family mixture model based on Bregman divergences. This section describes an agglomerative hierarchical clustering method based on the Fisher–Rao distance in the submanifold

M_{D}

(21) to simplify DGMM, and we present an application to image segmentation, complementing what was developed in [52]. We start by introducing the concept of the centroid for a set of distributions in

M_{D}

.

4.1. Centroids in the Submanifold $M_{D}$

In [53], Galperin described centroids in the two-dimensional Minkowski model, which can be translated also to the Klein disk and Poincare half-plane models. Given a set of points

q_{i} = (x_{q_{i}}, y_{q_{i}}, z_{q_{i}})

in the Minkowski model, with associated weights

u_{i}

, the centroid is computed and normalized as:

c^{'} = \sum_{i} u_{i} q_{i} and c = \frac{c^{'}}{- x_{c^{'}}^{2} - y_{c^{'}}^{2} + z_{c^{'}}^{2}} .

(66)

To calculate the centroid

c

of a subset of points

C = {(w_{i}, θ_{i})}

,

θ_{i} = (μ_{i}, σ_{i})

, the isometries presented in [25] and the relation between the media × standard deviation plane of parameters of univariate normal distributions and the Poincare half-plane given in [22] are used.

Given a dataset

C = {(w_{i}, θ_{i})}

, where

θ_{i} = (μ_{1 i}, σ_{1 i}, \dots, μ_{n i}, σ_{n i})

are distributions in

M_{D}

, the centroid of

C

is:

c : = (c_{1}, \dots, c_{n}),

(67)

where

c_{j}

,

j = 1, \dots, n

, is the centroid of

C_{j} = {(w_{j}, (μ_{j i}, σ_{j i}))}

given in Equation (66).

4.2. Hierarchical Clustering Algorithm

Let a DGMM f with parameters

C = {(w_{1}, θ_{1}), \dots, (w_{m}, θ_{m})}

.

In order to apply the hierarchical clustering algorithm, we need to consider the distance between two subsets A and B. The three most common distances are called linkage criteria and are given by [54]:

Single linkage:

$D (A, B) = min {d_{D} (a, b); a \in A, b \in B};$

(68)
Complete linkage:

$D (A, B) = max {d_{D} (a, b); a \in A, b \in B};$

(69)
Group average linkage:

$D (A, B) = \frac{1}{| A | | B |} \sum_{a \in A} \sum_{b \in B} d_{D} (a, b),$

(70)

where $d_{D}$ is the distance in the submanifold $M_{D}$ and $| X |$ is the number of elements of a set X.

A summary of the hierarchical clustering algorithm (Algorithm 1) [49] using one of these distances is given next.

Algorithm 1: Hierarchical Clustering Algorithm

1:: Form m clusters $C_{j} = {(w_{j}, θ_{j})}$ with one element.
2:: Find the two closest clusters, $C_{i}$ and $C_{j}$ , with respect to a distance D, and merge them into asingle cluster $C_{i} \cup C_{j}$ .
3:: Compute distances between the new cluster and each of the old clusters.
4:: Repeat Steps 2 and 3 until all items are clustered into a single cluster of size n.

The simplified DGMM:

g = \sum_{j = 1}^{l} β_{j} g_{j}

of l components is built from the l subsets

C_{1}

, …,

C_{l}

remaining after the iteration

n - l

of the hierarchical clustering algorithm. In this work, we choose the parameters of

g_{j}

in two ways: as the centroid in the submanifold

M_{D}

(Fisher–Rao hierarchical clustering) and as the Bregman left-sided centroid [49] (Bregman–Fisher–Rao hierarchical clustering) of the subset

C_{j}

with weights

β_{j} = \sum_{(w_{i}, θ_{i}) \in C_{j}} w_{i}

.

As remarked in [49], the hierarchical clustering algorithm allows introducing a method to learn the optimal number of components in the simplified mixture g. Thus, g must be as compact as possible and reach a minimum prescribed quality

d_{K L} (f | | g) \leq τ

, where

d_{K L} (f | | g)

is the Kullback–Leibler divergence.

4.3. Experiments in Image Segmentation

We can apply the Fisher–Rao and the Bregman–Fisher–Rao hierarchical clusterings to simplify a mixture of exponential families in the context of clustering-based image segmentation as was done in [49] for the Bregman hierarchical clustering. Given an input color image I, we adapt the Bregman soft clustering algorithm to generate a DGMM f of 32 components, which models the image pixels. We point out that the restriction considered in this paper (only DGMM) is also used in many applications due its much lower computational cost. We consider here a pixel

ρ = (ρ_{R}, ρ_{G}, ρ_{B})

as a point in

R^{3}

, where

ρ_{R}

,

ρ_{G}

, and

ρ_{B}

are the RGB color information. For image segmentation, we can say that the image pixel

ρ

belongs to the class

C_{j}

when:

p_{j} (ρ; μ_{j}, Σ_{j}) > p_{i} (ρ; μ_{i}, Σ_{i}), \forall i \in {1, \dots, m} ∖ {j} .

Thus, the segmented image is illustrated by replacing the color value of the pixel

ρ

by the mean

μ_{j}

of the Gaussian

p_{j}

.

Using the the Fisher–Rao and the Bregman–Fisher–Rao hierarchical clusterings, we simplify the mixture f into mixtures g of l components with

l = {2, 4, 8, 16}

. Each mixture gives one image segmentation. The linkage criterion used here was the complete linkage (68), which has presented better results in our simulations. Figure 8 shows the segmentation of the Baboon, Lena, and Clown input images given by the Bregman–Fisher–Rao hierarchical clustering. The number of colors in each image is equal to the number of components in the simplified mixture g.

The quality of the segmentation was analyzed as a function of l through the Kullback–Leibler divergence estimated by the Monte Carlo method, since there was no closed form for this measure (five thousand points were randomly drawn to estimate

d_{K L} (f | | g)

). Figure 9, Figure 10 and Figure 11 show the evolution of the simplification quality as a function of the number of components l for the Baboon, Lena, and Clown images, using the Bregman, the Fisher–Rao, and the Bregman–Fisher–Rao hierarchical clustering algorithms. We observed that the image quality increased (

d_{K L} (f | | g)

decreased) with l, as expected, and the behavior was similar in all clustering algorithms. In general, the Bregman–Fisher–Rao hierarchical clustering algorithm presented better results. Considering the constraint

τ = 0.2

, the learning process provided, for the Bregman–Fisher–Rao hierarchical clustering, mixtures of 19, 21, and 21 as optimal simplifications for the images of the Baboon, Lena, and Clown, respectively.

5. Concluding Remarks

The Fisher–Rao distance was approached here in the space of multivariate normal distributions. Initially, as in [38], we summarized some known closed forms for this distance in submanifolds of this model and some bounds for the general case. A closed form for the Fisher–Rao distance between distributions with the same covariance matrix was obtained in Proposition 3, and we also have derived a non-linear system characterizing the distance between two distributions with mirrored covariance matrices in Proposition 5. Some perspectives for future research related to this topic include deriving new bounds for the Fisher–Rao distance in the general case, by using these special distributions, to characterize as non-linear systems the distances between other types of distributions and to extend the closed forms and bounds presented here to the space of elliptical distributions. Finally, we have extended the analysis of the Bregman–Fisher–Rao hierarchical clustering algorithm to simplify Gaussian mixtures in the context of clustering-based image segmentation given in [52] with comparative results that encourage the use of the Fisher–Rao distance in other clustering or classification algorithms.

Author Contributions

All authors contributed equally to the research and the writing of the manuscript. All authors read and approved the final manuscript.

Acknowledgments

The authors are thankful to the referees, as their comments and suggestions have contributed to improve the presentation of the text. The authors were partially supported by grants FAPESP (13/25977-7) and CNPq (313326/2017-7) foundations.

Conflicts of Interest

The authors declare no conflict of interest.

References

Calin, O.; Udriste, C. Geometric Modeling in Probability and Statistics. In Mathematics and Statistics; Springer International: Cham, Switzerland, 2014. [Google Scholar]
Nielsen, F. An elementary introduction to information geometry. arXiv 2018, arXiv:1808.08271. [Google Scholar]
Amari, S.; Nagaoka, H. Methods of Information Geometry. In Translations of Mathematical Monographs; Oxford University Press: Oxford, UK, 2000; Volume 191. [Google Scholar]
Amari, S. Information Geometry and Its Applications; Springer: Tokyo, Japan, 2016. [Google Scholar]
Ay, N.; Jost, J.; Vân Lê, H.; Schwachhöfer, L. Information geometry and sufficient statistics. Probab. Theory Relat. Fields 2015, 162, 327–364. [Google Scholar] [CrossRef] [Green Version]
Vân Lê, H. The uniqueness of the Fisher metric as information metric. Ann. Inst. Stat. Math. 2017, 69, 879–896. [Google Scholar]
Gibilisco, P.; Riccomagno, E.; Rogantin, M.P.; Wynn, H.P. Algebraic and Geometric Methods in Statistics; Cambridge University Press: New York, NY, USA, 2010. [Google Scholar]
Chentsov, N.N. Statistical Decision Rules and Optimal Inference; AMS Bookstore: Providence, RI, USA, 1982; Volume 53. [Google Scholar]
Campbell, L.L. An extended Cencov characterization of the information metric. Proc. Am. Math. Soc. 1986, 98, 135–141. [Google Scholar]
Vân Lê, H. Statistical manifolds are statistical models. J. Geom. 2006, 84, 83–93. [Google Scholar]
Mahalanobis, P.C. On the generalized distance in statistics. Proc. Natl. Inst. Sci. 1936, 2, 49–55. [Google Scholar]
Bhattacharyya, A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 1943, 35, 99–110. [Google Scholar]
Hotelling, H. Spaces of statistical parameters. Bull. Am. Math. Soc. (AMS) 1930, 36, 191. [Google Scholar]
Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 1945, 37, 81–91. [Google Scholar]
Fisher, R.A. On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. Lond. 1921, 222, 309–368. [Google Scholar]
Burbea, J. Informative geometry of probability spaces. Expo. Math. 1986, 4, 347–378. [Google Scholar]
Skovgaard, L.T. A Riemannian geometry of the multivariate normal model. Scand. J. Stat. 1984, 11, 211–223. [Google Scholar]
Atkinson, C.; Mitchell, A.F.S. Rao’s Distance Measure. Sankhyã Indian J. Stat. 1981, 43, 345–365. [Google Scholar]
Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Villani, C. Optimal Transport, Old and New. In Grundlehren der Mathematischen Wissenschaften; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
Amari, S. Differential Geometrical Methods in Statistics; Springer: Berlin, Germany, 1985. [Google Scholar]
Costa, S.I.R.; Santos, S.A.; Strapasson, J.E. Fisher information distance: A geometrical reading. Discret. Appl. Math. 2015, 197, 59–69. [Google Scholar] [CrossRef]
Angulo, J.; Velasco-Forero, S. Morphological processing of univariate Gaussian distribution-valued images based on Poincaré upper-half plane representation. In Geometric Theory of Information; Springer International Publishing: Cham, Switzerland, 2014; pp. 331–366. [Google Scholar]
Maybank, S.J.; Ieng, S.; Benosman, R. A Fisher–Rao metric for paracatadioptric images of lines. Int. J. Comput. Vis. 2012, 99, 147–165. [Google Scholar] [CrossRef] [Green Version]
Schwander, O.; Nielsen, F. Model centroids for the simplification of kernel density estimators. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012. [Google Scholar]
Taylor, S. Clustering Financial Return Distributions Using the Fisher Information Metric. Entropy 2019, 21, 110. [Google Scholar] [CrossRef] [Green Version]
Eriksen, P.S. Geodesics Connected with the Fischer Metric on the Multivariate Normal Manifold; Institute of Electronic Systems, Aalborg University Centre: Aalborg, Denmark, 1986. [Google Scholar]
Calvo, M.; Oller, J.M. An explicit solution of information geodesic equations for the multivariate normal model. Stat. Decis. 1991, 9, 119–138. [Google Scholar] [CrossRef]
Lenglet, C.; Rousson, M.; Deriche, R.; Faugeras, O. Statistics on the manifold of multivariate normal distributions. Theory and application to diffusion tensor MRI processing. J. Math. Imaging Vis. 2006, 25, 423–444. [Google Scholar] [CrossRef]
Moakher, M.; Mourad, Z. The Riemannian geometry of the space of positive-definite matrices and its application to the regularization of positive-definite matrix-valued data. J. Math. Imaging Vis. 2011, 40, 171–187. [Google Scholar] [CrossRef]
Han, M.; Park, F.C. DTI Segmentation and Fiber Tracking Using Metrics on Multivariate Normal Distributions. J. Math. Imaging Vis. 2014, 49, 317–334. [Google Scholar] [CrossRef]
Verdoolaege, G.; Scheunders, P. Geodesics on the manifold of multivariate generalized Gaussian distributions with an application to multicomponent texture discrimination. Int. J. Comput. Vis. 2011, 95, 265. [Google Scholar] [CrossRef] [Green Version]
Tang, M.; Rong, Y.; Zhou, J.; Li, X.R. Information geometric approach to multisensor estimation fusion. IEEE Trans. Signal Process. 2018, 67, 279–292. [Google Scholar] [CrossRef]
Poon, C.; Keriven, N.; Peyré, G. Support Localization and the Fisher Metric for off-the-grid Sparse Regularization. arXiv 2018, arXiv:1810.03340. [Google Scholar]
Gattone, S.A.; De Sanctis, A.; Puechmorel, S.; Nicol, F. On the geodesic distance in shapes K-means clustering. Entropy 2018, 20, 647. [Google Scholar] [CrossRef] [Green Version]
Gattone, S.A.; De Sanctis, A.; Russo, T.; Pulcini, D. A shape distance based on the Fisher–Rao metric and its application for shapes clustering. Phys. A Stat. Mech. Appl. 2017, 487, 93–102. [Google Scholar] [CrossRef]
Pilté, M.; Barbaresco, F. Tracking quality monitoring based on information geometry and geodesic shooting. In Proceedings of the 2016 17th International Radar Symposium (IRS), Krakow, Poland, 10–12 May 2016. [Google Scholar]
Pinele, J.; Costa, S.I.; Strapasson, J.E. On the Fisher–Rao Information Metric in the Space of Normal Distributions. In International Conference on Geometric Science of Information; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2019; pp. 676–684. [Google Scholar]
Burbea, J.; Rao, C.R. Entropy differential metric, distance and divergence measures in probability spaces: A unified approach. J. Multivar. Anal. 1982, 12, 575–596. [Google Scholar] [CrossRef] [Green Version]
Porat, B.; Benjamin, F. Computation of the exact information matrix of Gaussian time series with stationary random components. IEEE Trans. Acoust. Speech Signal Process. 1986, 34, 118–130. [Google Scholar] [CrossRef]
Siegel, C.L. Symplectic geometry. Am. J. Math. 1943, 65, 1–86. [Google Scholar] [CrossRef]
Strapasson, J.E.; Pinele, J.; Costa, S.I.R. A totally geodesic submanifold of the multivariate normal distributions and bounds for the Fisher–Rao distance. In Proceedings of the IEEE Information Theory Workshop (ITW), Cambridge, UK, 11–14 September 2016; pp. 61–65. [Google Scholar]
Calvo, M.; Oller, J.M. A distance between multivariate normal distributions based in an embedding into the Siegel group. J. Multivar. Anal. 1990, 35, 223–242. [Google Scholar] [CrossRef] [Green Version]
Calvo, M.; Oller, J.M. A distance between elliptical distributions based in an embedding into the Siegel group. J. Comput. Appl. Math. 2002, 145, 319–334. [Google Scholar] [CrossRef] [Green Version]
Strapasson, J.E.; Porto, J.; Costa, S.I.R. On bounds for the Fisher–Rao distance between multivariate normal distributions. Aip Conf. Proc. 2015, 1641, 313–320. [Google Scholar]
Zhang, K.; Kwok, J.T. Simplifying mixture models through function approximation. IEEE Trans. Neural Netw. 2010, 21, 644–658. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Davis, J.V.; Dhillon, I.S. Differential entropic clustering of multivariate gaussians. In Proceedings of the 2006 Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 4–7 December 2006. [Google Scholar]
Goldberger, J.; Greenspan, H.K.; Dreyfuss, J. Simplifying mixture models using the unscented transform. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 1496–1502. [Google Scholar] [CrossRef]
Garcia, V.; Nielsen, F. Simplification and hierarchical representations of mixtures of exponential families. Signal Process. 2010, 90, 3197–3212. [Google Scholar] [CrossRef]
Bar-Shalom, Y.; Li, X. Estimation and Tracking: Principles, Techniques and Software; Artech House: Norwood, MA, USA, 1993. [Google Scholar]
Kurkoski, B.; Dauwels, J. Message-passing decoding of lattices using Gaussian mixtures. In Proceedings of the 2008 IEEE International Symposium on Information Theory, Toronto, ON, Canada, 6–11 July 2008. [Google Scholar]
Strapasson, J.E.; Pinele, J.; Costa, S.I.R. Clustering using the Fisher–Rao distance. In Proceedings of the IEEE Sensor Array and Multichannel Signal Processing Workshop, Rio de Janerio, Brazil, 10–13 July 2016. [Google Scholar]
Galperin, G.A. A concept of the mass center of a system of material points in the constant curvature spaces. Commun. Math. Phys. 1993, 154.1, 63–84. [Google Scholar] [CrossRef]
Nielsen, F. Introduction to HPC with MPI for Data Science. In Undergraduate Topics in Computer Science; Springer: Cham, Switzerland, 2016. [Google Scholar]

Figure 1. A comparison between the bounds

L B

,

U B_{1}

,

U B_{2}

,

U B_{3}

, and

G S

. (

λ_{1} = 2

,

λ_{2} = 0.5

, and

μ = 1

are fixed, and

α

varies from zero to

\frac{π}{2}

).

Figure 1. A comparison between the bounds

L B

,

U B_{1}

,

U B_{2}

,

U B_{3}

, and

G S

. (

λ_{1} = 2

,

λ_{2} = 0.5

, and

μ = 1

are fixed, and

α

varies from zero to

\frac{π}{2}

).

Figure 2. A comparison between the bounds

L B

,

U B_{1}

,

U B_{2}

,

U B_{3}

, and

G S

. (

λ_{1} = 2

,

λ_{2} = 0.5

, and

μ = 10

are fixed, and

α

varies from zero to

\frac{π}{2}

).

Figure 2. A comparison between the bounds

L B

,

U B_{1}

,

U B_{2}

,

U B_{3}

, and

G S

. (

λ_{1} = 2

,

λ_{2} = 0.5

, and

μ = 10

are fixed, and

α

varies from zero to

\frac{π}{2}

).

Figure 3. (a) A comparison between the bounds

L B

,

U B_{1}

,

U B_{2}

,

U B_{3}

, and

G S

. (

λ_{1} = 2

,

λ_{2} = 0.5

, and the rotation angle

α = π / 4

are fixed, and

μ

varies from zero to 10). (b) A comparison between the bounds

L B

,

U B_{123}

, and

G S

. (

λ_{1} = 2

,

λ_{2} = 0.5

, and the rotation angle

α = π / 4

are fixed, and

μ

varies from zero to 10).

Figure 3. (a) A comparison between the bounds

L B

,

U B_{1}

,

U B_{2}

,

U B_{3}

, and

G S

. (

λ_{1} = 2

,

λ_{2} = 0.5

, and the rotation angle

α = π / 4

are fixed, and

μ

varies from zero to 10). (b) A comparison between the bounds

L B

,

U B_{123}

, and

G S

. (

λ_{1} = 2

,

λ_{2} = 0.5

, and the rotation angle

α = π / 4

are fixed, and

μ

varies from zero to 10).

Figure 4. (a) Level curves of the distributions in the geodesic curve connecting the bivariate normal distributions

θ_{1} = ({(- 1, 0)}^{t}, Σ)

and

θ_{2} = ({(6, 3)}^{t}, Σ)

in

M

. (b) Level curves of the distributions in the geodesic curve connecting the bivariate normal distributions

θ_{1} = ({(- 1, 0)}^{t}, Σ)

and

θ_{2} = ({(6, 3)}^{t}, Σ)

in

M_{Σ}

.

Figure 4. (a) Level curves of the distributions in the geodesic curve connecting the bivariate normal distributions

θ_{1} = ({(- 1, 0)}^{t}, Σ)

and

θ_{2} = ({(6, 3)}^{t}, Σ)

in

M

. (b) Level curves of the distributions in the geodesic curve connecting the bivariate normal distributions

θ_{1} = ({(- 1, 0)}^{t}, Σ)

and

θ_{2} = ({(6, 3)}^{t}, Σ)

in

M_{Σ}

.

Figure 5. Example of level curves of mirrored distributions where

θ_{1}

and

θ_{2}

are given by Equation (40).

Figure 5. Example of level curves of mirrored distributions where

θ_{1}

and

θ_{2}

are given by Equation (40).

Figure 6. Approximation of the geodesic curve connecting

θ_{1}

and

θ_{2}

via the geodesic shooting algorithm. The level curve of

θ_{1 / 2}

is the dashed one.

Figure 6. Approximation of the geodesic curve connecting

θ_{1}

and

θ_{2}

via the geodesic shooting algorithm. The level curve of

θ_{1 / 2}

is the dashed one.

Figure 7. Contour curves of distributions

ϑ_{1} = φ ({\bar{θ}}_{1})

and

ϑ_{2} = φ ({\bar{θ}}_{2})

.

Figure 7. Contour curves of distributions

ϑ_{1} = φ ({\bar{θ}}_{1})

and

ϑ_{2} = φ ({\bar{θ}}_{2})

.

Figure 8. Illustration of the mixture simplification using the Fisher–Rao clustering, where l is the number of components of the mixture (the last column is the original figure).

Figure 9. Illustration of the simplification quality of the mixture modeling Baboon image.

Figure 10. Illustration of the simplification quality of the mixture modeling Lena image.

Figure 11. Illustration of the simplification quality of the mixture modeling Clown image.

Table 1. The lower bound

L B (θ_{1}, θ_{2})

and the upper bounds

U B_{1} (θ_{1}, θ_{2})

,

U B_{2} (θ_{1}, θ_{2})

and

U B_{3} (θ_{1}, θ_{2})

for the Fisher–Rao distance,

d_{F} (θ_{1}, θ_{2})

, between distributions

θ_{1} = (μ_{1}, Σ_{1})

and

θ_{2} = (μ_{2}, Σ_{2})

in

M

.

d_{F *}

is the distance between univariate normal distributions given in Equation (22).

Table 1. The lower bound

L B (θ_{1}, θ_{2})

and the upper bounds

U B_{1} (θ_{1}, θ_{2})

,

U B_{2} (θ_{1}, θ_{2})

and

U B_{3} (θ_{1}, θ_{2})

for the Fisher–Rao distance,

d_{F} (θ_{1}, θ_{2})

, between distributions

θ_{1} = (μ_{1}, Σ_{1})

and

θ_{2} = (μ_{2}, Σ_{2})

in

M

.

d_{F *}

is the distance between univariate normal distributions given in Equation (22).

$L B (θ_{1}, θ_{2}) = \sqrt{\frac{1}{2} \sum_{i = 1}^{n + 1} {[log (λ_{i})]}^{2}}$	$\begin{array}{l} - & S_{i} = (\begin{matrix} Σ_{i} + μ_{i}^{t} μ_{i} & μ_{i}^{t} \\ μ_{i} & 1 \end{matrix}); \\ - & λ_{i} are the eigenvalues of S_{1}^{- 1 / 2} S_{2} S_{1}^{- 1 / 2} . \end{array}$
$U B_{1} (θ_{1}, θ_{2}) = \sqrt{\sum_{i = 1}^{n} d_{F *}^{2} ((0, 1), (μ_{i}, λ_{i}))}$	$\begin{array}{l} - & Σ_{1}^{- (1 / 2)} Σ_{2} Σ_{1}^{- (1 / 2)} = Q Λ Q^{t}; \\ - & λ_{i} are the diagonal terms of the matrix Λ; \\ - & μ_{i} are the coordinates of μ = Q^{t} Σ_{1}^{- (1 / 2)} (μ_{2} - μ_{1}) . \end{array}$
$U B_{2} (θ_{1}, θ_{2}) = \sqrt{\frac{1}{2} \sum_{i = 1}^{n} {[log (λ_{i})]}^{2}} +$ $\sqrt{d_{F }^{2} ((0, 1), (\| μ_{3} \|, {\bar{d}}_{1})) + \sum_{i = 2}^{m} d_{F }^{2} ((0, 1), (0, {\bar{d}}_{i}))}$	$\begin{array}{l} - & μ_{3} = Σ_{1}^{- (1 / 2)} (μ_{2} - μ_{1}); \\ - & {\bar{d}}_{i} is given by (31); \\ - & Σ_{3} = Σ_{1}^{- (1 / 2)} Σ_{2} Σ_{1}^{- (1 / 2)}; \\ - & P is an orthogonal matrix such that \\ P μ = (\| μ_{3} \|, 0, \dots, 0); \\ - & \bar{Σ} = P^{- 1} Σ_{3} P^{- t}; \\ - & λ_{i} are the eigenvalues of {\bar{Σ}}^{- (1 / 2)} Σ_{3} {\bar{Σ}}^{- (1 / 2)} . \end{array}$
$U B_{3} (θ_{1}, θ_{2}) = \sqrt{\frac{1}{2} \sum_{i = 1}^{n} {[log (λ_{i})]}^{2}} +$ $d_{F *} ((0, 1), (\| μ_{3} \|, \sqrt{\frac{\| μ_{3} \| + 2}{2}}))$	$\begin{array}{l} - & μ_{3} = Σ_{1}^{- (1 / 2)} (μ_{2} - μ_{1}); \\ - & Σ_{3} = Σ_{1}^{- (1 / 2)} Σ_{2} Σ_{1}^{- (1 / 2)}; \\ - & P is an orthogonal matrix such that \\ P μ = (\| μ_{3} \|, 0, \dots, 0); \\ - & \bar{Σ} = P^{- 1} Σ_{3} P^{- t}; \\ - & λ_{i} are the eigenvalues of {\bar{Σ}}^{- (1 / 2)} Σ_{3} {\bar{Σ}}^{- (1 / 2)} . \end{array}$

Table 2. A time comparison between the numerical method proposed here and the geodesic shooting to calculate the distance between two mirrored distributions.

$μ$	$d_{F} (θ_{1}, θ_{2})$	Time Systems (s)	Time G.Shooting (s)
1	2.77395	0.046875	4.70313
2	3.67027	0.046875	5.60938
3	4.52933	0.0625	7.10938
4	5.26093	0.078125	9.17188
5	5.87480	0.046875	12.5313
6	6.39439	0.0625	18.4219
7	6.84043	0.078125	492.563
8	7.22903	0.0625	574.422
9	7.57221	0.046875	917.859
10	7.87896	0.046875	1007.13

Table 3. Closed forms for the Fisher–Rao distance in submanifolds of

M

and the distance in

M

between pairs of special distributions.

Table 3. Closed forms for the Fisher–Rao distance in submanifolds of

M

and the distance in

M

between pairs of special distributions.

Distance in Non-totally Geodesic Submanifolds
Submanifold	Distance
$\begin{matrix} M_{Σ} & = & \{\begin{matrix} p_{θ}; θ = (μ, Σ), \\ Σ = Σ_{0} \in P_{n} (R) constant \end{matrix}\}, \\ θ_{i} & = & (μ_{i}, Σ_{0}) \end{matrix}$	$d_{Σ} (θ_{1}, θ_{2}) = \sqrt{{(μ_{1} - μ_{2})}^{t} Σ_{0}^{- 1} (μ_{1} - μ_{2})}$
$\begin{matrix} M_{D} & = & \{\begin{matrix} p_{θ}; θ = (μ, Σ), \\ Σ = diag (σ_{1}^{2}, σ_{2}^{2}, \dots, σ_{n}^{2}) \end{matrix}\}, \\ θ_{i} & = & (μ_{1 i}, σ_{1 i}, μ_{2 i}, σ_{2 i}, \dots, μ_{n i}, σ_{n i}) \end{matrix}$	$d_{D} (θ_{1}, θ_{2}) = \sqrt{\sum_{i = 1}^{n} d_{F}^{2} ((μ_{1 i}, σ_{1 i}), (μ_{2 i}, σ_{2 i}))}$
Distance in Totally Geodesic Submanifolds
$\begin{matrix} M_{μ} & = & \{\begin{matrix} p_{θ}; θ = (μ, Σ), \\ μ = μ_{0} \in R^{n} contant \end{matrix}\}, \\ θ_{i} & = & (μ_{0}, Σ_{i}) \end{matrix}$	$d_{F} (θ_{1}, θ_{2}) = \sqrt{\frac{1}{2} \sum_{i = 1}^{n} {[log (λ_{i})]}^{2}}$ , where $λ_{i}$ are the eigenvalues of $Σ_{1}^{- 1 / 2} Σ_{2} Σ_{1}^{- 1 / 2}$
$\begin{matrix} M_{D μ} & = & \{\begin{matrix} {p_{θ}; θ = (μ, Σ), \\ μ is an eigenvector of \\ Σ = diag (σ_{1}^{2}, σ_{2}^{2}, \dots, σ_{n}^{2})} \end{matrix}\}, \\ θ_{i} & = & (μ_{1 i}, σ_{1 i}, σ_{2 i}, \dots, σ_{n i}) \end{matrix}$	$d_{D μ} (θ_{1}, θ_{2}) = (d_{F}^{2} ((μ_{11}, σ_{11}), (μ_{21}, σ_{21})) +$ $\sum_{i = 2}^{n} d_{F}^{2} ((0, σ_{1 i}), (0, σ_{2 i})))^{1 / 2}$
Distance Between Special Distributions in $M$
Distributions with Common Covariance Matrices, $θ_{i} = (μ_{i}, Σ_{0})$	$d_{F} (θ_{1}, θ_{2}) = d_{D μ} ((0, D), (\| μ_{2} - μ_{1} \| e_{1}, D))$ , where P is an orthogonal matrix such that $P (μ_{2} - μ_{1}) = \| μ_{2} - μ_{1} \| e_{1}$ and $P Σ P^{t} = U D U^{t}$
Mirrored Distributions, $θ_{1} = (μ_{1} e_{1}, Σ_{1})$ and $θ_{2} = (μ_{2} e_{1}, Σ_{2})$ , with $Σ_{2} = M_{1} Σ_{1} M_{1}$	$d_{F} (θ_{1}, θ_{2}) = 2 \sqrt{\sum_{l = 1}^{n - 1} b_{l}^{2} + x^{2}}$ , where x and $b_{i}$ are obtained by the solution of Equation (63)

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pinele, J.; Strapasson, J.E.; Costa, S.I.R. The Fisher–Rao Distance between Multivariate Normal Distributions: Special Cases, Bounds and Applications. Entropy 2020, 22, 404. https://0-doi-org.brum.beds.ac.uk/10.3390/e22040404

AMA Style

Pinele J, Strapasson JE, Costa SIR. The Fisher–Rao Distance between Multivariate Normal Distributions: Special Cases, Bounds and Applications. Entropy. 2020; 22(4):404. https://0-doi-org.brum.beds.ac.uk/10.3390/e22040404

Chicago/Turabian Style

Pinele, Julianna, João E. Strapasson, and Sueli I. R. Costa. 2020. "The Fisher–Rao Distance between Multivariate Normal Distributions: Special Cases, Bounds and Applications" Entropy 22, no. 4: 404. https://0-doi-org.brum.beds.ac.uk/10.3390/e22040404

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Fisher–Rao Distance between Multivariate Normal Distributions: Special Cases, Bounds and Applications

Abstract

1. Introduction

2. The Fisher–Rao Distance in the Multivariate Normal Distribution Space: Special Submanifolds and Bounds

2.1. Closed Forms for the Fisher–Rao Distance in Submanifolds of $M$

2.1.1. The Submanifold $M_{Σ}$ Where $Σ$ Is Constant

2.1.2. The Submanifold $M_{μ}$ Where $μ$ Is Constant

2.1.3. The Submanifold $M_{D}$ Where $Σ$ Is Diagonal

2.1.4. The Submanifold $M_{D μ}$ Where $Σ$ Is Diagonal and $μ$ Is an Eigenvector of $Σ$

2.2. Bounds for the Fisher–Rao in $M$

2.2.1. A Lower Bound

2.2.2. The Upper Bound $U B_{1}$

2.2.3. The Upper Bounds $U B_{2}$ and $U B_{3}$

2.2.4. Comparisons of the Bounds

3. Fisher–Rao Distance Between Special Distributions

3.1. The Fisher–Rao Distance Between Distributions with Common Covariance Matrices

3.2. The Fisher–Rao Distance Between Mirrored Distributions

4. Hierarchical Clustering for Diagonal Gaussian Mixture Simplification

4.1. Centroids in the Submanifold $M_{D}$

4.2. Hierarchical Clustering Algorithm

4.3. Experiments in Image Segmentation

5. Concluding Remarks

Author Contributions

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

The Fisher–Rao Distance between Multivariate Normal Distributions: Special Cases, Bounds and Applications

Abstract

1. Introduction

2. The Fisher–Rao Distance in the Multivariate Normal Distribution Space: Special Submanifolds and Bounds

2.1. Closed Forms for the Fisher–Rao Distance in Submanifolds of M

2.1.1. The Submanifold M Σ Where Σ Is Constant

2.1.2. The Submanifold M μ Where μ Is Constant

2.1.3. The Submanifold M D Where Σ Is Diagonal

2.1.4. The Submanifold M D μ Where Σ Is Diagonal and μ Is an Eigenvector of Σ

2.2. Bounds for the Fisher–Rao in M

2.2.1. A Lower Bound

2.2.2. The Upper Bound U B 1

2.2.3. The Upper Bounds U B 2 and U B 3

2.2.4. Comparisons of the Bounds

3. Fisher–Rao Distance Between Special Distributions

3.1. The Fisher–Rao Distance Between Distributions with Common Covariance Matrices

3.2. The Fisher–Rao Distance Between Mirrored Distributions

4. Hierarchical Clustering for Diagonal Gaussian Mixture Simplification

4.1. Centroids in the Submanifold M D

4.2. Hierarchical Clustering Algorithm

4.3. Experiments in Image Segmentation

5. Concluding Remarks

Author Contributions

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.1. Closed Forms for the Fisher–Rao Distance in Submanifolds of $M$

2.1.1. The Submanifold $M_{Σ}$ Where $Σ$ Is Constant

2.1.2. The Submanifold $M_{μ}$ Where $μ$ Is Constant

2.1.3. The Submanifold $M_{D}$ Where $Σ$ Is Diagonal

2.1.4. The Submanifold $M_{D μ}$ Where $Σ$ Is Diagonal and $μ$ Is an Eigenvector of $Σ$

2.2. Bounds for the Fisher–Rao in $M$

2.2.2. The Upper Bound $U B_{1}$

2.2.3. The Upper Bounds $U B_{2}$ and $U B_{3}$

4.1. Centroids in the Submanifold $M_{D}$