Next Article in Journal
Changes of Conformation in Albumin with Temperature by Molecular Dynamics Simulations
Previous Article in Journal
Rate Adaption for Secure HARQ-CC System with Multiple Eavesdroppers
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Fisher–Rao Distance between Multivariate Normal Distributions: Special Cases, Bounds and Applications

by
Julianna Pinele
1,*,
João E. Strapasson
2 and
Sueli I. R. Costa
3
1
Center of Exact and Technological Sciences, University of Reconcavo of Bahia, Cruz das Almas 44380-000, Brazil
2
School of Applied Sciences, University of Campinas, Limeira 13484-350, Brazil
3
Institute of Mathematics, University of Campinas, Campinas 13083-859, Brazil
*
Author to whom correspondence should be addressed.
Submission received: 26 January 2020 / Revised: 6 March 2020 / Accepted: 11 March 2020 / Published: 1 April 2020

Abstract

:
The Fisher–Rao distance is a measure of dissimilarity between probability distributions, which, under certain regularity conditions of the statistical model, is up to a scaling factor the unique Riemannian metric invariant under Markov morphisms. It is related to the Shannon entropy and has been used to enlarge the perspective of analysis in a wide variety of domains such as image processing, radar systems, and morphological classification. Here, we approach this metric considered in the statistical model of normal multivariate probability distributions, for which there is not an explicit expression in general, by gathering known results (closed forms for submanifolds and bounds) and derive expressions for the distance between distributions with the same covariance matrix and between distributions with mirrored covariance matrices. An application of the Fisher–Rao distance to the simplification of Gaussian mixtures using the hierarchical clustering algorithm is also presented.

1. Introduction

A proper measure to determine the dissimilarity between probability distributions has been approached in many problems and applications. The Fisher–Rao distance is a very special metric for statistical models of probability distributions. This distance is invariant by reparametrization of the sample space and covariant by reparameterization of the parameter space [1]. Moreover, the Fisher–Rao metric is preserved under Markov morphisms and under centain conditions it is, up to a scaling factor, the unique Riemannian metric satisfying this condition [2,3]. Markov morphisms are associated with the notion of statistical sufficiency which express the criterion of passing from one statistical model to another with no loss of information [4,5,6]. Therefore it is natural to require the invariance of the geometric structures of statistical models under Markov morphisms. Between finite sample size simplex model S k 1 = { p R k ; p i 0 and i = 1 k p i = 1 } , a Markov morphism is a linear map T Q ( x ) = x Q , where Q R n × l , with n l , is a matrix with non-negative entries such that every row sums to 1 and every column has precisely one non-zero element. The mapping T Q corresponds to probabilistic refining of the event space { 1 , , n } { 1 , , l } where the refinement i j occurs with probability Q i j [7]. Chentsov [8,9] has proved the Fisher–Rao uniqueness invariance property under Markov morphisms for the finite sample spaces. The extension of this result to more general statistical models requires careful formulations of statistical sufficiency and Markov morphisms and has been evolved since then [3,10]. More recently in [5,6] it is shown this uniqueness of the Fisher–Rao metric under an assumption of strong continuity of the information metric.
After previous papers [11,12,13] connecting geometry and statistics, C. R.Rao in an independent landmark paper [14] considered statistical models with the metric induced by the information matrix defined by R. Fisher in 1921 [15]. This work encouraged several authors to calculate the Fisher–Rao metric distance between other probability distributions [16,17,18] as well as stimulated approaches to other dissimilarity measures such as Kullback-Leibler divergence [19], total variation and Wasserstein distances [20]. Amari [3,4,21] unified the information geometry theory by organizing and introducing other concepts regarding statistical models [2].
An explicit form for the Fisher–Rao distance in the univariate normal distribution space is known via an association with the classical model of the hyperbolic plane [14,16,18,22]. It was applied to quantization of hyperspectral images [23] and to the space of projected lines in the paracatadioptric images [24]. This Fisher–Rao model was used to simplify Gaussian mixtures through the k-means method [25] and a hierarchical clustering technique [26].
An expression for the geodesic curve (initial value problem) in the multivariate normal distributions space was derived in [27] and in [28]. However, the calculus of the Fisher–Rao distance requires solving non-trivial differential equations under boundary conditions to find the geodesic connecting two distributions and then to calculate the integral along the geodesic. A closed form for this distance in the general case is still an open problem. Expressions for the distance are known only in special cases [16,17,18].
The Fisher–Rao distance between multivariate normal distributions in specific cases, such as distributions with a common mean, was considered in diffusion tensor image analysis [29,30,31], in color texture discrimination in several classification experiments [32], in the problem of distributed estimation fusion with unknown correlations [33], and in the machine learning technique [34]. In [35,36], the authors described shapes representing landmarks by a Gaussian model with diagonal covariance matrices and used the Fisher–Rao distance to quantify the difference between two shapes. In [17], this model was applied to statistical inference. Bounds for the Fisher–Rao distance were used to track quality monitoring [37].
This paper is organized as follows. In Section 2, we gather known results (closed forms for special cases and bounds) for the Fisher–Rao distance between multivariate normal distributions. In Section 3, we describe a closed form for the Fisher–Rao distance between distributions with the same covariance matrix and a non-linear system to find the distance between distributions with mirrored covariance matrices. An application of the Fisher–Rao distance to the simplification of Gaussian mixtures using the hierarchical clustering algorithm is presented in Section 4. Some conclusions and perspectives are drawn in Section 5.

2. The Fisher–Rao Distance in the Multivariate Normal Distribution Space: Special Submanifolds and Bounds

In this section, as in [38], we summarize previous results regarding the Fisher–Rao distance in the space of multivariate normal distributions including closed forms for this distance restricted to submanifolds and general bounds.
Given a statistical model S = { p θ = p ( x ; θ ) ; θ = ( θ 1 , θ 2 , , θ k ) Θ R k } , a natural Riemannian structure [21] can be provided by the Fisher information matrix G ( θ ) = g i j ( θ ) :
g i j ( θ ) = E θ θ i log p ( x ; θ ) θ j log p ( x ; θ )
= θ i log p ( x ; θ ) θ j log p ( x ; θ ) p ( x ; θ ) d x ,
where E θ is the expected value with respect to the distribution p θ . This matrix can also be viewed as the Hessian matrix of the Shannon entropy (concave function) [39],
H ( p ) = p ( x ; θ ) log p ( x ; θ ) d x ,
and is used to establish connections between inequalities in information theory and geometrical inequalities.
The Fisher–Rao distance, d F ( · , · ) , between two distributions p θ 1 and p θ 2 in S , identified with their parameters θ 1 and θ 2 , is given by the shortest length of a curve γ ( t ) in the parameter space Θ connecting these distributions, d F ( p θ 1 , p θ 2 ) d F ( θ 1 , θ 2 ) = min γ | γ ( t ) | G d t , where | γ ( t ) | G = γ ( t ) t G ( θ ) γ ( t ) . Note that this is in fact a metric, since for any θ 1 , θ 2 , and θ 3 in Θ , we have: (i) d F ( θ 1 , θ 2 ) 0 and d F ( θ 1 , θ 2 ) = 0 if only if θ 1 = θ 2 ; (ii) d F ( θ 1 , θ 2 ) = d F ( θ 2 , θ 1 ) ; (iii) d F ( θ 1 , θ 2 ) d F ( θ 1 , θ 3 ) + d F ( θ 3 , θ 2 ) . A curve that provides the shortest length is called a geodesic and is given by the solutions of the differential equations
d 2 θ m d t 2 + i , j Γ i j m d θ i d t d θ j d t = 0 , m = 1 , , k ,
where Γ i j m are the Christoffel symbols,
Γ i j m = 1 2 l g j l θ i + g l i θ j g i j θ l g l m
and [ g i j ] is the inverse matrix of the Fisher information matrix.
We consider here the space of the multivariate normal distributions given by:
p ( x ; μ , Σ ) = ( 2 π ) n 2 D e t ( Σ ) exp ( x μ ) t Σ 1 ( x μ ) 2 ,
where x t = ( x 1 , , x n ) R n is the variable vector, μ t = ( μ 1 , , μ n ) R n is the mean vector, and Σ is the covariance matrix in P n ( R ) , the space of order n positive definite symmetric matrices.
In this case, the model S = M = { p θ ; θ = ( μ , Σ ) R n × P n ( R ) } is a statistical n + n ( n + 1 ) 2 -dimensional manifold.
In this case, the model S = M = { p θ ; θ = ( μ , Σ ) R n × P n ( R ) } is a statistical manifold of dimension k = n + n ( n + 1 ) 2 . Considering a parametrization ( μ , Σ ) = ϕ ( θ 1 , , θ k ) of the model M , the Fisher information matrix is given by [40]
g i j ( θ ) = μ t θ i Σ 1 μ θ j + 1 2 tr Σ 1 Σ θ i Σ 1 Σ θ i .
The metric provided by this matrix is invariant with respect to affine transformations. In other words, for any ( c , Q ) R n × G L n ( R ) , where G L n ( R ) is the group of non-singular n-square matrices, the mapping:
ψ ( c , Q ) : M M ( μ , Σ ) ( Q μ + c , Q Σ Q t ) ,
is an isometry in M [16]. Consequently, the Fisher–Rao distance between θ 1 = ( μ 1 , Σ 1 ) and θ 2 = ( μ 2 , Σ 2 ) in M satisfies:
d F ( θ 1 , θ 2 ) = d F ( ( Q μ 1 + c , Q Σ 1 Q t ) , ( Q μ 2 + c , Q Σ 2 Q t ) )
for any ( c , Q ) R n × G L n ( R ) . In particular, for Q = Σ 1 ( 1 / 2 ) and c = Σ 1 ( 1 / 2 ) μ 1 , θ 3 = ( μ 3 , Σ 3 ) = ( Σ 1 ( 1 / 2 ) ( μ 2 μ 1 ) , Σ 1 ( 1 / 2 ) Σ 2 Σ 1 ( 1 / 2 ) ) , the Fisher–Rao distance admits the form:
d F ( θ 1 , θ 2 ) = d F ( θ 0 , θ 3 ) ,
where θ 0 = ( 0 , I n ) , I n is the n-order identity matrix, and 0 R n is the null vector.
The geodesic equations in M can be expressed as [17]:
d 2 μ d t 2 d Σ d t Σ 1 d μ d t = 0 d 2 Σ d t 2 + d μ d t d μ d t t d Σ d t Σ 1 d Σ d t = 0 .
and could be partially integrated [27]:
d μ d t = Σ x d Σ d t = Σ ( B x t μ ) ,
d Δ d t = B Δ + x δ t d δ d t = B δ + ( 1 + δ Δ 1 δ ) x .
where ( δ ( t ) , Δ ( t ) ) = ( Σ 1 ( t ) μ ( t ) , Σ 1 ( t ) ) , x R n , and B is a symmetric matrix. The initial conditions for this problem can be taken as:
( δ ( 0 ) , Δ ( 0 ) ) = ( 0 , I n ) d δ d t ( 0 ) , d Δ d t ( 0 ) = ( x , B ) .
Eriksen [27] and Calvo and Oller [28], in independent works, solved this initial value problem. An explicit solution to the geodesic curve in M [28] is:
δ ( t ) = B ( cosh ( t G ) I n ) ( G ) 2 x + sinh ( t G ) G x Δ ( t ) = I n + 1 2 ( cosh ( t G ) I n ) + 1 2 B ( cosh ( t G ) I n ) ( G ) 2 B 1 2 sinh ( t G ) G B 1 2 B sinh ( t G ) G ,
where I n is an n-order identity matrix, G 2 = B 2 + 2 x x t , and G is the generalized inverse square matrix of G, that is G G G = G .
Due the fact that the geodesic curve has constant velocity at any point, given ( x , B ) in the tangent space of M , the Fisher–Rao distance between ( 0 , I n ) and ( δ ( 1 ) , Δ ( 1 ) ) is:
0 1 d μ d t ( 0 ) Σ 1 ( 0 ) d μ d t ( 0 ) + 1 2 tr Σ 1 ( 0 ) Σ d t ( 0 ) 2 d t = 1 2 tr ( B 2 ) + | x | 2 ,
where | · | is the standard Euclidean norm. Note that the above expression provides the Fisher–Rao distance between two distributions only if we can determine the initial value problem from the boundary conditions, which usually is very difficult.
Han and Park in [31] presented a numerical shooting method for computing the minimum geodesic distance between two normal distributions, through parallel transport of a vector field defined along the geodesic curve given in Equation (15).
A closed form for the Fisher–Rao distance between two normal distributions in M is still an important open question. Next, we present closed forms for this distance in some submanifolds of M .

2.1. Closed Forms for the Fisher–Rao Distance in Submanifolds of M

In this subsection, we consider submanifolds M M with the distance induced by the Fisher–Rao metric in M . It is important to remark that, in general, given two distributions θ 1 and θ 2 in M , the distance between θ 1 and θ 2 when restricted to a submanifold M is bigger than the distance between θ 1 and θ 2 in M , that is d M ( θ 1 , θ 2 ) d M ( θ 1 , θ 2 ) . This is due to the fact that to get d M , we consider the minimum length of restricted curves, which are the ones contained in the submanifold M . We say that M is totally geodesic if only if d M ( θ 1 , θ 2 ) = d M ( θ 1 , θ 2 ) , for any θ 1 , θ 2 M , which means that the geodesic in M connecting θ 1 and θ 2 is contained in M .

2.1.1. The Submanifold M Σ Where Σ Is Constant

In the n-dimensional manifold composed by multivariate normal distributions with common covariance matrix Σ , M Σ = { p θ ; θ = ( μ , Σ ) , Σ = Σ 0 P n ( R ) constant } , the Fisher–Rao distance between two distributions θ 1 = ( μ 1 , Σ 0 ) and θ 2 = ( μ 2 , Σ 0 ) is [18]:
d Σ ( θ 1 , θ 2 ) = ( μ 1 μ 2 ) t Σ 0 1 ( μ 1 μ 2 ) .
This distance is equal to the Mahalanobis distance [11], which is equal to the Euclidean distance between the image of μ 1 and μ 2 under the transformation μ P 1 μ , where Σ 0 = P P t is the Cholesky decomposition [18]. This distance was one of the first dissimilarity measures between datasets with some correlation. Note that this submanifold is not totally geodesic, as it can be seen even in the space of univariate normal distributions [22] and in Example 1 in the next section.
A geodesic curve γ Σ ( t ) in M Σ connecting θ 1 and θ 2 can be provided by:
γ Σ ( t ) = ( ( 1 t ) μ 1 t μ 2 , Σ 0 ) .

2.1.2. The Submanifold M μ Where μ Is Constant

A totally geodesic submanifold of M is given by M μ = { p θ ; θ = ( μ , Σ ) , μ = μ 0 R n constant } of dimension n ( n + 1 ) 2 composed by distributions that have the same mean vector μ 0 . The Fisher–Rao distance in M μ was studied by several authors in different contexts [16,18,30,41] and for θ 1 = ( μ 0 , Σ 1 ) and θ 2 = ( μ 0 , Σ 2 ) is given by:
d F ( θ 1 , θ 2 ) = 1 2 i = 1 n [ log ( λ i ) ] 2 ,
where 0 < λ 1 λ 2 λ n are the eigenvalues of Σ 1 1 / 2 Σ 2 Σ 1 1 / 2 .
An expression for the geodesic curve connecting these two distributions is [30]:
γ μ ( t ) = ( μ 0 , Σ 1 1 / 2 exp ( t log ( Σ 1 1 / 2 Σ 2 Σ 1 1 / 2 ) ) Σ 1 1 / 2 ) .

2.1.3. The Submanifold M D Where Σ Is Diagonal

Let M D = { p θ ; θ = ( μ , Σ ) , Σ = diag ( σ 1 2 , σ 2 2 , , σ n 2 ) , σ i > 0 , i = 1 , , n } , the submanifold of M composed by distributions with a diagonal covariance matrix. If we consider the parameter θ = ( μ 1 , σ 1 , μ 2 , σ 2 , , μ n , σ n ) , it can be shown [22] that the metric in the parametric space of M D is equal to the product metric:
d D ( θ 1 , θ 2 ) = i = 1 n d F 2 ( ( μ 1 i , σ 1 i ) , ( μ 2 i , σ 2 i ) ) ,
where d F is the Fisher–Rao distance in the univariate case given by [22]:
d F ( ( μ 1 , σ 1 ) , ( μ 2 , σ 2 ) ) = 2 log μ 1 2 , σ 1 μ 2 2 , σ 2 + μ 1 2 , σ 1 μ 2 2 , σ 2 μ 1 2 , σ 1 μ 2 2 , σ 2 μ 1 2 , σ 1 μ 2 2 , σ 2 .
In this space, a curve γ D ( t ) = ( γ 1 ( t ) , , γ n ( t ) ) is a geodesic if, and only if, γ i ( t ) is a geodesic curve in the univariate case, for all i = 1 , , n . The geodesic curves in the univariate normal distributions space (upper half plane R × R + ) are half-vertical lines and half-ellipses centered at σ = 0 , with eccentricity 1 2 [22].
It is important to note that M D M is not totally geodesic. The submanifold of M D composed only by normal distributions with covariance matrices which are multiples of the identity (round normals) is totally geodesic [22]. In fact, this submanifold of round normals is also contained in the totally geodesic submanifold described next.

2.1.4. The Submanifold M D μ Where Σ Is Diagonal and μ Is an Eigenvector of Σ

Let M D μ be the n + 1 -dimensional submanifold composed by distributions with the mean vector μ = μ 1 e i for some e i { e 1 , , e n } (the canonical basis of R n ) and diagonal covariance matrix Σ , and without loss of generality, we shall assume that e i = e 1 . An analytic expression for the distance in M D μ is:
d D μ 2 ( θ 1 , θ 2 ) = d F 2 μ 11 , σ 11 , μ 21 , σ 21 + i = 2 n d F 2 0 , σ 1 i , 0 , σ 2 i .
We proved in [42] that this submanifold is totally geodesic.

2.2. Bounds for the Fisher–Rao in M

As mentioned, a closed form for the Fisher–Rao distance between two general normal distributions is not known. In this subsection, we present some bounds for this distance.

2.2.1. A Lower Bound

Calvo and Oller [43] derived a lower bound for the Fisher–Rao distance through an isometric embedding of the parametric space M into the manifold of the positive definite matrices.
Proposition 1.
[43] Given θ 1 = ( μ 1 , Σ 1 ) and θ 2 = ( μ 2 , Σ 2 ) , let:
S i = Σ i + μ i μ i t μ i t μ i 1 ,
i = 1 , 2 . A lower bound for the distance between θ 1 and θ 2 is:
L B ( θ 1 , θ 2 ) = 1 2 i = 1 n + 1 [ log ( λ i ) ] 2 ,
where λ i , 1 i n + 1 , are the eigenvalues of S 1 1 / 2 S 2 S 1 1 / 2 .
We note that this bound satisfies the distance proprieties in M . In [44], through a similar approach, a lower bound for the Fisher–Rao distance was obtained in the more general space of elliptical distributions, restricted to normal distributions, is the above bound.

2.2.2. The Upper Bound U B 1

In [45], we proposed an upper bound based on an isometry (8) in the manifold M and on the distance in the non-totally geodesic submanifold M D (21), as follows:
Proposition 2.
[45] The Fisher–Rao distance between two multivariate normal distributions θ 1 = ( μ 1 , Σ 1 ) and θ 2 = ( μ 2 , Σ 2 ) is upper bounded by,
U B 1 ( θ 1 , θ 2 ) = i = 1 n d F 2 ( ( 0 , 1 ) , ( μ i , λ i ) ) ,
where λ i are the diagonal terms of the matrix Λ given by the eigenvalues of A = Σ 1 ( 1 / 2 ) Σ 2 Σ 1 ( 1 / 2 ) = Q Λ Q t , μ i are the coordinates of μ = Q t Σ 1 ( 1 / 2 ) ( μ 2 μ 1 ) , Q is the orthogonal matrix whose columns are the eigenvectors of A and d F is the Fisher–Rao distance between univariate normal distributions given in Equation (22).

2.2.3. The Upper Bounds U B 2 and U B 3

Considering the Fisher–Rao distance in the totally geodesic submanifold M D μ and the triangular inequality, we propose another upper bound [42].
Given θ 1 = ( μ 1 , Σ 1 ) and θ 2 = ( μ 2 , Σ 2 ) , we consider the Fisher–Rao distance between θ 0 = ( 0 , I n ) and θ 3 = ( μ 3 , Σ 3 ) as in Equation (10). Let θ ¯ = ( μ ¯ , Σ ¯ ) ; by the triangular inequality, it follows that:
d F ( θ 0 , θ 3 ) d F ( θ 0 , θ ¯ ) + d F ( θ ¯ , θ 3 ) .
To calculate this bound, we choose θ ¯ appropriately. For μ ¯ = μ 3 , note that d F ( θ ¯ , θ 3 ) = d μ ( θ ¯ , θ 3 ) . Let P be an orthogonal matrix such that P μ = ( | μ 3 | , 0 , , 0 ) and D = diag ( d 1 2 , d 2 2 , , d n 2 ) a diagonal matrix. We will consider Σ ¯ = P 1 D P t and θ P = ( P μ , D ) . By the isometry ψ ( c , Q ) , given in Equation (9), for Q = P 1 and c = 0 , it follows:
d F ( θ 0 , θ ¯ ) = d D μ ( θ 0 , θ P ) .
Then, combining Inequality (27) and Equation (28), the left side of the equation below is an upper bound for the Fisher–Rao distance between θ 1 and θ 2 ,
d F ( θ 0 , θ 3 ) d D μ ( θ 0 , θ P ) + d μ ( θ ¯ , θ 3 ) .
In [42], we derived the upper bound:
U B 2 = d D μ ( θ 0 , θ P ) + d μ ( θ ¯ , θ 3 ) .
through a numerical minimization process by considering the diagonal elements of D as a vector that minimizes d D μ ( θ 0 , θ P ) + d μ ( θ ¯ , θ 3 ) ,
( d ¯ 1 , d ¯ 2 , , d ¯ n ) = min ( d 1 , d 2 , , d n ) { d D μ ( θ 0 , θ P ) + d μ ( θ ¯ , θ 3 ) } .
We also derive an analytic upper bound U B 3 by minimizing of the distance d D μ ( θ 0 , θ P ) . By expressing this distance in terms of the parameters ( θ 0 , θ P ) = ( 0 , I n ) , ( P μ , D ) , we can show that it reaches the minimum at:
D = diag | μ 3 | + 2 2 , 1 , , 1 .
The lower bound of Section 2.2.1 and the upper bounds of Section 2.2.2 and Section 2.2.3 are summarized in Table 1.
Upper and lower bounds have been used to estimate the Fisher–Rao distance in applications such as [37].

2.2.4. Comparisons of the Bounds

In this section, as in [42], we illustrate comparisons between the bounds presented previously.
We consider the bivariate normal distributions model ( n = 2 ) and distributions θ 0 and θ ^ = ( μ ^ , Σ ^ ) , where:
θ ^ = ( μ ^ , Σ ^ ) = μ 0 , cos ( α ) sin ( α ) sin ( α ) cos ( α ) λ 1 0 0 λ 2 cos ( α ) sin ( α ) sin ( α ) cos ( α ) .
From (10), we can see that there always exists an isometry that converts any two pairs of bivariate distributions into a pair of distributions as above.
We present next a comparison between the lower bound “LB” (25), the upper bounds U B 1 (26), U B 2 (31), and U B 3 (32), and the numerical solution given by the geodesic shooting algorithm (GS) [31] in specific situations.
In Figure 1, we consider the eigenvalues λ 1 = 2 , λ 2 = 0.5 , and μ = 1 to be fixed and α varying from zero to π 2 . We note that the upper bound U B 1 is very near the lower bound L B and to the numerical solution G S . The other upper bounds are bigger than the bound U B 1 . In Figure 2, it is considered μ = 10 and the previous eigenvalues. Now, the best performance is of bounds U B 2 and U B 3 , which are similar. In Figure 3a, we again keep the eigenvalues; the rotation angle is fixed α = π 4 ; and μ varies from zero to 10. We can see similar performances of U B 2 and U B 3 , which are better than U B 1 for larger values of μ .
We may also consider the upper bound:
U B 123 ( θ 1 , θ 2 ) = min { U B 1 ( θ 1 , θ 2 ) , U B 2 ( θ 1 , θ 2 ) , U B 3 ( θ 1 , θ 2 ) } .
Figure 3b displays the comparison between L B , U B 123 , and G S for the same data of the comparison in Figure 3a.

3. Fisher–Rao Distance Between Special Distributions

In this section, we describe the Fisher–Rao distance in the full space M between special kinds of distributions.

3.1. The Fisher–Rao Distance Between Distributions with Common Covariance Matrices

The Fisher–Rao distance between distributions with common covariance matrices given in Section 2.1.1 was restricted to non-totally geodesic submanifold M Σ . We show next that using the isometry given in (8) and the distance in the submanifold M D μ , it is possible to find a closed form for the distance between two distributions with the same covariance matrix, in the full manifold M .
Proposition 3.
Given two distributions θ 1 = ( μ 1 , Σ ) and θ 2 = ( μ 2 , Σ ) in M , let P be an orthogonal matrix such that P ( μ 2 μ 1 ) = | μ 2 μ 1 | e 1 , and consider the decomposition U D U t of the matrix P Σ P t ,
P Σ P t = U D U t ,
where U is an upper triangular matrix with all diagonal entries equal to one and D is a diagonal matrix. The Fisher–Rao distance between θ 1 and θ 2 is given by:
d F ( θ 1 , θ 2 ) = d D μ ( ( 0 , D ) , ( | μ 2 μ 1 | e 1 , D ) ) .
Proof. 
By considering the isometries ψ = ψ ( P μ 1 , P ) and ψ ^ = ψ ( 0 , U 1 ) and the decomposition given by Equation (35), it follows from Equation (9) that:
d F ( θ 1 , θ 2 ) = d F ( ψ ( P μ 1 , P ) ( θ 1 ) , ψ ( P μ 1 , P ) ( θ 2 ) ) = d F ( ( P μ 1 P μ 1 , P Σ P t ) , ( P μ 2 P μ 1 , P Σ P t ) ) = d F ( ( 0 , P Σ P t ) , ( | μ 2 μ 1 | e 1 , P Σ P t ) ) . = d F ( ψ ^ ( 0 , P Σ P t ) , ψ ^ ( | μ 2 μ 1 | e 1 , P Σ P t ) ) = d F ( ( U 1 0 , U 1 P Σ P t U t ) , ( | μ 2 μ 1 | U 1 e 1 , U 1 P Σ P t U t ) ) = d F ( ( 0 , D ) , ( | μ 2 μ 1 | e 1 , D ) ) .
Since the distributions ( 0 , D ) and ( | μ 2 μ 1 | e 1 , D ) belong to the submanifold M D μ , we conclude that:
d F ( θ 1 , θ 2 ) = d D μ ( ( 0 , D ) , ( | μ 2 μ 1 | e 1 , D ) ) .
Example 1.
Consider two bivariate normal distributions θ 1 = ( ( 1 , 0 ) t , Σ ) and θ 2 = ( ( 6 , 3 ) t , Σ ) with the same covariance matrix:
Σ = 1.1 0.9 0.9 1.1 .
Figure 4a illustrates the normal distributions in the geodesic curve connecting connecting θ 1 and θ 2 in M , and Figure 4b illustrates the geodesic in the submanifold M Σ . We observe that in M , the shape of the ellipses (contour curves) changes along the path. Furthermore, the Fisher–Rao distance between θ 1 and θ 2 is d F ( θ 1 , θ 2 ) = 5.00648 , which is less than the Mahalanobis distance given in Equation (17), d Σ ( θ 1 , θ 2 ) = 8.06226 , as expected, since the submanifold M Σ is not totally geodesic.

3.2. The Fisher–Rao Distance Between Mirrored Distributions

We consider here two mirrored normal distributions; that is, without loss of generality, if we consider up rotation, the line connecting μ 1 and μ 2 as parallel to the e 1 -axis, and the covariance matrices Σ 1 and Σ 2 satisfying:
Σ 2 = M 1 Σ 1 M 1 , where M 1 = 1 0 0 I n 1 .
This condition implies also the same eigenvalues for both matrices.
For bivariate normal distributions, we then should have:
θ 1 = μ 1 μ 0 , σ 11 σ 12 σ 12 σ 22 e θ 2 = μ 2 μ 0 σ 11 σ 12 σ 12 σ 22 ;
see Figure 5.
After several experiments using the algorithm geodesic shooting for the θ 1 and θ 2 , we have observed that for t = 0 the geodesic curve connecting these distributions ( γ ( t ) = ( μ ( t ) , Σ ( t ) ) , with γ ( 1 ) = θ 1 and γ ( 1 ) = θ 2 ), satisfies
γ ( 0 ) θ 1 / 2 = ( μ 1 / 2 , Σ 1 / 2 ) = μ 1 + μ 2 2 η , d 11 2 0 0 d 22 2 ,
γ ( 0 ) θ ^ 1 / 2 = ( μ ^ 1 / 2 , Σ ^ 1 / 2 ) = μ ^ 1 0 , 0 σ ^ 12 σ ^ 12 0 .
where η , d 11 , and d 22 are real values; see Figure 6.
The focus here is the “shape” of these distributions. Note that at t = 0 , the distribution γ ( 0 ) appears as θ 1 / 2 , which has a diagonal covariance matrix, and the tangent vector γ ( 0 ) appears as θ ^ 1 / 2 , which is composed by a mean vector with the second entry equal to zero and by a symmetric covariance matrix with a null diagonal.
This observation inspired us to get an explicit expression for the geodesic connecting two mirrored distributions. Starting with the bi-dimensional case again, we will prove that in fact we have equality in Expressions (41) and (42).
Let γ ( t ) = ( μ ( t ) , Σ ( t ) ) , 1 t 1 , and the geodesic curve in M connecting θ 1 and θ 2 , and consider that γ ( 0 ) = θ 1 / 2 and γ ( 0 ) = θ ^ 1 / 2 . Given the isometry ψ = ψ Σ 1 / 2 1 / 2 μ 1 / 2 , Σ 1 / 2 1 / 2 , we define:
γ ¯ ( t ) = ( μ ¯ ( t ) , Σ ¯ ( t ) ) : = ψ ( γ ( t ) ) = Σ 1 / 2 1 / 2 ( μ ( t ) μ 1 / 2 ) , Σ 1 / 2 1 / 2 Σ ( t ) Σ 1 / 2 1 / 2 .
Then:
γ ¯ ( t ) = d μ ¯ ( t ) d t , Σ ¯ ( t ) d t = Σ 1 / 2 1 / 2 d μ ( t ) d t , Σ 1 / 2 1 / 2 Σ ( t ) d t Σ 1 / 2 1 / 2 ,
γ ¯ ( 0 ) = Σ 1 / 2 1 / 2 ( μ 1 / 2 μ 1 / 2 ) , Σ 1 / 2 1 / 2 Σ 1 / 2 Σ 1 / 2 1 / 2 = ( 0 , I 2 ) = : θ 0
and:
γ ¯ ( 0 ) = Σ 1 / 2 1 / 2 μ 1 / 2 , Σ 1 / 2 1 / 2 Σ 1 / 2 Σ 1 / 2 1 / 2 = μ ( 0 ) d 11 0 , 0 σ 12 ( 0 ) d 11 d 22 σ 12 ( 0 ) d 11 d 22 0 .
Applying the natural changing of parameters:
( δ ( t ) , Δ ( t ) ) = φ ( μ ¯ ( t ) , Σ ¯ ( t ) ) = ( Σ ¯ ( t ) 1 μ ¯ ( t ) , Σ ¯ ( t ) 1 ) ,
it follows that:
d Δ d t ( t ) = Δ ( t ) d Σ ¯ d t ( t ) Δ ( t ) d δ d t ( t ) = d Δ d t ( t ) μ ¯ ( t ) + Δ ( t ) d μ ¯ d t ( t ) .
Then, given that ( δ ( 0 ) , Δ ( 0 ) ) = ( μ ¯ ( 0 ) , Σ ¯ ( 0 ) ) = ( 0 , I 2 ) ,
d Δ d t ( 0 ) = Δ ( 0 ) d Σ ¯ d t ( 0 ) Δ ( 0 ) = d Σ ¯ d t ( 0 ) d δ d t ( 0 ) = d Δ d t ( 0 ) μ ¯ ( 0 ) + Δ ( 0 ) d μ ¯ d t ( 0 ) = d μ ¯ d t ( 0 ) .
That is, at t = 0 , the tangent vector d δ d t ( 0 ) , d Δ d t ( 0 ) is equal to the tangent vector in (46). Furthermore, the distributions ϑ 1 = φ ( θ ¯ 1 ) = ( Σ ¯ 1 1 μ ¯ 1 , Σ ¯ 1 1 ) and ϑ 2 = φ ( θ ¯ 2 ) = ( Σ ¯ 2 1 μ ¯ 2 , Σ ¯ 2 1 ) are also mirrored Σ ¯ 2 1 = M 1 Σ ¯ 1 1 M 1 . In fact,
ϑ 1 = ( Σ ¯ 1 1 μ ¯ 1 , Σ ¯ 1 1 ) = ( ( Σ 1 / 2 1 / 2 Σ 1 Σ 1 / 2 1 / 2 ) 1 Σ 1 / 2 1 / 2 ( μ 1 μ 1 / 2 ) , ( Σ 1 / 2 1 / 2 Σ 1 Σ 1 / 2 1 / 2 ) 1 ) = 1 det ( Σ 1 ) σ 22 d 11 μ 1 μ 2 2 σ 12 d 11 ( μ 0 η ) σ 11 d 22 ( μ 0 η ) σ 12 d 22 μ 1 μ 2 2 , 1 det ( Σ 1 ) σ 22 d 11 2 σ 12 d 11 d 22 σ 12 d 11 d 22 σ 11 d 22 2 ,
and by similar arguments, we obtain:
ϑ 2 = 1 det ( Σ 1 ) σ 22 d 11 μ 2 μ 1 2 + σ 12 d 11 ( μ 0 η ) σ 11 d 22 ( μ 0 η ) σ 12 d 22 μ 1 μ 2 2 , 1 det ( Σ 1 ) σ 22 d 11 2 σ 12 d 11 d 22 σ 12 d 11 d 22 σ 11 d 22 2 .
Figure 7 illustrates the distributions θ 0 , ϑ 1 , and ϑ 2 .
Conversely, by considering:
( x , B ) = x 0 , 0 b b 0
in the initial value problem given in Equations (13) and (14), it follows that the matrix G 2 = B 2 + x x t is diagonal. Therefore, the geodesic curve ( δ ( t ) , Δ ( t ) ) with initial value ( δ ( 0 ) , Δ ( 0 ) ) = θ 0 and tangent vector ( x , B ) given in Equation (15) can be simplified as follows:
δ ( t ) = x sinh ( t b 2 + 2 x 2 ) b 2 + 2 x 2 b x ( cosh ( t b 2 + 2 x 2 ) 1 ) b 2 + 2 x 2 Δ ( t ) = 1 2 ( cosh ( b t ) + cosh ( t b 2 + 2 x 2 ) ) 1 2 sinh ( b t ) + b sinh ( t b 2 + 2 x 2 ) b 2 + 2 x 2 1 2 sinh ( b t ) + b sinh ( t b 2 + 2 x 2 ) b 2 + 2 x 2 1 2 cosh ( b t ) + 2 x 2 + b 2 cosh ( t b 2 + 2 x 2 ) b 2 + 2 x 2 .
From the parity of the functions sinh ( t ) and cosh ( t ) , it is possible to show that, given t 0 R , the distributions:
( δ ( t 0 ) , Δ ( t 0 ) ) and ( δ ( t 0 ) , Δ ( t 0 ) )
are also mirrored.
By the above discussion, we conclude that it is possible to calculate the geodesic curve connecting θ 1 and θ 2 making ψ 1 ( φ 1 ( δ ( 1 ) , Δ ( 1 ) ) ) = θ 1 and ψ 1 ( φ 1 ( δ ( 1 ) , Δ ( 1 ) ) ) = θ 2 . That is, we need to find the values of η , d 11 , and d 22 of the isometry ψ and the values of ( x , B ) such that:
φ ( ψ ( θ 1 ) ) = ( δ ( 1 ) , Δ ( 1 ) ) φ ( ψ ( θ 2 ) ) = ( δ ( 1 ) , Δ ( 1 ) ) .
Since the two equations above are equivalent, it is enough to solve the equation:
( δ ( 1 ) , Δ ( 1 ) ) = φ ( ψ ( μ 2 , Σ 2 ) ) .
This is equivalent to solving the system:
1 d 11 0 0 1 d 22 Δ ( 1 ) 1 d 11 0 0 1 d 22 = Δ 2 1 d 11 0 0 1 d 22 δ ( 1 ) + Δ 2 μ 1 + μ 2 2 η = δ 2 ,
where ( δ 2 , Δ 2 ) = φ ( μ 2 , Σ 2 ) .
The above non-linear system has five equations and five variables ( d 11 , d 22 , η , x , b ) and can be solved by an iterative method. With the solution of this system, we can determine the geodesic curve connecting the distributions θ 1 and θ 2 . Moreover, by Equation (16), the Fisher–Rao distance is:
d F ( θ 1 , θ 2 ) = 2 d F ( θ 0 , θ 2 ) = 2 d F ( ( 0 , I n ) , ( δ ( 1 ) , Δ ( 1 ) ) ) = 2 1 2 tr ( B 2 ) + x t x = 2 b 2 + x 2 .
We also remark that the curve of the means δ ( t ) (and therefore, μ ( t ) ) satisfies the equation of a hyperbola; in fact:
( b x ( cosh ( t b 2 + 2 x 2 ) 1 ) b 2 + 2 x 2 b x b 2 + 2 x 2 ) 2 b x b 2 + 2 x 2 2 x sinh ( t b 2 + 2 x 2 ) b 2 + 2 x 2 2 x b 2 + 2 x 2 2 = 1 .
Summarizing the above discussion, we have:
Proposition 4.
(i) 
Expression (57) provides a closed form for the Fisher–Rao distance between two mirrored bivariate normal distributions, based on the solutions of the non-linear system (56).
(ii) 
The plane curve given by the coordinates of the mean vector in the geodesic connecting two of these distributions is a hyperbola.
Table 2 shows a time comparison between the numerical method proposed here and the geodesic shooting to obtain the Fisher–Rao distance. The distributions used in this experiment were:
θ 1 = μ 0 , 0.55 0.45 0.45 0.55 and θ 2 = μ 0 0.55 0.45 0.45 0.55 .
for different values of μ .
The method proposed here uses a non-linear system for the calculus of the Fisher–Rao distance, so it is faster the geodesic shooting algorithm. Furthermore, we remark that for μ 7 , the geodesic shooting requires additional adaptation to convergence.
Next, we generalize the results of Proposition 4 to pairs of general multivariate normal mirrored distributions. Without loss of generality, we may assume:
θ 1 = ( μ 1 e 1 , Σ 1 ) and θ 2 = ( μ 2 e 1 , Σ 2 ) ,
with Σ 2 = M 1 Σ 1 M 1 as in (39), that is:
Σ 2 = σ ^ 1 j = σ ^ j 1 = σ 1 j , j = 2 , , n σ ^ i j = σ i j , otherwise . .
Proposition 5.
The Fisher–Rao distance between a pair of multivariate mirrored normal distributions θ 1 and θ 2 (65) is:
d F ( θ 1 , θ 2 ) = 2 l = 1 n 1 b l 2 + x 2 ,
where:
( x , B ) = x 0 0 , 0 b 1 b n 1 b 1 0 0 b n 1 0 0 .
The values x and b l , the non-zero entries of ( x , B ) , are obtained by the solution of the n + n ( n + 1 ) 2 order non-linear system:
L 1 Δ ( 1 ) L 1 = Δ 2 L 1 δ ( 1 ) + Δ 2 μ 1 / 2 = δ 2 ,
where μ 1 / 2 = μ 1 + μ 2 2 , η 1 , , η n 1 t , L is the Cholesky factor of the matrix Σ 1 / 2 = d 11 0 t 0 D , with D a symmetric ( n 1 ) order matrix, ( δ 2 , Δ 2 ) = φ ( μ 2 , Σ 2 ) , and ( δ ( t ) , Δ ( t ) ) is the geodesic curve with initial value ( δ ( 0 ) , Δ ( 0 ) ) = θ 0 and tangent vector ( x , B ) given in Equation (15).
Let γ ( t ) = ( μ ( t ) , Σ ( t ) ) , 1 t 1 , be the geodesic curve in M connecting θ 1 and θ 2 . The proof is similar to the bivariate case, by considering γ ( 0 ) = ( μ 1 / 2 , Σ 1 / 2 ) ,
γ ( 0 ) = θ ^ 1 / 2 = ( μ ^ 1 / 2 , Σ ^ 1 / 2 ) ) = μ ^ 0 0 , 0 σ ^ 12 σ ^ 1 n σ ^ 12 0 0 σ ^ 1 n 0 0
and γ ¯ ( t ) = ψ ( γ ( t ) ) where ψ = ψ L 1 μ 1 / 2 L 1 .
θ 1 = ( μ 1 e 1 , Σ 1 ) and θ 2 = ( μ 2 e 1 , Σ 2 ) ,
with Σ 2 = M 1 Σ 1 M 1 .
Table 3 collects the results in Section 2.1 and the new results of this section.

4. Hierarchical Clustering for Diagonal Gaussian Mixture Simplification

A parameterized Gaussian mixture model f is a weighted sum of m multivariate normal distributions, that is,
f ( x ) = i = 1 m w i p i ( x ; μ i , Σ i ) ,
where x R n , p i ( x ; μ i , Σ i ) , i = 1 , , m , are normal distributions and w i , i = 1 , , m , are mixture, i = 1 m w i = 1 . In this paper, we call the diagonal Gaussian mixture model (DGMM) the mixture composed only by distributions with diagonal covariance matrices.
Gaussian mixture models (GMM) are used in modeling datasets: image processing, signal processing, and density estimation problems [46,47,48]. In many applications involving mixture models, the computational requirements are of a very high level due to the large number of mixture components. This can be handled if we reduce the number of components of the mixture: given a mixture f of m components, we want to find a mixture g of l components, 1 l < m , such that g is a good approximation of f with respect to a similarity measure [49]. Gaussian mixture simplification was considered in statistical inference in [50] and to decode low-density lattice codes [51].
In [49] was proposed a hierarchical clustering algorithm to simplify an exponential family mixture model based on Bregman divergences. This section describes an agglomerative hierarchical clustering method based on the Fisher–Rao distance in the submanifold M D (21) to simplify DGMM, and we present an application to image segmentation, complementing what was developed in [52]. We start by introducing the concept of the centroid for a set of distributions in M D .

4.1. Centroids in the Submanifold M D

In [53], Galperin described centroids in the two-dimensional Minkowski model, which can be translated also to the Klein disk and Poincare half-plane models. Given a set of points q i = ( x q i , y q i , z q i ) in the Minkowski model, with associated weights u i , the centroid is computed and normalized as:
c = i u i q i and c = c x c 2 y c 2 + z c 2 .
To calculate the centroid c of a subset of points C = { ( w i , θ i ) } , θ i = ( μ i , σ i ) , the isometries presented in [25] and the relation between the media × standard deviation plane of parameters of univariate normal distributions and the Poincare half-plane given in [22] are used.
Given a dataset C = { ( w i , θ i ) } , where θ i = ( μ 1 i , σ 1 i , , μ n i , σ n i ) are distributions in M D , the centroid of C is:
c : = ( c 1 , , c n ) ,
where c j , j = 1 , , n , is the centroid of C j = { ( w j , ( μ j i , σ j i ) ) } given in Equation (66).

4.2. Hierarchical Clustering Algorithm

Let a DGMM f with parameters C = { ( w 1 , θ 1 ) , , ( w m , θ m ) } .
In order to apply the hierarchical clustering algorithm, we need to consider the distance between two subsets A and B. The three most common distances are called linkage criteria and are given by [54]:
  • Single linkage:
    D ( A , B ) = min { d D ( a , b ) ; a A , b B } ;
  • Complete linkage:
    D ( A , B ) = max { d D ( a , b ) ; a A , b B } ;
  • Group average linkage:
    D ( A , B ) = 1 | A | | B | a A b B d D ( a , b ) ,
    where d D is the distance in the submanifold M D and | X | is the number of elements of a set X.
A summary of the hierarchical clustering algorithm (Algorithm 1) [49] using one of these distances is given next.   
Algorithm 1: Hierarchical Clustering Algorithm
1:
Form m clusters C j = { ( w j , θ j ) } with one element.
2:
Find the two closest clusters, C i and C j , with respect to a distance D, and merge them into asingle cluster C i C j .
3:
Compute distances between the new cluster and each of the old clusters.
4:
Repeat Steps 2 and 3 until all items are clustered into a single cluster of size n.
The simplified DGMM:
g = j = 1 l β j g j
of l components is built from the l subsets C 1 , …, C l remaining after the iteration n l of the hierarchical clustering algorithm. In this work, we choose the parameters of g j in two ways: as the centroid in the submanifold M D (Fisher–Rao hierarchical clustering) and as the Bregman left-sided centroid [49] (Bregman–Fisher–Rao hierarchical clustering) of the subset C j with weights β j = ( w i , θ i ) C j w i .
As remarked in [49], the hierarchical clustering algorithm allows introducing a method to learn the optimal number of components in the simplified mixture g. Thus, g must be as compact as possible and reach a minimum prescribed quality d K L ( f | | g ) τ , where d K L ( f | | g ) is the Kullback–Leibler divergence.

4.3. Experiments in Image Segmentation

We can apply the Fisher–Rao and the Bregman–Fisher–Rao hierarchical clusterings to simplify a mixture of exponential families in the context of clustering-based image segmentation as was done in [49] for the Bregman hierarchical clustering. Given an input color image I, we adapt the Bregman soft clustering algorithm to generate a DGMM f of 32 components, which models the image pixels. We point out that the restriction considered in this paper (only DGMM) is also used in many applications due its much lower computational cost. We consider here a pixel ρ = ( ρ R , ρ G , ρ B ) as a point in R 3 , where ρ R , ρ G , and ρ B are the RGB color information. For image segmentation, we can say that the image pixel ρ belongs to the class C j when:
p j ( ρ ; μ j , Σ j ) > p i ( ρ ; μ i , Σ i ) , i { 1 , , m } { j } .
Thus, the segmented image is illustrated by replacing the color value of the pixel ρ by the mean μ j of the Gaussian p j .
Using the the Fisher–Rao and the Bregman–Fisher–Rao hierarchical clusterings, we simplify the mixture f into mixtures g of l components with l = { 2 , 4 , 8 , 16 } . Each mixture gives one image segmentation. The linkage criterion used here was the complete linkage (68), which has presented better results in our simulations. Figure 8 shows the segmentation of the Baboon, Lena, and Clown input images given by the Bregman–Fisher–Rao hierarchical clustering. The number of colors in each image is equal to the number of components in the simplified mixture g.
The quality of the segmentation was analyzed as a function of l through the Kullback–Leibler divergence estimated by the Monte Carlo method, since there was no closed form for this measure (five thousand points were randomly drawn to estimate d K L ( f | | g ) ). Figure 9, Figure 10 and Figure 11 show the evolution of the simplification quality as a function of the number of components l for the Baboon, Lena, and Clown images, using the Bregman, the Fisher–Rao, and the Bregman–Fisher–Rao hierarchical clustering algorithms. We observed that the image quality increased ( d K L ( f | | g ) decreased) with l, as expected, and the behavior was similar in all clustering algorithms. In general, the Bregman–Fisher–Rao hierarchical clustering algorithm presented better results. Considering the constraint τ = 0.2 , the learning process provided, for the Bregman–Fisher–Rao hierarchical clustering, mixtures of 19, 21, and 21 as optimal simplifications for the images of the Baboon, Lena, and Clown, respectively.

5. Concluding Remarks

The Fisher–Rao distance was approached here in the space of multivariate normal distributions. Initially, as in [38], we summarized some known closed forms for this distance in submanifolds of this model and some bounds for the general case. A closed form for the Fisher–Rao distance between distributions with the same covariance matrix was obtained in Proposition 3, and we also have derived a non-linear system characterizing the distance between two distributions with mirrored covariance matrices in Proposition 5. Some perspectives for future research related to this topic include deriving new bounds for the Fisher–Rao distance in the general case, by using these special distributions, to characterize as non-linear systems the distances between other types of distributions and to extend the closed forms and bounds presented here to the space of elliptical distributions. Finally, we have extended the analysis of the Bregman–Fisher–Rao hierarchical clustering algorithm to simplify Gaussian mixtures in the context of clustering-based image segmentation given in [52] with comparative results that encourage the use of the Fisher–Rao distance in other clustering or classification algorithms.

Author Contributions

All authors contributed equally to the research and the writing of the manuscript. All authors read and approved the final manuscript.

Acknowledgments

The authors are thankful to the referees, as their comments and suggestions have contributed to improve the presentation of the text. The authors were partially supported by grants FAPESP (13/25977-7) and CNPq (313326/2017-7) foundations.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Calin, O.; Udriste, C. Geometric Modeling in Probability and Statistics. In Mathematics and Statistics; Springer International: Cham, Switzerland, 2014. [Google Scholar]
  2. Nielsen, F. An elementary introduction to information geometry. arXiv 2018, arXiv:1808.08271. [Google Scholar]
  3. Amari, S.; Nagaoka, H. Methods of Information Geometry. In Translations of Mathematical Monographs; Oxford University Press: Oxford, UK, 2000; Volume 191. [Google Scholar]
  4. Amari, S. Information Geometry and Its Applications; Springer: Tokyo, Japan, 2016. [Google Scholar]
  5. Ay, N.; Jost, J.; Vân Lê, H.; Schwachhöfer, L. Information geometry and sufficient statistics. Probab. Theory Relat. Fields 2015, 162, 327–364. [Google Scholar] [CrossRef] [Green Version]
  6. Vân Lê, H. The uniqueness of the Fisher metric as information metric. Ann. Inst. Stat. Math. 2017, 69, 879–896. [Google Scholar]
  7. Gibilisco, P.; Riccomagno, E.; Rogantin, M.P.; Wynn, H.P. Algebraic and Geometric Methods in Statistics; Cambridge University Press: New York, NY, USA, 2010. [Google Scholar]
  8. Chentsov, N.N. Statistical Decision Rules and Optimal Inference; AMS Bookstore: Providence, RI, USA, 1982; Volume 53. [Google Scholar]
  9. Campbell, L.L. An extended Cencov characterization of the information metric. Proc. Am. Math. Soc. 1986, 98, 135–141. [Google Scholar]
  10. Vân Lê, H. Statistical manifolds are statistical models. J. Geom. 2006, 84, 83–93. [Google Scholar]
  11. Mahalanobis, P.C. On the generalized distance in statistics. Proc. Natl. Inst. Sci. 1936, 2, 49–55. [Google Scholar]
  12. Bhattacharyya, A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 1943, 35, 99–110. [Google Scholar]
  13. Hotelling, H. Spaces of statistical parameters. Bull. Am. Math. Soc. (AMS) 1930, 36, 191. [Google Scholar]
  14. Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 1945, 37, 81–91. [Google Scholar]
  15. Fisher, R.A. On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. Lond. 1921, 222, 309–368. [Google Scholar]
  16. Burbea, J. Informative geometry of probability spaces. Expo. Math. 1986, 4, 347–378. [Google Scholar]
  17. Skovgaard, L.T. A Riemannian geometry of the multivariate normal model. Scand. J. Stat. 1984, 11, 211–223. [Google Scholar]
  18. Atkinson, C.; Mitchell, A.F.S. Rao’s Distance Measure. Sankhyã Indian J. Stat. 1981, 43, 345–365. [Google Scholar]
  19. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  20. Villani, C. Optimal Transport, Old and New. In Grundlehren der Mathematischen Wissenschaften; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
  21. Amari, S. Differential Geometrical Methods in Statistics; Springer: Berlin, Germany, 1985. [Google Scholar]
  22. Costa, S.I.R.; Santos, S.A.; Strapasson, J.E. Fisher information distance: A geometrical reading. Discret. Appl. Math. 2015, 197, 59–69. [Google Scholar] [CrossRef]
  23. Angulo, J.; Velasco-Forero, S. Morphological processing of univariate Gaussian distribution-valued images based on Poincaré upper-half plane representation. In Geometric Theory of Information; Springer International Publishing: Cham, Switzerland, 2014; pp. 331–366. [Google Scholar]
  24. Maybank, S.J.; Ieng, S.; Benosman, R. A Fisher–Rao metric for paracatadioptric images of lines. Int. J. Comput. Vis. 2012, 99, 147–165. [Google Scholar] [CrossRef] [Green Version]
  25. Schwander, O.; Nielsen, F. Model centroids for the simplification of kernel density estimators. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012. [Google Scholar]
  26. Taylor, S. Clustering Financial Return Distributions Using the Fisher Information Metric. Entropy 2019, 21, 110. [Google Scholar] [CrossRef] [Green Version]
  27. Eriksen, P.S. Geodesics Connected with the Fischer Metric on the Multivariate Normal Manifold; Institute of Electronic Systems, Aalborg University Centre: Aalborg, Denmark, 1986. [Google Scholar]
  28. Calvo, M.; Oller, J.M. An explicit solution of information geodesic equations for the multivariate normal model. Stat. Decis. 1991, 9, 119–138. [Google Scholar] [CrossRef]
  29. Lenglet, C.; Rousson, M.; Deriche, R.; Faugeras, O. Statistics on the manifold of multivariate normal distributions. Theory and application to diffusion tensor MRI processing. J. Math. Imaging Vis. 2006, 25, 423–444. [Google Scholar] [CrossRef]
  30. Moakher, M.; Mourad, Z. The Riemannian geometry of the space of positive-definite matrices and its application to the regularization of positive-definite matrix-valued data. J. Math. Imaging Vis. 2011, 40, 171–187. [Google Scholar] [CrossRef]
  31. Han, M.; Park, F.C. DTI Segmentation and Fiber Tracking Using Metrics on Multivariate Normal Distributions. J. Math. Imaging Vis. 2014, 49, 317–334. [Google Scholar] [CrossRef]
  32. Verdoolaege, G.; Scheunders, P. Geodesics on the manifold of multivariate generalized Gaussian distributions with an application to multicomponent texture discrimination. Int. J. Comput. Vis. 2011, 95, 265. [Google Scholar] [CrossRef] [Green Version]
  33. Tang, M.; Rong, Y.; Zhou, J.; Li, X.R. Information geometric approach to multisensor estimation fusion. IEEE Trans. Signal Process. 2018, 67, 279–292. [Google Scholar] [CrossRef]
  34. Poon, C.; Keriven, N.; Peyré, G. Support Localization and the Fisher Metric for off-the-grid Sparse Regularization. arXiv 2018, arXiv:1810.03340. [Google Scholar]
  35. Gattone, S.A.; De Sanctis, A.; Puechmorel, S.; Nicol, F. On the geodesic distance in shapes K-means clustering. Entropy 2018, 20, 647. [Google Scholar] [CrossRef] [Green Version]
  36. Gattone, S.A.; De Sanctis, A.; Russo, T.; Pulcini, D. A shape distance based on the Fisher–Rao metric and its application for shapes clustering. Phys. A Stat. Mech. Appl. 2017, 487, 93–102. [Google Scholar] [CrossRef]
  37. Pilté, M.; Barbaresco, F. Tracking quality monitoring based on information geometry and geodesic shooting. In Proceedings of the 2016 17th International Radar Symposium (IRS), Krakow, Poland, 10–12 May 2016. [Google Scholar]
  38. Pinele, J.; Costa, S.I.; Strapasson, J.E. On the Fisher–Rao Information Metric in the Space of Normal Distributions. In International Conference on Geometric Science of Information; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2019; pp. 676–684. [Google Scholar]
  39. Burbea, J.; Rao, C.R. Entropy differential metric, distance and divergence measures in probability spaces: A unified approach. J. Multivar. Anal. 1982, 12, 575–596. [Google Scholar] [CrossRef] [Green Version]
  40. Porat, B.; Benjamin, F. Computation of the exact information matrix of Gaussian time series with stationary random components. IEEE Trans. Acoust. Speech Signal Process. 1986, 34, 118–130. [Google Scholar] [CrossRef]
  41. Siegel, C.L. Symplectic geometry. Am. J. Math. 1943, 65, 1–86. [Google Scholar] [CrossRef]
  42. Strapasson, J.E.; Pinele, J.; Costa, S.I.R. A totally geodesic submanifold of the multivariate normal distributions and bounds for the Fisher–Rao distance. In Proceedings of the IEEE Information Theory Workshop (ITW), Cambridge, UK, 11–14 September 2016; pp. 61–65. [Google Scholar]
  43. Calvo, M.; Oller, J.M. A distance between multivariate normal distributions based in an embedding into the Siegel group. J. Multivar. Anal. 1990, 35, 223–242. [Google Scholar] [CrossRef] [Green Version]
  44. Calvo, M.; Oller, J.M. A distance between elliptical distributions based in an embedding into the Siegel group. J. Comput. Appl. Math. 2002, 145, 319–334. [Google Scholar] [CrossRef] [Green Version]
  45. Strapasson, J.E.; Porto, J.; Costa, S.I.R. On bounds for the Fisher–Rao distance between multivariate normal distributions. Aip Conf. Proc. 2015, 1641, 313–320. [Google Scholar]
  46. Zhang, K.; Kwok, J.T. Simplifying mixture models through function approximation. IEEE Trans. Neural Netw. 2010, 21, 644–658. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  47. Davis, J.V.; Dhillon, I.S. Differential entropic clustering of multivariate gaussians. In Proceedings of the 2006 Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 4–7 December 2006. [Google Scholar]
  48. Goldberger, J.; Greenspan, H.K.; Dreyfuss, J. Simplifying mixture models using the unscented transform. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 1496–1502. [Google Scholar] [CrossRef]
  49. Garcia, V.; Nielsen, F. Simplification and hierarchical representations of mixtures of exponential families. Signal Process. 2010, 90, 3197–3212. [Google Scholar] [CrossRef]
  50. Bar-Shalom, Y.; Li, X. Estimation and Tracking: Principles, Techniques and Software; Artech House: Norwood, MA, USA, 1993. [Google Scholar]
  51. Kurkoski, B.; Dauwels, J. Message-passing decoding of lattices using Gaussian mixtures. In Proceedings of the 2008 IEEE International Symposium on Information Theory, Toronto, ON, Canada, 6–11 July 2008. [Google Scholar]
  52. Strapasson, J.E.; Pinele, J.; Costa, S.I.R. Clustering using the Fisher–Rao distance. In Proceedings of the IEEE Sensor Array and Multichannel Signal Processing Workshop, Rio de Janerio, Brazil, 10–13 July 2016. [Google Scholar]
  53. Galperin, G.A. A concept of the mass center of a system of material points in the constant curvature spaces. Commun. Math. Phys. 1993, 154.1, 63–84. [Google Scholar] [CrossRef]
  54. Nielsen, F. Introduction to HPC with MPI for Data Science. In Undergraduate Topics in Computer Science; Springer: Cham, Switzerland, 2016. [Google Scholar]
Figure 1. A comparison between the bounds L B , U B 1 , U B 2 , U B 3 , and G S . ( λ 1 = 2 , λ 2 = 0.5 , and μ = 1 are fixed, and α varies from zero to π 2 ).
Figure 1. A comparison between the bounds L B , U B 1 , U B 2 , U B 3 , and G S . ( λ 1 = 2 , λ 2 = 0.5 , and μ = 1 are fixed, and α varies from zero to π 2 ).
Entropy 22 00404 g001
Figure 2. A comparison between the bounds L B , U B 1 , U B 2 , U B 3 , and G S . ( λ 1 = 2 , λ 2 = 0.5 , and μ = 10 are fixed, and α varies from zero to π 2 ).
Figure 2. A comparison between the bounds L B , U B 1 , U B 2 , U B 3 , and G S . ( λ 1 = 2 , λ 2 = 0.5 , and μ = 10 are fixed, and α varies from zero to π 2 ).
Entropy 22 00404 g002
Figure 3. (a) A comparison between the bounds L B , U B 1 , U B 2 , U B 3 , and G S . ( λ 1 = 2 , λ 2 = 0.5 , and the rotation angle α = π / 4 are fixed, and μ varies from zero to 10). (b) A comparison between the bounds L B , U B 123 , and G S . ( λ 1 = 2 , λ 2 = 0.5 , and the rotation angle α = π / 4 are fixed, and μ varies from zero to 10).
Figure 3. (a) A comparison between the bounds L B , U B 1 , U B 2 , U B 3 , and G S . ( λ 1 = 2 , λ 2 = 0.5 , and the rotation angle α = π / 4 are fixed, and μ varies from zero to 10). (b) A comparison between the bounds L B , U B 123 , and G S . ( λ 1 = 2 , λ 2 = 0.5 , and the rotation angle α = π / 4 are fixed, and μ varies from zero to 10).
Entropy 22 00404 g003
Figure 4. (a) Level curves of the distributions in the geodesic curve connecting the bivariate normal distributions θ 1 = ( ( 1 , 0 ) t , Σ ) and θ 2 = ( ( 6 , 3 ) t , Σ ) in M . (b) Level curves of the distributions in the geodesic curve connecting the bivariate normal distributions θ 1 = ( ( 1 , 0 ) t , Σ ) and θ 2 = ( ( 6 , 3 ) t , Σ ) in M Σ .
Figure 4. (a) Level curves of the distributions in the geodesic curve connecting the bivariate normal distributions θ 1 = ( ( 1 , 0 ) t , Σ ) and θ 2 = ( ( 6 , 3 ) t , Σ ) in M . (b) Level curves of the distributions in the geodesic curve connecting the bivariate normal distributions θ 1 = ( ( 1 , 0 ) t , Σ ) and θ 2 = ( ( 6 , 3 ) t , Σ ) in M Σ .
Entropy 22 00404 g004
Figure 5. Example of level curves of mirrored distributions where θ 1 and θ 2 are given by Equation (40).
Figure 5. Example of level curves of mirrored distributions where θ 1 and θ 2 are given by Equation (40).
Entropy 22 00404 g005
Figure 6. Approximation of the geodesic curve connecting θ 1 and θ 2 via the geodesic shooting algorithm. The level curve of θ 1 / 2 is the dashed one.
Figure 6. Approximation of the geodesic curve connecting θ 1 and θ 2 via the geodesic shooting algorithm. The level curve of θ 1 / 2 is the dashed one.
Entropy 22 00404 g006
Figure 7. Contour curves of distributions ϑ 1 = φ ( θ ¯ 1 ) and ϑ 2 = φ ( θ ¯ 2 ) .
Figure 7. Contour curves of distributions ϑ 1 = φ ( θ ¯ 1 ) and ϑ 2 = φ ( θ ¯ 2 ) .
Entropy 22 00404 g007
Figure 8. Illustration of the mixture simplification using the Fisher–Rao clustering, where l is the number of components of the mixture (the last column is the original figure).
Figure 8. Illustration of the mixture simplification using the Fisher–Rao clustering, where l is the number of components of the mixture (the last column is the original figure).
Entropy 22 00404 g008
Figure 9. Illustration of the simplification quality of the mixture modeling Baboon image.
Figure 9. Illustration of the simplification quality of the mixture modeling Baboon image.
Entropy 22 00404 g009
Figure 10. Illustration of the simplification quality of the mixture modeling Lena image.
Figure 10. Illustration of the simplification quality of the mixture modeling Lena image.
Entropy 22 00404 g010
Figure 11. Illustration of the simplification quality of the mixture modeling Clown image.
Figure 11. Illustration of the simplification quality of the mixture modeling Clown image.
Entropy 22 00404 g011
Table 1. The lower bound L B ( θ 1 , θ 2 ) and the upper bounds U B 1 ( θ 1 , θ 2 ) , U B 2 ( θ 1 , θ 2 ) and U B 3 ( θ 1 , θ 2 ) for the Fisher–Rao distance, d F ( θ 1 , θ 2 ) , between distributions θ 1 = ( μ 1 , Σ 1 ) and θ 2 = ( μ 2 , Σ 2 ) in M . d F is the distance between univariate normal distributions given in Equation (22).
Table 1. The lower bound L B ( θ 1 , θ 2 ) and the upper bounds U B 1 ( θ 1 , θ 2 ) , U B 2 ( θ 1 , θ 2 ) and U B 3 ( θ 1 , θ 2 ) for the Fisher–Rao distance, d F ( θ 1 , θ 2 ) , between distributions θ 1 = ( μ 1 , Σ 1 ) and θ 2 = ( μ 2 , Σ 2 ) in M . d F is the distance between univariate normal distributions given in Equation (22).
L B ( θ 1 , θ 2 ) = 1 2 i = 1 n + 1 [ log ( λ i ) ] 2 S i = Σ i + μ i t μ i μ i t μ i 1 ; λ i are the eigenvalues of S 1 1 / 2 S 2 S 1 1 / 2 .
U B 1 ( θ 1 , θ 2 ) = i = 1 n d F 2 ( ( 0 , 1 ) , ( μ i , λ i ) ) Σ 1 ( 1 / 2 ) Σ 2 Σ 1 ( 1 / 2 ) = Q Λ Q t ; λ i are the diagonal terms of the matrix Λ ; μ i are the coordinates of μ = Q t Σ 1 ( 1 / 2 ) ( μ 2 μ 1 ) .
U B 2 ( θ 1 , θ 2 ) = 1 2 i = 1 n [ log ( λ i ) ] 2 +
d F 2 ( ( 0 , 1 ) , ( | μ 3 | , d ¯ 1 ) ) + i = 2 m d F 2 ( ( 0 , 1 ) , ( 0 , d ¯ i ) )
μ 3 = Σ 1 ( 1 / 2 ) ( μ 2 μ 1 ) ; d ¯ i is given by   ( 31 ) ; Σ 3 = Σ 1 ( 1 / 2 ) Σ 2 Σ 1 ( 1 / 2 ) ; P   is an orthogonal matrix such that                       P μ = ( | μ 3 | , 0 , , 0 ) ; Σ ¯ = P 1 Σ 3 P t ; λ i are the eigenvalues of Σ ¯ ( 1 / 2 ) Σ 3 Σ ¯ ( 1 / 2 ) .
U B 3 ( θ 1 , θ 2 ) = 1 2 i = 1 n [ log ( λ i ) ] 2 +
d F ( 0 , 1 ) , | μ 3 | , | μ 3 | + 2 2
μ 3 = Σ 1 ( 1 / 2 ) ( μ 2 μ 1 ) ; Σ 3 = Σ 1 ( 1 / 2 ) Σ 2 Σ 1 ( 1 / 2 ) ; P   is an orthogonal matrix such that                  P μ = ( | μ 3 | , 0 , , 0 ) ; Σ ¯ = P 1 Σ 3 P t ; λ i are the eigenvalues of Σ ¯ ( 1 / 2 ) Σ 3 Σ ¯ ( 1 / 2 ) .
Table 2. A time comparison between the numerical method proposed here and the geodesic shooting to calculate the distance between two mirrored distributions.
Table 2. A time comparison between the numerical method proposed here and the geodesic shooting to calculate the distance between two mirrored distributions.
μ d F ( θ 1 , θ 2 ) Time Systems (s)Time G.Shooting (s)
12.773950.0468754.70313
23.670270.0468755.60938
34.529330.06257.10938
45.260930.0781259.17188
55.874800.04687512.5313
66.394390.062518.4219
76.840430.078125492.563
87.229030.0625574.422
97.572210.046875917.859
107.878960.0468751007.13
Table 3. Closed forms for the Fisher–Rao distance in submanifolds of M and the distance in M between pairs of special distributions.
Table 3. Closed forms for the Fisher–Rao distance in submanifolds of M and the distance in M between pairs of special distributions.
Distance in Non-totally Geodesic Submanifolds
SubmanifoldDistance
M Σ = p θ ; θ = ( μ , Σ ) , Σ = Σ 0 P n ( R ) constant , θ i = ( μ i , Σ 0 ) d Σ ( θ 1 , θ 2 ) = ( μ 1 μ 2 ) t Σ 0 1 ( μ 1 μ 2 )
M D = p θ ; θ = ( μ , Σ ) , Σ = diag ( σ 1 2 , σ 2 2 , , σ n 2 ) , θ i = ( μ 1 i , σ 1 i , μ 2 i , σ 2 i , , μ n i , σ n i ) d D ( θ 1 , θ 2 ) = i = 1 n d F 2 ( ( μ 1 i , σ 1 i ) , ( μ 2 i , σ 2 i ) )
Distance in Totally Geodesic Submanifolds
M μ = p θ ; θ = ( μ , Σ ) , μ = μ 0 R n contant , θ i = ( μ 0 , Σ i ) d F ( θ 1 , θ 2 ) = 1 2 i = 1 n [ log ( λ i ) ] 2 ,
where λ i are the eigenvalues of Σ 1 1 / 2 Σ 2 Σ 1 1 / 2
M D μ = { p θ ; θ = ( μ , Σ ) , μ is an eigenvector of Σ = diag ( σ 1 2 , σ 2 2 , , σ n 2 ) } , θ i = ( μ 1 i , σ 1 i , σ 2 i , , σ n i ) d D μ ( θ 1 , θ 2 ) = ( d F 2 μ 11 , σ 11 , μ 21 , σ 21 +
              i = 2 n d F 2 0 , σ 1 i , 0 , σ 2 i ) 1 / 2
Distance Between Special Distributions in M
  Distributions with Common Covariance Matrices,
θ i = ( μ i , Σ 0 )
d F ( θ 1 , θ 2 ) = d D μ ( 0 , D ) , ( | μ 2 μ 1 | e 1 , D ) ,
where P is an orthogonal matrix such that
P ( μ 2 μ 1 ) = | μ 2 μ 1 | e 1 and P Σ P t = U D U t
Mirrored Distributions,
θ 1 = ( μ 1 e 1 , Σ 1 ) and θ 2 = ( μ 2 e 1 , Σ 2 ) ,
with Σ 2 = M 1 Σ 1 M 1
d F ( θ 1 , θ 2 ) = 2 l = 1 n 1 b l 2 + x 2 ,
where x and b i are obtained by
the solution of Equation (63)

Share and Cite

MDPI and ACS Style

Pinele, J.; Strapasson, J.E.; Costa, S.I.R. The Fisher–Rao Distance between Multivariate Normal Distributions: Special Cases, Bounds and Applications. Entropy 2020, 22, 404. https://0-doi-org.brum.beds.ac.uk/10.3390/e22040404

AMA Style

Pinele J, Strapasson JE, Costa SIR. The Fisher–Rao Distance between Multivariate Normal Distributions: Special Cases, Bounds and Applications. Entropy. 2020; 22(4):404. https://0-doi-org.brum.beds.ac.uk/10.3390/e22040404

Chicago/Turabian Style

Pinele, Julianna, João E. Strapasson, and Sueli I. R. Costa. 2020. "The Fisher–Rao Distance between Multivariate Normal Distributions: Special Cases, Bounds and Applications" Entropy 22, no. 4: 404. https://0-doi-org.brum.beds.ac.uk/10.3390/e22040404

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop