Next Article in Journal
Computing Classical-Quantum Channel Capacity Using Blahut–Arimoto Type Algorithm: A Theoretical and Numerical Analysis
Next Article in Special Issue
Conditional Rényi Divergences and Horse Betting
Previous Article in Journal
Magnetic Resonance Image Quality Assessment by Using Non-Maximum Suppression and Entropy Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On a Generalization of the Jensen–Shannon Divergence and the Jensen–Shannon Centroid

Sony Computer Science Laboratories, Tokyo 141-0022, Japan
Submission received: 5 December 2019 / Revised: 14 February 2020 / Accepted: 14 February 2020 / Published: 16 February 2020

Abstract

:
The Jensen–Shannon divergence is a renown bounded symmetrization of the Kullback–Leibler divergence which does not require probability densities to have matching supports. In this paper, we introduce a vector-skew generalization of the scalar α -Jensen–Bregman divergences and derive thereof the vector-skew α -Jensen–Shannon divergences. We prove that the vector-skew α -Jensen–Shannon divergences are f-divergences and study the properties of these novel divergences. Finally, we report an iterative algorithm to numerically compute the Jensen–Shannon-type centroids for a set of probability densities belonging to a mixture family: This includes the case of the Jensen–Shannon centroid of a set of categorical distributions or normalized histograms.

Graphical Abstract

1. Introduction

Let ( X , F , μ ) be a measure space [1] where X denotes the sample space, F the σ -algebra of measurable events, and μ a positive measure; for example, the measure space defined by the Lebesgue measure μ L with Borel σ -algebra B ( R d ) for X = R d or the measure space defined by the counting measure μ c with the power set σ -algebra 2 X on a finite alphabet X . Denote by L 1 ( X , F , μ ) the Lebesgue space of measurable functions, P 1 the subspace of positive integrable functions f such that X f ( x ) d μ ( x ) = 1 and f ( x ) > 0 for all x X , and P ¯ 1 the subspace of non-negative integrable functions f such that X f ( x ) d μ ( x ) = 1 and f ( x ) 0 for all x X .
We refer to the book of Deza and Deza [2] and the survey of Basseville [3] for an introduction to the many types of statistical divergences met in information sciences and their justifications. The Kullback–Leibler Divergence (KLD) KL : P 1 × P 1 [ 0 , ] is an oriented statistical distance (commonly called the relative entropy in information theory [4]) defined between two densities p and q (i.e., the Radon–Nikodym densities of μ -absolutely continuous probability measures P and Q) by
KL ( p : q ) : = p log p q d μ .
Although KL ( p : q ) 0 with equality iff. p = q μ -a. e. (Gibb’s inequality [4]), the KLD may diverge to infinity depending on the underlying densities. Since the KLD is asymmetric, several symmetrizations [5] have been proposed in the literature.
A well-grounded symmetrization of the KLD is the Jensen–Shannon Divergence [6] (JSD), also called capacitory discrimination in the literature (e.g., see [7]):
JS ( p , q ) : = 1 2 KL p : p + q 2 + KL q : p + q 2 ,
= 1 2 p log 2 p p + q + q log 2 q p + q d μ = JS ( q , p ) .
The Jensen–Shannon divergence can be interpreted as the total KL divergence to the average distribution p + q 2 . The Jensen–Shannon divergence was historically implicitly introduced in [8] (Equation (19)) to calculate distances between random graphs. A nice feature of the Jensen–Shannon divergence is that this divergence can be applied to densities with arbitrary support (i.e., p , q P ¯ 1 with the convention that 0 log 0 = 0 and log 0 0 = 0 ); moreover, the JSD is always upper bounded by log 2 . Let X p = supp ( p ) and X q = supp ( q ) denote the supports of the densities p and q, respectively, where supp ( p ) : = { x X : p ( x ) > 0 } . The JSD saturates to log 2 whenever the supports X p and X p are disjoints. We can rewrite the JSD as
JS ( p , q ) = h p + q 2 h ( p ) + h ( q ) 2 ,
where h ( p ) = p log p d μ denotes Shannon’s entropy. Thus, the JSD can also be interpreted as the entropy of the average distribution minus the average of the entropies.
The square root of the JSD is a metric [9] satisfying the triangle inequality, but the square root of the JD is not a metric (nor any positive power of the Jeffreys divergence, see [10]). In fact, the JSD can be interpreted as a Hilbert metric distance, meaning that there exists some isometric embedding of ( X , JS ) into a Hilbert space [11,12]. Other principled symmetrizations of the KLD have been proposed in the literature: For example, Naghshvar et al. [13] proposed the extrinsic Jensen–Shannon divergence and demonstrated its use for variable-length coding over a discrete memoryless channel (DMC).
Another symmetrization of the KLD sometimes met in the literature [14,15,16] is the Jeffreys divergence [17,18] (JD) defined by
J ( p , q ) : = KL ( p : q ) + KL ( q : p ) = ( p q ) log p q d μ = J ( q , p ) .
However, we point out that this Jeffreys divergence lacks sound information-theoretical justifications.
For two positive but not necessarily normalized densities p ˜ and q ˜ , we define the extended Kullback–Leibler divergence as follows:
KL + ( p ˜ : q ˜ ) : = KL ( p ˜ : q ˜ ) + q ˜ d μ p ˜ d μ ,
= p ˜ log p ˜ q ˜ + q ˜ p ˜ d μ .
The Jensen–Shannon divergence and the Jeffreys divergence can both be extended to positive (unnormalized) densities without changing their formula expressions:
JS + ( p ˜ , q ˜ ) : = 1 2 KL + p ˜ : p ˜ + q ˜ 2 + KL + q ˜ : p ˜ + q ˜ 2 ,
= 1 2 KL p ˜ : p ˜ + q ˜ 2 + KL q ˜ : p ˜ + q ˜ 2 = JS ( p ˜ , q ˜ ) ,
J + ( p ˜ , q ˜ ) : = KL + ( p ˜ : q ˜ ) + KL + ( p ˜ : q ˜ ) = ( p ˜ q ˜ ) log p ˜ q ˜ d μ = J ( p ˜ , q ˜ ) .
However, the extended JS + divergence is upper-bounded by ( 1 2 log 2 ) ( ( p ˜ + q ˜ ) d μ ) = 1 2 ( μ ( p ) + μ ( q ) ) log 2 instead of log 2 for normalized densities (i.e., when μ ( p ) + μ ( q ) = 2 ).
Let ( p q ) α ( x ) : = ( 1 α ) p ( x ) + α q ( x ) denote the statistical weighted mixture with component densities p and q for α [ 0 , 1 ] . The asymmetric α -skew Jensen–Shannon divergence can be defined for a scalar parameter α ( 0 , 1 ) by considering the weighted mixture ( p q ) α as follows:
JS a α ( p : q ) : = ( 1 α ) KL ( p : ( p q ) α ) + α KL ( q : ( p q ) α ) ,
= ( 1 α ) p log p ( p q ) α d μ + α q log q ( p q ) α d μ .
Let us introduce the α-skew K-divergence [6,19] K α ( p : q ) by:
K α p : q : = KL p : ( 1 α ) p + α q = KL p : ( p q ) α .
Then, both the Jensen–Shannon divergence and the Jeffreys divergence can be rewritten [20] using K α as follows:
JS p , q = 1 2 K 1 2 p : q + K 1 2 q : p ,
J p , q = K 1 ( p : q ) + K 1 ( q : p ) ,
since ( p q ) 1 = q , KL ( p : q ) = K 1 ( p : q ) and ( p q ) 1 2 = ( q p ) 1 2 .
We can thus define the symmetric α-skew Jensen–Shannon divergence [20] for α ( 0 , 1 ) as follows:
JS α ( p , q ) : = 1 2 K α ( p : q ) + 1 2 K α ( q : p ) = JS α ( q , p ) .
The ordinary Jensen–Shannon divergence is recovered for α = 1 2 .
In general, skewing divergences (e.g., using the divergence K α instead of the KLD) have been experimentally shown to perform better in applications like in some natural language processing (NLP) tasks [21].
The α-Jensen–Shannon divergences are Csiszár f-divergences [22,23,24]. An f-divergence is defined for a convex function f, strictly convex at 1, and satisfies f ( 1 ) = 0 as:
I f ( p : q ) = q ( x ) f p ( x ) q ( x ) d x f ( 1 ) = 0 .
We can always symmetrize f-divergences by taking the conjugate convex function f * ( x ) = x f ( 1 x ) (related to the perspective function): I f + f * ( p , q ) is a symmetric divergence. The f-divergences are convex statistical distances which are provably the only separable invariant divergences in information geometry [25], except for binary alphabets X (see [26]).
The Jeffreys divergence is an f-divergence for the generator f ( x ) = ( x 1 ) log x , and the α -Jensen–Shannon divergences are f-divergences for the generator family f α ( x ) = log ( ( 1 α ) + α x ) x log ( ( 1 α ) + α x ) . The f-divergences are upper-bounded by f ( 0 ) + f * ( 0 ) . Thus, the f-divergences are finite when f ( 0 ) + f * ( 0 ) < .
The main contributions of this paper are summarized as follows:
  • First, we generalize the Jensen–Bregman divergence by skewing a weighted separable Jensen–Bregman divergence with a k-dimensional vector α [ 0 , 1 ] k in Section 2. This yields a generalization of the symmetric skew α -Jensen–Shannon divergences to a vector-skew parameter. This extension retains the key properties for being upper-bounded and for application to densities with potentially different supports. The proposed generalization also allows one to grasp a better understanding of the “mechanism” of the Jensen–Shannon divergence itself. We also show how to directly obtain the weighted vector-skew Jensen–Shannon divergence from the decomposition of the KLD as the difference of the cross-entropy minus the entropy (i.e., KLD as the relative entropy).
  • Second, we prove that weighted vector-skew Jensen–Shannon divergences are f-divergences (Theorem 1), and show how to build families of symmetric Jensen–Shannon-type divergences which can be controlled by a vector of parameters in Section 2.3, generalizing the work of [20] from scalar skewing to vector skewing. This may prove useful in applications by providing additional tuning parameters (which can be set, for example, by using cross-validation techniques).
  • Third, we consider the calculation of the Jensen–Shannon centroids in Section 3 for densities belonging to mixture families. Mixture families include the family of categorical distributions and the family of statistical mixtures sharing the same prescribed components. Mixture families are well-studied manifolds in information geometry [25]. We show how to compute the Jensen–Shannon centroid using a concave–convex numerical iterative optimization procedure [27]. The experimental results graphically compare the Jeffreys centroid with the Jensen–Shannon centroid for grey-valued image histograms.

2. Extending the Jensen–Shannon Divergence

2.1. Vector-Skew Jensen–Bregman Divergences and Jensen Diversities

Recall our notational shortcut: ( a b ) α : = ( 1 α ) a + α b . For a k-dimensional vector α [ 0 , 1 ] k , a weight vector w belonging to the ( k 1 ) -dimensional open simplex Δ k , and a scalar γ ( 0 , 1 ) , let us define the following vector skew α-Jensen–Bregman divergence ( α -JBD) following [28]:
JB F α , γ , w ( θ 1 : θ 2 ) : = i = 1 k w i B F ( θ 1 θ 2 ) α i : ( θ 1 θ 2 ) γ 0 ,
where B F is the Bregman divergence [29] induced by a strictly convex and smooth generator F:
B F ( θ 1 : θ 2 ) : = F ( θ 1 ) F ( θ 2 ) θ 1 θ 2 , F ( θ 2 ) ,
with · , · denoting the Euclidean inner product x , y = x y (dot product). Expanding the Bregman divergence formulas in the expression of the α -JBD and using the fact that
( θ 1 θ 2 ) α i ( θ 1 θ 2 ) γ = ( γ α i ) ( θ 1 θ 2 ) ,
we get the following expression:
JB F α , γ , w ( θ 1 : θ 2 ) = i = 1 k w i F ( θ 1 θ 2 ) α i F ( θ 1 θ 2 ) γ i = 1 k w i ( γ α i ) ( θ 1 θ 2 ) , F ( ( θ 1 θ 2 ) γ ) .
The inner product term of Equation (21) vanishes when
γ = i = 1 k w i α i : = α ¯ .
Thus, when γ = α ¯ (assuming at least two distinct components in α so that γ ( 0 , 1 ) ), we get the simplified formula for the vector-skew α -JBD:
JB F α , w ( θ 1 : θ 2 ) = i = 1 k w i F ( θ 1 θ 2 ) α i F ( θ 1 θ 2 ) α ¯ .
This vector-skew Jensen–Bregman divergence is always finite and amounts to a Jensen diversity [30] J F induced by Jensen’s inequality gap:
JB F α , w ( θ 1 : θ 2 ) = J F ( ( θ 1 θ 2 ) α 1 , , ( θ 1 θ 2 ) α k ; w 1 , , w k ) : = i = 1 k w i F ( θ 1 θ 2 ) α i F ( θ 1 θ 2 ) α ¯ 0 .
The Jensen diversity is a quantity which arises as a generalization of the cluster variance when clustering with Bregman divergences instead of the ordinary squared Euclidean distance; see [29,30] for details. In the context of Bregman clustering, the Jensen diversity has been called the Bregman information [29] and motivated by rate distortion theory: Bregman information measures the minimum expected loss when encoding a set of points using a single point when the loss is measured using a Bregman divergence. In general, a k-point measure is called a diversity measure (for k > 2 ), while a distance/divergence is the special case of a 2-point measure.
Conversely, in 1D, we may start from Jensen’s inequality for a strictly convex function F:
i = 1 k w i F ( θ i ) F i = 1 k w i θ i .
Let us notationally write [ k ] : = { 1 , , k } , and define θ m : = min i [ k ] { θ i } i and θ M : = max i [ k ] { θ i } i > θ m (i.e., assuming at least two distinct values). We have the barycenter θ ¯ = i w i θ i = : ( θ m θ M ) γ which can be interpreted as the linear interpolation of the extremal values for some γ ( 0 , 1 ) . Let us write θ i = ( θ m θ M ) α i for i [ k ] and proper values of the α i s. Then, it comes that
θ ¯ = i w i θ i ,
= i w i ( θ m θ M ) α i ,
= i w i ( ( 1 α i ) θ m + α i θ M ) ,
= 1 i w i α i θ m + i α i w i θ M ,
= ( θ m θ M ) i w i α i = ( θ m θ M ) γ ,
so that γ = i w i α i = α ¯ .

2.2. Vector-Skew Jensen–Shannon Divergences

Let f ( x ) = x log x x be a strictly smooth convex function on ( 0 , ) . Then, the Bregman divergence induced by this univariate generator is
B f ( p : q ) = p log p q + q p = kl + ( p : q ) ,
the extended scalar Kullback–Leibler divergence.
We extend the scalar-skew Jensen–Shannon divergence as follows: JS α , w ( p : q ) : = JB h α , α ¯ , w ( p : q ) for h, the Shannon’s entropy [4] (a strictly concave function [4]).
Definition 1
(Weighted vector-skew ( α , w ) -Jensen–Shannon divergence). For a vector α [ 0 , 1 ] k and a unit positive weight vector w Δ k , the ( α , w ) -Jensen–Shannon divergence between two densities p , q P ¯ 1 is defined by:
JS α , w ( p : q ) : = i = 1 k w i KL ( ( p q ) α i : ( p q ) α ¯ ) = h ( p q ) α ¯ i = 1 k w i h ( p q ) α i ,
with α ¯ = i = 1 k w i α i , where h ( p ) = p ( x ) log p ( x ) d μ ( x ) denotes the Shannon entropy [4] (i.e., h is strictly convex).
This definition generalizes the ordinary JSD; we recover the ordinary Jensen–Shannon divergence when k = 2 , α 1 = 0 , α 2 = 1 , and w 1 = w 2 = 1 2 with α ¯ = 1 2 : JS ( p , q ) = JS ( 0 , 1 ) , ( 1 2 , 1 2 ) ( p : q ) .
Let KL α , β ( p : q ) : = KL ( ( p q ) α : ( p q ) β ) . Then, we have KL α , β ( q : p ) = KL 1 α , 1 β ( p : q ) . Using this ( α , β ) -KLD, we have the following identity:
JS α , w ( p : q ) = i = 1 k w i KL α i , α ¯ ( p : q ) ,
= i = 1 k w i KL 1 α i , 1 α ¯ ( q : p ) = JS 1 k α , w ( q : p ) ,
since i = 1 k w i ( 1 α i ) = 1 k α ¯ = 1 α ¯ , where 1 k = ( 1 , , 1 ) is a k-dimensional vector of ones.
A very interesting property is that the vector-skew Jensen–Shannon divergences are f-divergences [22].
Theorem 1.
The vector-skew Jensen–Shannon divergences JS α , w ( p : q ) are f-divergences for the generator f α , w ( u ) = i = 1 k w i ( α i u + ( 1 α i ) ) log ( 1 α i ) + α i u ( 1 α ¯ ) + α ¯ u with α ¯ = i = 1 k w i α i .
Proof. 
First, let us observe that the positively weighted sum of f-divergences is an f-divergence: i = 1 k w i I f i ( p : q ) = I f ( p : q ) for the generator f ( u ) = i = 1 k w i f i ( u ) .
Now, let us express the divergence KL α , β ( p : q ) as an f-divergence:
KL α , β ( p : q ) = I f α , β ( p : q ) ,
with generator
f α , β ( u ) = ( α u + 1 α ) log ( 1 α ) + α u ( 1 β ) + β u .
Thus, it follows that
JS α , w ( p : q ) = i = 1 k w i KL ( ( p q ) α i : ( p q ) α ¯ ) ,
= i = 1 k w i I f α i , α ¯ ( p : q ) ,
= I i = 1 k w i f α i , α ¯ ( p : q ) .
Therefore, the vector-skew Jensen–Shannon divergence is an f-divergence for the following generator:
f α , w ( u ) = i = 1 k w i ( α i u + ( 1 α i ) ) log ( 1 α i ) + α i u ( 1 α ¯ ) + α ¯ u ,
where α ¯ = i = 1 k w i α i .
When α = ( 0 , 1 ) and w = ( 1 2 , 1 2 ) , we recover the f-divergence generator for the JSD:
f JS ( u ) = 1 2 log 1 1 2 + 1 2 u + 1 2 u log u 1 2 + 1 2 u ,
= 1 2 log 2 1 + u + u log 2 u 1 + u .
Observe that f α , w * ( u ) = u f α , w ( 1 / u ) = f 1 α , w ( u ) , where 1 α : = ( 1 α 1 , , 1 α k ) .
We also refer the reader to Theorem 4.1 of [31], which defines skew f-divergences from any f-divergence.  □
Remark 1.
Since the vector-skew Jensen divergence is an f-divergence, we easily obtain Fano and Pinsker inequalities following [32], or reverse Pinsker inequalities following [33,34] (i.e., upper bounds for the vector-skew Jensen divergences using the total variation metric distance), data processing inequalities using [35], etc.
Next, we show that KL α , β (and JS α , w ) are separable convex divergences. Since the f-divergences are separable convex, the KL α , β divergences and the JS α , w divergences are separable convex. For the sake of completeness, we report a simplex explicit proof below.
Theorem 2
(Separable convexity). The divergence KL α , β ( p : q ) is strictly separable convex for α β and x X p X q .
Proof. 
Let us calculate the second partial derivative of KL α , β ( x : y ) with respect to x, and show that it is strictly positive:
2 x 2 KL α , β ( x : y ) = ( β α ) 2 y 2 ( x y ) α ( x y ) β 2 > 0 ,
for x , y > 0 . Thus, KL α , β is strictly convex on the left argument. Similarly, since KL α , β ( y : x ) = KL 1 α , 1 β ( x : y ) , we deduce that KL α , β is strictly convex on the right argument. Therefore, the divergence KL α , β is separable convex.  □
It follows that the divergence JS α , w ( p : q ) is strictly separable convex, since it is a convex combination of weighted KL α i , α ¯ divergences.
Another way to derive the vector-skew JSD is to decompose the KLD as the difference of the cross-entropy h × minus the entropy h (i.e., KLD is also called the relative entropy):
KL ( p : q ) = h × ( p : q ) h ( p ) ,
where h × ( p : q ) : = p log q d μ and h ( p ) : = h × ( p : p ) (self cross-entropy). Since α 1 h × ( p 1 : q ) + α 2 h × ( p 2 : q ) = h × ( α 1 p 1 + α 2 p 2 : q ) (for α 2 = 1 α 1 ), it follows that
JS α , w ( p : q ) : = i = 1 k w i KL ( ( p q ) α i : ( p q ) γ ) ,
= i = 1 k w i h × ( ( p q ) α i : ( p q ) γ ) h ( ( p q ) α i ) ,
= h × i = 1 k w i ( p q ) α i : ( p q ) γ i = 1 k w i h ( p q ) α i .
Here, the “trick” is to choose γ = α ¯ in order to “convert” the cross-entropy into an entropy: h × ( i = 1 k w i ( p q ) α i : ( p q ) γ ) = h ( ( p q ) α ¯ ) when γ = α ¯ . Then, we end up with
JS α , w ( p : q ) = h ( p q ) α ¯ i = 1 k w i h ( p q ) α i .
When α = ( α 1 , α 2 ) with α 1 = 0 and α 2 = 0 and w = ( w 1 , w 2 ) = ( 1 2 , 1 2 ) , we have α ¯ = 1 2 , and we recover the Jensen–Shannon divergence:
JS ( p : q ) = h p + q 2 h ( p ) + h ( q ) 2 .
Notice that Equation (13) is the usual definition of the Jensen–Shannon divergence, while Equation (48) is the reduced formula of the JSD, which can be interpreted as a Jensen gap for Shannon entropy, hence its name: The Jensen–Shannon divergence.
Moreover, if we consider the cross-entropy/entropy extended to positive densities p ˜ and q ˜ :
h + × ( p ˜ : q ˜ ) = ( p ˜ log q ˜ + q ˜ ) d μ , h + ( p ˜ ) = h + × ( p ˜ : p ˜ ) = ( p ˜ log p ˜ + p ˜ ) d μ ,
we get:
JS + α , w ( p ˜ : q ˜ ) = i = 1 k w i KL + ( ( p ˜ q ˜ ) α i : ( p ˜ q ˜ ) γ ) = h + ( ( p ˜ q ˜ ) α ¯ ) i = 1 k w i h + ( ( p ˜ q ˜ ) α i ) .
Next, we shall prove that our generalization of the skew Jensen–Shannon divergence to vector-skewing is always bounded. We first start by a lemma bounding the KLD between two mixtures sharing the same components:
Lemma 1
(KLD between two w-mixtures). For α [ 0 , 1 ] and β ( 0 , 1 ) , we have:
KL α , β ( p : q ) = KL ( p q ) α : ( p q ) β log max 1 α 1 β , α β .
Proof. 
For p ( x ) , q ( x ) > 0 , we have
( 1 α ) p ( x ) + α q ( x ) ( 1 β ) p ( x ) + β q ( x ) max 1 α 1 β , α β .
Indeed, by considering the two cases α β (or equivalently, 1 α 1 β ) and α β (or equivalently, 1 α 1 β ), we check that ( 1 α ) p ( x ) max 1 α 1 β , α β ( 1 β ) p ( x ) and α q ( x ) max 1 α 1 β , α β β q ( x ) . Thus, we have ( 1 α ) p ( x ) + α q ( x ) ( 1 β ) p ( x ) + β q ( x ) max 1 α 1 β , α β . Therefore, it follows that:
KL ( p q ) α : ( p q ) β ( p q ) α log max 1 α 1 β , α β d μ = log max 1 α 1 β , α β .
Notice that we can interpret log max 1 α 1 β , α β = max { log 1 α 1 β , log α β } as the -Rényi divergence [36,37] between the following two two-point distributions: ( α , 1 α ) and ( β , 1 β ) . See Theorem 6 of [36].
A weaker upper bound is KL ( ( p q ) α : ( p q ) β ) log 1 β ( 1 β ) . Indeed, let us form a partition of the sample space X into two dominance regions:
  • R p : = { x X : q ( x ) p ( x ) } and
  • R q : = { x X : q ( x ) > p ( x ) } .
We have ( p q ) α ( x ) = ( 1 α ) p ( x ) + α q ( x ) p ( x ) for x R p and ( p q ) α ( x ) q ( x ) for x R q . It follows that
KL ( p q ) α : ( p q ) β R p ( p q ) α ( x ) log p ( x ) ( 1 β ) p ( x ) d μ ( x ) + R q ( p q ) α ( x ) log q ( x ) β q ( x ) d μ ( x ) .
That is, KL ( ( p q ) α : ( p q ) β ) log ( 1 β ) log β = log 1 β ( 1 β ) . Notice that we allow α { 0 , 1 } but not β to take the extreme values (i.e., β ( 0 , 1 ) ).  □
In fact, it is known that for both α , β ( 0 , 1 ) , KL ( p q ) α : ( p q ) β amount to compute a Bregman divergence for the Shannon negentropy generator, since { ( p q ) γ : γ ( 0 , 1 ) } defines a mixture family [38] of order 1 in information geometry. Hence, it is always finite, as Bregman divergences are always finite (but not necessarily bounded).
By using the fact that
JS α , w ( p : q ) = i = 1 k w i KL ( p q ) α i : ( p q ) α ¯ ,
we conclude that the vector-skew Jensen–Shannon divergence is upper-bounded:
Lemma 2
(Bounded ( w , α ) -Jensen–Shannon divergence). JS α , w is bounded by log 1 α ¯ ( 1 α ¯ ) where α ¯ = i = 1 k w i α i ( 0 , 1 ) .
Proof. 
We have JS α , w ( p : q ) = i w i KL ( p q ) α i : ( p q ) α ¯ . Since 0 KL ( p q ) α i : ( p q ) α ¯ log 1 α ¯ ( 1 α ¯ ) , it follows that we have
0 JS α , w ( p : q ) log 1 α ¯ ( 1 α ¯ ) .
Notice that we also have
JS α , w ( p : q ) i w i log max 1 α i 1 α ¯ , α i α ¯ .
 □
The vector-skew Jensen–Shannon divergence is symmetric if and only if for each index i [ k ] there exists a matching index σ ( i ) such that α σ ( i ) = 1 α i and w σ ( i ) = w i .
For example, we may define the symmetric scalar α-skew Jensen–Shannon divergence as
JS s α ( p , q ) = 1 2 KL ( ( p q ) α : ( p q ) 1 2 ) + 1 2 KL ( ( p q ) 1 α : ( p q ) 1 2 ) ,
= 1 2 ( p q ) α log ( p q ) α ( p q ) 1 2 d μ + 1 2 ( p q ) 1 α log ( p q ) 1 α ( p q ) 1 2 d μ ,
= 1 2 ( q p ) 1 α log ( q p ) 1 α ( q p ) 1 2 d μ + + 1 2 ( q p ) α log ( q p ) α ( q p ) 1 2 d μ ,
= h ( ( p q ) 1 2 ) h ( ( p q ) α ) + h ( ( p q ) 1 α ) 2 ,
= : JS s α ( q , p ) ,
since it holds that ( a b ) c = ( b a ) 1 c for any a , b , c R . Note that JS s α ( p , q ) JS α ( p , q ) .
Remark 2.
We can always symmetrize a vector-skew Jensen–Shannon divergence by doubling the dimension of the skewing vector. Let α = ( α 1 , , α k ) and w be the vector parameters of an asymmetric vector-skew JSD, and consider α = ( 1 α 1 , , 1 α k ) and w to be the parameters of JS α , w . Then, JS ( α , α ) , ( w 2 , w 2 ) is a symmetric skew-vector JSD:
JS ( α , α ) , ( w 2 , w 2 ) ( p : q ) : = 1 2 JS α , w ( p : q ) + 1 2 JS α , w ( p : q ) ,
= 1 2 JS α , w ( p : q ) + 1 2 JS α , w ( q : p ) = JS ( α , α ) , ( w 2 , w 2 ) ( q : p ) .
Since the vector-skew Jensen–Shannon divergence is an f-divergence for the generator f α , w (Theorem 1), we can take generator f w , α s ( u ) = f w , α ( u ) + f w , α * ( u ) 2 to define the symmetrized f-divergence, where f w , α * ( u ) = u f w , α ( 1 u ) denotes the convex conjugate function. When f α , w yields a symmetric f-divergence I f α , w , we can apply the generic upper bound of f-divergences (i.e., I f f ( 0 ) + f * ( 0 ) ) to get the upper bound on the symmetric vector-skew Jensen–Shannon divergences:
I f α , w ( p : q ) f α , w ( 0 ) + f α , w * ( 0 ) ,
i = 1 k w i ( 1 α i ) log 1 α i 1 α ¯ + α i log α i α ¯ ,
since
f α , w * ( u ) = u f α , w 1 u ,
= i = 1 k w i ( ( 1 α i ) u + α i ) log ( 1 α i ) u + α i ( 1 α ¯ ) u + α ¯ .
For example, consider the ordinary Jensen–Shannon divergence with w = 1 2 , 1 2 and α = ( 0 , 1 ) . Then, we find JS ( p : w ) = I f ( 0 , 1 ) , 1 2 , 1 2 ( p : q ) 1 2 log 2 + 1 2 log 2 = log 2 , the usual upper bound of the JSD.
As a side note, let us notice that our notation ( p q ) α allows one to compactly write the following property:
Property 1.
We have q = ( q q ) λ for any λ [ 0 , 1 ] , and ( ( p 1 p 2 ) λ ( q 1 q 2 ) λ ) α = ( ( p 1 q 1 ) α ( p 2 q 2 ) α ) λ for any α , λ [ 0 , 1 ] .
Proof. 
Clearly, q = ( 1 λ ) q + λ q = : ( ( q q ) λ ) for any λ [ 0 , 1 ] . Now, we have
( ( p 1 p 2 ) λ ( q 1 q 2 ) λ ) α = ( 1 α ) ( p 1 p 2 ) λ + α ( q 1 q 2 ) λ ,
= ( 1 α ) ( ( 1 λ ) p 1 + λ p 2 ) + α ( ( 1 λ ) q 1 + λ q 2 ) ,
= ( 1 λ ) ( ( 1 α ) p 1 + α q 1 ) + λ ( ( 1 α ) p 2 + α q 2 ) ,
= ( 1 λ ) ( p 1 q 1 ) α + λ ( p 2 q 2 ) α ,
= ( ( p 1 q 1 ) α ( p 2 q 2 ) α ) λ .
 □

2.3. Building Symmetric Families of Vector-Skewed Jensen–Shannon Divergences

We can build infinitely many vector-skew Jensen–Shannon divergences. For example, consider α = 0 , 1 , 1 3 and w = 1 3 , 1 3 , 1 3 . Then, α ¯ = 1 3 + 1 9 = 4 9 , and
JS α , w ( p : q ) = h ( p q ) 4 9 h ( p ) + h ( q ) + h ( p q ) 1 3 3 JS α , w ( q : p ) .
Interestingly, we can also build infinitely many families of symmetric vector-skew Jensen–Shannon divergences. For example, consider these two examples that illustrate the construction process:
  • Consider k = 2 . Let ( w , 1 w ) denote the weight vector, and α = ( α 1 , α 2 ) the skewing vector. We have α ¯ = w α 1 + ( 1 w ) α 2 = α 2 + w ( α 1 α 2 ) . The vector-skew JSD is symmetric iff. w = 1 w = 1 2 (with α ¯ = α 1 + α 2 2 ) and α 2 = 1 α 1 . In that case, we have α ¯ = 1 2 , and we obtain the following family of symmetric Jensen–Shannon divergences:
    JS ( α , 1 α ) , ( 1 2 , 1 2 ) ( p , q ) = h ( p q ) 1 2 h ( ( p q ) α ) + h ( ( p q ) 1 α ) 2 ,
    = h ( p q ) 1 2 h ( ( p q ) α ) + h ( ( q p ) α ) 2 = JS ( α , 1 α ) , ( 1 2 , 1 2 ) ( q , p ) .
  • Consider k = 4 , weight vector w = 1 3 , 1 3 , 1 6 , 1 6 , and skewing vector α = ( α 1 , 1 α 1 , α 2 , 1 α 2 ) for α 1 , α 2 ( 0 , 1 ) . Then, α ¯ = 1 2 , and we get the following family of symmetric vector-skew JSDs:
    JS ( α 1 , α 2 ) ( p , q ) = h ( p q ) 1 2 2 h ( ( p q ) α 1 ) + 2 h ( ( p q ) 1 α 1 ) + h ( ( p q ) α 2 ) + h ( ( p q ) 1 α 2 ) 6 ,
    = h ( p q ) 1 2 2 h ( ( p q ) α 1 ) + 2 h ( ( q p ) α 1 ) + h ( ( p q ) α 2 ) + h ( ( q p ) α 2 ) 6 ,
    = JS ( α 1 , α 2 ) ( q , p ) .
  • We can similarly carry on the construction of such symmetric JSDs by increasing the dimensionality of the skewing vector.
In fact, we can define
JS s α , w ( p , q ) : = h ( p q ) 1 2 i = 1 k w i h ( ( p q ) α i ) + h ( ( p q ) 1 α i ) 2 = i = 1 k w i JS s α i ( p , q ) ,
with
JS s α ( p , q ) : = h ( p q ) 1 2 h ( ( p q ) α ) + h ( ( p q ) 1 α ) 2 .

3. Jensen–Shannon Centroids on Mixture Families

3.1. Mixture Families and Jensen–Shannon Divergences

Consider a mixture family in information geometry [25]. That is, let us give a prescribed set of D + 1 linearly independent probability densities p 0 ( x ) , , p D ( x ) defined on the sample space X . A mixture family M of order D consists of all strictly convex combinations of these component densities:
M : = m ( x ; θ ) : = i = 1 D θ i p i ( x ) + 1 i = 1 D θ i p 0 ( x ) : θ i > 0 , i = 1 D θ i < 1 .
For example, the family of categorical distributions (sometimes called “multinouilli” distributions) is a mixture family [25]:
M = m θ ( x ) = i = 1 D θ i δ ( x x i ) + 1 i = 1 D θ i δ ( x x 0 ) ,
where δ ( x ) is the Dirac distribution (i.e., δ ( x ) = 1 for x = 0 and δ ( x ) = 0 for x 0 ). Note that the mixture family of categorical distributions can also be interpreted as an exponential family.
Notice that the linearly independent assumption on probability densities is to ensure to have an identifiable model: θ m ( x ; θ ) .
The KL divergence between two densities of a mixture family M amounts to a Bregman divergence for the Shannon negentropy generator F ( θ ) = h ( m θ ) (see [38]):
KL ( m θ 1 : m θ 2 ) = B F ( θ 1 : θ 2 ) = B h ( m θ ) ( θ 1 : θ 2 ) .
On a mixture manifold M , the mixture density ( 1 α ) m θ 1 + α m θ 2 of two mixtures m θ 1 and m θ 2 of M also belongs to M :
( 1 α ) m θ 1 + α m θ 2 = m ( θ 1 θ 2 ) α M ,
where we extend the notation ( θ 1 θ 2 ) α : = ( 1 α ) θ 1 + α θ 2 to vectors θ 1 and θ 2 : ( θ 1 θ 2 ) α i = ( θ 1 i θ 2 i ) α .
Thus, the vector-skew JSD amounts to a vector-skew Jensen diversity for the Shannon negentropy convex function F ( θ ) = h ( m θ ) :
JS α , w ( m θ 1 : m θ 2 ) = i = 1 k w i KL ( m θ 1 m θ 2 ) α i : ( m θ 1 m θ 2 ) α ¯ ,
= i = 1 k w i KL m ( θ 1 θ 2 ) α i : m ( θ 1 θ 2 ) α ¯ ,
= i = 1 k w i B F ( θ 1 θ 2 ) α i : ( θ 1 θ 2 ) α ¯ ,
= JB F α , α ¯ , w ( θ 1 : θ 2 ) ,
= i = 1 k w i F ( θ 1 θ 2 ) α i F ( θ 1 θ 2 ) α ¯ ,
= h ( m ( θ 1 θ 2 ) α ¯ ) i = 1 k w i h m ( θ 1 θ 2 ) α i .

3.2. Jensen–Shannon Centroids

Given a set of n mixture densities m θ 1 , , m θ n of M , we seek to calculate the skew-vector Jensen–Shannon centroid (or barycenter for non-uniform weights) defined as m θ * , where θ * is the minimizer of the following objective function (or loss function):
L ( θ ) : = j = 1 n ω j JS α , w ( m θ k : m θ ) ,
where ω Δ n is the weight vector of densities (uniform weight for the centroid and non-uniform weight for a barycenter). This definition of the skew-vector Jensen–Shannon centroid is a generalization of the Fréchet mean (the Fréchet mean may not be unique, as it is the case on the sphere for two antipodal points for which their Fréchet means with respect to the geodesic metric distance form a great circle) [39] to non-metric spaces. Since the divergence JS α , w is strictly separable convex, it follows that the Jensen–Shannon-type centroids are unique when they exist.
Plugging Equation (82) into Equation (88), we get that the calculation of the Jensen–Shannon centroid amounts to the following minimization problem:
L ( θ ) = j = 1 n ω j i = 1 k w i F ( ( θ j θ ) α i ) F ( θ j θ ) α ¯ .
This optimization is a Difference of Convex (DC) programming optimization, for which we can use the ConCave–Convex procedure [27,40] (CCCP). Indeed, let us define the following two convex functions:
A ( θ ) = j = 1 n i = 1 k ω j w i F ( ( θ j θ ) α i ) ,
B ( θ ) = j = 1 n ω j F ( θ j θ ) α ¯ .
Both functions A ( θ ) and B ( θ ) are convex since F is convex. Then, the minimization problem of Equation (89) to solve can be rewritten as:
min θ A ( θ ) B ( θ ) .
This is a DC programming optimization problem which can be solved iteratively by initializing θ to an arbitrary value θ ( 0 ) (say, the centroid of the θ i s), and then by updating the parameter at step t using the CCCP [27] as follows:
θ ( t + 1 ) = ( B ) 1 ( A ( θ ( t ) ) ) .
Compared to a gradient descent local optimization, there is no required step size (also called “learning” rate) in CCCP.
We have A ( θ ) = j = 1 n i = 1 k ω j w i α i F ( ( θ j θ ) α i ) and B ( θ ) = j = 1 n ω j α ¯ F ( θ j θ ) α ¯ .
The CCCP converges to a local optimum θ * where the support hyperplanes of the function graphs of A and B at θ * are parallel to each other, as depicted in Figure 1. The set of stationary points is { θ : A ( θ ) = B ( θ ) } . In practice, the delicate step is to invert B . Next, we show how to implement this algorithm for the Jensen–Shannon centroid of a set of categorical distributions (i.e., normalized histograms with all non-empty bins).

3.2.1. Jensen–Shannon Centroids of Categorical Distributions

To illustrate the method, let us consider the mixture family of categorical distributions [25]:
M = m θ ( x ) = i = 1 D θ i δ ( x x i ) + 1 i = 1 D θ i δ ( x x 0 ) .
The Shannon negentropy is
F ( θ ) = h ( m θ ) = i = 1 D θ i log θ i + 1 i = 1 D θ i log 1 i = 1 D θ i .
We have the partial derivatives
F ( θ ) = θ i i , θ i F ( θ ) = log θ i 1 j = 1 D θ j .
Inverting the gradient F requires us to solve the equation F ( θ ) = η so that we get θ = ( F ) 1 ( η ) . We find that
F * ( η ) = ( F ) 1 ( η ) = 1 1 + j = 1 D exp ( η j ) [ exp ( η i ) ] i , θ i = ( F 1 ( η ) ) i = exp ( η i ) 1 + j = 1 D exp ( η j ) , i [ D ] .
Table 1 summarizes the dual view of the family of categorical distributions, either interpreted as an exponential family or as a mixture family.
We have JS ( p 1 , p 2 ) = J F ( θ 1 , θ 2 ) for p 1 = m θ 1 and p 2 = m θ 2 , where
J F ( θ 1 : θ 2 ) = F ( θ 1 ) + F ( θ 2 ) 2 F θ 1 + θ 2 2 ,
is the Jensen divergence [40]. Thus, to compute the Jensen–Shannon centroid of a set of n densities p 1 , , p n of a mixture family (with p i = m θ i ), we need to solve the following optimization problem for a density p = m θ :
min p i JS ( p i , p ) ,
min θ i J F ( θ i , θ ) ,
min θ i F ( θ i ) + F ( θ ) 2 F θ i + θ 2 ,
min θ 1 2 F ( θ ) 1 n i F θ i + θ 2 : = E ( θ ) .
The CCCP algorithm for the Jensen–Shannon centroid proceeds by initializing θ ( 0 ) = 1 n i θ i (center of mass of the natural parameters), and iteratively updates as follows:
θ ( t + 1 ) = ( F ) 1 1 n i F θ i + θ ( t ) 2 .
We iterate until the absolute difference | E ( θ ( t ) ) E ( θ ( t + 1 ) ) | between two successive θ ( t ) and θ ( t + 1 ) goes below a prescribed threshold value. The convergence of the CCCP algorithm is linear [41] to a local minimum that is a fixed point of the equation
θ = M H θ 1 + θ 2 , , θ n + θ 2 ,
where M H ( v 1 , , v n ) : = H 1 ( i = 1 n H ( v i ) ) is a vector generalization of the formula of the quasi-arithmetic means [30,40] obtained for the generator H = F . Algorithm 1 summarizes the method for approximating the Jensen–Shannon centroid of a given set of categorical distributions (given a prescribed number of iterations). In the pseudo-code, we used the notation ( t + 1 ) θ instead of θ ( t + 1 ) in order to highlight the conversion procedures of the natural parameters to/from the mixture weight parameters by using superscript notations for coordinates.
Algorithm 1: The CCCP algorithm for computing the Jensen–Shannon centroid of a set of categorical distributions.
Entropy 22 00221 i001
Figure 2 displays the results of the calculations of the Jeffreys centroid [18] and the Jensen–Shannon centroid for two normalized histograms obtained from grey-valued images of Lena and Barbara. Figure 3 show the Jeffreys centroid and the Jensen–Shannon centroid for the Barbara image and its negative image. Figure 4 demonstrates that the Jensen–Shannon centroid is well defined even if the input histograms do not have coinciding supports. Notice that on the parts of the support where only one distribution is defined, the JS centroid is a scaled copy of that defined distribution.

3.2.2. Special Cases

Let us now consider two special cases:
  • For the special case of D = 1 , the categorical family is the Bernoulli family, and we have F ( θ ) = θ log θ + ( 1 θ ) log ( 1 θ ) (binary negentropy), F ( θ ) = log θ 1 θ (and F ( θ ) = 1 θ ( 1 θ ) > 0 ) and ( F ) 1 ( η ) = e η 1 + e η . The CCCP update rule to compute the binary Jensen–Shannon centroid becomes
    θ ( t + 1 ) = ( F ) 1 i w i F θ ( t ) + θ i 2 .
  • Since the skew-vector Jensen–Shannon divergence formula holds for positive densities:
    JS + α , w ( p ˜ : q ˜ ) = i = 1 k w i KL + ( ( p ˜ q ˜ ) α i : ( ( p ˜ q ˜ ) α ¯ ) ,
    = i = 1 k w i KL ( ( p ˜ q ˜ ) α i : ( ( p ˜ q ˜ ) α ¯ ) + ( p ˜ q ˜ ) α ¯ d μ i = 1 k w i ( p ˜ q ˜ ) α i d μ = ( p ˜ q ˜ ) α ¯ d μ ,
    = JS α , w ( p ˜ : q ˜ ) ,
    we can relax the computation of the Jensen–Shannon centroid by considering 1D separable minimization problems. We then normalize the positive JS centroids to get an approximation of the probability JS centroids. This approach was also considered when dealing with the Jeffreys’ centroid [18]. In 1D, we have F ( θ ) = θ log θ θ , F ( θ ) = log θ and ( F ) 1 ( η ) = e η .
In general, calculating the negentropy for a mixture family with continuous densities sharing the same support is not tractable because of the log-sum term of the differential entropy. However, the following remark emphasizes an extension of the mixture family of categorical distributions:

3.2.3. Some Remarks and Properties

Remark 3.
Consider a mixture family m ( θ ) = i = 1 D θ i p i ( x ) + 1 i = 1 D θ i p 0 ( x ) (for a parameter θ belonging to the D-dimensional standard simplex) of probability densities p 0 ( x ) , , p D ( x ) defined respectively on the supports X 0 , X 1 , , X D . Let θ 0 : = 1 i = 1 D θ i . Assume that the support X i s of the p i s are mutually non-intersecting( X i X j = for all i j implying that the D + 1 densities are linearly independent) so that m θ ( x ) = θ i p i ( x ) for all x X i , and let X = i X i . Consider Shannon negative entropy F ( θ ) = h ( m θ ) as a strictly convex function. Then, we have
F ( θ ) = h ( m θ ) = X m θ ( x ) log m θ ( x ) ,
= i = 0 D θ i X i p i ( x ) log ( θ i p i ( x ) ) d μ ( x ) ,
= i = 0 D θ i log θ i i = 0 D θ i h ( p i ) .
Note that the term i θ i h ( p i ) is affine in θ, and Bregman divergences are defined up to affine terms so that the Bregman generator F is equivalent to the Bregman generator of the family of categorical distributions. This example generalizes the ordinary mixture family of categorical distributions where the p i s are distinct Dirac distributions. Note that when the support of the component distributions are not pairwise disjoint, the (neg)entropy may not be analytic [42] (e.g., mixture of the convex weighting of two prescribed distinct Gaussian distributions). This contrasts with the fact that the cumulant function of an exponential family is always real-analytic [43]. Observe that the term i θ i h ( p i ) can be interpreted as a conditional entropy: i θ i h ( p i ) = h ( X | Θ ) where Pr ( Θ = i ) = θ i and Pr ( X S | Θ = i ) = S p i ( x ) d μ ( x ) .
Notice that we can truncate an exponential family [25] to get a (potentially non-regular [44]) exponential family for defining the p i s on mutually non-intersecting domains X i s. The entropy of a natural exponential family { e ( x : θ ) = exp ( x θ ψ ( θ ) ) : θ Θ } with cumulant function ψ ( θ ) and natural parameter space Θ is ψ * ( η ) , where η = ψ ( θ ) , and ψ * is the Legendre convex conjugate [45]: h ( e ( x : θ ) ) = ψ * ( ψ ( θ ) ) .
In general, the entropy and cross-entropy between densities of a mixture family (whether the distributions have disjoint supports or not) can be calculated in closed-form.
Property 2.
The entropy of a density belonging to a mixture family M is h ( m θ ) = F ( θ ) , and the cross-entropy between two mixture densities m θ 1 and m θ 2 is h × ( m θ 1 : m θ 2 ) = F ( θ 2 ) ( θ 1 θ 2 ) η 2 = F * ( η 2 ) θ 1 η 2 .
Proof. 
Let us write the KLD as the difference between the cross-entropy minus the entropy [4]:
KL ( m θ 1 : m θ 2 ) = h × ( m θ 1 : m θ 2 ) h ( m θ 1 ) ,
= B F ( θ 1 : θ 2 ) ,
= F ( θ 1 ) F ( θ 2 ) ( θ 1 θ 2 ) F ( θ 2 ) .
Following [45], we deduce that h ( m θ ) = F ( θ ) + c and h × ( m θ 1 : m θ 2 ) = F ( θ 2 ) ( θ 1 θ 2 ) η 2 c for a constant c. Since F ( θ ) = h ( m θ ) by definition, it follows that c = 0 and that h × ( m θ 1 : m θ 2 ) = F ( θ 2 ) ( θ 1 θ 2 ) η 2 = F * ( η 2 ) θ 1 η 2 where η = F ( θ ) .  □
Thus, we can numerically compute the Jensen–Shannon centroids (or barycenters) of a set of densities belonging to a mixture family. This includes the case of categorical distributions and the case of Gaussian Mixture Models (GMMs) with prescribed Gaussian components [38] (although in this case, the negentropy needs to be stochastically approximated using Monte Carlo techniques [46]). When the densities do not belong to a mixture family (say, the Gaussian family, which is an exponential family [25]), we face the problem that the mixture of two densities does not belong to the family anymore. One way to tackle this problem is to project the mixture onto the Gaussian family. This corresponds to an m-projection (mixture projection) which can be interpreted as a Maximum Entropy projection of the mixture [25,47]).
Notice that we can perform fast k-means clustering without centroid calculations using a generalization of the k-means++ probabilistic initialization [48,49]. See [50] for details of the generalized k-means++ probabilistic initialization defined according to an arbitrary divergence.
Finally, let us notice some decompositions of the Jensen–Shannon divergence and the skew Jensen divergences.
Remark 4.
We have the following decomposition for the Jensen–Shannon divergence:
JS ( p 1 , p 2 ) = h p 1 + p 2 2 h ( p 1 ) + h ( p 2 ) 2 ,
= h JS × ( p 1 : p 2 ) h JS ( p 2 ) 0 ,
where
h JS × ( p 1 : p 2 ) = h p 1 + p 2 2 1 2 h ( p 1 ) ,
and h JS ( p 2 ) = h JS × ( p 2 : p 2 ) = h ( p 2 ) 1 2 h ( p 2 ) = 1 2 h ( p 2 ) . This decomposition bears some similarity with the KLD decomposition viewed as the cross-entropy minus the entropy (with the cross-entropy always upper-bounding the entropy).
Similarly, the α-skew Jensen divergence
J F α ( θ 1 : θ 2 ) : = ( F ( θ 1 ) F ( θ 2 ) ) α F ( θ 1 θ 2 ) α , α ( 0 , 1 )
can be decomposed as the sum of the information I F α ( θ 1 ) = ( 1 α ) F ( θ 1 ) minus the cross-information C F α ( θ 1 : θ 2 ) : = F ( θ 1 θ 2 ) α α F ( θ 2 ) :
J F α ( θ 1 : θ 2 ) = I F α ( θ 1 ) C F α ( θ 1 : θ 2 ) 0 .
Notice that the information I F α ( θ 1 ) is the self cross-information: I F α ( θ 1 ) = C F α ( θ 1 : θ 1 ) = ( 1 α ) F ( θ 1 ) . Recall that the convex information is the negentropy where the entropy is concave. For the Jensen–Shannon divergence on the mixture family of categorical distributions, the convex generator F ( θ ) = h ( m θ ) = i = 1 D θ i log θ i is the Shannon negentropy.
Finally, let us briefly mention the Jensen–Shannon diversity [30] which extends the Jensen–Shannon divergence to a weighted set of densities as follows:
JS ( p 1 , , p k ; w 1 , , w k ) : = i = 1 k w i KL ( p i : p ¯ ) ,
where p ¯ = i = 1 k w i p i . The Jensen–Shannon diversity plays the role of the variance of a cluster with respect to the KLD. Indeed, let us state the compensation identity [51]: For any q, we have
i = 1 k w i KL ( p i : q ) = i = 1 k w i KL ( p i : p ¯ ) + KL ( p ¯ : q ) .
Thus, the cluster center defined as the minimizer of i = 1 k w i KL ( p i : q ) is the centroid p ¯ , and
i = 1 k w i KL ( p i : p ¯ ) = JS ( p 1 , , p k ; w 1 , , w k ) .

4. Conclusions and Discussion

The Jensen–Shannon divergence [6] is a renown symmetrization of the Kullback–Leibler oriented divergence that enjoys the following three essential properties:
  • It is always bounded,
  • it applies to densities with potentially different supports, and
  • it extends to unnormalized densities while enjoying the same formula expression.
This JSD plays an important role in machine learning and in deep learning for studying Generative Adversarial Networks (GANs) [52]. Traditionally, the JSD has been skewed with a scalar parameter [19,53] α ( 0 , 1 ) . In practice, it has been experimentally demonstrated that skewing divergences may significantly improve the performance of some tasks (e.g., [21,54]).
In general, we can symmetrize the KLD KL ( p : q ) by taking an abstract mean (we require a symmetric mean M ( x , y ) = M ( y , x ) with the in-betweenness property: min { x , y } M ( x , y ) max { x , y } ) M between the two orientations KL ( p : q ) and KL ( q : p ) :
KL M ( p , q ) : = M ( KL ( p : q ) , KL ( q : p ) ) .
We recover the Jeffreys divergence by taking the arithmetic mean twice (i.e., J ( p , q ) = 2 A ( KL ( p : q ) , KL ( q : p ) ) where A ( x , y ) = x + y 2 ), and the resistor average divergence [55] by taking the harmonic mean (i.e., R KL ( p , q ) = H ( KL ( p : q ) , KL ( q : p ) ) = 2 KL ( p : q ) KL ( q : p ) KL ( p : q ) + KL ( q : p ) where H ( x , y ) = 2 1 x + 1 y ). When we take the limit of Hölder power means, we get the following extremal symmetrizations of the KLD:
KL min ( p : q ) = min { KL ( p : q ) , KL ( q : p ) } = KL min ( q : p ) ,
KL max ( p : q ) = max { KL ( p : q ) , KL ( q : p ) } = KL max ( q : p ) .
In this work, we showed how to vector-skew the JSD while preserving the above three properties. These new families of weighted vector-skew Jensen–Shannon divergences may allow one to fine-tune the dissimilarity in applications by replacing the skewing scalar parameter of the JSD by a vector parameter (informally, adding some “knobs” for tuning a divergence). We then considered computing the Jensen–Shannon centroids of a set of densities belonging to a mixture family [25] by using the convex–concave procedure [27].
In general, we can vector-skew any arbitrary divergence D by using two k-dimensional vectors α [ 0 , 1 ] k and β [ 0 , 1 ] k (with α β ) by building a weighted separable divergence as follows:
D α , β , w ( p : q ) : = i = 1 k w i D ( p q ) α i : ( p q ) β i = D 1 k α , 1 k β , w ( q : p ) , α β .
This bi-vector-skew divergence unifies the Jeffreys divergence with the Jensen–Shannon α -skew divergence by setting the following parameters:
KL ( 0 , 1 ) , ( 1 , 0 ) , ( 1 , 1 ) ( p : q ) = KL ( p : q ) + KL ( q : p ) = J ( p , q ) ,
KL ( 0 , α ) , ( 1 , 1 α ) , ( 1 2 , 1 2 ) ( p : q ) = 1 2 KL ( p : ( p q ) α ) + 1 2 KL ( q : ( p q ) α ) .
We have shown in this paper that interesting properties may occur when the skewing vector β is purposely correlated to the skewing vector α : Namely, for the bi-vector-skew Bregman divergences with β = ( α ¯ , , α ¯ ) and α ¯ = i w i α i , we obtain an equivalent Jensen diversity for the Jensen–Bregman divergence, and, as a byproduct, a vector-skew generalization of the Jensen–Shannon divergence.

Funding

This research received no external funding.

Acknowledgments

The author is very grateful to the two Reviewers and the Academic Editor for their careful reading, helpful comments, and suggestions which led to this improved manuscript. In particular, Reviewer 2 kindly suggested the stronger bound of Lemma 1 and hinted at Theorem 1.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Billingsley, P. Probability and Measure; John Wiley & Sons: Hoboken, NJ, USA, 2008. [Google Scholar]
  2. Deza, M.M.; Deza, E. Encyclopedia of Distances; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
  3. Basseville, M. Divergence measures for statistical data processing—An annotated bibliography. Signal Process. 2013, 93, 621–633. [Google Scholar] [CrossRef]
  4. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
  5. Nielsen, F. On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means. Entropy 2019, 21, 485. [Google Scholar] [CrossRef] [Green Version]
  6. Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef] [Green Version]
  7. Sason, I. Tight bounds for symmetric divergence measures and a new inequality relating f-divergences. In Proceedings of the 2015 IEEE Information Theory Workshop (ITW), Jerusalem, Israel, 26 April–1 May 2015; pp. 1–5. [Google Scholar]
  8. Wong, A.K.; You, M. Entropy and distance of random graphs with application to structural pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 1985, 7, 599–609. [Google Scholar] [CrossRef]
  9. Endres, D.M.; Schindelin, J.E. A new metric for probability distributions. IEEE Trans. Inf. Theory 2003, 49, 1858–1860. [Google Scholar] [CrossRef] [Green Version]
  10. Kafka, P.; Österreicher, F.; Vincze, I. On powers of f-divergences defining a distance. Stud. Sci. Math. Hung. 1991, 26, 415–422. [Google Scholar]
  11. Fuglede, B. Spirals in Hilbert space: With an application in information theory. Expo. Math. 2005, 23, 23–45. [Google Scholar] [CrossRef] [Green Version]
  12. Acharyya, S.; Banerjee, A.; Boley, D. Bregman divergences and triangle inequality. In Proceedings of the 2013 SIAM International Conference on Data Mining, Austin, TX, USA, 2–4 May 2013; pp. 476–484. [Google Scholar]
  13. Naghshvar, M.; Javidi, T.; Wigger, M. Extrinsic Jensen–Shannon divergence: Applications to variable-length coding. IEEE Trans. Inf. Theory 2015, 61, 2148–2164. [Google Scholar] [CrossRef] [Green Version]
  14. Bigi, B. Using Kullback-Leibler distance for text categorization. In European Conference on Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2003; pp. 305–319. [Google Scholar]
  15. Chatzisavvas, K.C.; Moustakidis, C.C.; Panos, C. Information entropy, information distances, and complexity in atoms. J. Chem. Phys. 2005, 123, 174111. [Google Scholar] [CrossRef] [Green Version]
  16. Yurdakul, B. Statistical Properties of Population Stability Index. Ph.D. Thesis, Western Michigan University, Kalamazoo, MI, USA, 2018. [Google Scholar]
  17. Jeffreys, H. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. A 1946, 186, 453–461. [Google Scholar]
  18. Nielsen, F. Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation for frequency histograms. IEEE Signal Process. Lett. 2013, 20, 657–660. [Google Scholar] [CrossRef] [Green Version]
  19. Lee, L. Measures of Distributional Similarity. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, ACL ’99; Association for Computational Linguistics: Stroudsburg, PA, USA, 1999; pp. 25–32. [Google Scholar] [CrossRef] [Green Version]
  20. Nielsen, F. A family of statistical symmetric divergences based on Jensen’s inequality. arXiv 2010, arXiv:1009.4004. [Google Scholar]
  21. Lee, L. On the effectiveness of the skew divergence for statistical language analysis. In Proceedings of the 8th International Workshop on Artificial Intelligence and Statistics (AISTATS 2001), Key West, FL, USA, 4–7 January 2001. [Google Scholar]
  22. Csiszár, I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung. 1967, 2, 229–318. [Google Scholar]
  23. Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B (Methodol.) 1966, 28, 131–142. [Google Scholar] [CrossRef]
  24. Sason, I. On f-divergences: Integral representations, local behavior, and inequalities. Entropy 2018, 20, 383. [Google Scholar] [CrossRef] [Green Version]
  25. Amari, S.I. Information Geometry and Its Applications; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
  26. Jiao, J.; Courtade, T.A.; No, A.; Venkat, K.; Weissman, T. Information measures: The curious case of the binary alphabet. IEEE Trans. Inf. Theory 2014, 60, 7616–7626. [Google Scholar] [CrossRef]
  27. Yuille, A.L.; Rangarajan, A. The concave-convex procedure (CCCP). In Proceedings of the Neural Information Processing Systems 2002, Vancouver, BC, Canada, 9–14 December 2002; pp. 1033–1040. [Google Scholar]
  28. Nielsen, F.; Nock, R. Skew Jensen-Bregman Voronoi diagrams. In Transactions on Computational Science XIV; Springer: Berlin/Heidelberg, Germany, 2011; pp. 102–128. [Google Scholar]
  29. Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
  30. Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904. [Google Scholar] [CrossRef] [Green Version]
  31. Melbourne, J.; Talukdar, S.; Bhaban, S.; Madiman, M.; Salapaka, M.V. On the Entropy of Mixture distributions. Available online: http://box5779.temp.domains/~jamesmel/publications/ (accessed on 16 February 2020).
  32. Guntuboyina, A. Lower bounds for the minimax risk using f-divergences, and applications. IEEE Trans. Inf. Theory 2011, 57, 2386–2399. [Google Scholar] [CrossRef]
  33. Sason, I.; Verdu, S. f-divergence Inequalities. IEEE Trans. Inf. Theory 2016, 62, 5973–6006. [Google Scholar] [CrossRef]
  34. Melbourne, J.; Madiman, M.; Salapaka, M.V. Relationships between certain f-divergences. In Proceedings of the 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA , 24–27 September 2019; pp. 1068–1073. [Google Scholar]
  35. Sason, I. On Data-Processing and Majorization Inequalities for f-Divergences with Applications. Entropy 2019, 21, 1022. [Google Scholar] [CrossRef] [Green Version]
  36. Van Erven, T.; Harremos, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef] [Green Version]
  37. Xu, P.; Melbourne, J.; Madiman, M. Infinity-Rényi entropy power inequalities. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 2985–2989. [Google Scholar]
  38. Nielsen, F.; Nock, R. On the geometry of mixtures of prescribed distributions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 2861–2865. [Google Scholar]
  39. Fréchet, M. Les éléments aléatoires de nature quelconque dans un espace distancié. Ann. De L’institut Henri PoincarÉ 1948, 10, 215–310. [Google Scholar]
  40. Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef] [Green Version]
  41. Lanckriet, G.R.; Sriperumbudur, B.K. On the convergence of the concave-convex procedure. In Proceedings of the Advances in Neural Information Processing Systems 22 (NIPS 2009), Vancouver, BC, Canada, 7–10 December 2009; pp. 1759–1767. [Google Scholar]
  42. Nielsen, F.; Sun, K. Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities. Entropy 2016, 18, 442. [Google Scholar] [CrossRef] [Green Version]
  43. Springer Verlag GmbH, European Mathematical Society. Encyclopedia of Mathematics. Available online: https://www.encyclopediaofmath.org/ (accessed on 19 December 2019).
  44. Del Castillo, J. The singly truncated normal distribution: A non-steep exponential family. Ann. Inst. Stat. Math. 1994, 46, 57–66. [Google Scholar] [CrossRef] [Green Version]
  45. Nielsen, F.; Nock, R. Entropies and cross-entropies of exponential families. In Proceedings of the 2010 IEEE International Conference on Image Processing, Hong Kong, China, 26–29 September 2010; pp. 3621–3624. [Google Scholar]
  46. Nielsen, F.; Hadjeres, G. Monte Carlo information geometry: The dually flat case. arXiv 2018, arXiv:1803.07225. [Google Scholar]
  47. Schwander, O.; Nielsen, F. Learning mixtures by simplifying kernel density estimators. In Matrix Information Geometry; Springer: Berlin/Heidelberg, Germany, 2013; pp. 403–426. [Google Scholar]
  48. Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’07), New Orleans, LA, USA, 7–9 January 2007; pp. 1027–1035. [Google Scholar]
  49. Nielsen, F.; Nock, R.; Amari, S.I. On clustering histograms with k-means by using mixed α-divergences. Entropy 2014, 16, 3273–3301. [Google Scholar] [CrossRef] [Green Version]
  50. Nielsen, F.; Nock, R. Total Jensen divergences: Definition, properties and clustering. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, Australia, 19–24 April 2015; pp. 2016–2020. [Google Scholar]
  51. Topsøe, F. Basic concepts, identities and inequalities-the toolkit of information theory. Entropy 2001, 3, 162–190. [Google Scholar] [CrossRef]
  52. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
  53. Yamano, T. Some bounds for skewed α-Jensen-Shannon divergence. Results Appl. Math. 2019, 3, 100064. [Google Scholar] [CrossRef]
  54. Kotlerman, L.; Dagan, I.; Szpektor, I.; Zhitomirsky-Geffet, M. Directional distributional similarity for lexical inference. Nat. Lang. Eng. 2010, 16, 359–389. [Google Scholar] [CrossRef]
  55. Johnson, D.; Sinanovic, S. Symmetrizing the Kullback-Leibler distance. IEEE Trans. Inf. Theory 2001, 1–8. [Google Scholar]
Figure 1. The Convex–ConCave Procedure (CCCP) iteratively updates the parameter θ by aligning the support hyperplanes at θ . In the limit case of convergence to θ * , the support hyperplanes at θ * are parallel to each other. CCCP finds a local minimum.
Figure 1. The Convex–ConCave Procedure (CCCP) iteratively updates the parameter θ by aligning the support hyperplanes at θ . In the limit case of convergence to θ * , the support hyperplanes at θ * are parallel to each other. CCCP finds a local minimum.
Entropy 22 00221 g001
Figure 2. The Jeffreys centroid (grey histogram) and the Jensen–Shannon centroid (black histogram) for two grey normalized histograms of the Lena image (red histogram) and the Barbara image (blue histogram). Although these Jeffreys and Jensen–Shannon centroids look quite similar, observe that there is a major difference between them in the range [ 0 , 20 ] where the blue histogram is zero.
Figure 2. The Jeffreys centroid (grey histogram) and the Jensen–Shannon centroid (black histogram) for two grey normalized histograms of the Lena image (red histogram) and the Barbara image (blue histogram). Although these Jeffreys and Jensen–Shannon centroids look quite similar, observe that there is a major difference between them in the range [ 0 , 20 ] where the blue histogram is zero.
Entropy 22 00221 g002
Figure 3. The Jeffreys centroid (grey histogram) and the Jensen–Shannon centroid (black histogram) for the grey normalized histogram of the Barbara image (red histogram) and its negative image (blue histogram which corresponds to the reflection around the vertical axis x = 128 of the red histogram).
Figure 3. The Jeffreys centroid (grey histogram) and the Jensen–Shannon centroid (black histogram) for the grey normalized histogram of the Barbara image (red histogram) and its negative image (blue histogram which corresponds to the reflection around the vertical axis x = 128 of the red histogram).
Entropy 22 00221 g003
Figure 4. Jensen–Shannon centroid (black histogram) for the clamped grey normalized histogram of the Lena image (red histograms) and the clamped gray normalized histogram of Barbara image (blue histograms). Notice that on the part of the sample space where only one distribution is non-zero, the JS centroid scales that histogram portion.
Figure 4. Jensen–Shannon centroid (black histogram) for the clamped grey normalized histogram of the Lena image (red histograms) and the clamped gray normalized histogram of Barbara image (blue histograms). Notice that on the part of the sample space where only one distribution is non-zero, the JS centroid scales that histogram portion.
Entropy 22 00221 g004
Table 1. Two views of the family of categorical distributions with d choices: An exponential family or a mixture family of order D = d 1 . Note that the Bregman divergence associated to the exponential family view corresponds to the reverse Kullback–Leibler (KL) divergence, while the Bregman divergence associated to the mixture family view corresponds to the KL divergence.
Table 1. Two views of the family of categorical distributions with d choices: An exponential family or a mixture family of order D = d 1 . Note that the Bregman divergence associated to the exponential family view corresponds to the reverse Kullback–Leibler (KL) divergence, while the Bregman divergence associated to the mixture family view corresponds to the KL divergence.
Exponential FamilyMixture Family
pdf p θ ( x ) = i = 1 d p i t i ( x ) , p i = Pr ( x = e i ) , t i ( x ) { 0 , 1 } , i = 1 d t i ( x ) = 1 m θ ( x ) = i = 1 d p i δ e i ( x )
primal θ θ i = log p i p d θ i = p i
F ( θ ) log ( 1 + i = 1 D exp ( θ i ) ) θ i log θ i + ( 1 i = 1 D θ i ) log ( 1 i = 1 D θ i )
dual η = F ( θ ) e θ i 1 + j = 1 D exp ( θ j ) log θ i 1 j = 1 D θ j
primal θ = F * ( η ) log η i 1 j = 1 D η j e θ i 1 + j = 1 D exp ( θ j )
F * ( η ) i = 1 D η i log η i + ( 1 j = 1 D η j ) log ( 1 j = 1 D η j ) log ( 1 + i = 1 D exp ( η i ) )
Bregman divergence B F ( θ : θ ) = KL * ( p θ : p θ ) B F ( θ : θ ) = KL ( m θ : m θ )
= KL ( p θ : p θ )

Share and Cite

MDPI and ACS Style

Nielsen, F. On a Generalization of the Jensen–Shannon Divergence and the Jensen–Shannon Centroid. Entropy 2020, 22, 221. https://0-doi-org.brum.beds.ac.uk/10.3390/e22020221

AMA Style

Nielsen F. On a Generalization of the Jensen–Shannon Divergence and the Jensen–Shannon Centroid. Entropy. 2020; 22(2):221. https://0-doi-org.brum.beds.ac.uk/10.3390/e22020221

Chicago/Turabian Style

Nielsen, Frank. 2020. "On a Generalization of the Jensen–Shannon Divergence and the Jensen–Shannon Centroid" Entropy 22, no. 2: 221. https://0-doi-org.brum.beds.ac.uk/10.3390/e22020221

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop