Next Article in Journal
Simulating Solid-Liquid Phase-Change Heat Transfer in Metal Foams via a Cascaded Lattice Boltzmann Model
Next Article in Special Issue
Information–Theoretic Aspects of Location Parameter Estimation under Skew–Normal Settings
Previous Article in Journal
Correlations, Information Backflow, and Objectivity in a Class of Pure Dephasing Models
Previous Article in Special Issue
Inequalities for Jensen–Sharma–Mittal and Jeffreys–Sharma–Mittal Type f–Divergences
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Generalizations of Talagrand Inequality for Sinkhorn Distance Using Entropy Power Inequality †

1
Communication Systems Department, EURECOM, 06904 Sophia Antipolis, France
2
Division of Information Science and Engineering, KTH Royal Institute of Technology, 114 28 Stockholm, Sweden
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in 2021 IEEE Information Theory Workshop (ITW), Kanazawa, Japan.
Submission received: 4 January 2022 / Revised: 17 February 2022 / Accepted: 17 February 2022 / Published: 21 February 2022
(This article belongs to the Special Issue Distance in Information and Statistical Physics III)

Abstract

:
The distance that compares the difference between two probability distributions plays a fundamental role in statistics and machine learning. Optimal transport (OT) theory provides a theoretical framework to study such distances. Recent advances in OT theory include a generalization of classical OT with an extra entropic constraint or regularization, called entropic OT. Despite its convenience in computation, entropic OT still lacks sufficient theoretical support. In this paper, we show that the quadratic cost in entropic OT can be upper-bounded using entropy power inequality (EPI)-type bounds. First, we prove an HWI-type inequality by making use of the infinitesimal displacement convexity of the OT map. Second, we derive two Talagrand-type inequalities using the saturation of EPI that corresponds to a numerical term in our expressions. These two new inequalities are shown to generalize two previous results obtained by Bolley et al. and Bai et al. Using the new Talagrand-type inequalities, we also show that the geometry observed by Sinkhorn distance is smoothed in the sense of measure concentration. Finally, we corroborate our results with various simulation studies.

1. Introduction

OT theory studies how to transport one measure to another in the path with minimal cost. The Wasserstein distance is the cost given by the optimal path and closely connected with information measures; see, e.g., [1,2,3,4,5].
During the last decade, OT has been studied and applied extensively, especially in the machine learning community; see, e.g., [6,7,8,9]. Entropic OT, a technique to approximate the solution of the original OT, was given for computational efficiency in [10]. A key concept in the entropic OT is the Sinkhorn distance, which is a generalization of the Wasserstein distance with an extra entropic constraint. Due to the extra entropic constraint in the domain of the optimization problem, randomness is added to the original deterministic system, and the total cost increases from the original Wasserstein distance to a larger value. Therefore, a natural question is how to quantify the extra cost caused by the entropic constraint.
In this paper, we derive upper bounds for the quadratic cost of entropic OT, which are shown to include a term of entropy power responsible for quantifying the amount of uncertainty caused by the entropic constraint. This work is an extended version of [11].

1.1. Literature Review

The dynamical formulation of OT, also known as the Benamou–Brenier formula [12], generalizes the original Monge–Kantorovich formulation into a time-dependent problem. It changes the original distance problem (i.e., find the distance between two prescribed measures) into a geodesic problem (i.e., find the optimal path between two prescribed measures). Using the displacement convexity of relative entropy along the geodesic, functional inequalities such as HWI inequality and Talagrand inequality can be obtained (see, e.g., ([13] Chapter 20)).
Talagrand inequality, first given in [1], upper bounds the Wasserstein distance by relative entropy. Recent results in [2,4] obtain several refined Talagrand inequalities with dimensional improvements on the multidimensional Euclidean space. These inequalities bound Wasserstein distance with entropy power, which is sharper compared to the original one with relative entropy.
An analogue of the dynamical OT problem is the SP [14]. The SP aims to find the most likely evolution of a system of particles with respect to a reference process. The most likely evolution is called a Schrödinger bridge. SP and OT intersect on many occasions; see, e.g., [15,16,17]. The problem we study in this paper is in this intersection and mostly related to [15]. In particular, Léonard in [15] showed that the entropic OT with quadratic cost is equivalent to the SP with a Brownian motion as the reference process. He further derived that the Schrödinger bridge also admits a Benamou–Brenier formula with an additional diffusion term. Conforti in [18,19] claimed that the process can also be formulated as a continuity equation and proved that the acceleration of particles is the gradient of the Fisher information. The result therein leads to a generalized Talagrand inequality for relative entropy. Later, Bai et al. in [20] upper-bounded the extra cost from the Brownian motion by separating one Gaussian marginal into two independent random vectors. Using this approach, they showed that the dimensional improvement can be generalized to entropic OT and gave a Gaussian Talagrand inequality for the Sinkhorn distance. Additional results in [20] include a strong data processing inequality derived from their new Talagrand inequality and a bound on the capacity of the relay channel.
Entropic OT has other interesting properties. For example, Rigollet and Weed studied the case with one side of empirical measure in [21]. Their result shows that entropic OT performs maximum-likelihood estimation for Gaussian deconvolution of the empirical measure. This result can be further applied in uncoupled isotonic regression (see [9]). The dimensionality is also observed in the applications of entropic OT. For example, sample complexity bounds in [22,23] appear to be dimensional-dependent. In the GAN model, Reshetova et al. in [24] showed that the entropic regularization of OT promotes sparsity in the generated distribution.
Another element in our paper is EPI (for details on EPI, see, e.g., [25,26,27]). This inequality provides a clear expression to bound the differential entropy of two distributions’ convolution. We refer the interested reader to [28,29,30,31,32] for the connections between EPI and functional inequalities, and [33] for the connections between EPI and SP.

1.2. Contributions

In this paper, we upper-bound the quadratic cost of entropic OT by deconvolution of one side measure and EPIs. Using this approach, we avoid any discussion related to the dynamics of SP and instead we capture the uncertainty caused by the Brownian motion quantitatively. Our contributions can be articulated as follows:
(1)
We derive an HWI-type inequality for Sinkhorn distance using a modification of Bolley’s proof in [4] (see Theorem 2).
(2)
We prove two new Talagrand-type inequalities (see Theorems 3 and 4). These inequalities are obtained via a numerical term C related to the saturation, or the tightness, of EPI. We claim that this term can be computed with arbitrary deconvolution of one side marginal, while the optimal deconvolution is shown to be unknown beyond the Gaussian case. Nevertheless, we simulate suboptimally this term for a variety of distributions in Figure 1.
(3)
We show that the geometry observed by Sinkhorn distance is smoothed in the sense of measure concentration. In other words, Sinkhorn distance implies a dimensional measure concentration inequality following Marton’s method (see Corollary 2). This inequality has a simple form of normal concentration that is related to the term C and is weaker than the one implied by Wasserstein distance.
(4)
Our theoretical results are validated via numerical simulations (see Section 4). These simulations reveal several reasons for which our bounds can be either tight or loose.

Connections to Prior Art

The novelty of our work is that it comprises naturally ideas from Bolley et al. in [4] and from Bai et al. in [20] to develop new entropic OT inequalities. The dimensional improvement of Bolley et al. in [4] separates an independent term of entropy power from the original Talagrand inequality. This allows us to utilize an approach to study the entropic OT problem, which is the OT with randomness, based on the convolutional property of entropy power. On the other hand, we generalize the constructive proof of Bai et al. in [20], where they separate one Gaussian random vector into two independent Gaussian random vectors. We further claim that, for any distribution, we can always find similar independent pairs satisfying several assumptions, to upper-bound the Sinkhorn distance. As a consequence of the above, our results generalize the Talagrand inequalities of Bolley et al. in ([4] Theorem 2.1) from classical OT to entropic OT and the results of Bai et al. in ([20] Theorem 2.2) from the Gaussian case to the strongly log-concave case. In particular, we show that Theorem 3 recovers ([4] Theorem 2.1) (see Corollary 1 and the discussion in Remark 6) and that Theorem 4 recovers ([20] Theorem 2.2) (see Remark 9). It should be noted that in our analysis, we focus on the primal problem defined in [10], as opposed to the studies of its Lagrangian dual in [18,19].

1.3. Notation

N is the set of positive integers { 1 , 2 , 3 , . . . } . R is the set of real numbers. R n is the n-dimension Euclidean space. R + denotes the set { x R : x 0 } .
Let X , Y be two Polish spaces, i.e., separable complete metric spaces. We write an element x X in lower-case letters and a random vector X on X in capital letters. We denote P ( X ) as the set of all probability measures on X . Let μ be a Borel measure on X . For a measurable map T : X Y , T # μ denotes the pushing forward of μ to Y , i.e., for all A Y , T # μ [ A ] = μ [ T 1 ( A ) ] . For p 1 , L p ( X ) or L p ( d μ ) denotes the Lebesgue space of p-th order for the reference measure μ .
∇ is the gradient operator, · is the divergence operator, Δ is the Laplacian operator, D 2 is the Hessian operator, I n is the n-dimension identity matrix, I d is the identity map, · is the Euclidean norm, C k is the set of functions that is k-times continuously differentiable, Ric is the Ricci curvature.
h ( · ) , I ( · ; · ) , D ( · · ) , J ( · ) , I ( · | · ) denote differential entropy, mutual information, relative entropy, Fisher information and relative Fisher information, respectively. All the logarithms are natural logarithms. ! is unique existence. * is the convolution operator.

1.4. Organization of the Paper

The rest of the paper is organized as follows: in Section 2, we give the technical preliminaries of the theories and tools that we use; in Section 3, we state our main theoretical results; in Section 4, we give numerical simulations for our theorems, and in Section 5, we give the conclusions and future directions. Long proofs and background material are included in the Appendix.

2. Preliminaries

In this section, we give an overview of the theories and tools that we use.

2.1. Synopsis of Optimal Transport

We first give a brief introduction of OT theory. The OT problem was initialized by Gaspard Monge. The original formulation can be described as follows.
Definition 1
(Monge Problem [34]). Let P X and P Y be two probability measures supported on two Polish spaces X , Y . Given a lower semi-continuous (see Definition A1) cost function c ( x , y ) : X × Y R + { + } , the Monge problem wants to find a transport map T : X Y minimizing the total cost:
inf T : T # P X = P Y X c ( x , T ( x ) ) d P X ( x ) .
Then, Kantorovich gave a probabilistic interpretation to the OT. This is stated next.
Definition 2
(Kantorovich Problem [35]). Let X and Y be two random vectors on two Polish spaces X , Y . X and Y have probability measures P X P ( X ) , P Y P ( Y ) . We denote Π ( P X , P Y ) as the set of all joint probability measures on X × Y with marginal measures P X , P Y . Given a lower semi-continuous cost function c ( x , y ) : X × Y R + { + } , the Kantorovich problem can be written as:
inf P Π ( P X , P Y ) X × Y c ( x , y ) d P .
It can be further proven that (2) gives the same optimizer as (1) (see, e.g., [36]). One can define the Wasserstein distance ([13] Definition 6.1) from (2). Let X = Y and let d be a metric on X . Then, the Wasserstein distance of order p , p 1 , is defined as follows:
W p ( P X , P Y ) : = inf P Π ( P X , P Y ) X × Y d p ( x , y ) d P 1 p .
We note that the Wasserstein distance is a metric between two measures.
Cuturi in [10] gave the concept of entropic OT. In this definition, he adds an information theoretic constraint to (2), i.e.,
inf P Π ( P X , P Y ; R ) X × Y c ( x , y ) d P ,
where
Π ( P X , P Y ; R ) : = { P Π ( P X , P Y ) : I ( X ; Y ) R } ,
with I ( X ; Y ) : = D ( P P X × P Y ) denoting the mutual information [32] between X and Y, and R R + . It is well known that the constraint set is convex and compact with respect to the topology of weak convergence (for details, see, e.g., ([13] Lemma 4.4), ([37] Section 1.4)). Using the lower semi-continuity of c ( x , y ) and ([13] Lemma 4.3), we know that the objective function f : P c d P is also lower semi-continuous. Using the compactness of the constraint set and the lower semi-continuity of f, then, from Weierstrass’ extreme value theorem, the minimum in (4) is attained. Moreover, the solution is always located on its boundary, i.e., I ( X ; Y ) = R , because the objective function of (4) is linear.
Entropic OT is an efficient way to approximate solutions of the Kantorovich problem. The Lagrangian dual of (4), which was introduced by Cuturi in [10], can be solved iteratively. The dual problem of (4) can be reformulated as follows:
max ϵ 0 inf P Π ( P X , P Y ) X × Y c ( x , y ) d P + ϵ I ( X ; Y ) R ,
where ϵ is a Lagrange multiplier. Using the Lagrange duality theorem ([38] Theorem 1, pp. 224–225), it can be shown that (4) and (5) give the same optimizer P .
The uncertainty of entropic OT can be understood as follows. We can write I ( X ; Y ) = h ( Y ) h ( Y | X ) , where h ( Y ) is fixed. The conditional entropy encapsulates the randomness of the conditional distribution. The randomness decreases when I ( X ; Y ) increases. Thus, unlike (1) and (2), there is no deterministic map anymore for (4) and (5), because a one-to-one mapping leads to infinite mutual information. Note that ϵ in (5) also has an explicit physical meaning. In particular, entropic OT with quadratic cost coincides with SP with a reference measure of Brownian motion (see [15]). Then, ϵ is a diffusion coefficient of the Fokker–Planck equation associated with the Schrödinger bridge.
In our main results, we study (4) instead of (5) for two reasons. First, the mutual information in (4) gives a global description of the amount of uncertainty, while the coefficient ϵ in (5) and its associated Fokker–Planck equation are more related to local properties, from the definitions of the Lagrangian dual and Fokker–Planck equation. Further on this point, there is no explicit expression for the correspondence between R and ϵ in the duality. Second, the expectation of cost function in (2) is comparable to the Wasserstein distance. As we demonstrate in the following, it gives a smooth version of the Wasserstein distance.
Similar to the Wasserstein distance, the Sinkhorn distance of order p is defined as follows:
W p ( P X , P Y ; R ) : = inf P Π ( P X , P Y ; R ) X × Y d p ( x , y ) d P 1 p .
Clearly, Π ( P X , P Y ; R ) is a subset of Π ( P X , P Y ) . Because of the minimization problem, it is easy to see that W p ( P X , P Y ; R ) > W p ( P X , P Y ) . For this reason, we say that entropic OT is a smoothed version of classical OT. We note that the Sinkhorn distance is not a metric because it does not fulfill the axiom of identity of indiscernibles.
Since entropic OT is concerned with mutual information, it may be of interest to introduce a conditional Sinkhorn distance. This is defined as follows:
W p ( P X | Z , P Y | Z | P Z ; R ) : = inf P Π ( P X | Z , P Y | Z | P Z ) ; I ( X ; Y | Z ) R { E P [ d p ( X , Y ) ] } 1 / p ,
where the conditional mutual information I ( X ; Y | Z ) : = I ( P X | Z = z ; P Y | Z = z ) d P Z ( z ) and Π ( P X | Z , P Y | Z | P Z ) : = { P X , Y | Z · P Z : P X , Y | Z = z Π ( P X | Z = z , P Y | Z = z ) for z a . e . } . Conditional Sinkhorn distance is utilized in [20] and leads to a data processing inequality. Since the constraint of conditional mutual information is a linear form of I ( P X | Z = z ; P Y | Z = z ) , the constraint set is still convex. The objective function is also a linear form of P. Therefore, the functional and topological properties of the conditional Sinkhorn distance are similar to the unconditional one.
Next, we state some known results of Talagrand inequality [1].
Definition 3
(Talagrand Inequality). Let P X be a reference probability measure with density e V ( x ) , where V : X R . We say that P X satisfies T ( λ ) > 0 , i.e., Talagrand inequality with parameter λ > 0 , if, for any P Y P ( Y ) ,
W 2 ( P X , P Y ) 2 λ D ( P Y P X ) .
Remark 1.(8) was originally introduced by Talagrand in [1] when P X is Gaussian. Blower in [39] gave a refinement and proved that
D 2 V λ I n T ( λ ) .
When going beyond the Euclidean space to a manifold, Otto and Villani in [40] showed that the Bakry–Emery condition D 2 V + Ric also implies T ( λ ) .
Recently, refined inequalities with dimensional improvements were obtained in multidimensional Euclidean space. These dimensional improvements were first observed in the Gaussian case of logarithmic Sobolev inequality, Brascamp–Lieb (or Poincaré) inequality [41] and Talagrand inequality [2]. For a standard Gaussian measure P X , the dimensional Talagrand inequality has the form:
W 2 2 ( P X , P Y ) E [ Y 2 ] + n 2 n e 1 2 n ( E [ Y 2 ] n 2 D ( P Y P X ) ) .
Bolley et al. in [4] generalized the results in [2,41] from Gaussian to strongly log-concave or log-concave. Next, we state their result. Let d P X = e V , where V : R n R is C 2 continuous, D 2 V λ I n . Bolley’s dimensional Talagrand inequality is given as follows:
λ 2 W 2 2 ( P X , P Y ) E [ V ( Y ) ] E [ V ( X ) ] + n n e 1 n ( E [ V ( Y ) ] E [ V ( X ) ] D ( P Y P X ) ) .
The dimensional Talagrand inequalities (9) and (10) are tighter than (8). To see this result, one may refer to our Remark 6 below.
Bai et al. in [20] gave a generalization of (9) to Sinkhorn distance. When P X is standard Gaussian,
W 2 2 ( P X , P Y ; R ) E [ Y 2 ] + n 2 n 1 2 π e ( 1 e 2 n R ) e 1 n h ( Y ) .
When R + , this inequality coincides with (9).

2.2. Measure Concentration

The measure concentration phenomenon describes how the probability of a random variable X changes with the deviation from a given value such as its mean or median. Marton introduced an approach of concentration directly on the level of probability measures using OT (see, e.g., ([13] Chapter 22)).
To give the notation of concentration of measure, we first introduce the probability metric space. Let X be a Polish space. Let d be a metric on X . Let μ be a probability measure defined on the Borel set of X . Then, we say that the triple ( X , d , μ ) is a probability metric space.
For an arbitrary set A X and any r 0 , we define A r as
A r : = { x X : d ( x , A ) > r } ,
where d ( x , A ) : = inf a A d ( x , a ) . Then, we say that a probability measure μ has normal (or Gaussian) concentration on ( X , d ) if there exists positive K and κ such that
μ ( A ) 1 2 μ ( A r ) K e κ r 2 , r > 0 .
There is another weaker statement of normal concentration, such that
μ ( A ) 1 2 μ ( A r ) K e κ ( r r 0 ) 2 , r > r 0 .
It is not difficult to see that (12) can be obtained from (13), possibly with degraded constants, i.e., larger K and/or smaller κ .
The next theorem gives the connection between normal concentration and Talagrand inequality.
Theorem 1
(Theorem 3.4.7 [5]). Let ( X , d , μ ) be a probability metric space. Then, the following two statements are equivalent:
  • μ satisfies T ( λ ) .
  • μ has a dimension-free normal concentration with κ = 1 2 λ .
The intuition behind Marton’s method is that OT theory can give a metric between two probability measures by the metric structure of the supporting Polish space. The metric can be further connected with probability divergence using Talagrand inequality.

2.3. Entropy Power Inequality and Deconvolution

EPI [25] states that, for all independent continuous random vectors X and Y,
N ( X + Y ) N ( X ) + N ( Y ) ,
where N ( X ) : = 1 2 π e e 2 n h ( X ) denotes the entropy power of X. The equality is achieved when X and Y are Gaussian random vectors with proportional covariance matrices.
Deconvolution is a problem of estimating the distribution f ( x ) by the observations Y 1 ,..., Y k corrupted by additive noise Z 1 ,..., Z k , written as
Y i = X i + Z i ,
where k , i N and 1 i k . X i ’s are i.i.d. in f ( x ) , Z i ’s are i.i.d in h ( z ) . X i ’s and Z i ’s are mutually independent. Let g ( y ) be the probability density function of Y that is given by the convolution g = f h . Then, their entropies can be bounded by EPI directly.
In our problem, we slightly abuse the concept by simply separating a random vector Y into two independent random vectors X and Z. We use this approach to introduce the uncertainty to entropic OT and consequently bound the Sinkhorn distance by EPI. Deconvolution is generally a more challenging problem than convolution. For instance, the log-concave family is convolution stable, i.e., convolution of two log-concave distributions is still log-concave, but we cannot guarantee that the deconvolution of two log-concave distributions is still log-concave. A trivial case is that wherein the deconvolution of a log-concave distribution by itself is a Dirac function. Moreover, f may not in general be positive or integrable for arbitrary given g and h, as shown in [42]. However, it should be noted that there are many numerical methods to compute deconvolution; see, e.g., [42,43,44].

3. Main Theoretical Results

In this section, we derive our main theoretical results. First, we give a new HWI-type inequality.
Theorem 2
(HWI-Type Inequality). Let X = Y = R n . Let μ be a probability measure with density e V ( x ) with λ > 0 , where V : R n R is C 2 continuous, D 2 V λ I n . Let P X , P Y be two probability measures on R n , P X , P Y μ . For any independent Y 1 , Y 2 satisfying Y 1 + Y 2 = Y , E [ Y 2 ] = 0 and h ( Y ) h ( Y 2 ) R , the following bound holds:
λ 2 W 2 2 ( P X , P Y ; R ) E [ V ( Y ) ] E [ V ( X ) ] + n n e 1 n ( h ( Y 1 ) h ( X ) ) + W 2 ( P X , P Y 1 ) I ( P X | μ ) ,
where the relative Fisher information I ( P X | μ ) : = f 2 f d μ , f = d P X d μ .
Proof. 
See Appendix A. □
Remark 2.
In Theorem 2, we construct Y 1 and Y 2 , where Y 1 and X have a deterministic relationship. We note that the uncertainty in our construction, i.e., the independent Y 2 , is located at one marginal, whereas the uncertainty of the true dynamics of the entropic OT is all along the path. The simplicity of our construction allows for the specific bound. We further note that there always exist such Y 1 , Y 2 satisfying the assumptions given in Theorem 2. A trivial proof is that Y 1 = E [ Y ] and Y 2 = Y E [ Y ] fulfill the above assumptions.
The next result gives a new Talagrand-type inequality.
Theorem 3
(Talagrand-Type Inequality). Let X = Y = R n . Let d P X = e V ( x ) d x , where V : R n R is C 2 continuous, D 2 V λ I n with λ > 0 , P Y P X . Then, the following bound holds:
λ 2 W 2 2 ( P X , P Y ; R ) E [ V ( Y ) ] E [ V ( X ) ] + n n C ( P Y , R ) e 1 n ( h ( Y ) h ( X ) ) ,
where C ( P Y , R ) [ 0 , 1 ] is a numerical term for the given P Y and R 0 .
Proof. 
Let d P X = e V in (15). In such a case, we have I ( P X | μ ) = 0 from the definition of relative Fisher information. Take C ( P Y , R ) = e 1 n ( h ( Y 1 ) h ( Y ) ) ; then, (16) is proven from (15). □
Next, we state some technical remarks on Theorem 3.
Remark 3
(On Theorem 3). In Theorem 3, we show that the Sinkhorn distance of two random vectors can be upper-bounded by a difference of a functional on two marginals, i.e., E [ V ( Y ) ] E [ V ( X ) ] , and a term related to entropy power, i.e., n C ( P Y , R ) e 1 n ( h ( Y ) h ( X ) ) . Interestingly, only the latter term is related to the constraint R. This means that the effect of information constraint is only associated with the randomness of the two random vectors instead of their positions. Recalling the physical meaning of entropic OT with quadratic cost, we can see that this expression is very natural, because the information constraint R is directly related to the randomness of the Schrödinger bridge.
Remark 4
(On the numerical term C ( · , · ) ). The numerical term C = e 1 n ( h ( Y 1 ) h ( Y ) ) can be computed by arbitrary Y 1 satisfying the assumptions that we gave in Theorem 2, i.e., Y 1 , Y 2 are independent, Y 1 + Y 2 = Y , E [ Y 2 ] = 0 and h ( Y ) h ( Y 2 ) R . We observe that e 1 n h ( Y 1 ) has the form of a square root of entropy power. Using EPI and the fact that N ( · ) 0 , we have
N ( Y ) N ( Y 1 ) + N ( Y 2 ) N ( Y 1 ) .
Therefore, C = e 1 n ( h ( Y 1 ) h ( Y ) ) = N ( Y 1 ) / N ( Y ) [ 0 , 1 ] . When R = 0 , then Y 2 = Y E [ Y ] and the density of Y 1 is δ ( x E [ Y ] ) . This means that e 1 n h ( Y 1 ) = 0 , hence C = 0 . When R = + , then Y = Y 1 , e 1 n h ( Y 1 ) = e 1 n h ( Y ) , and consequently C = 1 . Therefore, C ( · , 0 ) = 0 , C ( · , + ) = 1 for all P Y .
Moreover, we can show that there always exists such a sequence C non-decreasing with respect to R. We know that C = e h ( Y 1 ) n h ( Y ) n subject to Y 1 + Y 2 = Y , E [ Y 2 ] = 0 and h ( Y ) h ( Y 2 ) R . Thus, for a larger value of R, there exists at least a h ( Y 2 ) non-increasing. This further leads to a non-decreasing h ( Y 1 ) . Therefore, there exists at least a C ( · , R + Δ R ) not smaller than C ( · , R ) , Δ R > 0 , i.e., C ( · , R ) is monotonic non-decreasing with respect to R.
We note that, for particular distributions, we may have an explicit expression of C ( · , · ) . For instance, when P Y is Gaussian, we can always take the linear combination Y = Y 1 + Y 2 , where Y 1 and Y 2 are independent Gaussian and have proportional covariance matrices. In such a case, EPI is saturated as follows:
e 2 n h ( Y 1 ) = e 2 n h ( Y ) e 2 n h ( Y 2 ) = ( 1 e 2 n R ) e 2 n h ( Y ) .
As a result, we have C ( P Y , R ) = e 1 n ( h ( Y 1 ) h ( Y ) ) = 1 R 2 n . For Cauchy distribution C a u c h y ( x 0 , γ ) , its differential entropy is log ( 4 π γ ) . The summation of independent Cauchy random variables i n C a u c h y ( x i , γ i ) C a u c h y ( i n x i , i n γ i ) . When Y is i.i.d. Cauchy, i.e., ( Y ) i C a u c h y ( x 0 , γ ) , we take ( Y 1 ) i C a u c h y ( x 0 , 1 4 π e 1 n h ( Y ) · ( 1 e R n ) ) and ( Y 2 ) i C a u c h y ( 0 , 1 4 π e 1 n h ( Y ) · e R n ) . We can see that this linear combination satisfies our assumption h ( Y ) h ( Y 2 ) R and C ( P Y , R ) = e 1 n ( h ( Y 1 ) h ( Y ) ) = 1 e R n .
Note that the linear combination Y = Y 1 + Y 2 is not unique, according to the assumption of Theorem 2. Consequently, this implies the non-uniqueness of C ( · , · ) . In order to obtain the tightest bound in (16), we need to solve the following optimization problem
C ( P Y , R ) = sup e 1 n ( h ( Y 1 ) h ( Y ) ) ,
subject to Y 1 + Y 2 = Y and h ( Y ) h ( Y 2 ) R . To look into this optimization problem, we recall Courtade’s reverse EPI ([31] Corollary 1) as follows. If we have independent X and Y with finite second moments and choose θ to satisfy θ / ( 1 θ ) = N ( Y ) / N ( X ) , then
N ( X + Y ) ( N ( X ) + N ( Y ) ) ( θ p ( X ) + ( 1 θ ) p ( Y ) ) ,
where p ( X ) : = 1 n N ( X ) J ( X ) 1 is the Stam defect and J ( X ) : = I ( P X | μ ) , where d μ = d x is the Fisher information. We note that p ( X ) is affine invariant, i.e., p ( X ) = p ( t X ) , t > 0 because t 2 N ( X ) = N ( t X ) and t 2 J ( t X ) = J ( X ) . We note that the equality p ( X ) = 1 holds only if X is Gaussian. In our case, θ = N ( Y 1 ) / ( N ( Y 1 ) + N ( Y 2 ) ) . When θ 1 , (18) becomes
N ( Y ) ( N ( Y 1 ) + N ( Y 2 ) ) · p ( Y 2 ) .
This means that the saturation of EPI is controlled by p ( Y 2 ) when the noise Y 2 is small, i.e., when R is large. In such a case, C ( P Y , R ) 1 R 2 n if we let Y 2 be close to Gaussian, i.e., p ( Y 2 ) = 1 . On the other hand, when θ 0 , EPI can also be saturated if we let Y 1 be close to Gaussian.
In Figure 1, we illustrate numerical simulations of C ( · , · ) for the one-dimensional case. For general distributions beyond Gaussian and i.i.d. Cauchy, one can approximate C ( · , · ) using kernel methods of deconvolution; see, e.g., [43,44]. Our strategy of deconvolution in Figure 1 is to let Y 2 = t Y , where Y is a copy of Y and t [ 0 , 1 ] . Gaussian mixture is an exception for this strategy because its spectrum would not be integrable. Instead, we let Y 2 be Gaussian for a Gaussian mixture. We note that this strategy is mostly not optimal and the optimal way to maximize the entropy power in (17) remains an open question.
Figure 1. Plot of the numerical term C subject to the information constraint R evaluated with respect to different distributions for the one-dimensional case.
Figure 1. Plot of the numerical term C subject to the information constraint R evaluated with respect to different distributions for the one-dimensional case.
Entropy 24 00306 g001
Remark 5
(On the condition of identity of Theorem 3). To show the condition of identity of (16), we need the inequalities in (A1) and in Lemma A1 to be equalities. The equality of (A1) holds when P X is isotropic Gaussian, i.e., P X N ( μ , σ 2 I n ) for some μ R n and σ > 0 . The equality in Lemma A1 holds when φ is affine and D 2 φ has identical eigenvalues, i.e., φ = k · I d , k R , see ([4] Lemma 2.6). From ([45] Theorem 1), we know that the linear combination Y = Y 1 + Y 2 in Theorem 2 is the optimizer for entropic OT when X and Y are isotropic Gaussian. In such a case, the equality of (16) holds and C ( · , R ) = 1 R 2 n .
The following corollary is immediate from Theorem 3.
Corollary 1.
Wasserstein distance is bounded by
λ 2 W 2 2 ( P X , P Y ) E [ V ( Y ) ] E [ V ( X ) ] + n n e 1 n ( h ( Y ) h ( X ) ) .
Proof. 
This is immediate from Theorem 3 when R . In this case, C ( · , + ) = 1 . □
Remark 6
(On Corollary 1). We note that (19) is equivalent to Bolley’s dimensional Talagrand inequality (10) and it is tighter than the classical Talagrand inequality (8). To make this point clear, note that, under our assumptions, h ( X ) = E [ V ( X ) ] and D ( P Y P X ) = E [ V ( Y ) ] h ( Y ) because d P X = e V ( x ) d x . Clearly, by substituting these expressions to the last term of (19), we obtain (10). Since e μ 1 + μ , (10) is, in general, tighter than the classical Talagrand inequality (8), i.e., RHS of (10) ≤ RHS of (8). The equality holds if and only if h ( Y ) = h ( X ) .
We notice that C is the only difference between (10) and (16), from Remark 6. Therefore, we can immediately obtain a result related to measure concentration following ([4] Corollary 2.4). Next, we state the result on measure concentration obtained from (16).
Corollary 2.
Let d μ = e V , where V : R n R is C 2 continuous, D 2 V λ I n with λ > 0 . Let A R n , A r : = { x R n : y A , x y > r } for r 0 and c A : = 2 λ 1 log ( 1 / μ ( A ) ) . Then, for r c A , we obtain
μ ( A r ) C n · e λ 2 ( r c A ) 2 .
Proof. 
See Appendix B. □
Next, we state some technical comments on Corollary 2.
Remark 7
(On Corollary 2). We note that in the derivation of Corollary 2, we follow the method of Marton in [46], which utilizes the geometrical properties of Wasserstein distance. From our discussion above, the information constraint leads to the uncertainty in entropic OT. In this result, we further show that the uncertainty smooths the geometrical properties of Wasserstein distance, i.e., Sinkhorn distance implies a looser measure concentration inequality. We begin with two extreme cases. When C = 0 , the two random vectors are independent and entropic OT has the most uncertainty. It is natural that the quadratic difference of two independent random vectors does not imply any concentration. When C = 1 , the inequality is the same as the one in Theorem 1. Between these two extremes, i.e., when 0 < C < 1 , Sinkhorn distance leads to a weaker normal concentration, compared to Theorem 1. Furthermore, we include in Appendix C the proof that demonstrates that the Sinkhorn distance gives a weaker measure concentration inequality in high dimensions.
The next theorem is another Talagrand-type inequality. Compared to Theorem 3, the following result is a bound obtained using a term related to the saturation of P X , instead of the saturation of P Y that was used in Theorem 3.
Theorem 4.
Let X = Y = R n . Without loss of generality, let X be a zero-mean random vector with density e V ( x ) , where V : R n R is C 2 continuous, D 2 V λ I n with λ > 0 , P Y P X . Then, the following bound holds:
λ 2 W 2 2 ( P X , P Y ; R ) E [ V ( Y ) ] E [ V ( X ) ] + n n C x ( P X , R ) e 1 n ( h ( Y ) h ( X ) ) + ϵ ,
where ϵ is a term related to the linearity of V.
Proof. 
See Appendix D. □
We offer the following technical comments on Theorem 4.
Remark 8
(On Theorem 4). Similar to C ( P Y , R ) , C x ( P X , R ) [ 0 , 1 ] can be computed by the equation X = C x X + X 2 with arbitrary independent X , X 2 under the assumptions that X is a copy of X, E [ X 2 ] = 0 , h ( X ) h ( X 2 ) R , as shown in the proof. However, (21) is less natural than (16) because of the extra term ϵ . When V is nearly linear, ϵ should be small. When V is far from linear, ϵ is unknown.
Theorem 4 can also give a measure concentration inequality, namely
μ ( A r ) C x n · e λ 2 ( r c A ) 2 + ϵ ,
where A r and c A are the same as those defined in Corollary 2. We omit the proof of (22) because it follows using similar steps to the ones used to prove Corollary 2.
Remark 9.
When V is linear, ϵ is zero and C x ( · , R ) = 1 R 2 n , as simply taking t = 1 R 2 n in the proof. In such a case, (21) recovers (11) by taking X as a standard Gaussian, i.e., V ( x ) = x 2 / 2 + k , where k is a normalization factor. Substitute V and times 2 on both sides of (21), we have
W 2 2 ( X , Y ; R ) E [ Y 2 ] E [ X 2 ] + 2 n 2 n 1 e 2 n R e 1 n ( h ( Y ) h ( X ) ) = E [ Y 2 ] + n 2 n 1 2 π e ( 1 e 2 n R ) e 1 n h ( Y ) ,
which is exactly the same as (11).
The next theorem gives a Talagrand-type bound for the conditional Sinkhorn distance.
Theorem 5
(Talagrand-type bound for conditional Sinkhorn distance). Let X = Y = R n . Given a probability measure P Z and two conditional probability measures P X | Z and P Y | Z , where the probability density d P X | Z = z 0 = d P X = e V ( x ) d x , z 0 Z , let V : R n R be C 2 continuous, D 2 V λ I n with λ > 0 , P Y | Z = z 0 P X . Then, the following bound holds:
λ 2 W 2 2 ( P X , P Y | P Z ; R ) E [ V ( Y | Z ) ] E [ V ( X ) ] + n n C ( P Y | Z , R ) e 1 n ( h ( Y | Z ) h ( X ) ) ,
where C ( P Y | Z , R ) is a numerical term. A direct observation is that C ( P Y | Z , R ) can take inf z 0 Z C ( P Y | Z = z 0 , R ) .
Proof. 
See Appendix E. □

4. Numerical Simulations

In this section, we describe several numerical simulations to illustrate the validity of our theoretical findings. To check the tightness of our bounds, we use as a reference bound the numerical solution obtained via the Sinkhorn algorithm, which can be found in the POT library [47]. As an iterative method, the Sinkhorn algorithm has computational error, since the iteration stops when it converges to a certain rate. For example, in Figure 2a, we plot the result for Theorem 3 with two Gaussian marginals, which is the scenario when the identity of (16) holds. From the figure, we can see that the simulation is slightly greater than the bound. Nevertheless, we note that the precision is reasonably small.
The simulations for Theorems 3 and 4 are given in Figure 2 and Figure 3, respectively. Since the optimal value of C in Theorem 3 and the error term ϵ in Theorem 4 beyond the linear case are unknown, we mainly simulate the case with one side Gaussian, i.e., with C = 1 R 2 n and ϵ = 0 . In this way, we avoid the unknown factors and deduce several observations related to the tightness of the bounds derived in these two theorems.
The first observation is about absolute continuity. We observe that the original Talagrand inequality (8) is not tight when P Y is not absolutely continuous with respect to P X , because D ( P Y P X ) = + in this case. In Figure 4, we illustrate one such case with almost discontinuity between two strongly log-concave distributions, i.e., the Radon–Nikodym derivative d P Y / d P X = exp [ ( x / 5 ) 4 ] + k , where k R is a normalizing factor, goes to when x . Consequently, the bound (16) from Theorem 3 is loose, as illustrated in Figure 5. The bound can be much looser if we increase the discontinuity, i.e., we let d P Y / d P X = exp [ ( x / 5 ) 8 ] + k , as shown in Figure 6. By simply changing the sides of distributions P X and P Y , we preserve the absolute continuity and the bound becomes tight, as we can see in Figure 5b and Figure 6b.
The second observation is related to the numerical term C. By comparing Figure 2b and Figure 3a, we observe that C = 1 R 2 n gives a better description than C = 1 R 1 n , i.e., the former one gives a tighter bound. This is reasonable according to our previous discussion, i.e., the independent linear combination of Cauchy random variables is not the optimal deconvolution. Actually, even if P Y is not Gaussian in (16), C = 1 R 2 n seems to be true for all the simulated distributions.
Furthermore, we observe that the tightness of the bounds in Theorems 3 and 4 is related to the linearity of the transport map, which can be seen as a similarity between the two marginals. For example, Cauchy and Laplace distributions are similar to the Gaussian distribution. Thus, they show a tight bound in Figure 3a,d. On the other hand, Gaussian mixture and exponential distribution are relatively far from the Gaussian distribution. Hence, Figure 3c,f give looser bounds.
In Figure 7, we plot the dimensionality of Sinkhorn distance between isotropic Gaussians. Different curves correspond to a pair of Gaussian distributions in different dimensions and these pairs have the same Wasserstein distance. It can be seen that the information constraint causes more smoothing in higher dimensions, which is consistent with Corollary 2.
Figure 2. Numerical simulations and bounds via (16) for different R. (a) d P X N ( 0 , 1 25 ) and d P Y N ( 0 , 1 100 ) . (b) d P X N ( 0 , 1 25 ) and d P Y C a u c h y ( 0 , 10 ) . (c) d P X = e V , V = ( x / 5 ) 2 / 2 + | x / 10 | + e | x / 10 | + k , k R and d P Y N ( 0 , 1 25 ) .
Figure 2. Numerical simulations and bounds via (16) for different R. (a) d P X N ( 0 , 1 25 ) and d P Y N ( 0 , 1 100 ) . (b) d P X N ( 0 , 1 25 ) and d P Y C a u c h y ( 0 , 10 ) . (c) d P X = e V , V = ( x / 5 ) 2 / 2 + | x / 10 | + e | x / 10 | + k , k R and d P Y N ( 0 , 1 25 ) .
Entropy 24 00306 g002
Figure 3. Numerical simulations and bounds via (21) for different R. (a) d P X N ( 0 , 1 25 ) and d P Y C a u c h y ( 0 , 10 ) . (b) d P X N ( 0 , 1 25 ) and d P Y χ 2 ( 6 ) . (c) d P X N ( 0 , 1 25 ) and d P Y E x p ( 0.2 ) . (d) d P X N ( 0 , 1 25 ) and d P Y L a p l a c e ( 0 , 5 ) . (e) d P X N ( 0 , 1 25 ) and d P Y is Gamma distribution with α = 2 and β = 0.2 . (f) d P X N ( 0 , 1 25 ) and d P Y = 1 2 N ( 20 , 1 25 ) + 1 2 N ( 20 , 1 25 ) .
Figure 3. Numerical simulations and bounds via (21) for different R. (a) d P X N ( 0 , 1 25 ) and d P Y C a u c h y ( 0 , 10 ) . (b) d P X N ( 0 , 1 25 ) and d P Y χ 2 ( 6 ) . (c) d P X N ( 0 , 1 25 ) and d P Y E x p ( 0.2 ) . (d) d P X N ( 0 , 1 25 ) and d P Y L a p l a c e ( 0 , 5 ) . (e) d P X N ( 0 , 1 25 ) and d P Y is Gamma distribution with α = 2 and β = 0.2 . (f) d P X N ( 0 , 1 25 ) and d P Y = 1 2 N ( 20 , 1 25 ) + 1 2 N ( 20 , 1 25 ) .
Entropy 24 00306 g003
Figure 4. Probability densities of d P X = e V , V = ( x / 5 ) 2 / 2 + ( x / 5 ) 4 + k , k R and d P Y N ( 0 , 1 25 ) .
Figure 4. Probability densities of d P X = e V , V = ( x / 5 ) 2 / 2 + ( x / 5 ) 4 + k , k R and d P Y N ( 0 , 1 25 ) .
Entropy 24 00306 g004
Figure 5. Numerical simulations and bounds for different R, with d μ = e V , V = ( x / 5 ) 2 / 2 + ( x / 5 ) 4 + k , k R and d μ N ( 0 , 1 25 ) . (a) Bound via (16) with d P X = d μ , d P Y = d ν . (b) Bound via (21) with d P X = d ν , d P Y = d μ .
Figure 5. Numerical simulations and bounds for different R, with d μ = e V , V = ( x / 5 ) 2 / 2 + ( x / 5 ) 4 + k , k R and d μ N ( 0 , 1 25 ) . (a) Bound via (16) with d P X = d μ , d P Y = d ν . (b) Bound via (21) with d P X = d ν , d P Y = d μ .
Entropy 24 00306 g005
Figure 6. Numerical simulations and bounds for different R, with d μ = e V , V = ( x / 5 ) 2 / 2 + ( x / 5 ) 8 + k , k R and d μ N ( 0 , 1 25 ) . (a) Bound via (16) with d P X = d μ , d P Y = d ν . (b) Bound via (21) with d P X = d ν , d P Y = d μ .
Figure 6. Numerical simulations and bounds for different R, with d μ = e V , V = ( x / 5 ) 2 / 2 + ( x / 5 ) 8 + k , k R and d μ N ( 0 , 1 25 ) . (a) Bound via (16) with d P X = d μ , d P Y = d ν . (b) Bound via (21) with d P X = d ν , d P Y = d μ .
Entropy 24 00306 g006
Figure 7. Sinkhorn distances between isotropic Gaussians in different dimensions.
Figure 7. Sinkhorn distances between isotropic Gaussians in different dimensions.
Entropy 24 00306 g007

5. Conclusions and Future Directions

In this paper, we considered a generalization of OT with an entropic constraint. We showed that the constraint leads to uncertainty and the uncertainty can be captured by EPI. We first derived an HWI-type inequality for the Sinkhorn distance. Then, we derived two Talagrand-type inequalities. Because of the strong geometric implication of Talagrand inequality, these two Talagrand-type inequalities can also give a weaker measure concentration inequality, respectively. From this result, we claimed that the geometry implied by the Sinkhorn distance can be smoothed out by the entropic constraint. We also showed that our results can be generalized into a conditional version of entropic OT inequality.
However, there are two factors unknown in the inequalities we derived, i.e., the optimal value of the term C in Theorem 3 and the error term ϵ in Theorem 4 when one goes beyond the linear case. Although we showed that we can compute a suboptimal C using the arbitrary linear combination of two random vectors, the optimal value C is an intriguing open question to answer. We believe that the improvement of the term C may be related to the Fisher information. Without the assumption of strong log-concavity, it requires an extra term of relative Fisher information to upper-bound the Wasserstein distance in Theorem 2. The reversing of EPI in [31] is also concerned with Fisher information. If we consider the changing of Fisher information along the Schrödinger bridge, a better estimate of term C may be feasible.

Author Contributions

Conceptualization, P.A.S. and S.W.; methodology, S.W. and P.A.S.; software, S.W.; validation, P.A.S. and S.W.; formal analysis, S.W.; investigation, S.W.; resources, M.S.; writing—original draft preparation, S.W. and P.A.S.; writing—review and editing, P.A.S.; supervision, P.A.S.; project administration, P.A.S.; funding acquisition, M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Swedish Research Council (VR), grant number 2019-03606.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
OToptimal transport
EPIentropy power inequality
SPSchrödinger problem
GANgenerative adversarial network
RHSright-hand side
a.e.almost everywhere
POTPython Optimal Transport

Appendix A. Proof of Theorem 2

For a C 2 continuous function V : R n R , D 2 V λ I n , by Taylor formula ([4] Lemma 2.5), there exists a t [ 0 , 1 ] satisfying
V ( y ) V ( x ) = V ( x ) · ( y x ) + ( y x ) · D 2 V ( t x + ( 1 t ) y ) ( y x ) / 2 V ( x ) · ( y x ) + λ 2 y x 2 .
Hence, we can bound the second-order cost by
λ 2 X × Y y x 2 d P X × Y V ( y ) V ( x ) V ( x ) · ( y x ) d P .
Because entropic OT is a minimization problem, we can take any case in Π ( P X , P Y ; R ) to bound W 2 ( P X , P Y ; R ) . We take a linear combination Y = Y 1 + Y 2 , where Y 1 and Y 2 are independent, h ( Y ) h ( Y 2 ) R and E [ Y 2 ] = 0 . Assume that there is a Brenier map between Y 1 and X, i.e., Y 1 = φ ( X ) , which always exists, according to Theorem A1 (see Appendix F). Then, we can see that this special case is in Π ( P X , P Y , R ) , namely,
I ( X ; Y ) = h ( Y ) h ( Y | X ) = h ( Y ) h ( Y 1 + Y 2 | X ) = h ( Y ) h ( Y 2 ) R .
Let d μ = e V , where V : R n R is C 2 continuous, D 2 V λ I n . In order to bound the Sinkhorn distance, we simply need to bound X × Y V ( x ) · ( y x ) d P , according to (A2). This term can be bounded as follows:
X × Y V ( x ) · ( y x ) d P = V ( x ) · ( φ ( x ) + y 2 x ) d P Y 2 d P X = V ( x ) · ( φ ( x ) x ) d P X = V ( x ) · ( φ ( x ) x ) d P X d μ d μ Δ φ ( x ) f d μ n + ( φ ( x ) x ) · f d μ n e 1 n ( h ( Y 1 ) h ( X ) ) n W 2 ( P X , P Y 1 ) · I ( P X | μ ) ,
where we take the Radon–Nikodym derivative f = d P X d μ in (A3) and apply Lemma A1 in Appendix F. This completes the derivation.

Appendix B. Proof of Corollary 2

Let μ A = 1 A μ ( A ) μ and μ A r = 1 A r μ ( A r ) μ be the conditional probability measure μ restricted to A and A r . Using the triangle inequality of W 2 , we have
r W 2 ( μ A , μ A r ; R ) W 2 ( μ A , μ ) + W 2 ( μ , μ A r ; R ) .
From (8), we know that
W 2 ( μ A , μ ) 2 λ 1 D ( μ A μ ) = 2 λ 1 log ( 1 / μ ( A ) ) : = c A .
Let μ and μ A be the probability measures of two random vectors X and Y, respectively. By substituting c A into (A4) and using (16), we have
( r c A ) 2 W 2 2 ( μ , μ A r ; R ) 2 λ E [ V ( Y ) ] E [ V ( X ) ] + n n C e 1 n ( h ( Y ) h ( X ) ) = 2 λ E [ V ( Y ) ] E [ V ( X ) ] + n n C e 1 n ( E [ V ( Y ) ] E [ V ( X ) ] D ( μ A r μ ) ) .
Substituting μ A r = 1 A r μ ( A r ) μ into the definition of relative entropy, we have
D ( μ A r μ ) = R n log 1 A r μ ( A r ) · 1 A r μ ( A r ) d μ = log μ ( A r ) μ ( A r ) A r d μ = log μ ( A r ) .
Then, we obtain, for r c A ,
μ ( A r ) C n · e E [ V ( X ) ] E [ V ( Y ) ] 1 + 1 n E [ V ( Y ) ] E [ V ( X ) ] λ 2 ( r c A ) 2 n .
(A5) already indicates a concentration of measure. Using the inequality ( 1 + u / n ) n e u , it can be further shown that (A5) also implies normal concentration, as follows:
μ ( A r ) C n · e λ 2 ( r c A ) 2 .

Appendix C. Proof of the Dimensionality of (20)

To prove that (20) is weaker when the dimension increases, we only need to show the decreasing of C n with respect to n. When we increase the dimension, for given random vectors Y 1 + Y 2 = Y , we assume that this convolution relation still holds and their entropies increase proportionally to n, i.e., the shapes of their distributions do not change.
Since now n is a variable of C, we let n R and let C be a function with respect to R n , in order to compute the partial derivative. We first show that C is a non-decreasing function with respect to R n . We know that C = e h ( Y 1 ) n h ( Y ) n subject to Y 1 + Y 2 = Y and h ( Y ) n h ( Y 2 ) n R n . Thus, for a larger value of R n , there at least exists a h ( Y 2 ) n that is non-increasing. It further leads to a non-decreasing h ( Y 1 ) n . Therefore, there at least exists a C ( · , R + Δ R ) that is not smaller than C ( · , R ) , Δ R > 0 , i.e., C is non-decreasing with respect to R n .
Then, we take the logarithm of C n and compute the partial derivative with respect to n:
n n · log C ( R n ) = log C ( R n ) + n C n C ( R n ) = log C ( R n ) + n C C · ( R n 2 ) .
We know that C [ 0 , 1 ] and non-decreasing. Thus, (A6) is less than 0 when C ( 0 , 1 ) . This completes the proof, where we show that C n is decreasing with respect to n.

Appendix D. Proof of Theorem 4

Let X be a copy of X. X can be written as a linear combination X = t X + X 2 , where X 2 is zero-mean and independent of X , h ( X ) h ( X 2 ) R , t [ 0 , 1 ] . Assume that there exists a Brenier map Y = φ ( X ) . Similar to the proof of Theorem 3, this case is also in Π ( P X , P Y , R ) . Then, we have
X × Y V ( x ) · ( y x ) d P = V ( t x + x 2 ) · φ ( x ) d P X d P X 2 n = ( V ( t x + x 2 ) t · V ( x ) ) · φ ( x ) d P X d P X 2 + t V ( x ) · φ ( x ) d P X n
= ( V ( t x + x 2 ) t · V ( x ) ) · φ ( x ) d P X d P X 2 + t Δ φ d P X n
t · n e 1 n ( h ( Y ) h ( X ) ) ϵ n ,
where we use Lemma A2 in (A7) and (A8). In (A9), we let ϵ = ( V ( t x + x 2 ) t · V ( x ) ) · φ ( x ) d P X d P X 2 and apply (A15). After changing the order of integral, we can see that V ( t x + x 2 ) d P X 2 is a smoothed version of V ( t x ) . When V is a linear function perturbed by zero-mean noise, i.e., V ( t x ) = t · V ( x ) + W , the integral of x 2 is cancelled out and ϵ = 0 . By taking C x ( P X , R ) = t , we complete the proof.

Appendix E. Proof of Theorem 5

According to Theorem 3, z 0 Z , there exists such P X , Y | Z = z 0 Π ( P X | Z = z 0 , P Y | Z = z 0 ; R ) satisfying I ( X ; Y | Z = z 0 ) R and following
λ 2 E [ X Y 2 | Z = z 0 ] E [ V ( Y | Z = z 0 ) ] E [ V ( X ) ] + n n C ( P Y | Z = z 0 , R ) e 1 n ( h ( Y | Z = z 0 ) h ( X ) ) E [ V ( Y | Z = z 0 ) ] E [ V ( X ) ] + n n C ( P Y | Z , R ) e 1 n ( h ( Y | Z = z 0 ) h ( X ) ) .
The conditional mutual information of the specific distribution P X , Y | Z can be bounded in this case as follows:
I ( X ; Y | Z ) = E P Z [ I ( X ; Y | Z = z 0 ) ] E P Z [ R ] = R .
Therefore, this P X , Y | Z = z 0 yields the following estimate:
λ 2 W 2 2 ( P X | Z , P Y | Z | P Z ; R ) λ 2 E P Z { E [ X Y 2 | Z = z 0 ] }
E [ V ( Y | Z ) ] E [ V ( X ) ] + n n C ( P Y | Z , R ) E P Z [ e 1 n ( h ( Y | Z = z 0 ) h ( X ) ) ]
E [ V ( Y | Z ) ] E [ V ( X ) ] + n n C ( P Y | Z , R ) e 1 n ( E P Z [ h ( Y | Z = z 0 ) ] h ( X ) ) = E [ V ( Y | Z ) ] E [ V ( X ) ] + n n C ( P Y | Z , R ) e 1 n ( h ( Y | Z ) h ( X ) ) ,
where (A11) follows from definition (7), (A12) follows from (A10), and (A13) follows from Jensen’s inequality. This completes the proof.

Appendix F. Background Material

Theorem A1
(Brenier’s, Theorem 2.12 [48]). Let P X P ( X ) , P Y P ( Y ) with X R n , Y R n and assume that d P X , d P Y both have finite second moments. If P X does not give mass to small sets, then, for the Kantorovich problem with cost c ( x , y ) = 1 2 x y 2 , ! φ : X R gives the optimal coupling
P = ( I d × φ ) # P X ,
where φ is convex. φ is called the Brenier map.
Lemma A1
(Theorem 9.17 [48]). Let d μ = e V . Let f = d P X d μ being a Radon–Nikodym derivative between two measures P X and μ. Let φ be a Brenier map. We have
V ( x ) · ( φ ( x ) x ) f ( x ) d μ ( x ) [ ( Δ φ n ) f + ( φ x ) · f ] d μ = Δ φ f d μ n + ( φ x ) · f d μ ,
where φ ( x ) x is called displacement. For the first term of (A14), because φ is convex, from ([4] Lemma 2.6), we have
Δ φ d P X n e 1 n ( h ( φ ( X ) ) h ( X ) ) .
Moreover, the last term of (A14) can be bounded using Cauchy–Schwarz inequality as follows:
( φ x ) · f d μ φ x 2 f d μ 1 / 2 f 2 f d μ 1 / 2 = W 2 ( P X , φ # P X ) · I ( P X | μ ) .
Lemma A2
(Fact 7 [3]). For any φ L 1 ( X ) L 2 ( X ) on a Polish space ( X , μ ) and d μ = e V , we have
Δ φ d μ = φ · V d μ .
Definition A1
(Lower Semi-Continuity, Appendix B, p. 602 [49]). Given a metric space X , f : X R { } is lower semi-continuous if there exists a convergent sequence { x n } , x n x , that f ( x ) lim inf n f ( x n ) , where lim inf n x n : = lim n ( inf m n x m ) .
Definition A2
(Log-Concavity, Definition 2.1 [50]). A density function f with respect to the Lebesgue measure on ( R n , B n ) is log-concave if f = e V , where V is a convex function.
Definition A3
(Strong Log-Concavity, Definition 2.8 [50]). A density function f is called strongly log-concave if it has the form
f ( x ) = g ( x ) φ ( x ) ,
with some log-concave g and some φ N ( μ , Σ ) .

References

  1. Talagrand, M. Transportation cost for Gaussian and other product measures. Geom. Funct. Anal. 1996, 6, 587–600. [Google Scholar] [CrossRef]
  2. Bakry, D.; Bolley, F.; Gentil, I. Dimension dependent hypercontractivity for Gaussian kernels. Probab. Theory Relat. Fields 2012, 154, 845–874. [Google Scholar] [CrossRef] [Green Version]
  3. Cordero-Erausquin, D. Transport inequalities for log-concave measures, quantitative forms, and applications. Can. J. Math. 2017, 69, 481–501. [Google Scholar] [CrossRef] [Green Version]
  4. Bolley, F.; Gentil, I.; Guillin, A. Dimensional improvements of the logarithmic Sobolev, Talagrand and Brascamp–Lieb inequalities. Ann. Probab. 2018, 46, 261–301. [Google Scholar] [CrossRef] [Green Version]
  5. Raginsky, M.; Sason, I. Concentration of Measure Inequalities in Information Theory, Communications and Coding. In Foundations and Trends in Communications and Information Theory; NOW Publishers: Boston, MA, USA, 2018. [Google Scholar]
  6. Zhang, R.; Chen, C.; Li, C.; Carin, L. Policy Optimization as Wasserstein Gradient Flows. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 5737–5746. [Google Scholar]
  7. Montavon, G.; Müller, K.R.; Cuturi, M. Wasserstein Training of Restricted Boltzmann Machines. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 3718–3726. [Google Scholar]
  8. Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein Generative Adversarial Networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]
  9. Rigollet, P.; Weed, J. Uncoupled isotonic regression via minimum Wasserstein deconvolution. Inf. Inference 2019, 8, 691–717. [Google Scholar] [CrossRef]
  10. Cuturi, M. Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances. Adv. Neural Inf. Process. Syst. 2013, 26, 2292–2300. [Google Scholar]
  11. Wang, S.; Stavrou, P.A.; Skoglund, M. Generalized Talagrand Inequality for Sinkhorn Distance using Entropy Power Inequality. In Proceedings of the 2021 IEEE Information Theory Workshop (ITW), Kanazawa, Japan, 17–21 October 2021; pp. 1–6. [Google Scholar]
  12. Benamou, J.D.; Brenier, Y. A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem. Numer. Math. 2000, 84, 375–393. [Google Scholar] [CrossRef]
  13. Villani, C. Optimal Transport: Old and New; Springer: Norwell, MA, USA, 2008; Volume 338. [Google Scholar]
  14. Schrödinger, E. Über die Umkehrung der Naturgesetze; Verlag der Akademie der Wissenschaften in Kommission bei Walter De Gruyter: Berlin, Germany, 1931. [Google Scholar]
  15. Léonard, C. A survey of the Schrödinger problem and some of its connections with optimal transport. Discret. Contin. Dyn. Syst. 2014, 34, 1533–1574. [Google Scholar] [CrossRef]
  16. Chen, Y.; Georgiou, T.T.; Pavon, M. On the relation between optimal transport and Schrödinger bridges: A stochastic control viewpoint. J. Optim.Theory Appl. 2016, 169, 671–691. [Google Scholar] [CrossRef] [Green Version]
  17. Chen, Y.; Georgiou, T.T.; Pavon, M. Optimal transport over a linear dynamical system. IEEE Trans. Autom. Control 2016, 62, 2137–2152. [Google Scholar] [CrossRef]
  18. Conforti, G. A second order equation for Schrödinger bridges with applications to the hot gas experiment and entropic transportation cost. Probab. Theory Relat. Fields 2019, 174, 1–47. [Google Scholar] [CrossRef] [Green Version]
  19. Conforti, G.; Ripani, L. Around the entropic Talagrand inequality. Bernoulli 2020, 26, 1431–1452. [Google Scholar] [CrossRef] [Green Version]
  20. Bai, Y.; Wu, X.; Özgür, A. Information Constrained Optimal Transport: From Talagrand, to Marton, to Cover. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020; pp. 2210–2215. [Google Scholar]
  21. Rigollet, P.; Weed, J. Entropic optimal transport is maximum-likelihood deconvolution. C. R. Mathem. 2018, 356, 1228–1235. [Google Scholar] [CrossRef] [Green Version]
  22. Mena, G.; Niles-Weed, J. Statistical bounds for entropic optimal transport: Sample complexity and the central limit theorem. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2019; Volume 32. [Google Scholar]
  23. Genevay, A.; Chizat, L.; Bach, F.; Cuturi, M.; Peyré, G. Sample Complexity of Sinkhorn Divergences. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, Naha, Japan, 16–18 April 2019; pp. 1574–1583. [Google Scholar]
  24. Reshetova, D.; Bai, Y.; Wu, X.; Özgür, A. Understanding Entropic Regularization in GANs. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Victoria, Australia, 11–16 July 2021; pp. 825–830. [Google Scholar]
  25. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef] [Green Version]
  26. Stam, A.J. Some inequalities satisfied by the quantities of information of Fisher and Shannon. Inf. Control 1959, 2, 101–112. [Google Scholar] [CrossRef] [Green Version]
  27. Rioul, O. Information theoretic proofs of entropy power inequalities. IEEE Trans. Inf. Theory 2010, 57, 33–55. [Google Scholar] [CrossRef] [Green Version]
  28. Courtade, T.A.; Fathi, M.; Pananjady, A. Quantitative stability of the entropy power inequality. IEEE Trans. Inf. Theory 2018, 64, 5691–5703. [Google Scholar] [CrossRef]
  29. Bobkov, S.; Madiman, M. Reverse Brunn—Minkowski and reverse entropy power inequalities for convex measures. J. Funct. Anal. 2012, 262, 3309–3339. [Google Scholar] [CrossRef] [Green Version]
  30. Bobkov, S.G.; Madiman, M.M. On the problem of reversibility of the entropy power inequality. In Limit Theorems in Probability, Statistics and Number Theory; Springer: Norwell, MA, USA, 2013; pp. 61–74. [Google Scholar]
  31. Courtade, T.A. A strong entropy power inequality. IEEE Trans. Inf. Theory 2017, 64, 2173–2192. [Google Scholar] [CrossRef]
  32. Cover, T.M. Elements of Information Theory; John Wiley & Sons: New York, NY, USA, 1999. [Google Scholar]
  33. Tamanini, L. A generalization of Costa’s Entropy Power Inequality. arXiv 2020, arXiv:2012.12230. [Google Scholar]
  34. Monge, G. Mémoire sur la théorie des déblais et des remblais. In Histoire de l’Académie Royale des Sciences de Paris; De l’Imprimerie Royale: Paris, France, 1781. [Google Scholar]
  35. Kantorovich, L.V. On the translocation of masses. J. Math. Sci. 2006, 133, 1381–1382. [Google Scholar] [CrossRef]
  36. Kantorovich, L.V. On a Problem of Monge. J. Math. Sci. 2006, 133, 1383. [Google Scholar] [CrossRef]
  37. Dupuis, P.; Ellis, R.S. A Weak Convergence Approach to the Theory of Large Deviations; Wiley Series in Probability and Statistics; Wiley-Interscience: Hoboken, NJ, USA, 2011. [Google Scholar]
  38. Luenberger, D.G. Optimization by Vector Space Methods; John Wiley & Sons: New York, NY, USA, 1997. [Google Scholar]
  39. Blower, G. The Gaussian isoperimetric inequality and transportation. Positivity 2003, 7, 203–224. [Google Scholar] [CrossRef]
  40. Otto, F.; Villani, C. Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality. J. Funct. Anal. 2000, 173, 361–400. [Google Scholar] [CrossRef] [Green Version]
  41. Bakry, D.; Ledoux, M. A logarithmic Sobolev form of the Li-Yau parabolic inequality. Rev. Matemática Iberoam. 2006, 22, 683–702. [Google Scholar] [CrossRef] [Green Version]
  42. Masry, E. Multivariate probability density deconvolution for stationary random processes. IEEE Trans. Inf. Theory 1991, 37, 1105–1115. [Google Scholar] [CrossRef]
  43. Stefanski, L.A.; Carroll, R.J. Deconvolving kernel density estimators. Statistics 1990, 21, 169–184. [Google Scholar] [CrossRef]
  44. Fan, J. On the optimal rates of convergence for nonparametric deconvolution problems. Ann. Stat. 1991, 19, 1257–1272. [Google Scholar] [CrossRef]
  45. Janati, H.; Muzellec, B.; Peyré, G.; Cuturi, M. Entropic optimal transport between unbalanced Gaussian measures has a closed form. Adv. Neural Inf. Process. Syst. 2020, 33, 10468–10479. [Google Scholar]
  46. Marton, K. A measure concentration inequality for contracting Markov chains. Geom. Funct. Anal. 1996, 6, 556–571. [Google Scholar] [CrossRef]
  47. Flamary, R.; Courty, N.; Gramfort, A.; Alaya, M.Z.; Boisbunon, A.; Chambon, S.; Chapel, L.; Corenflos, A.; Fatras, K.; Fournier, N.; et al. POT: Python Optimal Transport. J. Mach. Learn. Res. 2021, 22, 1–8. [Google Scholar]
  48. Villani, C. Topics in Optimal Transportation; Number 58; American Mathematical Society: Providence, RI, USA, 2003. [Google Scholar]
  49. Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; John Wiley & Sons: New York, NY, USA, 2014. [Google Scholar]
  50. Saumard, A.; Wellner, J.A. Log-concavity and strong log-concavity: A review. Stat. Surv. 2014, 8, 45–114. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Wang, S.; Stavrou, P.A.; Skoglund, M. Generalizations of Talagrand Inequality for Sinkhorn Distance Using Entropy Power Inequality. Entropy 2022, 24, 306. https://0-doi-org.brum.beds.ac.uk/10.3390/e24020306

AMA Style

Wang S, Stavrou PA, Skoglund M. Generalizations of Talagrand Inequality for Sinkhorn Distance Using Entropy Power Inequality. Entropy. 2022; 24(2):306. https://0-doi-org.brum.beds.ac.uk/10.3390/e24020306

Chicago/Turabian Style

Wang, Shuchan, Photios A. Stavrou, and Mikael Skoglund. 2022. "Generalizations of Talagrand Inequality for Sinkhorn Distance Using Entropy Power Inequality" Entropy 24, no. 2: 306. https://0-doi-org.brum.beds.ac.uk/10.3390/e24020306

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop