0% found this document useful (0 votes)

39 views35 pages

Quantitative Stability of Regularized Optimal Transport

Uploaded by

mymnaka82125

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views35 pages

Quantitative Stability of Regularized Optimal Transport

Uploaded by

mymnaka82125

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Quantitative Stability of Regularized Optimal

Transport and Convergence of Sinkhorn’s

Algorithm
Stephan Eckstein∗ Marcel Nutz†

June 8, 2022

Abstract
We study the stability of entropically regularized optimal trans-
port with respect to the marginals. Lipschitz continuity of the value
and Hölder continuity of the optimal coupling in p-Wasserstein dis-
tance are obtained under general conditions including quadratic costs
and unbounded marginals. The results for the value extend to reg-
ularization by an arbitrary divergence. As an application, we show
convergence of Sinkhorn’s algorithm in Wasserstein sense, including
for quadratic cost. Two techniques are presented: The first compares
an optimal coupling with its so-called shadow, a coupling induced on
other marginals by an explicit construction. The second transforms
one set of marginals by a change of coordinates and thus reduces the
comparison of differing marginals to the comparison of differing cost
functions under the same marginals.

Keywords Entropic Optimal Transport; Stability; Sinkhorn’s Algorithm; IPFP

AMS 2010 Subject Classification 90C25; 49N05

1 Introduction
Following advances allowing for computation in high dimensions, applica-
tions of optimal transport are thriving in areas such as machine learning,
statistics, image and language processing (e.g., [4, 15, 50, 3]). Regularization
plays a key role in enabling efficient algorithms with provable convergence;
∗
Department of Mathematics, ETH Zurich, seckstein@ethz.ch. Research supported by
Landesforschungsförderung Hamburg under project LD-SODA. SE thanks Daniel Bartl,
Mathias Beiglböck and Gudmund Pammer for fruitful discussions and helpful comments.
†
Departments of Statistics and Mathematics, Columbia University,
mnutz@columbia.edu. Research supported by an Alfred P. Sloan Fellowship and
NSF Grants DMS-1812661, DMS-2106056. MN is grateful to Guillaume Carlier, Giovanni
Conforti, Flavien Léger and Luca Tamanini for their kind hospitality and advice.

1
see [48] for a recent monograph with numerous references. Popularized in
this context by [20], entropic regularization is the most popular choice as
it allows for Sinkhorn’s algorithm (iterative proportional fitting procedure)
that can be implemented at large scale using parallel computing and is an-
alytically tractable. The entropically regularized transport problem can be
formulated as
Z
ε
Sent (µ1 , µ2 , c) = inf c(x, y) π(dx, dy) + εDKL (π, µ1 ⊗ µ2 ). (1.1)
π∈Π(µ1 ,µ2 )

Here Π(µ1 , µ2 ) is the set of couplings of the given marginals µ1 , µ2 and

DKL (·, µ1 ⊗ µ2 ) is the Kullback–Leibler divergence relative to the product
measure µ1 ⊗µ2 . Moreover, ε > 0 is a regularization parameter and c is a cost
function; the most important example is quadratic cost kx − yk2 on Rd × Rd .
The basic idea is to solve (1.1) for small ε > 0 to obtain an approximation
of the (unregularized) optimal transport problem that corresponds to ε = 0.
Starting with [16, 42, 43] and followed by [14, 37], the convergence as ε → 0
has been studied in detail and remains a very active area of investigation;
see for instance [2, 5, 6, 7, 17, 34, 45, 47, 53].
The entropic optimal transport problem (1.1) is also of its own interest.
On the one hand, it is equivalent to a static formulation of the Schrödinger
bridge problem that has a long history in physics (see [27, 38] for surveys);
the dynamic Schrödinger bridge can be constructed by solving the static
problem and combining it with a Brownian bridge. On the other hand, ap-
plied researchers have started to exploit numerous benefits resulting from
entropic regularization, such as smoothness, existence of a gradient for gra-
dient descent, improved sampling complexity (e.g., [18, 21, 30, 31]), among
many others. Thus, the regularization is increasingly seen as an advantage
rather than an approximation error; notions such as Sinkhorn divergence
[32, 49] have become tools of their own right. We note that as long as ε > 0
is fixed, we can assume without loss of generality that ε = 1, simply divid-
ing (1.1) by ε and using the cost function c/ε. Hence, we shall drop ε from
the formulation in our results.
The main objective of the present study is to establish and quantify
the stability of the value Sent and its optimal coupling π ∗ with respect to
the input marginals µ1 and µ2 , or more generally µ1 , . . . , µN in the multi-
marginal setting. Distances will be quantified by Wasserstein distance Wp ,
thus allowing for comparison of measures with different supports, discrete
and continuous measures, etc. We aim for results including unbounded
marginals, replacing compactness by suitable integrability conditions such as
the subgaussian tails in [41]. Schrödinger bridges are one application where

2
unbounded supports are very natural, as the Brownian dynamics produce
unbounded intermediate marginals even if the boundary data are bounded.
In this context, costs are usually quadratic, so that unbounded and non-
Lipschitz cost functions are necessary. Even in applications with bounded
costs, one may be interested in estimates with constants that do not depend
on kck∞ , especially not exponentially.
To the best of our knowledge, the first stability result for entropic op-
timal transport is due to [12]. Here, costs are uniformly bounded, and all
marginals are equivalent to a common reference measure (e.g., Lebesgue),
with densities uniformly bounded above and below. Within these families,
distances of measures can be quantified by the Lp norm of the difference of
their densities. The authors show that the Schrödinger potentials (i.e., the
dual entropic optimizers) are Lipschitz continuous relative to the marginals
in Lp , for p = 2 and p = ∞. This result is obtained by a differential ap-
proach establishing invertibility of the Schrödinger system. More recently,
[33] obtain the first result on stability in a general setting. Using a geometric
approach called cyclical invariance, continuity of optimizers is established in
the sense of weak convergence. The geometric method avoids integrability
conditions almost entirely and indeed remains valid even if the value of (1.1)
is infinite. On the other hand, the method relies on differentiation of mea-
sures which essentially forces the marginal spaces to be finite-dimensional.
More importantly, the continuity result is purely qualitative, and that is the
main difference with the present results. Most recently, and partly concur-
rently with the present study, a beautiful result of [22] establishes the uniform
stability of Sinkhorn’s algorithm with respect to the marginals, in a bounded
setting. As a consequence, the authors deduce Lipschitzianity in W1 of the
optimal couplings with respect to the marginals; the assumptions include
bounded Lipschitz costs and bounded spaces. The argument is based on
the Hilbert–Birkhoff projective metric which has also been used successfully
to show linear convergence of Sinkhorn’s algorithm [13, 29]. A crucial addi-
tional step accomplished in [22] is to pass from this metric to a more standard
norm on the potentials. The techniques involving the projective metric are
less probabilistic in nature, which may be one reason why it is wide open
how to relax the boundedness conditions. We remark that the initial result
of [12] also covered the multimarginal problem which has recently become
popular due to its role in the Wasserstein barycenter problem [1, 11]. At
least in the context of [10], it was observed that Hilbert–Birkhoff arguments
may not be equally successful beyond two marginals. Finally, we mention
the follow-up [46] on the continuity of the potentials in unbounded settings.
We apply our stability result to Sinkhorn’s algorithm for N = 2 marginals.

3
It is well known that each iterate π n of the algorithm solves an entropic
optimal transport problem between its own marginals, and moreover these
marginals converge to the given marginals µi . Thus, the convergence can be
seen as a particular instance of stability with respect to marginals and our
results apply. Sinkhorn’s algorithm has been studied over almost a century
(see [48] for numerous references); the most general convergence results in
this literature are due to [51]. While they treat costs that are merely mea-
surable and show π n → π ∗ in total variation, they do not cover unbounded
functions like the quadratic cost in most examples, especially when both
marginals have unbounded support. Applying stability results under regu-
larity of c turns out to be fruitful in this regard: we not only obtain the
convergence to the optimal value and π n → π ∗ in Wasserstein distance, but
even a rate of convergence. The conditions are sufficiently general to cover
quadratic cost with subgaussian marginals.

1.1 Synopsis
Our first result, detailed in Theorem 3.7, is the continuity of the value Sent
with respect to the marginals in p-Wasserstein distance under generic con-
ditions. If the cost c is a product of suitably integrable Lipschitz functions,
then Sent is also Lipschitz. This includes quadratic costs on Rd with pos-
sibly unbounded marginal supports. The proof is based on comparing the
optimizer π ∗ with the “shadow” coupling it induces on other marginals. The
shadow is a particular projection that we construct explicitly by gluing, con-
trolling both the distance to π ∗ and its divergence. The construction is simple
and flexible, thus potentially useful for other purposes. For instance, Theo-
rem 3.7 holds for a general class of optimal transport problems regularized by
a divergence Df as previously considered in [24], Kullback–Leibler divergence
is a particular case. Other divergences, especially quadratic, are being used
in some applications where entropic regularization performs poorly, usually
because non-equivalent optimizers are desired or weak penalization (small ε)
causes numerical instabilities, see [8, 25, 39]. Theoretical results are scarce
so far as these regularizations are less tractable.
By way of strong convexity, the continuity of the value Sent in The-
orem 3.7 leads to the continuity of the optimizer π ∗ with respect to the
marginals. Theorem 3.11 states a nonasymptotic inequality bounding the
distance of two entropic optimizers for different marginals in terms of the Wp
distance of the marginals. It shows in particular that the map (µ1 , . . . , µN ) 7→
π ∗ is 1/(2p)-Hölder in Wp . Exploiting a Pythagorean-type property of rel-
ative entropy to implement the strong convexity, we achieve an unbounded

4
setting requiring only a transport inequality; i.e., a control of Wasserstein
distance through entropy. This condition holds as soon as the marginals have
a finite exponential moment; in particular, the result covers quadratic costs
when marginals are σ 2 -subgaussian for some (arbitrarily small) σ. We re-
mark that Theorem 3.7 is the first quantitative stability result for unbounded
costs, and in settings without differentiation of measures as assumed in [33],
even the qualitative result alone would be novel.
One noteworthy feature of Theorem 3.11 is that the constants grow only
linearly in c, which is particularly important for the regularized transport
problem (1.1): here the effective cost function is c̃ := c/ε and ε is usually
small. Many results on entropic optimal transport feature constants depend-
ing exponentially on the cost, typically exp(kc̃k∞ ) or exp(kc̃k∞ + Lip c̃),
including all previous results on stability that we are aware of. Even for
well-behaved c on a fairly small domain, a choice like ε = .01 then leads
to constants far exceeding e100 , potentially a concern in practical considera-
tions.
Our second continuity result, Theorem 3.13, aims at improving the Hölder
exponent in Theorem 3.11 under the more restrictive condition that the cost c
is bounded (spaces may still be unbounded). For instance, we show 1/(p+1)-
Hölder continuity in Wp . More generally, Theorem 3.13 yields the Hölder
exponent p/(p + 1)q from Wp to Wq ; to wit, we can improve the exponent
by measuring the distance of the marginals in a stronger norm. In particu-
lar, p = ∞ leads to a Lipschitz result into W1 . This choice also eliminates
exponential dependence of the constant on the cost. In fact, we prove that
the Lipschitz constant is sharp in a nontrivial discrete example. This may
be surprising given that the idea of proof is somewhat circuitous and that
many estimates in this area are thought to be overly conservative.
Indeed, Theorem 3.13 is based on a novel approach that may be of inde-
pendent interest; the basic idea is to reduce the problem of differing marginals
to one of differing cost functions (under the same marginals). In the latter
problem, optimizers are measure-theoretically equivalent and comparable in
the sense of Kullback–Leibler divergence. Our starting point is the obser-
vation that the regularization in our problem depends only on the relative
density, but not on the geometry of the distributions. In the simplest case,
a Wp -optimal coupling of the differing marginals induces an invertible trans-
port map T that can be used as change of coordinates to achieve identical
marginals. The cost is transformed at the same time and we end up com-
paring c with c ◦ T . For this comparison, we can apply a separate result
(Proposition 3.12) based on an entropy calculation.
The application to Sinkhorn’s algorithm is summarized in Theorem 3.15

5
which states convergence of the entropic cost and of the Sinkhorn iterates π n
themselves. The qualitative and quantitative results follow from Theorem 3.7
and Theorem 3.11. In essence, the stability results turn a convergence rate
for the Sinkhorn marginals into a convergence rate for π n → π ∗ . We use the
sublinear rate for the marginals as obtained in [36]. As noted there, these
rates are likely suboptimal—for bounded cost functions, linear convergence
of Sinkhorn’s algorithm is well known [10, 13, 29]—our focus at this stage is
on having some quantitative control.
The organization of this paper is simple: Section 2 details the setting,
Section 3 presents the main results, and Section 4 contains the proofs.

2 Setting and Notation

Let (Y, dY ) be a Polish space and P(Y ) its set of Borel probability measures.
Given p ∈ [1, ∞), we R denote by Pp (Y ) the subset of measures µ with finite
p
p-th moment; i.e., dY (x, x̂) µ(dx) < ∞ for some (and then all) x̂ ∈ Y .
For p = ∞, we define P∞ (Y ) as the measures with bounded support. The
p-Wasserstein distance Wp (µ, ν) between µ, ν ∈ Pp (Y ) is defined via
Z
p
Wp (µ, ν) = inf dY (x, y)p π(dx, dy), p ∈ [1, ∞),
π∈Π(µ,ν)

W∞ (µ, ν) = inf ess sup dY (x, y),

π∈Π(µ,ν) (x,y)∼π

while kµ − νkT V = supA⊆Y Borel |µ(A) − ν(A)| is the total variation distance
of µ, ν ∈ P(Y ).
Fix N ∈ N and let (Xi , dXi ), i = 1, . . . , N be QPolish probability spaces
with measures µi ∈ P(Xi ). We denote by X = N i=1 Xi the product space
and write x ∈ X as x = (x1 , . . . , xN ). When p ∈ [1, ∞] is given, it will be
convenient to use on X the particular product metric
( P
N p 1/p ,

i=1 dXi (xi , yi ) p ∈ [1, ∞),
dX,p (x, y) :=
maxi=1,...,N dXi (xi , yi ), p = ∞.

Unless otherwise noted, p-Wasserstein distances on X are understood with

respect to dX,p . Similarly, the distance between two tuples of marginals will
often be quantified by
( P
N p 1/p ,

i=1 Wp (µi , µ̃i ) p ∈ [1, ∞),
Wp (µ1 , . . . , µN ; µ̃1 , . . . , µ̃N ) :=
maxi=1,...,N W∞ (µi , µ̃i ), p = ∞.

6
Given a Lipschitz function c : X → R, we denote by Lipp (c) its Lipschitz
constant with respect to dX,p .
For a strictly convex, lower bounded function f : R+ → R with f (1) = 0
and limx→∞ f (x)/x = ∞, the f -divergence Df (µ, ν) between probabilities
µ, ν on the same space is
Z
dµ
Df (µ, ν) := f dν for µ ν
dν
and Df (µ, ν) := ∞ for µ 6 ν. The main example of interest to us is the
Kullback–Leibler divergence (relative entropy) DKL (µ, ν) which corresponds
to the choice f (x) := x log x. We always assume that (µ, ν) 7→ Df (µ, ν) is
lower semicontinuous for weak convergence. This holds for DKL , and more
generally whenever Df has a suitable variational representation.
Given µi ∈ P(Xi ) and a continuous, nonnegative1 cost function c ∈
L1 (µ1 ⊗ · · · ⊗ µN ), we can now introduce the regularized transport problem
Z
S(µ1 , . . . , µN , c) = inf c dπ + Df (π, µ1 ⊗ · · · ⊗ µN ), (2.1)
π∈Π(µ1 ,...,µN )

where Π(µ1 , . . . , µN ) ⊂ P(X) denotes the set of couplings of the marginals µi .

Note that S(µ1 , . . . , µN , c) < ∞ by way of π := µ1 ⊗ · · · ⊗ µN . A standard
argument of compactness and strict convexity then shows that (2.1) admits
a unique optimizer π ∗ ∈ Π(µ1 , . . . , µN ). When p ∈ [1, ∞) is given, we always
assume that c has growth of order p,

|c(x)| ≤ C(1 + dX,p (x, x̂)p ) (2.2)

for some C > 0 and x̂ ∈ X, whereas for p = ∞ the meaning is that c

is bounded. For marginals µi ∈ Pp (Xi ), this ensures in particular that
c ∈ L1 (π) for any coupling π.
While some of our results below hold for general divergences, we use the
notation Sent in results specific to the entropic version, so that (2.1) becomes
Z
Sent (µ1 , . . . , µN , c) = inf c dπ + DKL (π, µ1 ⊗ · · · ⊗ µN ). (2.3)
π∈Π(µ1 ,...,µN )

Remark 2.1. A variation of (2.3) uses entropy relative to a reference mea-

sure P̂ different from the product of the marginals,
Z
inf c dπ + DKL (π, P̂ ), (2.4)
π∈Π(µ1 ,...,µN )

1
The lower bound is easily relaxed in view of the behavior of (2.1) under shifts of c.

7
for instance (normalized) Lebesgue measure for problems with absolutely
continuous marginals on Rd . Of course, a compatibility condition between P̂
and the marginals is necessary to guarantee that (2.4) is finite. As long as
P̂ = P̂1 ⊗ · · · ⊗ P̂N is a product measure, a standard computation shows that
the optimizer π ∗ of this problem is the same as the one of (2.3). Therefore,
our stability results for (2.3) carry over to (2.4).

3 Results
3.1 Shadows and Preliminaries
Given π ∈ Π(µ1 , . . . , µN ), we introduce a coupling π̃ ∈ Π(µ̃1 , . . . , µ̃N ) of
different marginals through a gluing construction. Intuitively, for N = 2,
the transport π̃ is obtained by concatenating three transports: move µ̃1 to
µ1 using a Wp -optimal transport, then follow the transport π moving µ1 into
µ2 , and finally move µ2 to µ̃2 using a Wp -optimal transport. We think of π̃
as a coupling of µ̃1 , µ̃2 that “shadows” π ∈ Π(µ1 , µ2 ) as closely as possible
given the differing marginals. The formal definition reads as follows.

Definition 3.1 (Shadow). Let p ∈ [1, ∞] and µi , µ̃i ∈ Pp (Xi ), i = 1, . . . , N .

Let κi ∈ Π(µi , µ̃i ) be a coupling attaining Wp (µi , µ̃i ) and κi = µi ⊗ Ki a
disintegration. Given π ∈ Π(µ1 , . . . , µN ), its shadow π̃ ∈ Π(µ̃1 , . . . , µ̃N ) is
defined as the second marginal of π ⊗ K ∈ P(X × X), where the kernel
K : X → P(X) is defined as K(x) = K1 (x1 ) ⊗ · · · ⊗ KN (xN ).

In general, the Wp -optimal kernel Ki need not be unique, so that there

can in fact be more than one choice for the shadow. Any choice will do
in what follows, and we shall speak of “the” shadow despite the abuse of
language. As detailed in Remark 4.2, the shadow can also be understood as
a particular choice of a Wp -projection of π onto Π(µ̃1 , . . . , µ̃N ). The crucial
additional property of the shadow is that its divergence is controlled by the
one of π.

Lemma 3.2. Let p ∈ [1, ∞] and µi , µ̃i ∈ Pp (Xi ), i = 1, . . . , N . Given

π ∈ Π(µ1 , . . . , µN ), its shadow π̃ ∈ Π(µ̃1 , . . . , µ̃N ) satisfies

Wp (π, π̃) = Wp (µ1 , . . . , µN ; µ̃1 , . . . , µ̃N ),

Df (π̃, µ̃1 ⊗ · · · ⊗ µ̃N ) ≤ Df (π, µ1 ⊗ · · · ⊗ µN ).

To study the continuity properties of regularized optimal transport, we

need to compare the cost of two couplings π, π̃ in the unregularized transport

8
problem. If c is L-Lipschitz, the following inequality holds for all probability
measures π, π̃. We formulate an abstract condition to cover more general
cases, especially Example 3.4 below.
Definition 3.3. Let p ∈ [1, ∞] and µi , µ̃i ∈ Pp (Xi ), i = 1, . . . , N . For a
constant L ≥ 0, we say that c satisfies (AL ) if
Z
c d(π − π̃) ≤ LWp (π, π̃) (AL )

for all π ∈ Π(µ1 , . . . , µN ) and π̃ ∈ Π(µ̃1 , . . . , µ̃N ).2

The most important application is quadratic cost.
Example 3.4. For p = 2 and cost c(x1 , x2 ) = kx1 −x2 k2 on Euclidean space
Rd × Rd , we have that (AL ) holds with
√
L := 2 [M (µ1 ) + M (µ̃1 ) + M (µ2 ) + M (µ̃2 )]
where M (µ) := ( kxk2 µ(dx))1/2 for µ ∈ P(Rd ).
R

The example is a special case of the following observation.

Lemma 3.5. Let p ∈ [1, ∞). Let c(x) = f (x)g(x) where f, g are Lipschitz
and have growth of order at most p − 1. Then (AL ) holds with a constant L
depending only on the Lipschitz and growth constants of f, g and the p-th
moments of µi , µ̃i , i = 1, . . . , N . For p = ∞, the analogue holds with depen-
dence on the bounds of f, g instead of moments.
This criterion generalizes to a product c(x) = c1 (x) · · · cm (x) of m Lip-
schitz functions satisfying a suitable growth condition; cf. Remark 4.3.
The next example shows that (AL ) also holds for the p-th power as cost.
Example 3.6. For cost c(x1 , x2 ) = kx1 − x2 kp with p ∈ (1, ∞) on Euclidean
space Rd × Rd , we have that (AL ) holds with
p−1
L := Cp Mp (µ1 ) + Mp (µ̃1 ) + Mp (µ2 ) + Mp (µ̃2 ) ,

where Mp (µ) := ( kxkp µ(dx))1/p for µ ∈ P(Rd ) and Cp is a constant de-

pending only on p.
The proof, detailed in Section 4, is similar to [52, Proposition 7.29] and
proceeds by estimating the derivative of a curve connecting the integrals
in question. The example generalizes to costs c(x1 , x2 ) = c̄(x1 , x2 )p with c̄
being Lipschitz.
2
In fact, (AL ) will only ever be used when one coupling is the shadow of the other, but
that restriction does not seem to substantially enhance the applicability.

9
3.2 Stability through Shadows
We can now state our first result, establishing the continuity of (2.1) with
respect to the marginals. The qualitative part (i) holds for general costs,
the quantitative part (ii) applies, in particular, to quadratic costs under
2-Wasserstein distance.

Theorem 3.7 (Continuity of Value). Let p ∈ [1, ∞].

(i) Let µi , µni ∈ Pp (Xi ) satisfy limn Wp (µi , µni ) = 0 for i = 1, . . . , N .

Then S(µn1 , . . . , µnN , c) → S(µ1 , . . . , µN , c) and the associated optimal
couplings converge in Wp .

(ii) Let µi , µ̃i ∈ Pp (Xi ) for i = 1, . . . , N and let c satisfy (AL ). Then

|S(µ1 , . . . , µN , c) − S(µ̃1 , . . . , µ̃N , c)| ≤ LWp (µ1 , . . . , µN ; µ̃1 , . . . , µ̃N ).

This result will be proved by comparing the cost of a coupling with the
cost of its shadow. Using the same idea, we can show the convergence of the
cost functionals as follows.

Remark 3.8 (Γ-Convergence). Define F : Pp (X) → R ∪ {∞} by

(R
c dπ + Df (π, µ1 ⊗ · · · ⊗ µN ) if π ∈ Π(µ1 , . . . , µN ),
F(π) =
∞, otherwise

and similarly Fn for the marginals µni . If limn Wp (µi , µni ) = 0, then Fn
Γ-converges to F ; that is, given π ∈ Pp (X),

(a) F(π) ≤ lim inf Fn (πn ) for any (πn )n≥1 ⊂ Pp (X) with Wp (π, πn ) → 0,

(b) there exists a sequence (πn )n≥1 ⊂ Pp (X) with Wp (π, πn ) → 0 and
F(π) ≥ lim sup Fn (πn ).

For the recovery sequence in (b), we can choose πn ∈ Π(µn1 , . . . , µnN ) to be

the shadow of π ∈ Π(µ1 , . . . , µN ).

Remark 3.9. Theorem 3.7 (i) and Remark 3.8 generalize to a sequence of
cost functions cn converging to c as long as the convergence is strong enough
to imply cn dπn → c dπ whenever πn ∈ Π(µn1 , . . . , µnN ) converge in Wp to
R R

some π ∈ Π(µ1 , . . . , µN ).

10
Our second aim is to bound the distance between the optimizers for dif-
ferent marginals. The line of argument requires controlling Wasserstein dis-
tance through entropy, hence it is natural to postulate a transport inequality.
Given q ∈ [1, ∞), we say that µi ∈ Pq (Xi ), i = 1, . . . , N satisfy (Iq ) with
constant Cq if
1
Wq (π, θ) ≤ Cq DKL (θ, π) 2q for all π, θ ∈ Π(µ1 , . . . , µN ). (Iq )
0 0
Similarly, they satisfy (Iq ) with constant Cq if
" 1#
0 1 DKL (θ, π) 2q 0
Wq (π, θ) ≤ Cq DKL (θ, π) + q (Iq )
2

for all π, θ ∈ Π(µ1 , . . . , µN ). The two inequalities serve a similar purpose,

0
but (Iq ) is implied by a weaker integrability condition. Indeed, when X is
bounded, (Iq ) holds as a simple consequence of Pinsker’s inequality. Using
0
the weighted inequalities of [9], (Iq ) and (Iq ) also hold under much weaker
exponential moment conditions on µi as detailed in (ii) and (iii) below. In (i),
we obtain a different relaxation where all but one space Xi are bounded.
Thus for the standard case N = 2, if one marginal is bounded, no condition
at all is needed on the other marginal.

Lemma 3.10. (i) Let X 0 := X2 × · · · × XN and suppose that

diamq (X 0 ) := sup dX 0 ,q (x, y) < ∞.

x,y∈X 0
1
−
Then (Iq ) holds with Cq = 2 2q diamq (X 0 ) for all µi ∈ Pq (Xi ).

(ii) If µi ∈ P(Xi ) satisfy exp(α dXi (x̂i , xi )2q ) µi (dxi ) < ∞ for some α ∈
R

(0, ∞) and x̂i ∈ Xi , then (Iq ) holds with constant

N Z ! 2q1
N X
Cq = 2 inf 1 + log exp(αdXi (x̂i , xi )2q ) µi (dxi ) .
x̂∈X,α>0 2α
i=1

(iii) If µi ∈ P(Xi ) satisfy exp(α dXi (x̂i , xi )q ) µi (dxi ) < ∞ for some α ∈
R
0
(0, ∞) and x̂i ∈ Xi , then (Iq ) holds with constant

N Z ! 1q
0 1X 3 q
Cq = 2 inf + log exp(αdXi (x̂i , xi ) )µi (dxi ) .
x̂∈X,α>0 α 2
i=1

11
0
Noting the logarithm in the formulas for Cq and Cq , we observe that these
constants are typically much smaller than the exponential moment itself. We
also note that the condition in (iii) covers subgaussian marginals for q = 2.
We can now state a quantitative result for the stability of the optimizer
of (2.3) relative to the marginals. In view of the above, the assumptions cover
quadratic cost under 2-Wasserstein distance and subgaussian marginals.
Theorem 3.11 (Stability of Optimizers). Let p ∈ [1, ∞] and q ∈ [1, ∞)
with q ≤ p, let µi , µ̃i ∈ Pp (Xi ), let µ1 , . . . , µN satisfy (Iq ) with constant Cq ,
and let c satisfy (AL ). Then the optimizers π ∗ , π̃ ∗ of Sent (µ1 , . . . , µN , c) and
Sent (µ̃1 , . . . , µ̃N , c) satisfy
( 1q − p1 ) 1
Wq (π ∗ , π̃ ∗ ) ≤ N ∆ + Cq (2L ∆) 2q , ∆ := Wp (µ1 , . . . , µN ; µ̃1 , . . . , µ̃N ).
0 0
If µ1 , . . . , µN satisfy (Iq ) with constant Cq instead of (Iq ), then
(1−1) 1 1
0
h i
Wq (π ∗ , π̃ ∗ ) ≤ N q p ∆ + Cq (2L∆) q + (L ∆) 2q .

In particular, (µ1 , . . . , µN ) 7→ π ∗ is 2p
1
-Hölder continuous in Wp when re-
stricted to a bounded set of marginals satisfying (AL ) and (Ip ) or (T0p ) with
given constants.
This result will be derived by comparing the optimizer with its shadow
and applying a strong convexity argument, more specifically, a Pythagorean
relation for relative entropy. In Theorem 3.11, only one set of marginals
0
needs to satisfy (Iq ) or (Iq ). If the assumption holds for both (µi ) and (µ̃i ),
the proof shows that L can be replaced by L/2 in the assertion.

3.3 Stability through Transformation

Next, we improve the Hölder exponent of Theorem 3.11 for the case of
bounded cost. The general line of argument is to reduce a difference in
marginals to a difference in cost functions. Thus, we first state a stability
result for the cost function under fixed marginals; it may be of independent
interest.
Proposition 3.12 (Stability wrt. Cost). Let p ∈ [1, ∞], let µi ∈ Pp (Xi ),
i = 1, . . . , N and P = µ1 ⊗ · · · ⊗ µN . Let c, c̃ : X → R+ be bounded measur-
able, then the optimizers π ∗ , π̃ ∗ of Sent (µ1 , . . . , µN , c) and Sent (µ1 , . . . , µN , c̃)
satisfy
p
1 1
kπ ∗ − π̃ ∗ kT V ≤ a p+1 kc − c̃kLp+1 p (P ) ,
2
2p
2
DKL (π ∗ , π̃ ∗ ) + DKL (π̃ ∗ , π ∗ ) ≤ a p+1 kc − c̃kLp+1
p (P ) ,

12
where a := exp(N kck∞ ) + exp(N kc̃k∞ ). Let q ∈ [1, ∞). If µ1 , . . . , µN
satisfy (Iq ) with constant Cq , then also
p
−1
1
Wq (π ∗ , π̃ ∗ ) ≤ 2 2q Cq a p kc − c̃kLp (P )
(p+1)q
,
0 0
whereas if µ1 , . . . , µN satisfy (Iq ) with constant Cq , then

0

1
2p 1
1 p
∗ ∗ (p+1)q − 2q (p+1)q
Wq (π , π̃ ) ≤ Cq a kc − c̃kLp (P )
p +2 a kc − c̃kLp (P )
p .

p
(For p = ∞, the exponent (p+1)q should be read as 1q .) Proposition 3.12
will be derived by comparing the optimizers in the sense of relative entropy
DKL (π ∗ , π̃ ∗ ). Of course, this is not possible in the other results where the
marginals differ in a possibly singular way. We observe that the constant a
1
deteriorates exponentially in kck∞ , however due to the a p in the formula this
can be counteracted by using a stronger Lp norm. In particular, for p = ∞,
the direct dependence on kck∞ , kc̃k∞ disappears completely, and moreover
we obtain a Lipschitz estimate from L∞ to W1 .
Those features are inherited by our final result on the stability with
respect to marginals; it improves the Hölder exponent of Theorem 3.11 in
the case of bounded costs. As above, the dependence of the constant on kck∞
is avoided for p = ∞; we now obtain a Lipschitz result from W∞ into W1 .
Theorem 3.13 (Stability of Optimizers for Bounded Cost). Let p ∈ [1, ∞]
and q ∈ [1, ∞) with q ≤ p, let µi , µ̃i ∈ Pp (Xi ) satisfy (Iq ) with con-
stant Cq and let c be bounded Lipschitz. Then the optimizers π ∗ , π̃ ∗ of
Sent (µ1 , . . . , µN , c) and Sent (µ̃1 , . . . , µ̃N , c) satisfy
p
(1−1) −1
1
Wq (π ∗ , π̃ ∗ ) ≤ N q p ∆ + 2 2q Cq a p Lipp (c) ∆
(p+1)q

where a := 2 exp(N kck∞ ) and ∆ := Wp (µ1 , . . . , µN ; µ̃1 , . . . , µ̃N ). If µi , µ̃i

0 0
satisfy (Iq ) with constant Cq instead of (Iq ), then
( 1q − p1 )
Wq (π ∗ , π̃ ∗ ) ≤ N ∆
2p p
−1 0 1
−1
1
(p+1)q (p+1)q
+ 2 q Cq a p Lipp (c) ∆ + 2 2q a p Lipp (c) ∆ .

In particular, (µ1 , . . . , µN ) 7→ π ∗ is p+1

1
-Hölder continuous in Wp when re-
stricted to a bounded set of marginals satisfying (Ip ) or (T0p ) with a given
constant. For q = 1 and p = ∞, we have the Lipschitz estimate
W1 (π ∗ , π̃ ∗ ) ≤ ` W∞ (µ1 , . . . , µN ; µ̃1 , . . . , µ̃N )

13
√
with constant ` := N + (C1 / 2) Lip∞ (c) independent of kck∞ . The con-
stant ` is sharp.

As discussed in the Introduction, this result is based on a transformation:

instead of dealing with two sets of marginals, we use a change of coordinates
to transform µ̃i to µi , at the expense of also transforming the cost function.
For the resulting problem, we can apply Proposition 3.12. The sharpness of
the constant ` is discussed in Example 4.10.

Remark 3.14. For simplicity, we have stated our results in the traditional
setting where Wp is defined through a metric compatible with the underlying
Polish space. However, much of the above generalizes to any measurable
metric. For instance, the discrete metric can be used to see that for p = 1,
our results include the total variation distance (see also [46] for further results
on continuity in total variation). The majority of our arguments extend
without change to the more general setting. In Definition 3.1, it is no longer
clear that there is a coupling attaining Wp (µi , µ̃i ). However, we can use an
-optimal coupling to define an “approximate shadow” for which the first part
of Lemma 3.2 is replaced by Wp (π, π̃) ≤ Wp (µ1 , . . . , µN ; µ̃1 , . . . , µ̃N ) + , and
then we can argue the main results as before. The extension to measurable
metrics also applies to Proposition 3.12. Theorem 3.13 extends with the
caveat that one needs to provide a substitute for the technical Lemma 4.9 (ii)
in the specific metric under consideration, as its proof uses separability of
the metric.

3.4 Application to Sinkhorn’s Algorithm

In this section we focus on N = 2 marginals µ1 , µ2 . Sinkhorn’s algorithm
exp(−c(x))
is initialized at π 0 := Pc , where d(µdP c
1 ⊗µ2 )
(x) = R exp(−c) d(µ1 ⊗µ2 )
is the Gibbs
n
kernel associated with the cost c. The Sinkhorn iterates π ∈ P(X), n ≥ 1
can then be defined recursively via
dπ n dµ1
(x) := (x1 ) for n odd,
dπ n−1 dπ1n−1
dπ n dµ2
(x) := (x2 ) for n even,
dπ n−1 dπ2n−1

where πin−1 is the i-th marginal of π n−1 . It follows that π1n = µ1 for n odd
and π2n = µ2 for n even: for each iterate, one of the two marginals is the
correct marginal. The other marginal does not match µi , but converges to
it as n → ∞. Importantly, each iterate π n is the solution of an entropic

14
optimal transport problem between its own marginals. As these marginals
converge to (µ1 , µ2 ), the convergence of Sinkhorn’s algorithm can be framed
as a particular instance of stability with respect to the marginals. As above,
we denote by π ∗ the optimizer of Sent (µ1 , µ2 , c). Moreover, we write
Z
F(π) := c dπ + DKL (π, µ1 ⊗ µ2 )

for the entropic cost of π ∈ P(X), similarly as in Remark 3.8 but without
the penalty.

Theorem 3.15 (Sinkhorn Convergence). Let p ∈ [1, ∞). For i = 1, 2, let

µi ∈ P(Xi ) satisfy exp(α dXi (x̂i , xi )p ) µi (dxi ) < ∞ for some α ∈ (0, ∞)
R

and x̂i ∈ Xi .

(i) Let c be continuous with growth of order p. As n → ∞, we have

F(π n ) → F(π ∗ ), πn → π∗ in Wp .

(ii) Let 1 ≤ q ≤ p and c(x) = f (x)g(x) where f, g are Lipschitz with growth
of order p − 1. For all n ≥ 2, with a constant c0 detailed in the proof,
1 1
− 2p − 4pq
|F(π ∗ ) − F(π n )| ≤ c0 n , Wq (π ∗ , π n ) ≤ c0 n .

Theorem 3.15 with p = q = 2 implies W2 -convergence for quadratic cost

with subgaussian marginals. The form c(x) = f (x)g(x) can be extended
as in Remark 4.3, or more generally to any condition guaranteeing (AL )
uniformly over the marginals produced by the algorithm. In particular, using
Example 3.6, the assertion of the theorem also holds for c(x) = kx2 − x1 kp .
The more detailed estimate given in the proof of the theorem shows that
the constant c0 is at the same scale as c; in particular, it does not grow
exponentially with c.

4 Proofs
4.1 Shadows and Preliminaries
For the convenience of the reader, we first recall the data processing in-
equality for our setting. Let Y1 and Y2 be Polish spaces. If µ ∈ P(Y1 ) and
K : Y1 → P(Y2 ) is a stochastic kernel, we

denote by µK ∈ P(Y2 ) the second marginal of µ ⊗ K ∈ P(Y1 × Y2 ). (4.1)

15
Lemma 4.1. Let µ, ν ∈ P(Y1 ) and K : Y1 → P(Y2 ) a kernel. Then

Df (µK, νK) ≤ Df (µ, ν).

Proof. We may assume that µ ν. For any kernels K1 K2 : Y1 → P(Y2 ),

d(µ ⊗ K1 ) dµ dK1 (x)

(x, y) = (x) (y) ν ⊗ K2 -a.s. (4.2)
d(ν ⊗ K2 ) dν dK2 (x)
d(µ⊗K) dµ
In particular, d(ν⊗K) (x, y) = dν (x) and thus

Df (µ, ν) = Df (µ ⊗ K, ν ⊗ K). (4.3)

Whereas in general, (4.2) and Jensen’s inequality for f yield

ZZ
dµ dK1 (x)
Df (µ ⊗ K1 , ν ⊗ K2 ) = f (x) (y) K2 (x, dy)ν(dx)
dν dK2 (x)
Z
dµ
≥ f (x) ν(dx) = Df (µ, ν). (4.4)
dν

Denote by µ ⊗ K = (µK) ⊗ K̃1 and ν ⊗ K = (νK) ⊗ K̃2 the “reverse”

disintegrations from the second marginal to the first. Applying (4.4) to
(µK) ⊗ K̃1 and (νK) ⊗ K̃2 ,

Df (µ ⊗ K, ν ⊗ K) = Df ((µK) ⊗ K̃1 , (νK) ⊗ K̃2 ) ≥ Df (µK, νK).

In view of (4.3), this yields the claim.

We can now show the two fundamental properties of the shadow.

Proof of Lemma 3.2. Let µi ⊗ Ki ∈ Π(µi , µ̃i ) be a Wp -optimal coupling and

define κ = π ⊗ K ∈ P(X × X) where K(x) = K1 (x1 ) ⊗ · · · ⊗ KN (xN ), so
that π̃ := πK is the shadow of π. In view of κ ∈ Π(π, π̃), for p < ∞,
Z
Wp (π, π̃)p ≤ dX,p (x, y)p κ(dx, dy)
N
Z X N
X
p
= dXi (xi , yi ) κ(dx, dy) = Wp (µi , µ̃i )p .
i=1 i=1

On the other hand, given an arbitrary coupling π̃ ∈ Π(µ̃1 , . . . , µ̃N ), any

coupling γ ∈ Π(π, π̃) induces couplings γi ∈ Π(πi , π̃i ) = Π(µi , µ̃i ) of the

16
individual marginals, hence
N
Z X
p
Wp (π, π̃) = inf dXi (xi , yi )p γ(dx, dy)
γ∈Π(π,π̃)
i=1
N
X Z N
X
≥ inf dXi (xi , yi )p γi (dxi , dyi ) = Wp (µi , µ̃i )p .
γi ∈Π(µi ,µ̃i )
i=1 i=1

The argument for p = ∞ is similar, completing the proof of the first claim. To
show the bound on the divergence, note that µ̃1 ⊗· · ·⊗ µ̃N = (µ1 ⊗· · ·⊗µN )K.
Therefore, the data processing inequality (Lemma 4.1) yields

Df (π̃, µ̃1 ⊗· · ·⊗ µ̃N ) = Df (πK, (µ1 ⊗· · ·⊗µN )K) ≤ Df (π, µ1 ⊗· · ·⊗µN ).

Remark 4.2. The preceding proof shows that the shadow is a Wp -projection
onto Π(µ̃1 , . . . , µ̃N ); that is, π̃ ∈ arg minΠ(µ̃1 ,...,µ̃N ) Wp (π, ·). In general, the
argmin may have more than one element. A simple example on R × R is
µ1 = µ2 = δ0 and µ̃1 = µ̃2 = (δ−1 + δ1 )/2; here any element of Π(µ̃1 , µ̃2 ) has
the same distance to the singleton Π(µ1 , µ2 ) = {δ(0,0) }. In this example, the
shadow of π := δ(0,0) is unique. Clearly, not any projection is a shadow, and
most projections fail to satisfy the divergence bound in Lemma 3.2.
Next, we show the criteria for (AL ).

Proof of Lemma 3.5 and Example 3.4. To show the lemma, let κ ∈ Π(π, π̃)
be a coupling attaining Wp (π, π̃). Then
Z Z
c d(π − π̃) = c(x) − c(y) κ(dx, dy)
Z Z
= f (x)(g(x) − g(y)) κ(dx, dy) + g(y)(f (x) − f (y)) κ(dx, dy). (4.5)

We estimate the first integral; the second is treated analogously. Hölder’s

inequality with q such that 1/p + 1/q = 1 yields
Z
|f (x)(g(x) − g(y))| κ(dx, dy) ≤ kf kLq (π) kg(x) − g(y)kLp (κ) .

As |f (x)| ≤ Cf [1 + dX1 (x1 , x̄1 )l + · · · + dXN (xN , x̄N )l ] with l ≤ p − 1 =

p(1 − 1/p) = p/q and hence lq ≤ p, and as π has marginals µi ∈ Pp (Xi ), we
see that kf kLq (π) is finite with a bound depending only on the p-th moments
of µi , i = 1, . . . , N . On the other hand,

kg(x) − g(y)kLp (κ) ≤ Lipp (g)Wp (π, π̃)

17
due to the fact that κ attains Wp (π, π̃). The lemma follows. Example 3.4
follows from the above√ estimate with f (x) = g(x) = kx1 − x2 k in which case
Lip2 (f ) = Lip2 (g) = 2.

Remark 4.3. Lemma 3.5 can be generalized to a product of any finite num-
ber of Lipschitz functions. Let c(x) = c1 (x) · · · cm (x) where cj are Lipschitz
and decompose c(x) − c(y) as in (4.5) with f (x) := c1 (x) · · · cm−1 (x) and
g(x) := cm (x). Proceeding inductively, we obtain that
m
X
c(x) − c(y) = Aj (x, y)(cj (x) − cj (y))
j=1

where Aj (x, y) is a product of m−1 factors of the form ck (x) or cl (y). If cj (x),
j = 1, . . . , m satisfy a growth condition suitably coordinated with a moment
condition on µi , µ̃i , then kAj (x, y)kLq (π) and kAj (x, y)kLq (π̃) can be bounded
in terms of those moments and we deduce an analogue of Lemma 3.5.
Proof of Example 3.6. Let κ be a Wp -optimal coupling of π and π̃. Set
ψ(x) := kxkp and define ϕ : [0, 1] → R by
Z
ϕ(t) := ψ((1 − t)(x2 − x1 ) + t(y2 − y1 )) κ(dx, dy);

then c(x) = ψ(x2 − x1 ) and the quantity to be estimated is

Z Z
c dπ − c dπ̃ = |ϕ(0) − ϕ(1)|. (4.6)

Using differentiation under the integral (justified by [26, Theorem 2.27]), we

see that ϕ is differentiable and
Z
∂ϕ
(t) = ∇ψ((1 − t)(x2 − x1 ) + t(y2 − y1 )), (y2 − y1 − x2 + x1 ) κ(dx, dy).
∂t
Noting k∇ψ(v)k = pkvkp−1 and writing vt = (1 − t)(x2 − x1 ) + t(y2 − y1 ), the
inequalities of Cauchy–Schwarz, Hölder and (a + b)p ≤ 2p−1 (ap + bp ) yield
Z
∂ϕ
(t) ≤ k∇ψ(vt )kk(y2 − x2 ) + (x1 − y1 )k κ(dx, dy)
∂t
Z p−1 Z 1
p p p
p
≤ k∇ψ(vt )k p−1 κ(dx, dy) k(y2 − x2 ) + (x1 − y1 )k κ(dx, dy)
Z p−1
p
≤ Cp0 p
kvt k κ(dx, dy) Wp (π, π̃)
p−1
≤ Cp Mp (µ1 ) + Mp (µ̃1 ) + Mp (µ2 ) + Mp (µ̃2 ) Wp (π, π̃)

18
where Cp , Cp0 are constants depending only on p. In view of (4.6), the claim
follows.

4.2 Stability through Shadows

We can now show the continuity of the value.

Proof of Theorem 3.7. (i) Let π ∗ , πn∗ be the optimizers for S(µ1 , . . . , µN , c)
and S(µn1 , . . . , µnN , c), respectively. For brevity, set P = µ1 ⊗ · · · ⊗ µN and
Pn = µn1 ⊗ · · · ⊗ µnN . After passing to a subsequence, πn converges in Wp to
some π ∈ Π(µ1 , . . . , µN ), by weak compactness. We have
Z Z Z
lim inf c dπn + Df (πn , Pn ) ≥ c dπ + Df (π, P ) ≥ c dπ ∗ + Df (π ∗ , P )
∗ ∗
n→∞

by lower semicontinuity of c d(·) + Df (·, ·) and optimality of π ∗ . On the

other hand, let π̃n ∈ Π(µn1 , . . . , µnN ) be the shadow of π ∗ . Then Lemma 3.2
shows limn Wp (π̃n , π ∗ ) = 0 and Df (π̃n , Pn ) ≤ Df (π ∗ , P ), hence
Z Z
∗ ∗
lim sup c dπn + Df (πn , Pn ) ≤ lim sup c dπ̃n + Df (π̃n , Pn )
n→∞
Zn→∞
≤ c dπ ∗ + Df (π ∗ , P ).

Together, limn c dπn∗ + Df (πn∗ , Pn ) = c dπ ∗ + Df (π ∗ , P ) and π must be

R R

the (unique) optimizer π ∗ of S(µ1 , . . . , µN , c). In particular, the original

sequence (πn∗ ) converges to π ∗ , as claimed.
(ii) Let π ∗ be the optimizer of S(µ1 , . . . , µN , c) and π̃ ∈ Π(µ̃1 , . . . , µ̃N )
its shadow. Using (AL ) and Lemma 3.2,
Z
S(µ1 , . . . , µN , c) = c dπ ∗ + Df (π ∗ , µ1 ⊗ · · · ⊗ µN )
Z
≥ c dπ̃ − LWp (π ∗ , π̃) + Df (π̃, µ̃1 ⊗ · · · ⊗ µ̃N )

≥ S(µ̃1 , . . . , µ̃N , c) − LWp (µ1 , . . . , µN ; µ̃1 , . . . , µ̃N ).

The claim follows by symmetry.

The proof of Γ-convergence follows the same line of argument.

Proof of Remark 3.8. Similarly

R to the preceding proof, (a) follows from the
lower semicontinuity of c d(·)R+ Df (·, ·).R For (b), let πn be the shadow of π
and use Lemma 3.2 to obtain c dπn → c dπ and Df (πn , µn1 ⊗ · · · ⊗ µnN ) ≤
Df (π, µ1 ⊗ · · · ⊗ µN ), again as in the preceding proof.

19
The criteria for the transport inequality (Iq ) are derived as follows.

Proof of Lemma 3.10. (i) For the convenience of the reader, we first recall
the standard argument for bounded X: combine dX,q (x, y)q ≤ diamq (X)q 1x6=y
with the transport representation of total variation distance [40, Lemma 2.20]
and Pinsker’s inequality [40, Theorem 2.16] to obtain
Z
q
Wq (π, θ) = inf dX,q (x, y)q κ(dx, dy)
κ∈Π(π,θ)
Z
≤ diamq (X)q inf 1x6=y κ(dx, dy)
κ∈Π(π,θ)
1 1/2
= diamq (X)q kπ − θkT V ≤ diamq (X)q DKL (θ, π) .
2
The above holds for arbitrary probabilities π, θ. To prove the stronger
estimate claimed in the lemma, we improve the above by exploiting that
π, θ ∈ Π(µ1 , . . . , µN ). Indeed, let Π1 (π, θ) ⊂ Π(π, θ) denote the set of cou-
plings κ ∈ Π(π, θ) not moving mass in the X1 -direction; i.e.,

κ{(x1 , . . . , xN , y1 , . . . , yN ) : x1 = y1 } = 1.

Note that Π1 (π, θ) 6= ∅ due to the fact that π and θ have the same marginal µ1
on X1 . Clearly
Z
q
Wq (π, θ) = inf dX,q (x, y)q κ(dx, dy)
κ∈Π(π,θ)
Z
≤ inf dX,q (x, y)q κ(dx, dy)
κ∈Π1
Z
q
≤M inf 1x6=y κ(dx, dy), M := diamq (X2 × · · · × XN ).
κ∈Π1 (π,θ)

On the other hand, we claim that π, θ having the same marginal implies
Z
inf 1x6=y κ(dx, dy) ≤ kπ − θkT V ; (4.7)
κ∈Π1 (π,θ)

in words, where mass needs to be moved, one might as well move only in the
directions X2 , . . . , XN . Granted (4.7), we can proceed as in the beginning
and conclude the assertion of the lemma,
1 1/2
Wq (π, θ)q ≤ M q kπ − θkT V ≤ M q DKL (θ, π) .
2

20
To show (4.7), consider the mutually singular measures π̃ = π − (π ∧ θ) and
θ̃ = θ − (π ∧ θ), where π ∧ θ is defined as usual via d(π ∧ θ)/d(π + θ) =
min{dπ/d(π + θ), dθ/d(π + θ)}. These measures again share a common first
marginal, so that Π1 (π̃, θ̃) 6= ∅. Let κ̃ ∈ Π1 (π̃, θ̃) be arbitrary and let
κ ∈ Π(π, θ) be the coupling given by κ = κ̃ + i where i is the identical
coupling of π ∧ θ with itself. Then
Z Z
kπ − θkT V ≤ 1x6=y κ(dx, dy) = 1x6=y κ̃(dx, dy) = kπ̃ − θ̃kT V

where the last equality follows from mutual singularity. As kπ̃ − θ̃kT V =
kπ − θkT V , all expressions are equal and (4.7) follows.
(ii) It is shown in [9, Corollary 2.4] that the inequality (Iq ) holds for a
given measure π ∈ P(X) and all θ ∈ P(X) whenever
Z
exp(α̃ dX,q (x, x̂)2q ) π(dx) < ∞ (4.8)

for some α̃ > 0 and x̂ ∈ X, with constant

1
Z 2q1
2q
Cπ,q = 2 inf 1 + log exp(α̃dX,q (x̂, x) )π(dx) . (4.9)
x̂∈X,α̃>0 2α̃

To obtain the claim for a coupling π (and general θ ∈ P(X)), note that
N N
X 1 X 2
dX,q (x̂, x)2q ≤ N dX,i (x̂i , xi )2q = N dX,i (x̂i , xi )2q
N
i=1 i=1
R
and that the functional f 7→ log exp(α̃f (x)) π(dx), is convex (as can be seen
from a variational representation, e.g. [28, Example 4.34, p. 201]). Hence
Z N Z
2q 1 X
log exp(α̃dX,q (x̂, x) ) π(dx) ≤ log exp(α̃N 2 dXi (x̂i , xi )2q ) µi (dxi ).
N
i=1

To obtain the claim for Cq , we plug this inequality into (4.9) and set α̃ =
α/N 2 . Similarly, the integrability condition in the lemma implies (4.8).
(iii) The proof is similar to (ii) but refers to a different result of [9].
Indeed, by [9, Corollary 2.3], it suffices to bound

0

1 3
Z 1q
q
Cπ,q = 2 inf + log exp(α̃dX,q (x̂, x) )π(dx) .
x̂∈X,α̃>0 α̃ 2

Here the term inside the exponential

R already factorizes and we can directly
apply the convexity of f 7→ log exp(α̃f (x))π(dx), which yields the claim
after the substitution α̃ = α/N .

21
As a preparation for the proof of Theorem 3.11, we recall a Pythagorean
relation for the entropic optimal transport problem. We denote
Z
F(π) = c dπ + DKL (π, π1 ⊗ · · · ⊗ πN )

where π1 , . . . , πN are the marginals of π.

Lemma 4.4. If π ∗ ∈ Π(µ1 , . . . , µN ) is the optimizer of S(µ1 , . . . , µN , c),
DKL (π, π ∗ ) ≤ F(π) − F(π ∗ ) for all π ∈ Π(µ1 , . . . , µN ).
Proof. Set P = µ1 ⊗ · · · ⊗ µN and define Pc ∈ P(X) by dPc = α−1 e−c dP ,
where α is the normalizing constant. Then
F(π) = DKL (π, Pc ) − log α, (4.10)
so that the entropic optimal transport problem (2.3) is equivalent to mini-
mizing DKL (·, Pc ). In particular, π ∗ = arg minΠ(µ1 ,...,µN ) DKL (·, Pc ) and the
Pythagorean theorem for relative entropy [19, Theorem 2.2] yields
DKL (π, Pc ) ≥ DKL (π ∗ , Pc ) + DKL (π, π ∗ ) for all π ∈ Π(µ1 , . . . , µN ).
In view of (4.10), the claim follows. (In the case under consideration, the
assertion holds even with equality. We do not need that fact here.)

We can now show the stability of optimizers with respect to the marginals.
0
Proof of Theorem 3.11. We detail the proof for (Iq ); the argument for (Iq ) is
identical. For notational convenience, we treat the case where µ̃i (rather
than µi ) satisfy (Iq ). Consider the optimizers π ∗ ∈ Π(µ1 , . . . , µN ) and
π̃ ∗ ∈ Π(µ̃1 , . . . , µ̃N ). Let π̃ ∈ Π(µ̃1 , . . . , µ̃N ) be the shadow of π ∗ for the
p-Wasserstein distance. Using Lemma 3.2 and (AL ) as in the proof of The-
orem 3.7 (ii),
Z
F(π̃) − F(π ) ≤ c d(π̃ − π ∗ ) ≤ L Wp (π̃, π ∗ ) ≤ L∆.
∗

We also have F(π ∗ ) − F(π̃ ∗ ) ≤ L∆ by Theorem 3.7 (ii), and adding the
inequalities yields
F(π̃) − F(π̃ ∗ ) ≤ 2L∆.
(If both marginals satisfy (Iq ) with constant L, we can assume by symmetry
that F(π ∗ ) − F(π̃ ∗ ) ≤ 0, and obtain the estimate with L instead of 2L.) By
Lemma 4.4, it follows that DKL (π̃, π̃ ∗ ) ≤ 2L∆, and now (Iq ) implies
1
Wq (π̃, π̃ ∗ ) ≤ Cq (2L∆) 2q .

22
Recalling that Wr on X was defined relative to the distance dX,r , Jensen’s
( 1q − p1 )
inequality implies Wq (·, ·) ≤ N Wp (·, ·). In view of Lemma 3.2, we
( 1q − p1 ) ( 1q − p1 )
deduce Wq (π ∗ , π̃)
≤N Wp (π ∗ , π̃) ≤N ∆. We conclude the proof
via the triangle inequality,
( 1q − p1 ) 1
Wq (π ∗ , π̃ ∗ ) ≤ Wq (π ∗ , π̃) + Wq (π̃, π̃ ∗ ) ≤ N ∆ + Cq (2L∆) 2q .

4.3 Stability with respect to Cost

Throughout this section, we fix p ∈ [1, ∞], µi ∈ Pp (Xi ) for i = 1, . . . , N
and c, c̃ : X → [0, ∞) satisfying the growth condition (2.2). The following
observation is the starting point for the stability with respect to the cost
function. We recall that P := µ1 ⊗ · · · ⊗ µN .

Lemma 4.5. Let π ∗ , π̃ ∗ be the respective optimizers of Sent (µ1 , . . . , µN , c)

and Sent (µ1 , . . . , µN , c̃). Then
Z
DKL (π , π̃ ) + DKL (π̃ , π ) ≤ (c − c̃) d(π̃ ∗ − π ∗ ).
∗ ∗ ∗ ∗

Proof. Lemma 4.4 yields

Z Z
∗ ∗ ∗ ∗ ∗ ∗
DKL (π , π̃ ) + DKL (π̃ , π ) ≤ c dπ̃ + DKL (π̃ , P ) + c̃ dπ ∗ + DKL (π ∗ , P )
Z Z
∗ ∗
− c dπ − DKL (π , P ) − c̃ dπ̃ ∗ − DKL (π̃ ∗ , P )
Z
= (c − c̃) d(π̃ ∗ − π ∗ ).

Lemma 4.5 clearly implies a Lipschitz estimate with respect to kc − c̃k∞

by using Pinsker’s inequality on the left-hand side. The following proof is a
variation on that observation.

Proof of Proposition 3.12. Combining

dπ ∗ dπ̃ ∗
Z Z
∗ ∗
(c̃ − c) d(π − π̃ ) ≤ |c̃ − c| − dP
dP dP
p
with Hölder’s inequality as well as (in case p 6= 1), for q := p−1 ,

q q−1
dπ ∗ dπ̃ ∗ dπ ∗ dπ̃ ∗ dπ ∗ dπ̃ ∗
− ≤ − − ,
dP dP dP dP L∞ (P ) dP dP

23
yields
1
Z ∗ dπ̃ ∗
1− 1 dπ
p
(c̃ − c) d(π ∗ − π̃ ∗ ) ≤ kc̃ − ckLp (P ) (2kπ ∗ − π̃ ∗ kT V ) p − .
dP dP ∞
(4.11)
Next, we show
dπ ∗ dπ̃ ∗
− ≤ a := exp(N kck∞ ) + exp(N kc̃k∞ ). (4.12)
dP dP ∞

Recall that by duality (e.g., [23, 44]), for certain “potentials” ϕi : Xi → R,

dπ ∗
(x) = exp (−c + ⊕i ϕi ) (4.13)
dP
PN
where (⊕i ϕi )(x) := i=1 ϕ(xi ), and moreover
Z
⊕i ϕi dP = Sent (µ1 , . . . , µN , c) ≥ 0 (4.14)

where the inequality is due to c ≥ 0. To estimate the right-hand side

of (4.13), recall that (4.13) and the fact that π ∗ is a coupling imply a conju-
gacy relation between the potentials (e.g., [23, 44, 45]), namely
Z
ϕi (xi ) = − log exp(−c(x) + ⊕j6=i ϕj (xj )) P−i (dx−i )
Z
≤ kck∞ − ⊕j6=i ϕj dP−i ,

where x−i := (x1 , . . . , xi−1 , xi+1 , . . . , xN ) and P−i := ⊗j6=i µj . Thus by (4.14),
Z
⊕i ϕi (x) ≤ N kck∞ − (N − 1) ⊕N j=1 ϕj dP ≤ N kck∞ .

Using this in (4.13), we conclude that

dπ ∗
≤ exp(N kck∞ ).
dP ∞
∗ ∗ ∗ dπ̃ ∗
The analogue holds for π̃ ∗ , hence dπ dπ̃
dP − dP ∞ ≤
dπ
dP ∞ + dP ∞ ≤a
as claimed in (4.12).
Pinsker’s inequality, Lemma 4.5, (4.11) and (4.12) imply
4kπ ∗ − π̃ ∗ k2T V ≤ DKL (π ∗ , π̃ ∗ ) + DKL (π̃ ∗ , π ∗ )
Z
1
1− 1
≤ (c − c̃) d(π̃ ∗ − π ∗ ) ≤ a p (2kπ ∗ − π̃ ∗ kT V ) p kc̃ − ckLp (P ) .

24
1− 1
Dividing by 4kπ ∗ − π̃ ∗ kT V p yields

1+ 1
1 1+ 1 1
kπ ∗ − π̃ ∗ kT V p ≤
p
a p kc̃ − ckLp (P ) (4.15)
2
which is the first claim of the proposition. On the other hand, using Lemma 4.5
and (4.11) together with (4.15) yields

1
1
p−1
∗ ∗ ∗ ∗ p+1
DKL (π , π̃ ) + DKL (π̃ , π ) ≤ a kc̃ − ckLp (P ) a kc̃ − ckLp (P )
p p . (4.16)

As (Iq ) implies 2Cq−2q Wq (π ∗ , π̃ ∗ )2q ≤ DKL (π ∗ , π̃ ∗ ) + DKL (π̃ ∗ , π ∗ ), this proves

the second claim of the proposition. For the last claim, we drop the nonneg-
0
ative term DKL (π̃ ∗ , π ∗ ) on the left-hand side of (4.16) and use (Iq ) with the
remaining inequality.

4.4 Stability through Transformation

Let p ∈ [1, ∞], µi , µ̃i ∈ Pp (Xi ) for i = 1, . . . , N and let c : X → [0, ∞) satisfy
the growth condition (2.2). We begin with preliminary results connecting
stability with respect to the marginals and stability with respect to the cost
function. As in Definition 3.1, K denotes the kernel K(x) = K1 (x1 ) ⊗
· · · ⊗ KN (xN ), where µi ⊗ Ki ∈ Π(µi , µ̃i ) isRan optimal coupling attaining
Wp (µi , µ̃i ). We use the notation Kc(x) := c(y) K(x, dy) for the integral
of c with respect to the kernel.

Lemma 4.6. Let p ∈ [1, ∞] and let c be Lipp (c)-Lipschitz. Then

kc − KckLp (π) ≤ Lipp (c)Wp (µ1 , . . . , µN ; µ̃1 , . . . , µ̃N ), π ∈ Π(µ1 , . . . , µN ).

Proof. We only detail the calculation for p < ∞,

Z Z
p
kc − KckpLp (π) = c(x) − c(y)K(x, dy) π(dx)
ZZ
≤ |c(x) − c(y)|p K(x, dy)π(dx)
N
ZZ X
p
≤ Lipp (c) dX,i (xi , yi )p K(x, dy)π(dx)
i=1
N
X
= Lipp (c)p Wp (µi , µ̃i )p .
i=1

25
Next, consider the kernel K̃ defined like K but with the marginals re-
versed; that is, K̃(x) = K̃1 (x1 )⊗· · ·⊗K̃N (xN ), where µ̃i ⊗K̃i ∈ Π(µ̃i , µi ) is an
optimal coupling attaining Wp (µ̃i , µi ). The double integral K̃Kc := K̃(Kc)
thus corresponds to a round-trip between the marginals. In general, this
round-trip leads to a positive gap R in value, as shown in the next result.
The result will not be used in the subsequent proofs but it may be useful to
understand the steps below, where we look for situations where the gap is
zero.
Lemma 4.7. Let p ∈ [1, ∞]. We have

S(µ̃1 , . . . , µ̃N , c) ≤ S(µ1 , . . . , µN , Kc) ≤ S(µ̃1 , . . . , µ̃N , c) + R,

where R := (K̃Kc − c) dπ̃ ∗ and π̃ ∗ is the optimizer of S(µ̃1 , . . . , µ̃N , c).

Moreover, R ≤ 2 Lipp (c)Wp (µ1 , . . . , µN ; µ̃1 , . . . , µ̃N ).

Proof. Set P̃ = µ̃1 ⊗ · · · ⊗ µ̃N and recall (4.1). Using Lemma 4.1 twice,
Z
S(µ̃1 , . . . , µ̃N , c) = inf c dπ̃ + Df (π̃, P̃ )
π̃∈Π(µ̃1 ,...,µ̃N )
Z
≤ inf c d(πK) + Df (πK, P K)
π∈Π(µ1 ,...,µN )
Z
≤ inf Kc dπ + Df (π, P )
π∈Π(µ1 ,...,µN )

= S(µ1 , . . . , µN , Kc)
Z
≤ Kc d(π̃ ∗ K̃) + Df (π̃ ∗ K̃, P̃ K̃)
Z
≤ K̃Kc dπ̃ ∗ + Df (π̃ ∗ , P̃ ) = S(µ̃1 , . . . , µ̃N , c) + R.

The bound for R is similar to the proof of Lemma 4.6.

In Lemma 4.7, there is a gap between the values of S(µ̃1 , . . . , µ̃N , c) and
S(µ1 , . . . , µN , Kc). If however the kernels K, K̃ are given by maps inverse
to one another (as will be the case in the proof of Lemma 4.9 below), the
gap is zero and the problems S(µ̃1 , . . . , µ̃N , c) and S(µ1 , . . . , µN , Kc) become
equivalent in the following sense. We write T] for the pushforward under T .
Lemma 4.8. For i = 1, . . . , N , let Ti : Xi → Xi satisfy µ̃i = (Ti )] µi and
admit a (measurable) a.s. inverse Ti−1 : Xi → Xi ; that is, Ti−1 ◦ Ti = id
µi -a.s. and Ti ◦ Ti−1 = id µ̃i -a.s. Define

T (x) = (T1 (x1 ), . . . , TN (xN )), T −1 (x) = (T1−1 (x1 ), . . . , TN−1 (xN )).

26
Then S(µ̃1 , . . . , µ̃N , c) = S(µ1 , . . . , µN , c ◦ T ) and the optimizers π̃ ∗ , π ∗ of
the two problems are related by π̃ ∗ = T] π ∗ and π ∗ = T]−1 π̃ ∗ .

Proof. Set P = µ1 ⊗ · · · ⊗ µN and P̃ = µ̃1 ⊗ · · · ⊗ µ̃N . We have

Z Z
c ◦ T dπ + Df (π, P ) = c ◦ T d(T]−1 (T] π)) + Df (T]−1 (T] π), T]−1 P̃ )
Z
= c d(T] π) + Df (T] π, P̃ )

for any π ∈ Π(µ1 , . . . , µN ), hence taking infimum over π ∈ Π(µ1 , . . . , µN )

yields S(µ1 , . . . , µN , c ◦ T ) ≥ S(µ̃1 , . . . , µ̃N , c). Symmetric results hold start-
ing from π̃ ∈ Π(µ̃1 , . . . , µ̃N ). Thus S(µ̃1 , . . . , µ̃N , c) = S(µ1 , . . . , µN , c ◦ T ),
and now the formulas for the optimizers follow as well.

In the simplest case, the optimal couplings for Wp (µi , µ̃i ) are given by
invertible maps, and then we can apply Lemma 4.8 directly to prove Theo-
rem 3.13. In general, we approximate the marginals with measures having
that property as detailed next, passing to an augmented space to guarantee
that the setting is sufficiently rich. We write δx for the Dirac measure at x.
Lemma 4.9. Let p ∈ [1, ∞]. Let X̄i = Xi × (−1, 1) and embed the Q marginals
as νi := µi ⊗ δ0 and ν̃i := µ̃i ⊗ δ0 for i = 1, . . . , N . Set X̄ = Ni=1 X̄i and
define c̄ : X̄ → R by c̄(x, u) := c(x) for x ∈ X and u ∈ (−1, 1) . N

(i) We have S(µ1 , . . . , µN , c) = S(ν1 , . . . , νN , c̄) and the corresponding op-

timizers π, θ are related by θ = π ⊗ δ0N .
If π̃, θ̃ are the optimizers for S(µ̃1 , . . . , µ̃N , c) and S(ν̃i , . . . , ν̃N , c̄), then

Wp (π, π̃) = Wp (θ, θ̃).

(ii) Given 0 < < 1 and i = 1, . . . , N , there exist νi , ν̃i ∈ P(X̄i ) with

Wp (νi , νi ) ≤ , Wp (ν̃i , ν̃i ) ≤ (4.17)

and an a.s. invertible map Ti : X̄i → X̄i such that ν̃i = (Ti )] νi and
the corresponding coupling attains Wp (νi , ν̃i ).
Proof. (i) follows immediately from the definitions; we prove (ii). The case
p < ∞ is standard: for n large enough, there exist ρi , ρ̃i ∈ P(X̄i ) of the form
n n
1X 1X
ρi = δ(xk ,0) , ρ̃i = δ(x̃k ,0)
n n
k=1 k=1

27
such that Wp (νi , ρi ) ≤ 2 and Wp (ν̃i , ρ̃i ) ≤ 2 ; for instance, one can use
suitable realizations of i.i.d. samples (e.g., [35, Corollary 1.1]). Next, choose
distinct u1 , . . . , un ∈ (0, 1) small enough such that the measures
n n
1X 1X
νi = δ(xk ,uk ) , ν̃i = δ(x̃k ,uk )
n n
k=1 k=1

satisfy Wp (ρi , νi )
≤ and 2 ≤Wp (ρ̃i , ν̃i )
Then (4.17) holds and νi , ν̃i are
2.
empirical measures on n distinct points due to the choice of u1 , . . . , un . As a
result, there is an optimal transport map that is one-to-one on the supports.
Let p = ∞. Here a different argument is necessary. (The following also
gives an alternate proof for p < ∞.) As X is Polish, we can find a dense
sequence (qk ) ⊂ X and a countable measurable partition (Qk ) of X with
qk ∈ Qk and diam Qk ≤ 4 . Consider the approximations
∞
X ∞
X
ρi := νi (Qk ) δqk ⊗ δ0 , ρ̃i := ν̃i (Qk ) δqk ⊗ δ0
k=1 k=1

which clearly satisfy W∞ (ρi , νi ) < 2 and W∞ (ρ̃i , ν̃i ) < 2 , but may have
atoms of unequal mass. Let ρi ⊗ Ui ∈ Π(ρi , ρ̃i ) be a W∞ -optimal coupling,
then Ui : X̄i → P(X̄i ) is a stochastic kernel such that for each k,
∞
X
Ui ((qk , 0)) = wj,k δqj ⊗ δ0 ,
j=1
P∞
for some weights wj,k ≥ 0 with j=1 wj,k = 1. Let 0 > 0 and pick disjoint
numbers uj,k ∈ (0, 0 ), define
∞
X ∞
X
νi := νi (Qk )wj,k δqk ⊗ δuj,k , ν̃i := νi (Qk )wj,k δqj ⊗ δuj,k
j,k=1 j,k=1

and observe that W∞ (νi , ρi ) < 2 and W∞ (ν̃i , ρ̃i ) < 2 for 0 sufficiently small
(note that uj,k := 0 would lead to νi = ρi and ν̃i = νi Ui = ρ̃i ). Now (4.17)
holds by the triangle inequality. Define

Ti : {qk : k ∈ N} × {uj,k : j, k ∈ N} → {qk : k ∈ N} × {uj,k : j, k ∈ N},

Ti (qk , uj,k ) := (qj , uj,k )

which is one-to-one as the uj,k are distinct. Moreover, ρi ⊗ Ui ∈ Π(ρi , ρ̃i )
implies ν̃i = (Ti )] νi , and since ρi ⊗ Ui attains W∞ (ρi , ρ̃i ) = W∞ (νi , ν̃i ), the
coupling induced by Ti attains W∞ (νi , ν̃i ).

28
After these preparations, we are ready to prove Theorem 3.13.
0
Proof of Theorem 3.13. We detail the proof for (Iq ); the argument for (Iq ) is
identical. We shall apply Proposition 3.12 though the equivalence outlined
in Lemma 4.8. To this end, we extend the spaces Xi by the interval (−1, 1)
and introduce νi , ν̃i , c̄ as in Lemma 4.9. In view of Lemma 4.9 (i), it suffices
to prove the claim for these data instead of µi , µ̃i , c.
Let > 0, choose νi , ν̃i , T as in Lemma 4.9 (ii) and denote by θ , θ̃ , θ̂
the respective optimizers of Sent (ν1 , . . . , νN , c̄) and S
ent (ν̃1 , . . . , ν̃N , c̄) and

Sent (ν1 , . . . , νN , c̄ ◦ T ), respectively. Noting that Lipp (c̄) = Lipp (c) and
setting ∆() := Wp (ν1 , . . . , νN ; ν̃ , . . . , ν ), Lemma 4.6 yields
1 N

kc̄ − c̄ ◦ T kLp (P ) ≤ Lipp (c) ∆()

and thus Proposition 3.12 shows that

1 p
1 2q 1 (p+1)q
Wq (θ̂ , θ ) ≤ Cq a p Lipp (c) ∆() . (4.18)
2

As θ̃ = T ] θ̂ by Lemma 4.8 and Ti attains Wp (νi , ν̃i ), it follows by the
same calculation as in the proof of Theorem 3.11 that
( 1q − p1 ) ( 1q − p1 )
Wq (θ̃ , θ̂ ) ≤ N Wp (θ̃ , θ̂ ) ≤ N ∆().

Combining the two estimates, we find that

Wq (θ , θ̃ ) ≤ Wq (θ , θ̂ ) + Wq (θ̂ , θ̃ )

1 p
( 1q − p1 ) 1 2q 1 (p+1)q
≤N ∆() + Cq a Lipp (c) ∆()
p .
2

Letting → 0, the left-hand side converges to Wq (π ∗ , π̃ ∗ ) by Theorem 3.11

and Lemma 4.9 (ii), while ∆() → ∆ by construction. The claim on sharp-
ness is discussed in Example 4.10 below.

Finally, we exhibit a family of examples for which the constant ` of

Theorem 3.13 is optimal.

Example 4.10 (Sharpness of ` in Theorem 3.13.). On X = [−1, 1]2 , let

1 1
µ1 = µ2 = (δ−1 + δ1 ) , µ̃1 = µ̃2 = (δ−1+ε + δ1−ε ) ,
2 2

29
where ε ∈ (0, 1/2) is a parameter. We define the cost function c = c(ε) by
c(−1, −1) = c(1, 1) = c(−1 + ε, 1 − ε) = c(1 − ε, −1 + ε) = 0,
c(1, −1) = c(−1, 1) = c(−1 + ε, −1 + ε) = c(1 − ε, 1 − ε) = ε,
exp(ε)
then c is Lipschitz with constant Lip∞ (c) = 1. Setting α(ε) := 1+exp(ε) , we
∗ ∗
calculate the optimizers π , π̃ of Sent (µ1 , µ2 , c) and Sent (µ̃1 , µ̃2 , c) to be
α() 1 − α()
π∗ =

δ(−1,−1) + δ(1,1) + δ(−1,1) + δ(1,−1) ,
2 2
∗ 1 − α() α()
π̃ = δ(−1+ε,−1+ε) + δ(1−ε,1−ε) + δ(1−ε,−1+ε) + δ(−1+ε,1−ε) .
2 2
Next, we find
W1 (π ∗ , π̃ ∗ ) = 2(1 − α(ε))2ε + (2α(ε) − 1)2
by observing that an optimal coupling κ ∈ Π(π ∗ , π̃ ∗ ) is to move a total mass
of 2(1 − α(ε)) over a dX,1 -distance of 2ε and mass 2α(ε) − 1 over distance
(2 − ε) + ε = 2. In view of α(ε) = 21 + 4ε + O(ε3 ) as ε → 0, we deduce
W1 (π ∗ , π̃ ∗ ) = 3ε + O(ε2 ).
On the other hand, clearly
W∞ (µ1 , µ2 ; µ̃1 , µ̃2 ) = ε.
In summary, any constant ` such that W1 (π ∗ , π̃ ∗ ) ≤ `W∞ (µ1 , µ2 ; µ̃1 , µ̃2 )
holds in the above example for all ε, has to satisfy ` ≥ 3.
It remains to see that we attain ` = 3 in the last assertion of Theo-
rem 3.13. For q = 1, Lemma 3.10 (i) with √ diam1 (X2 ) = diam([−1, 1]) = 2
shows that (Iq ) is satisfied with C1 = 2. Hence, the formula in Theo-
rem 3.13 reads
√
` = N + (C1 / 2) Lip∞ (c) = 2 + 1 = 3
as desired.
We remark that this example can be extended to more general parame-
ters. Replacing c by Lc for some L > 0 leads to a different Lipschitz constant
in the definition of l. Replacing α(ε) by α(Lε) in the formula for W1 (π ∗ , π̃ ∗ ),
one finds that the constant l is again sharp. Similarly, replacing [−1, 1]
by [−K, K] for some K > 0 and replacing 1 by K in the definition of the
marginals, we find that only the constant C1 changes in the definition of l,
while for W1 (π ∗ , π̃ ∗ ) one replaces the final 2 by 2K. Again, the constant l
remains sharp.

30
4.5 Application to Sinkhorn’s Algorithm
Proof of Theorem 3.15. We first observe that π n is the optimizer of the prob-
lem Sent (π1n , π2n , c):

π n = arg min DKL (π, π 0 )

π∈Π(π1n ,π2n )
Z
= arg min c dπ + DKL (π, µ1 ⊗ µ2 )
π∈Π(π1n ,π2n )
Z
= arg min c dπ + DKL (π, π1n ⊗ π2n ),
π∈Π(π1n ,π2n )

where the last step used Remark 2.1. (The first identity is well known; e.g.,
it follows from the fact that by construction, dπ n /dπ 0 admits a factorization
a(x1 )b(x2 ).) To apply our stability results, we require the convergence of
the marginals in Wp . Indeed, DKL (πin , µi ) → 0 holds by a standard entropy
calculation, see for instance [51]. More precisely, we have

DKL (π ∗ , Pc )
DKL (πin , µi ) ≤ 2 (4.19)
n
according to [36, Corollary 1]. By the exponential moment condition on µi
and [9, Corollary 2.3], (4.19) yields
− p1 1
− 2p
Wp (πin , µi ) ≤ C0 Cµi (n +n ) where

1 1
n o
C0 := max (2DKL (π ∗ , Pc )) p , (2DKL (π ∗ , Pc )) 2p ,

1 3
Z p1
Cµi := 2 inf + log exp(αdXi (x0 , xi )) µi (dxi ) .
x0 ∈Xi ,α>0 α 2

As a result,
− p1 1
− 2p
∆ := max Wp (πin , µi ) ≤ C0 max{Cµ1 , Cµ2 }(k +k ). (4.20)
i=1,2

We remark that ∆ = Wp (π1n , π2n ; µ1 , µ2 ) as Wp (π1n , µ1 ) = 0 or Wp (π2n , µ2 ) = 0

for each n, consistent with our previous notation. By (4.20), πin has a finite
p-th moment. Noting also that
2
X
n n
DKL (π , µ1 ⊗ µ2 ) = DKL (π , π1n ⊗ π2n ) + DKL (πin , µi ), (4.21)
i=1

31
assertion (i) thus follows directly from Theorem 3.7 (i).
Regarding (ii), note that the p-th moments of πin are bounded uniformly
in n due to (4.20). In view of Lemma 3.5, the cost function c thus satis-
fies (AL ) with a uniform constant L for the marginals (π1n , π2n )n as well as
(µ1 , µ2 ). Using also (4.19) and (4.21), Theorem 3.7 (ii) yields
|F(π ∗ ) − F(π n )| ≤ L ∆ + 2DKL (π ∗ , Pc )n−1 .
0
In view of (4.20), the claimed rate for |F(π ∗ ) − F(π n )| follows. Finally, (Iq )
0
holds with constant Cq by Lemma 3.10 (iii) and thus Theorem 3.11 yields
( 1q − p1 ) 0 1 0 1 1
Wq (π ∗ , π n ) ≤ 2 ∆ + Cq (2L)1/q ∆ q + Cq L 2q ∆ 2q ,
so that the claimed rate for Wq (π ∗ , π n ) follows via (4.20).

References
[1] M. Agueh and G. Carlier. Barycenters in the Wasserstein space. SIAM J.
Math. Anal., 43(2):904–924, 2011.
[2] J. M. Altschuler, J. Niles-Weed, and A. J. Stromme. Asymptotics for semidis-
crete entropic optimal transport. SIAM J. Math. Anal., 54(2):1718–1741,
2022.
[3] D. Alvarez-Melis and T. Jaakkola. Gromov-Wasserstein alignment of word em-
bedding spaces. In Proceedings of the 2018 Conference on Empirical Methods
in Natural Language Processing, pages 1881–1890, 2018.
[4] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial
networks. volume 70 of Proceedings of Machine Learning Research, pages
214–223, 2017.
[5] R. J. Berman. The Sinkhorn algorithm, parabolic optimal transport and geo-
metric Monge-Ampère equations. Numer. Math., 145(4):771–836, 2020.
[6] E. Bernton, P. Ghosal, and M. Nutz. Entropic optimal transport: Geometry
and large deviations. Duke Math. J., to appear, 2021. arXiv:2102.04397.
[7] J. Blanchet, A. Jambulapati, C. Kent, and A. Sidford. Towards optimal run-
ning times for optimal transport. Preprint arXiv:1810.07717v1, 2018.
[8] M. Blondel, V. Seguy, and A. Rolet. Smooth and sparse optimal transport. In
Proceedings of the Twenty-First International Conference on Artificial Intelli-
gence and Statistics, volume 84 of Proceedings of Machine Learning Research,
pages 880–889, 2018.
[9] F. Bolley and C. Villani. Weighted Csiszár-Kullback-Pinsker inequalities and
applications to transportation inequalities. Ann. Fac. Sci. Toulouse Math. (6),
14(3):331–352, 2005.
[10] G. Carlier. On the linear convergence of the multi-marginal Sinkhorn algo-
rithm. SIAM J. Optim., 32(2):786–794, 2022.
[11] G. Carlier, K. Eichinger, and A. Kroshnin. Entropic-Wasserstein barycenters:
PDE characterization, regularity, and CLT. SIAM J. Math. Anal., 53(5):5880–
5914, 2021.

32
[12] G. Carlier and M. Laborde. A differential approach to the multi-marginal
Schrödinger system. SIAM J. Math. Anal., 52(1):709–717, 2020.
[13] Y. Chen, T. Georgiou, and M. Pavon. Entropic and displacement interpo-
lation: a computational approach using the Hilbert metric. SIAM J. Appl.
Math., 76(6):2375–2396, 2016.
[14] Y. Chen, T. T. Georgiou, and M. Pavon. On the relation between optimal
transport and Schrödinger bridges: a stochastic control viewpoint. J. Optim.
Theory Appl., 169(2):671–691, 2016.
[15] V. Chernozhukov, A. Galichon, M. Hallin, and M. Henry. Monge-Kantorovich
depth, quantiles, ranks and signs. Ann. Statist., 45(1):223–256, 2017.
[16] R. Cominetti and J. San Martín. Asymptotic analysis of the exponential
penalty trajectory in linear programming. Math. Programming, 67(2, Ser.
A):169–187, 1994.
[17] G. Conforti and L. Tamanini. A formula for the time derivative of the entropic
cost and applications. J. Funct. Anal., 280(11):108964, 2021.
[18] A. Corenflos, J. Thornton, G. Deligiannidis, and A. Doucet. Differentiable
particle filtering via entropy-regularized optimal transport. In International
Conference on Machine Learning, pages 2100–2111. PMLR, 2021.
[19] I. Csiszár. I-divergence geometry of probability distributions and minimization
problems. Ann. Probability, 3:146–158, 1975.
[20] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport.
In Advances in Neural Information Processing Systems 26, pages 2292–2300.
2013.
[21] M. Cuturi, O. Teboul, and J.-P. Vert. Differentiable ranking and sorting using
optimal transport. In Advances in Neural Information Processing Systems,
volume 32, 2019.
[22] G. Deligiannidis, V. De Bortoli, and A. Doucet. Quantitative uniform stability
of the iterative proportional fitting procedure. Preprint arXiv:2108.08129v1,
2021.
[23] S. Di Marino and A. Gerolin. An optimal transport approach for the
Schrödinger bridge problem and convergence of Sinkhorn algorithm. J. Sci.
Comput., 85(2):Paper No. 27, 28, 2020.
[24] S. Di Marino and A. Gerolin. Optimal transport losses and Sinkhorn algorithm
with general convex regularization. Preprint arXiv:2007.00976v1, 2020.
[25] M. Essid and J. Solomon. Quadratically regularized optimal transport on
graphs. SIAM J. Sci. Comput., 40(4):A1961–A1986, 2018.
[26] G. B. Folland. Real analysis. Pure and Applied Mathematics. John Wiley &
Sons, New York, second edition, 1999.
[27] H. Föllmer. Random fields and diffusion processes. In École d’Été de Prob-
abilités de Saint-Flour XV–XVII, 1985–87, volume 1362 of Lecture Notes in
Math., pages 101–203. Springer, Berlin, 1988.
[28] H. Föllmer and A. Schied. Stochastic Finance: An Introduction in Discrete
Time. W. de Gruyter, Berlin, 3rd edition, 2011.
[29] J. Franklin and J. Lorenz. On the scaling of multidimensional matrices. Linear
Algebra Appl., 114/115:717–735, 1989.
[30] A. Genevay, L. Chizat, F. Bach, M. Cuturi, and G. Peyré. Sample complexity

33
of Sinkhorn divergences. In The 22nd International Conference on Artificial
Intelligence and Statistics, pages 1574–1583. PMLR, 2019.
[31] A. Genevay, M. Cuturi, G. Peyré, and F. Bach. Stochastic optimization for
large-scale optimal transport. In Advances in Neural Information Processing
Systems 29, pages 3440–3448. 2016.
[32] A. Genevay, G. Peyré, and M. Cuturi. Learning generative models with
Sinkhorn divergences. In Proceedings of the 21st International Conference
on Artificial Intelligence and Statistics, PMLR, pages 1608–1617, 2018.
[33] P. Ghosal, M. Nutz, and E. Bernton. Stability of entropic optimal transport
and Schrödinger bridges. Preprint arXiv:2106.03670v1, 2021.
[34] N. Gigli and L. Tamanini. Second order differentiation formula on
RCD∗ (K, N ) spaces. J. Eur. Math. Soc. (JEMS), 23(5):1727–1795, 2021.
[35] D. Lacker. A non-exponential extension of Sanov’s theorem via convex duality.
Adv. in Appl. Probab., 52(1):61–101, 2020.
[36] F. Léger. A gradient descent perspective on Sinkhorn. Appl. Math. Optim.,
84(2):1843–1855, 2021.
[37] C. Léonard. From the Schrödinger problem to the Monge-Kantorovich prob-
lem. J. Funct. Anal., 262(4):1879–1920, 2012.
[38] C. Léonard. A survey of the Schrödinger problem and some of its connections
with optimal transport. Discrete Contin. Dyn. Syst., 34(4):1533–1574, 2014.
[39] D. A. Lorenz, P. Manns, and C. Meyer. Quadratically regularized optimal
transport. Appl. Math. Optim., 83(3):1919–1949, 2021.
[40] P. Massart. Concentration inequalities and model selection, volume 1896 of
Lecture Notes in Mathematics. Springer, Berlin, 2007. Lectures from the 33rd
Summer School on Probability Theory held in Saint-Flour, July 6–23, 2003.
[41] G. Mena and J. Niles-Weed. Statistical bounds for entropic optimal transport:
sample complexity and the central limit theorem. In Advances in Neural
Information Processing Systems 32, pages 4541–4551. 2019.
[42] T. Mikami. Optimal control for absolutely continuous stochastic processes and
the mass transportation problem. Electron. Comm. Probab., 7:199–213, 2002.
[43] T. Mikami. Monge’s problem with a quadratic cost by the zero-noise limit of
h-path processes. Probab. Theory Related Fields, 129(2):245–260, 2004.
[44] M. Nutz. Introduction to Entropic Optimal Transport. Lecture notes,
Columbia University, 2021. https://www.math.columbia.edu/~mnutz/
docs/EOT_lecture_notes.pdf.
[45] M. Nutz and J. Wiesel. Entropic optimal transport: Convergence of potentials.
Probab. Theory Related Fields, to appear. arXiv:2104.11720v2.
[46] M. Nutz and J. Wiesel. Stability of Schrödinger potentials and convergence
of Sinkhorn’s algorithm. Preprint arXiv:2201.10059v1, 2022.
[47] S. Pal. On the difference between entropic cost and the optimal transport
cost. Preprint arXiv:1905.12206v1, 2019.
[48] G. Peyré and M. Cuturi. Computational optimal transport: With applications
to data science. Foundations and Trends in Machine Learning, 11(5-6):355–
607, 2019.
[49] A. Ramdas, N. García Trillos, and M. Cuturi. On Wasserstein two-sample

34
testing and related families of nonparametric tests. Entropy, 19(2):Paper No.
47, 15, 2017.
[50] Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover’s distance as a
metric for image retrieval. Int. J. Comput. Vis., 40:99–121, 2000.
[51] L. Rüschendorf. Convergence of the iterative proportional fitting procedure.
Ann. Statist., 23(4):1160–1174, 1995.
[52] C. Villani. Optimal transport, old and new, volume 338 of Grundlehren der
Mathematischen Wissenschaften. Springer-Verlag, Berlin, 2009.
[53] J. Weed. An explicit analysis of the entropic penalty in linear programming.
volume 75 of Proceedings of Machine Learning Research, pages 1841–1855,
2018.

VBook O&N
No ratings yet
VBook O&N
547 pages
Knot Theory
100% (3)
Knot Theory
348 pages
Optimal Transportation and Action Minimizing Measures PDF
No ratings yet
Optimal Transportation and Action Minimizing Measures PDF
251 pages
Computational Optimal Transport
No ratings yet
Computational Optimal Transport
209 pages
Ambrosio L., Gigli N. - A User's Guide To Optimal Transport-Web Draft (2009)
No ratings yet
Ambrosio L., Gigli N. - A User's Guide To Optimal Transport-Web Draft (2009)
128 pages
Grundlehren Der Mathematischen Wissenschaften 306: A Series of Comprehensive Studies in Mathematics
No ratings yet
Grundlehren Der Mathematischen Wissenschaften 306: A Series of Comprehensive Studies in Mathematics
361 pages
Optimal Transport Map Estimation in General Function Spaces
No ratings yet
Optimal Transport Map Estimation in General Function Spaces
49 pages
Iterative Bregman Projections For Regularized Transportation Problems
No ratings yet
Iterative Bregman Projections For Regularized Transportation Problems
29 pages
Tensor Optimal Transport Distance Between Sets of
No ratings yet
Tensor Optimal Transport Distance Between Sets of
33 pages
Asymptotics For Semi-Discrete Entropic Optimal
No ratings yet
Asymptotics For Semi-Discrete Entropic Optimal
26 pages
Quantitative Convergence of Quadratically Regularized Linear Programs
No ratings yet
Quantitative Convergence of Quadratically Regularized Linear Programs
15 pages
A Relaxed Inertial Forward-Backward-Forward Algorithm For Solving Monotone Inclusions With Application To Gans
No ratings yet
A Relaxed Inertial Forward-Backward-Forward Algorithm For Solving Monotone Inclusions With Application To Gans
37 pages
The O.D.E. Method For Convergence of Stochastic Approximation and Reinforcement Learning
No ratings yet
The O.D.E. Method For Convergence of Stochastic Approximation and Reinforcement Learning
23 pages
Rectified Flow
No ratings yet
Rectified Flow
24 pages
Li 2017
No ratings yet
Li 2017
25 pages
Exact Controllability For The Non Stationary Transport Equation
No ratings yet
Exact Controllability For The Non Stationary Transport Equation
29 pages
Debiased Sinkhorn Barycenters
No ratings yet
Debiased Sinkhorn Barycenters
27 pages
IBVPs For Inhomogeneous Systems of Balance Laws
No ratings yet
IBVPs For Inhomogeneous Systems of Balance Laws
32 pages
The Derivatives of Sinkhorn-Knopp Converge
No ratings yet
The Derivatives of Sinkhorn-Knopp Converge
24 pages
First Order Optimization Algorithms Via Discretization of Finite Time Convergent Flows
No ratings yet
First Order Optimization Algorithms Via Discretization of Finite Time Convergent Flows
24 pages
Sparse Regularized Optimal Transport With Deformed Q-Entropy
No ratings yet
Sparse Regularized Optimal Transport With Deformed Q-Entropy
27 pages
Quantitative Stability in Optimal Transport For General Power Costs
No ratings yet
Quantitative Stability in Optimal Transport For General Power Costs
25 pages
Sparsity-Constrained Optimal Transport
No ratings yet
Sparsity-Constrained Optimal Transport
26 pages
Accelerating Sinkhorn Algorithm With Sparse Newton Iterations
No ratings yet
Accelerating Sinkhorn Algorithm With Sparse Newton Iterations
20 pages
Optimization: A Journal of Mathematical Programming and Operations Research
No ratings yet
Optimization: A Journal of Mathematical Programming and Operations Research
14 pages
Dynamical Inertial Extragradient Techniques For Solving Equilibrium and Fixed-Point Problems in Real Hilbert Spaces
No ratings yet
Dynamical Inertial Extragradient Techniques For Solving Equilibrium and Fixed-Point Problems in Real Hilbert Spaces
36 pages
Lipschitz Continuity and Semiconc
No ratings yet
Lipschitz Continuity and Semiconc
14 pages
Probabilistic Inverse Optimal Transport: Wei-Ting Chiu and Pei Wang Contribute Equally
No ratings yet
Probabilistic Inverse Optimal Transport: Wei-Ting Chiu and Pei Wang Contribute Equally
18 pages
Solodov - Siam Control Optim
No ratings yet
Solodov - Siam Control Optim
12 pages
Sinkhorn Distances: Lightspeed Computation of Optimal Transport
No ratings yet
Sinkhorn Distances: Lightspeed Computation of Optimal Transport
9 pages
J. Vovelle and S. Martin - Large-Time Behavior of Entropy Solutions To Scalar Conservation Laws On Bounded Domain
No ratings yet
J. Vovelle and S. Martin - Large-Time Behavior of Entropy Solutions To Scalar Conservation Laws On Bounded Domain
21 pages
A-A Regularized Robust Design Criterion For Uncertain Data
No ratings yet
A-A Regularized Robust Design Criterion For Uncertain Data
23 pages
Stability and Finite Element Error Analysis For The Helmholtz Equation With Variable Coefficients
No ratings yet
Stability and Finite Element Error Analysis For The Helmholtz Equation With Variable Coefficients
25 pages
Nonlinear Analysis: Alexandra Smirnova, Necibe Tuncer
No ratings yet
Nonlinear Analysis: Alexandra Smirnova, Necibe Tuncer
12 pages
Quadratically Regularized Optimal Transport
No ratings yet
Quadratically Regularized Optimal Transport
28 pages
23 Ecp525
No ratings yet
23 Ecp525
14 pages
OTNote
No ratings yet
OTNote
46 pages
De Vecchi - Rigoni - A Description Based On Optimal Transport For A Class of Stochastic McKean-Vlasov Control Problems
No ratings yet
De Vecchi - Rigoni - A Description Based On Optimal Transport For A Class of Stochastic McKean-Vlasov Control Problems
54 pages
Notes On Conservation-Laws
No ratings yet
Notes On Conservation-Laws
50 pages
Anisotropic Elliptic Equations With Gradient-Depen
No ratings yet
Anisotropic Elliptic Equations With Gradient-Depen
33 pages
Integration of Subdifferentials OF Nonconvex Functions: in Nonsmooth
No ratings yet
Integration of Subdifferentials OF Nonconvex Functions: in Nonsmooth
14 pages
Stabilized Sparse Scaling Algorithms For Entropy Regularized Transport Problems
No ratings yet
Stabilized Sparse Scaling Algorithms For Entropy Regularized Transport Problems
30 pages
Fuzzy and Neural Control
No ratings yet
Fuzzy and Neural Control
181 pages
Differential Properties of Sinkhorn Approximation For Learning With Wasserstein Distance
No ratings yet
Differential Properties of Sinkhorn Approximation For Learning With Wasserstein Distance
1 page
Gradient Methods For Nonsmooth Problems
No ratings yet
Gradient Methods For Nonsmooth Problems
26 pages
Course Optimal Transport
No ratings yet
Course Optimal Transport
46 pages
Continuous Maximal Flows and Wulff Shapes: Application To Mrfs
No ratings yet
Continuous Maximal Flows and Wulff Shapes: Application To Mrfs
8 pages
Doi:10 3934/dcds 2022096
No ratings yet
Doi:10 3934/dcds 2022096
26 pages
Formulario de Transformadas de Laplace
No ratings yet
Formulario de Transformadas de Laplace
1 page
Abd Elkareem Soliman PAPER 17
No ratings yet
Abd Elkareem Soliman PAPER 17
6 pages
Introduction To Numerical Analysis For Engineers: - Fundamentals of Digital Computing
No ratings yet
Introduction To Numerical Analysis For Engineers: - Fundamentals of Digital Computing
12 pages
Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances
No ratings yet
Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances
13 pages
MBeck Dissertation
No ratings yet
MBeck Dissertation
153 pages
Saldana
No ratings yet
Saldana
36 pages
Dynamical Properties of Hamilton-Jacobi Equations Via The Nonlinear Adjoint Method Large Time Behavior and Discounted Approximation
No ratings yet
Dynamical Properties of Hamilton-Jacobi Equations Via The Nonlinear Adjoint Method Large Time Behavior and Discounted Approximation
100 pages
A Neural Network Approach For Solving Optimal Control Problems With Inequality Constraints and Some Applications
No ratings yet
A Neural Network Approach For Solving Optimal Control Problems With Inequality Constraints and Some Applications
29 pages
Velammal Engineering College Department of Cse Design and Analysis of Algorithm Unit V Notes Decision Trees (2 Marks)
No ratings yet
Velammal Engineering College Department of Cse Design and Analysis of Algorithm Unit V Notes Decision Trees (2 Marks)
29 pages
JuraevMammadzada 3
No ratings yet
JuraevMammadzada 3
6 pages
Grade 9 - TIMO 202021 Heat
100% (1)
Grade 9 - TIMO 202021 Heat
9 pages
Matbematicae: Ordinary Differential Equations, Transport Theory and Sobolev Spaces
No ratings yet
Matbematicae: Ordinary Differential Equations, Transport Theory and Sobolev Spaces
37 pages
Home MR Ahmed Bayoumy
No ratings yet
Home MR Ahmed Bayoumy
1 page
QA Shortcuts
No ratings yet
QA Shortcuts
81 pages
Jean Paul Penot1989
No ratings yet
Jean Paul Penot1989
15 pages
Jhele Rene Advanced Microeconomics
No ratings yet
Jhele Rene Advanced Microeconomics
44 pages
Chapter 6 - Stress and Strain - A4
0% (1)
Chapter 6 - Stress and Strain - A4
15 pages
Cauchy Best of Best Est-P-9
No ratings yet
Cauchy Best of Best Est-P-9
22 pages
DS Assignment 1
100% (1)
DS Assignment 1
9 pages
Slides - Graph Signal Processing: Fundamentals and Applications To Diffusion Processes
No ratings yet
Slides - Graph Signal Processing: Fundamentals and Applications To Diffusion Processes
118 pages
Jcam 2024
No ratings yet
Jcam 2024
8 pages
Optimal Transport Old and New
No ratings yet
Optimal Transport Old and New
998 pages
Stability of Linear Systems: 11.1 Some Definitions
No ratings yet
Stability of Linear Systems: 11.1 Some Definitions
8 pages
Stella Octangula Number
100% (1)
Stella Octangula Number
2 pages
Section3 PDF
No ratings yet
Section3 PDF
10 pages
Wiki - DiGamma Function
No ratings yet
Wiki - DiGamma Function
6 pages
Singular Arcs On Average Optimal Control-Affine Problems: M.S. Aronna, G. de Lima Monteiro and O. Sierra
No ratings yet
Singular Arcs On Average Optimal Control-Affine Problems: M.S. Aronna, G. de Lima Monteiro and O. Sierra
6 pages
Ch06 - Probality and Random Process
No ratings yet
Ch06 - Probality and Random Process
42 pages
Home Work 5
No ratings yet
Home Work 5
6 pages
Reflection
No ratings yet
Reflection
8 pages
Circles - Past Edexcel Exam Questions: C Studywell Publications Ltd. 2015
No ratings yet
Circles - Past Edexcel Exam Questions: C Studywell Publications Ltd. 2015
12 pages
5.8 Sketching Graphs of Derivatives: Practice
No ratings yet
5.8 Sketching Graphs of Derivatives: Practice
5 pages
Robust Shape Matching With OT
No ratings yet
Robust Shape Matching With OT
175 pages
Lec-14 Traversing A Binary Tree
No ratings yet
Lec-14 Traversing A Binary Tree
12 pages
6 Infinite Random Sequences: 6a Introductory Remarks Almost Certainty
No ratings yet
6 Infinite Random Sequences: 6a Introductory Remarks Almost Certainty
16 pages
Assignment 2
No ratings yet
Assignment 2
4 pages
2021 G6
No ratings yet
2021 G6
14 pages
A Geometric View of Optimal Transportation and Generative Model
No ratings yet
A Geometric View of Optimal Transportation and Generative Model
21 pages
Inde. Tntrgral-2
No ratings yet
Inde. Tntrgral-2
69 pages
Hypergraph Co-Optimal Transport: Metric and Categorical Properties
No ratings yet
Hypergraph Co-Optimal Transport: Metric and Categorical Properties
21 pages
Spectral Distances On Graphs
No ratings yet
Spectral Distances On Graphs
11 pages
Slides - Graph Signal Processing and Applications in Neuroscience
No ratings yet
Slides - Graph Signal Processing and Applications in Neuroscience
103 pages
The Emerging Field of Signal Processing On Graphs
No ratings yet
The Emerging Field of Signal Processing On Graphs
14 pages
Transport Inequalities. A Survey
No ratings yet
Transport Inequalities. A Survey
82 pages
Distance Distributions and Inverse Problems For Metric Measure
No ratings yet
Distance Distributions and Inverse Problems For Metric Measure
71 pages
A Linear Transportation LP Distance For Pattern Recognition
No ratings yet
A Linear Transportation LP Distance For Pattern Recognition
41 pages
Uniqueness and Monge Solutions in The Multimarginal OT Problem
No ratings yet
Uniqueness and Monge Solutions in The Multimarginal OT Problem
20 pages
Basic Calculus Q4 Optimization
No ratings yet
Basic Calculus Q4 Optimization
7 pages
DRJK ED Proj of Points PPT Notes - Compressed
No ratings yet
DRJK ED Proj of Points PPT Notes - Compressed
12 pages
Understanding The Basis of Graph Signal Processing Via An Intuitive Example-Driven Approach
No ratings yet
Understanding The Basis of Graph Signal Processing Via An Intuitive Example-Driven Approach
10 pages
Multi-Marginal Optimal Transport Defines A Generalized Metric
No ratings yet
Multi-Marginal Optimal Transport Defines A Generalized Metric
17 pages
SWE Math4543 Lecture1
No ratings yet
SWE Math4543 Lecture1
25 pages
A Multiscale Approach To Optimal Transport
No ratings yet
A Multiscale Approach To Optimal Transport
19 pages
NCERT Solutions For Class 10 Maths Chapter 1 Real Numbers Ex 1.1 - Free PDF Download
No ratings yet
NCERT Solutions For Class 10 Maths Chapter 1 Real Numbers Ex 1.1 - Free PDF Download
4 pages
10 Maths 20210621133107
No ratings yet
10 Maths 20210621133107
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Quantitative Stability of Regularized Optimal Transport

Uploaded by

Quantitative Stability of Regularized Optimal Transport

Uploaded by

Quantitative Stability of Regularized Optimal

Transport and Convergence of Sinkhorn’s

Keywords Entropic Optimal Transport; Stability; Sinkhorn’s Algorithm; IPFP

Here Π(µ1 , µ2 ) is the set of couplings of the given marginals µ1 , µ2 and

2 Setting and Notation

W∞ (µ, ν) = inf ess sup dY (x, y),

Unless otherwise noted, p-Wasserstein distances on X are understood with

where Π(µ1 , . . . , µN ) ⊂ P(X) denotes the set of couplings of the marginals µi .

|c(x)| ≤ C(1 + dX,p (x, x̂)p ) (2.2)

for some C > 0 and x̂ ∈ X, whereas for p = ∞ the meaning is that c

Remark 2.1. A variation of (2.3) uses entropy relative to a reference mea-

Definition 3.1 (Shadow). Let p ∈ [1, ∞] and µi , µ̃i ∈ Pp (Xi ), i = 1, . . . , N .

In general, the Wp -optimal kernel Ki need not be unique, so that there

Lemma 3.2. Let p ∈ [1, ∞] and µi , µ̃i ∈ Pp (Xi ), i = 1, . . . , N . Given

Wp (π, π̃) = Wp (µ1 , . . . , µN ; µ̃1 , . . . , µ̃N ),

To study the continuity properties of regularized optimal transport, we

for all π ∈ Π(µ1 , . . . , µN ) and π̃ ∈ Π(µ̃1 , . . . , µ̃N ).2

The example is a special case of the following observation.

where Mp (µ) := ( kxkp µ(dx))1/p for µ ∈ P(Rd ) and Cp is a constant de-

Theorem 3.7 (Continuity of Value). Let p ∈ [1, ∞].

(i) Let µi , µni ∈ Pp (Xi ) satisfy limn Wp (µi , µni ) = 0 for i = 1, . . . , N .

|S(µ1 , . . . , µN , c) − S(µ̃1 , . . . , µ̃N , c)| ≤ LWp (µ1 , . . . , µN ; µ̃1 , . . . , µ̃N ).

Remark 3.8 (Γ-Convergence). Define F : Pp (X) → R ∪ {∞} by

For the recovery sequence in (b), we can choose πn ∈ Π(µn1 , . . . , µnN ) to be

for all π, θ ∈ Π(µ1 , . . . , µN ). The two inequalities serve a similar purpose,

Lemma 3.10. (i) Let X 0 := X2 × · · · × XN and suppose that

diamq (X 0 ) := sup dX 0 ,q (x, y) < ∞.

(0, ∞) and x̂i ∈ Xi , then (Iq ) holds with constant

3.3 Stability through Transformation

where a := 2 exp(N kck∞ ) and ∆ := Wp (µ1 , . . . , µN ; µ̃1 , . . . , µ̃N ). If µi , µ̃i

In particular, (µ1 , . . . , µN ) 7→ π ∗ is p+1

As discussed in the Introduction, this result is based on a transformation:

3.4 Application to Sinkhorn’s Algorithm

Theorem 3.15 (Sinkhorn Convergence). Let p ∈ [1, ∞). For i = 1, 2, let

(i) Let c be continuous with growth of order p. As n → ∞, we have

Theorem 3.15 with p = q = 2 implies W2 -convergence for quadratic cost

denote by µK ∈ P(Y2 ) the second marginal of µ ⊗ K ∈ P(Y1 × Y2 ). (4.1)

Df (µK, νK) ≤ Df (µ, ν).

Proof. We may assume that µ  ν. For any kernels K1  K2 : Y1 → P(Y2 ),

d(µ ⊗ K1 ) dµ dK1 (x)

Df (µ, ν) = Df (µ ⊗ K, ν ⊗ K). (4.3)

Whereas in general, (4.2) and Jensen’s inequality for f yield

Denote by µ ⊗ K = (µK) ⊗ K̃1 and ν ⊗ K = (νK) ⊗ K̃2 the “reverse”

Df (µ ⊗ K, ν ⊗ K) = Df ((µK) ⊗ K̃1 , (νK) ⊗ K̃2 ) ≥ Df (µK, νK).

In view of (4.3), this yields the claim.

We can now show the two fundamental properties of the shadow.

Proof of Lemma 3.2. Let µi ⊗ Ki ∈ Π(µi , µ̃i ) be a Wp -optimal coupling and

On the other hand, given an arbitrary coupling π̃ ∈ Π(µ̃1 , . . . , µ̃N ), any

Df (π̃, µ̃1 ⊗· · ·⊗ µ̃N ) = Df (πK, (µ1 ⊗· · ·⊗µN )K) ≤ Df (π, µ1 ⊗· · ·⊗µN ).

We estimate the first integral; the second is treated analogously. Hölder’s

As |f (x)| ≤ Cf [1 + dX1 (x1 , x̄1 )l + · · · + dXN (xN , x̄N )l ] with l ≤ p − 1 =

kg(x) − g(y)kLp (κ) ≤ Lipp (g)Wp (π, π̃)

then c(x) = ψ(x2 − x1 ) and the quantity to be estimated is

Using differentiation under the integral (justified by [26, Theorem 2.27]), we

4.2 Stability through Shadows

by lower semicontinuity of c d(·) + Df (·, ·) and optimality of π ∗ . On the

Together, limn c dπn∗ + Df (πn∗ , Pn ) = c dπ ∗ + Df (π ∗ , P ) and π must be

the (unique) optimizer π ∗ of S(µ1 , . . . , µN , c). In particular, the original

≥ S(µ̃1 , . . . , µ̃N , c) − LWp (µ1 , . . . , µN ; µ̃1 , . . . , µ̃N ).

The proof of Γ-convergence follows the same line of argument.

Proof of Remark 3.8. Similarly

for some α̃ > 0 and x̂ ∈ X, with constant

Here the term inside the exponential

where π1 , . . . , πN are the marginals of π.

4.3 Stability with respect to Cost

Lemma 4.5. Let π ∗ , π̃ ∗ be the respective optimizers of Sent (µ1 , . . . , µN , c)

Proof. Lemma 4.4 yields

Lemma 4.5 clearly implies a Lipschitz estimate with respect to kc − c̃k∞

Proof of Proposition 3.12. Combining

Recall that by duality (e.g., [23, 44]), for certain “potentials” ϕi : Xi → R,

where the inequality is due to c ≥ 0. To estimate the right-hand side

Using this in (4.13), we conclude that

As (Iq ) implies 2Cq−2q Wq (π ∗ , π̃ ∗ )2q ≤ DKL (π ∗ , π̃ ∗ ) + DKL (π̃ ∗ , π ∗ ), this proves

4.4 Stability through Transformation

Lemma 4.6. Let p ∈ [1, ∞] and let c be Lipp (c)-Lipschitz. Then

kc − KckLp (π) ≤ Lipp (c)Wp (µ1 , . . . , µN ; µ̃1 , . . . , µ̃N ), π ∈ Π(µ1 , . . . , µN ).

Proof. We may assume that µ ν. For any kernels K1 K2 : Y1 → P(Y2 ),

Wp (νi , νi ) ≤ , Wp (ν̃i , ν̃i ) ≤ (4.17)

Ti : {qk : k ∈ N} × {uj,k : j, k ∈ N} → {qk : k ∈ N} × {uj,k : j, k ∈ N},

Ti (qk , uj,k ) := (qj , uj,k )

kc̄ − c̄ ◦ T kLp (P ) ≤ Lipp (c) ∆()

Wq (θ , θ̃ ) ≤ Wq (θ , θ̂ ) + Wq (θ̂ , θ̃ )

Letting → 0, the left-hand side converges to Wq (π ∗ , π̃ ∗ ) by Theorem 3.11