Distributionally Robust Optimization
Distributionally Robust Optimization
Daniel Kuhn
Risk Analytics and Optimization Chair,
École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
E-mail: daniel.kuhn@epfl.ch
Soroosh Shafiee
School of Operations Research and Information Engineering,
Cornell University, Ithaca, NY, USA
E-mail: shafiee@cornell.edu
Wolfram Wiesemann
Imperial College Business School,
Imperial College London, London, United Kingdom
E-mail: ww@imperial.ac.uk
Distributionally robust optimization (DRO) studies decision problems under uncer-
tainty where the probability distribution governing the uncertain problem parameters
is itself uncertain. A key component of any DRO model is its ambiguity set, that
is, a family of probability distributions consistent with any available structural or
statistical information. DRO seeks decisions that perform best under the worst dis-
tribution in the ambiguity set. This worst case criterion is supported by findings
in psychology and neuroscience, which indicate that many decision-makers have a
low tolerance for distributional ambiguity. DRO is rooted in statistics, operations re-
search and control theory, and recent research has uncovered its deep connections to
regularization techniques and adversarial training in machine learning. This survey
presents the key findings of the field in a unified and self-contained manner.
CONTENTS
1 Introduction 2
2 Ambiguity Sets 9
3 Topological Properties of Ambiguity Sets 46
4 Duality Theory for Worst-Case Expectation Problems 57
5 Duality Theory for Worst-Case Risk Problems 74
6 Analytical Solutions of Nature’s Subproblem 87
7 Finite Convex Reformulations of Nature’s Subproblem 119
8 Regularization by Robustification 148
9 Numerical Solution Methods for DRO Problems 168
10 Statistical Guarantees 177
References 191
2 D. Kuhn, S. Shafiee, and W. Wiesemann
1. Introduction
Traditionally, mathematical optimization studies problems of the form
inf ℓ(𝑥),
𝑥 ∈X
where a decision 𝑥 is sought from the set X ⊆ R𝑛 of feasible solutions that minim-
izes a loss function ℓ : R𝑛 → R. With its early roots in the development of calculus
by Isaac Newton, Gottfried Wilhelm Leibniz, Pierre de Fermat and others in the
late 17th century, mathematical optimization has a rich history that involves con-
tributions from numerous mathematicians, economists, engineers, and scientists.
The birth of modern mathematical optimization is commonly credited to George
Dantzig, whose simplex algorithm developed in 1947 solves linear optimization
problems where ℓ is affine and X is a polyhedron (Dantzig 1956). Subsequent mile-
stones include the development of the rich theory of convex analysis (Rockafellar
1970) as well as the discovery of polynomial-time solution methods for linear
(Khachiyan 1979, Karmarkar 1984) and broad classes of nonlinear convex optim-
ization problems (Nesterov and Nemirovskii 1994).
Classical optimization problems are deterministic, that is, all problem data are as-
sumed to be known with certainty. However, most decision problems encountered
in practice depend on parameters that are corrupted by measurement errors or that
are revealed only after a decision must be determined and committed. A naïve
approach to model uncertainty-affected decision problems as deterministic optim-
ization problems would be to replace all uncertain parameters with their expected
values or with appropriate point predictions. However, it has long been known
and well-documented that decision-makers who replace an uncertain parameter of
an optimization problem with its mean value fall victim to the ‘flaw of averages’
(Savage, Scholtes and Zweidler 2006, Savage 2012). In order to account for un-
certainty realizations that deviate from the mean value, Beale (1955) and Dantzig
(1955) independently introduced stochastic programs of the form
inf E P [ℓ(𝑥, 𝑍)] , (1.1)
𝑥 ∈X
which model the uncertain problem parameters 𝑍 as a random vector that is gov-
erned by some distribution P from within an ambiguity set P, and where a decision
is sought that performs best in view of its expected value under the worst distribution
P ∈ P. Distributionally robust optimization (DRO) thus blends the distributional
perspective of stochastic programming with the worst-case focus of robust optim-
4 D. Kuhn, S. Shafiee, and W. Wiesemann
Historically, the term ‘distributional robustness’ has its roots in robust statistics.
The term was coined by Huber (1981) to describe methods aimed at making
robust decisions in the presence of outlier data points. This idea expanded upon
earlier works by Box (1953, 1979), who explores robustness in situations where the
underlying distribution deviates from normality, a common assumption underlying
many statistical models. To address the challenges posed by outliers, statisticians
have developed several contamination models, each offering a unique approach
to mitigating data irregularities. The Huber contamination model, introduced by
Huber (1964, 1968) and further developed by Hampel (1968, 1971), assumes that
the observed data is drawn from a mixture of the true distribution and an arbitrary
contaminating distribution. Neighborhood contamination models define deviations
from the true distribution in terms of statistical distances such as the total variation
(Donoho and Liu 1988) or Wasserstein distances (Zhu, Jiao and Steinhardt 2022a,
Liu and Loh 2023). More recently, data-dependent adaptive contamination models
allow for a fraction of the observed data points to be replaced with points drawn
from an arbitrary distribution (Diakonikolas, Kamath, Kane, Li, Moitra and Stewart
2019, Zhu et al. 2022a). Interestingly, the optimistic counterpart of a DRO model,
which optimizes in view of the best (as opposed to the worst) distribution in the
ambiguity set, recovers many estimators from robust statistics (Blanchet, Li, Lin and
Zhang 2024b, Jiang and Xie 2024). For a survey of recent advances in algorithmic
robust statistics we refer to Diakonikolas and Kane (2023).
Robust and distributionally robust optimization have found manifold applications
in machine learning. For example, popular regularizers from the machine learning
literature are known to admit a robustness interpretation, which offers theoretical
insights into the strong empirical performance of regularization in practice (Xu,
Caramanis and Mannor 2009, Shafieezadeh-Abadeh, Kuhn and Mohajerin Esfa-
hani 2019, Li, Lin, Blanchet and Nguyen 2022, Gao, Chen and Kleywegt 2024).
Likewise, optimistic counterparts of DRO models that optimize in view of the
best (as opposed to the worst) distribution in the ambiguity set give rise to upper
confidence bound algorithms that are ubiquitous in the bandit and reinforcement
learning literature (Blanchet et al. 2024b, Jiang and Xie 2024). DRO is also related
to adversarial training, which aims to improve the generalization performance of a
machine learning model by training it in view of adversarial examples (Goodfellow,
Shlens and Szegedy 2015). Adversarial examples are perturbations of existing data
points that are designed to mislead a model into making incorrect predictions.
There are also deep connections between DRO and extensions of stochastic (dy-
namic) programming that replace the expected value with coherent risk measures.
Similar to the expected value, a risk measure maps random variables to exten-
ded real numbers. In contrast to the expected value, which is risk-neutral since
it weighs positive and negative outcomes equally, risk measures most commonly
assign greater weights to negative outcomes and thus account for the risk aversion
frequently observed among decision-makers. Artzner, Delbaen, Eber and Heath
(1999) and Delbaen (2002) show that risk measures satisfying the axioms of co-
6 D. Kuhn, S. Shafiee, and W. Wiesemann
ous versus risky decisions. Genetic factors may influence an individual’s tendency
toward ambiguity aversion. He, Xue, Chen, Lu, Dong, Lei, Ding, Li, Li, Chen, Li,
Moyzis and Bechara (2010) link certain genetic polymorphisms to the perform-
ance of individuals in decision-making under risk and ambiguity. In a separate
study, Buckert, Schwieren, Kudielka and Fiebach (2014) examine how hormonal
changes, such as higher cortisol levels which are linked to stress and anxiety, affect
decision-making under risk and ambiguity. These findings collectively suggest that
perceptions of risk and ambiguity are not just a cognitive phenomenon but also in-
fluenced by brain structures and genetic and hormonal factors that shape individual
differences in decision-making under ambiguity. Finally, we mention Hartley and
Somerville (2015) and Blankenstein, Crone, van den Bos and van Duijvenvoorde
(2016), who examine how ambiguity aversion differs between children, adoles-
cents and adults, and Hayden, Heilbronner and Platt (2010), who observed that
rhesus macaques monkeys also exhibit ambiguity aversion when offered the choice
between risky and ambiguous games of large and small juice outcomes.
The remainder of this survey is structured as follows. A significant part of our
analysis is dedicated to studying the worst-case expectation supP∈P E P [ℓ(𝑥, 𝑍)],
which constitutes the objective function of the DRO problem (1.2). Evaluating this
expression typically requires the solution of a semi-infinite optimization problem
over infinitely many variables that characterize the probability distribution P, sub-
ject to finitely many constraints imposed by the ambiguity set P. This problem,
which we refer to as nature’s subproblem, is the key feature that distinguishes
the DRO problem (1.2) from deterministic, stochastic, and robust optimization
problems. Sections 2 and 3 review commonly studied ambiguity sets P and their
topological properties, focusing especially on conditions under which nature’s sub-
problem attains its optimal value. Sections 4 and 5 develop a duality theory for
nature’s subproblem that allows us to upper bound or equivalently reformulate the
worst-case expectation with a semi-infinite optimization problem over finitely many
dual decision variables that are subjected to infinitely many constraints. This duality
framework lays the foundations for the analytical solution of nature’s subproblem
in Section 6, which relies on constructing primal and dual feasible solutions that
yield the same objective value and thus enjoy strong duality. Sections 7 and 8 lever-
age the same duality theory to develop equivalent reformulations and conservative
approximations of nature’s subproblem as well as the overall DRO problem (1.2).
Section 9 demonstrates how the duality theory gives rise to numerical solution
techniques for nature’s subproblem and the full DRO problem. Finally, Section 10
reviews the statistical guarantees enjoyed by different ambiguity sets.
Length restrictions dictated difficult trade-offs in the choice of topics covered
by this survey. We decided to focus on the most commonly used ambiguity sets
and to only briefly review other possible choices, such as marginal ambiguity
sets, ambiguity sets with structural constraints (including, e.g., symmetry and
unimodality), Sinkhorn ambiguity sets or conditional relative entropy ambiguity
sets. Likewise, we do not cover the important but somewhat more advanced topics
8 D. Kuhn, S. Shafiee, and W. Wiesemann
1.1. Notation
All vector spaces considered in this paper are defined over the real numbers. For
brevity, we simply refer to them as ‘vector spaces’ instead of ‘real vector spaces.’
We use R = R ∪ {−∞, ∞} to denote the extended reals. The effective domain of
a function 𝑓 : R𝑑 → R is defined as dom( 𝑓 ) = {𝑧 ∈ R𝑑 : 𝑓 (𝑧) < ∞}, and the
epigraph of 𝑓 is defined as epi( 𝑓 ) = {(𝑧, 𝛼) ∈ R𝑑 × R : 𝑓 (𝑧) ≤ 𝛼}. We say that 𝑓
is proper if dom( 𝑓 ) ≠ ∅ and 𝑓 (𝑧) > −∞ for all 𝑧 ∈ R𝑑 . The convex conjugate of
𝑓 is the function 𝑓 ∗ : R𝑑 → R defined through 𝑓 ∗ (𝑦) = sup 𝑧 ∈R𝑑 𝑦 ⊤ 𝑧 − 𝑓 (𝑧). A
convex function 𝑓 is called closed if it is proper and lower semicontinuous or if it
is identically equal to +∞ or to −∞. One can show that 𝑓 is closed if and only if it
coincides with its bi-conjugate 𝑓 ∗∗ , that is, with the conjugate of 𝑓 ∗ . If 𝑓 is proper,
convex and lower semicontinuous, then its recession function 𝑓 ∞ : R𝑑 → R is
defined through 𝑓 ∞ (𝑧) = lim 𝛼→∞ 𝛼 −1 ( 𝑓 (𝑧0 + 𝛼𝑧) − 𝑓 (𝑧0 )), where 𝑧0 is any point
in dom( 𝑓 ) (Rockafellar 1970, Theorem 8.5). The perspective of 𝑓 is the function
𝑓 𝜋 : R𝑑 × R → R defined through 𝑓 𝜋 (𝑧, 𝑡) = 𝑡 𝑓 (𝑧/𝑡) if 𝑡 > 0, 𝑓 𝜋 (𝑧, 𝑡) = 𝑓 ∞ (𝑧)
if 𝑡 = 0 and 𝑓 𝜋 (𝑧, 𝑡) = ∞ if 𝑡 < 0. One can show that 𝑓 𝜋 is proper, convex
and lower semicontinuous (Rockafellar 1970, page 67). When there is no risk
of confusion, we occasionally use 𝑡 𝑓 (𝑧/𝑡) to denote 𝑓 𝜋 (𝑧, 𝑡) even if 𝑡 = 0. The
indicator function 𝛿Z : R𝑑 → R of a set Z ⊆ R𝑑 is defined through 𝛿Z (𝑧) = 0
if 𝑧 ∈ Z and 𝛿Z (𝑧) = ∞ if 𝑧 ∉ Z. The conjugate 𝛿Z ∗ of 𝛿 is called the support
Z
∗ ⊤
function of Z. Thus, it satisfies 𝛿Z (𝑦) = sup 𝑧 ∈Z 𝑦 𝑧. Random objects are denoted
by capital letters (e.g., 𝑍) and their realizations are denoted by the corresponding
lowercase letters (e.g., 𝑧). For any closed set Z ⊆ R𝑑 , we use M(Z) to denote the
space of all finite signed Borel measures on Z, while M+ (Z) stands for the convex
cone of all (non-negative) Borel measures in M(Z), and P(Z) stands for the convex
set of all probability distributions in M+ (Z). The
∫ expectation operator with respect
to P ∈ P(Z) is defined through E P [ 𝑓 (𝑍)] = Z 𝑓 (𝑧) dP(𝑧) for any Borel function
𝑓 : Z → R. If the integrals of the positive and the negative parts of 𝑓 both evaluate
to ∞, then we define E P [ 𝑓 (𝑍)] ‘adversarially.’ That is, we set E P [ 𝑓 (𝑍)] = ∞ (−∞)
if the integral appears in the objective function of a minimization (maximization)
problem. The Dirac probability distribution that assigns unit probability to 𝑧 ∈ Z is
denoted as 𝛿 𝑧 . The Dirac distribution 𝛿 𝑧 should not be confused with the indicator
function 𝛿 { 𝑧 } of the singleton {𝑧}. For any P ∈ P(Z) and any Borel measurable
′
transformation 𝑓 : Z → Z ′ between Borel sets Z ⊆ R𝑑 and Z ′ ⊆ R𝑑 , we denote
by P ◦ 𝑓 −1 the pushforward distribution of P under 𝑓 . Thus, if 𝑍 is a random vector
on Z governed by P, then 𝑓 (𝑍) is a random vector on Z ′ governed by P ◦ 𝑓 −1 . The
Distributionally Robust Optimization 9
closure, the interior and the relative interior of a set Z ⊆ R𝑑 are denoted by cl(Z),
int(Z) and rint(Z), respectively. We use R+𝑑 and R++ 𝑑 to denote the non-negative
orthant in R and its interior. In addition, we use S𝑑 to denote the space of all
𝑑
2. Ambiguity Sets
An ambiguity set P is a family of probability distributions on a common measurable
space. Throughout this paper we assume that P ⊆ P(Z), where P(Z) denotes
the entirety of all Borel probability distributions on a closed set Z ⊆ R𝑑 . This
section reviews popular classes of ambiguity sets. For each class, we first give a
formal definition and provide historical background information. Subsequently, we
exemplify important instances of ambiguity sets and highlight how they are used.
Thus, the Markov ambiguity set (2.2) contains all distributions supported on Z that
share the same mean vector 𝜇. However, these distributions may have dramatically
different shapes and higher-order moments. Worst-case expectations over Markov
ambiguity sets are sometimes used as efficiently computable upper bounds on
the expected cost-to-go functions in stochastic programming. If the cost-to-go
functions are concave in the uncertain problem parameters, then these worst-case
expectations are closely related to Jensen’s inequality (Jensen 1906); see also
Section 6.1. If the cost-to-go functions are convex and Z is a polyhedron, on
the other hand, then these worst-case expectations are related to the Edmundson-
Madansky inequality (Edmundson 1956, Madansky 1959); see also Section 6.2.
Distributionally Robust Optimization 11
2019, Rontsis, Osborne and Goulart 2020), stochastic programming (Birge and
Wets 1986, Dulá and Murthy 1992, Dokov and Morton 2005, Bertsimas, Doan,
Natarajan and Teo 2010, Natarajan, Teo and Zheng 2011), control (Van Parys, Kuhn,
Goulart and Morari 2015, Yang 2018, Xin and Goldberg 2021, 2022), the operation
of power systems (Xie and Ahmed 2017, Zhao and Jiang 2017), complex network
analysis (Van Leeuwaarden and Stegehuis 2021, Brugman, Van Leeuwaarden and
Stegehuis 2022), queuing systems (van Eekelen, Hanasusanto, Hasenbein and van
Leeuwaarden 2023), healthcare (Mak, Rong and Zhang 2015, Shehadeh, Cohn and
Jiang 2020), and extreme event analysis (Lam and Mottet 2017), among others.
Note that the Chebyshev ambiguity set with uncertain moments encapsulates the
support-only ambiguity set, the Markov ambiguity set, and the Chebyshev ambigu-
ity set as special cases. They are recovered by setting F = R𝑑 × S+𝑑 , F = {𝜇} × S+𝑑 ,
and F = {𝜇} × {𝑀 }, respectively.
El Ghaoui et al. (2003) capture the uncertainty in the moments using the box
n o
F = (𝜇, 𝑀) ∈ R𝑑 × S+𝑑 : 𝜇 ≤ 𝜇 ≤ 𝜇, 𝑀 𝑀 𝑀
Delage and Ye (2010) show that if 𝜇ˆ and Σ̂ are set to the sample mean and the
sample covariance matrix constructed from a finite number of independent samples
from P, respectively, then one can tune the size parameters 𝛾1 ≥ 0 and 𝛾2 ≥ 1 to
ensure that P belongs to P with any desired confidence.
Chebyshev as well as Markov ambiguity sets with uncertain moments have
found various applications ranging from control (Nakao, Jiang and Shen 2021)
to integer stochastic programming (Bertsimas, Natarajan and Teo 2004, Cheng,
Delage and Lisser 2014), portfolio optimization (Natarajan et al. 2010), extreme
event analysis (Bai, Lam and Zhang 2023) and mechanism design and pricing
(Bergemann and Schlag 2008, Bandi and Bertsimas 2014, Koçyiğit, Iyengar, Kuhn
and Wiesemann 2020, Koçyiğit, Rujeerapaiboon and Kuhn 2022, Chen, Hu and
Wang 2024a, Bayrak, Koçyiğit, Kuhn and Pınar 2022, Anunrojwong, Balseiro and
Besbes 2024), among many others.
The uncertainty set F for the first- and second-order moments of P often corres-
ponds to a neighborhood of a nominal mean-covariance pair ( 𝜇, ˆ Σ̂) with respect to
some measure of discrepancy. For example, matrix norms such as the Frobenius
norm, the spectral norm or the nuclear norm (Bernstein 2009, § 9) provide nat-
ural measures to quantify the dissimilarity of covariance matrices. The discrep-
ancy between two mean-covariance pairs (𝜇, Σ) and ( 𝜇, ˆ Σ̂) can also be defined
as the discrepancy between the normal distributions N (𝜇, Σ) and N ( 𝜇, ˆ Σ̂) with
respect to a probability metric or an information-theoretic divergence such as the
Kullback-Leibler divergence (Kullback 1959), the Fisher-Rao distance (Atkinson
and Mitchell 1981) or other spectral divergences (Zorzi 2014).
As we will discuss in more detail in Section 2.3, the 2-Wasserstein distance
between two normal distributions N (𝜇, Σ) and N ( 𝜇,
ˆ Σ̂) coincides with the Gelbrich
distance between the underlying mean-covariance pairs (𝜇, Σ) and ( 𝜇, ˆ Σ̂). In the
following, we first provide a formal definition of the Gelbrich distance and then
exemplify how it can be used to define a moment uncertainty set F.
Definition 2.1 (Gelbrich Distance). The Gelbrich distance between two mean-
ˆ Σ̂) in R𝑑 × S+𝑑 is given by
covariance pairs (𝜇, Σ) and ( 𝜇,
r
1 1
1
ˆ 22 + Tr Σ + Σ̂ − 2 Σ̂ 2 Σ Σ̂ 2 2 .
G (𝜇, Σ), ( 𝜇,
ˆ Σ̂) = k𝜇 − 𝜇k
arity between density matrices in quantum information theory. The Bures distance
is known to induce a Riemannian metric on the space of positive semidefinite
matrices (Bhatia, Jain and Lim 2018, 2019). When Σ and Σ̂ are simultaneously
diagonalizable, then their Bures distance coincides with the Hellinger distance
between their spectra. The Hellinger distance is closely related to the Fisher-Rao
metric ubiquitous in information theory (Liese and Vajda 1987). Even though the
Gelbrich distance is nonconvex, the squared Gelbrich distance is jointly convex in
both of its arguments. This is an immediate consequence of the following propos-
ition, which can be found in (Olkin and Pukelsheim 1982, Dowson and Landau
1982, Givens and Shortt 1984, Panaretos and Zemel 2020).
Proposition 2.2 (SDP Representation of the Gelbrich Distance). For any mean-
ˆ Σ̂) in R𝑑 × S+𝑑 , we have
covariance pairs (𝜇, Σ) and ( 𝜇,
min
ˆ 22 + Tr(Σ + Σ̂ − 2𝐶)
k𝜇 − 𝜇k
𝐶 ∈R 𝑑×𝑑
G2 (𝜇, Σ), ( 𝜇,
ˆ Σ̂) = Σ 𝐶 (2.5)
s.t.
𝐶 ⊤ Σ̂
0.
Proof. Throughout the proof we keep 𝜇, 𝜇ˆ and Σ fixed and treat Σ̂ as a parameter.
We also use 𝑓 (Σ̂) as a shorthand for the left hand side of (2.5) and 𝑔(Σ̂) as a
shorthand for the right hand side of (2.5). Elementary manipulations show that
max Tr(2𝐶)
𝐶 ∈R𝑑×𝑑
2
ˆ 2 + Tr(Σ + Σ̂) −
𝑔(Σ̂) = k𝜇 − 𝜇k Σ 𝐶 (2.6)
s.t. 0.
⊤
𝐶 Σ̂
The maximization problem in (2.6) is dual to the following minimization problem.
inf Tr(𝐴11 Σ) + Tr(𝐴22 Σ̂)
𝐴11 , 𝐴22 ∈S 𝑑
𝐴11 𝐼𝑑
s.t. 0
𝐼 𝑑 𝐴22
Strong duality holds because 𝐴11 = 𝐴22 = 2𝐼𝑑 constitutes a Slater point for the dual
problem (Ben-Tal and Nemirovski 2001, Theorem 2.4.1). The existence of a Slater
point further implies that the primal maximization problem in (2.6) as well as the
minimization problem in (2.5) are solvable. By (Bernstein 2009, Corollary 8.2.2),
both 𝐴11 and 𝐴22 must be positive definite in order to be dual feasible. Thus, they
are invertible. We can therefore employ a Schur complement argument (Ben-Tal
and Nemirovski 2001, Lemma 4.2.1) to simplify the dual problem to
−1
inf Tr(𝐴11 Σ) + Tr(𝐴22 Σ̂) = inf Tr(𝐴22 Σ) + Tr(𝐴22 Σ̂), (2.7)
𝐴11 𝐴−1 𝐴22 ≻0
22 ≻0
where the equality holds because Σ 0. The optimal value of the resulting minim-
ization problem is concave and upper semicontinuous in Σ̂ because it constitutes a
pointwise infimum of affine functions of Σ̂. Thus, 𝑔(Σ̂) is convex and lower semi-
continuous. We now show that if Σ̂ ≻ 0, then the convex minimization problem
Distributionally Robust Optimization 15
over 𝐴22 in (2.7) can be solved in closed form. To this end, we construct a positive
definite matrix 𝐴★ 22 that satisfies the problem’s first-order optimality condition
−1 −1
Σ̂ − 𝐴22 Σ 𝐴22 =0 ⇐⇒ 𝐴22 Σ̂ 𝐴22 − Σ = 0.
1
Indeed, multiplying the quadratic equation on the right from both sides with Σ̂ 2
1 1 1 1
yields the equivalent equation (Σ̂ 2 𝐴22 Σ̂ 2 )2 = Σ̂ 2 Σ Σ̂ 2 . As Σ̂ ≻ 0, this equation is
uniquely solved by 𝐴★ − 21 1 1 1
− 12 ★
22 = Σ̂ (Σ̂ Σ Σ̂ ) Σ̂ . Substituting 𝐴22 into (2.7) reveals
2 2 2
1 1 1
that the optimal value of the dual minimization problem is given by 2 Tr((Σ̂ 2 Σ Σ̂ 2 ) 2 ).
Substituting this value into (2.6) then shows that 𝑔(Σ̂) = 𝑓 (Σ̂) whenever Σ̂ ≻ 0.
It remains to be shown that 𝑔(Σ̂) = 𝑓 (Σ̂) if Σ̂ is singular. To this end, we re-
call from (Nguyen, Shafieezadeh-Abadeh, Kuhn and Mohajerin Esfahani 2023,
Lemma A.2) that the matrix square root is continuous on S+𝑑 , which implies
that 𝑓 (Σ̂) is continuous on S+𝑑 . For any singular Σ̂ 0, we thus have
𝑓 (Σ̂) = lim inf 𝑓 (Σ̂′ ) = lim inf 𝑔(Σ̂′ ) = 𝑔(Σ̂).
Σ̂′ →Σ̂, Σ̂′ ≻0 Σ̂′ →Σ̂, Σ̂′ ≻0
Here, the first equality exploits the continuity of 𝑓 , and the second equality holds
because 𝑓 (Σ̂′ ) = 𝑔(Σ̂′ ) for every Σ̂′ ≻ 0. The third equality follows from the
convexity and lower semicontinuity of 𝑔, which imply that the limit inferior can
neither be smaller nor larger than 𝑔(Σ̂), respectively. This completes the proof.
Proposition 2.2 shows that the squared Gelbrich distance coincides with the
optimal value of a tractable semidefinite program. This makes the Gelbrich distance
attractive for computation. As a byproduct, the proof of Proposition 2.2 reveals
that the squared Gelbrich distance is convex as well as continuous on its domain.
Following Nguyen, Shafiee, Filipović and Kuhn (2021), we can now introduce
the Gelbrich ambiguity set as an instance of the Chebyshev ambiguity set (2.4)
with uncertain moments. The corresponding moment uncertainty set is given by
𝑑 𝑑 ∃Σ ∈ S+𝑑 with 𝑀 = Σ + 𝜇𝜇⊤ ,
F = (𝜇, 𝑀) ∈ R × S+ : , (2.8)
G (𝜇, Σ), ( 𝜇,
ˆ Σ̂) ≤ 𝑟
where ( 𝜇,
ˆ Σ̂) is a nominal mean-covariance pair, and the radius 𝑟 ≥ 0 serves as a
tunable size parameter. Below we refer to F as the Gelbrich uncertainty set. The
next proposition establishes basic topological and computational properties of F.
Proposition 2.3 (Gelbrich Uncertainty Set). The uncertainty set F defined in (2.8)
is convex and compact. In addition, it admits the semidefinite representation
∃𝐶 ∈ R𝑑×𝑑 , 𝑈 ∈ S+𝑑 with
ˆ 2 − 2𝜇⊤ 𝜇ˆ + Tr(𝑀 + Σ̂ − 2𝐶) ≤ 𝑟 2 ,
k 𝜇k
F = (𝜇, 𝑀) ∈ R𝑑 × S+𝑑 : 2 .
𝑀 −𝑈 𝐶 𝑈 𝜇
0, ⊤ 0
𝐶⊤ Σ̂ 𝜇 1
Proof. The proof exploits the semidefinite representation of the squared Gelbrich
16 D. Kuhn, S. Shafiee, and W. Wiesemann
where the equality has been established in the proof of Proposition 2.2. The two
inequalities follow from a relaxation of the linear matrix inequality, which exploits
the observation that all second principal minors of a positive semidefinite matrix
are non-negative, and from the Cauchy-Schwarz inequality. Thus, Σ satisfies
1 1
2 1 1 1
Tr(Σ) 2 − Tr(Σ̂) 2 ≤ Tr Σ + Σ̂ − 2 Σ̂ 2 Σ Σ̂ 2 2 ) ≤ 𝑟 2 ,
where the second inequality holds because (𝜇, Σ) ∈ V. We may therefore conclude
1 1
that Tr(Σ) ≤ (𝑟 + Tr(Σ̂) 2 )2 , which in turn implies that 0 Σ (𝑟 + (Tr(Σ̂)) 2 )2 𝐼 𝑑 . In
summary, we have shown that both 𝜇 and Σ belong to bounded sets. As (𝜇, Σ) ∈ V
was chosen arbitrarily, this proves that V is indeed bounded and thus compact.
Distributionally Robust Optimization 17
Proposition 2.2 shows that the uncertainty set F is convex. This is surprising
because F = 𝑓 (V), where the Gelbrich ball V in the space of mean-covariance
pairs is convex thanks to Proposition 2.2 and where 𝑓 is a quadratic bijection.
Indeed, convexity is usually only preserved under affine transformations.
Gelbrich ambiguity sets were introduced by Nguyen et al. (2021) in the context
of robust portfolio optimization. They have also found use in machine learning
(Bui, Nguyen and Nguyen 2022, Vu, Tran, Yue and Nguyen 2021, Nguyen, Bui and
Nguyen 2022a), estimation (Nguyen et al. 2023), filtering (Shafieezadeh-Abadeh,
Nguyen, Kuhn and Mohajerin Esfahani 2018, Kargin, Hajar, Malik and Hassibi
2024b) and control (McAllister and Mohajerin Esfahani 2023, Al Taha, Yan and
Bitar 2023, Hajar, Kargin and Hassibi 2023, Hakobyan and Yang 2024, Taşkesen,
Iancu, Koçyiğit and Kuhn 2024, Kargin, Hajar, Malik and Hassibi 2024a,c,d).
Hence, one can replace the original ambiguity set P with the lifted ambiguity set Q.
This is useful because Q constitutes a simple Markov ambiguity set that specifies
18 D. Kuhn, S. Shafiee, and W. Wiesemann
only the support set C and the mean (𝜇, 𝑔) of the joint random vector (𝑍, 𝑈). In
addition, one can show that Z is convex because Z is convex and 𝐺 is K-convex.
In summary, DRO problems with mean-dispersion ambiguity sets of the form (2.9)
can systematically be reduced to DRO problems with Markov ambiguity sets.
A more general class of mean-dispersion ambiguity sets can be used to shape the
moment generating function of 𝑍 under P. Specifically, Chen, Sim and Xu (2019)
introduce the entropic dominance ambiguity set
P = P ∈ P(Z) : E P [𝑍] = 𝜇, log E P [exp(𝜃 ⊤ (𝑍 − 𝜇))] ≤ 𝑔(𝜃) ∀𝜃 ∈ R𝑑 ,
a generalized moment problem. This moment problem as well as its dual constitute
semi-infinite linear programs, which can be recast as finite-dimensional conic
optimization problems over certain moment cones and the corresponding dual cones
of non-negative polynomials (Karlin and Studden 1966, Zuluaga and Pena 2005).
Even though NP-hard in general, these conic problems can be approximated by
increasingly tight sequences of tractable semidefinite programs by using tools from
polynomial optimization (Parrilo 2000, 2003, Lasserre 2001, 2009). This general
technique gives rise to worst-case expectation bounds and generalized Chebyshev
inequalities with respect to the ambiguity set P (Bertsimas and Sethuraman 2000,
Lasserre 2002, Popescu 2005, Lasserre 2008). In addition, it leads to tight bounds
on worst-case risk measures (Natarajan, Pachamanova and Sim 2009a).
A dominating measure 𝜌 always exists, but it must depend on P and P̂. For
example, one may set 𝜌 = P + P̂. The absolute continuity conditions P ≪ 𝜌 and
dP dP̂
P̂ ≪ 𝜌 ensure that the Radon-Nikodym derivatives d𝜌 and d𝜌 are well-defined,
respectively. The following proposition derives a dual representation of a generic
𝜙-divergence, which reveals that D 𝜙 (P, P̂) is in fact independent of the choice of 𝜌.
20 D. Kuhn, S. Shafiee, and W. Wiesemann
Proposition 2.6 reveals that D 𝜙 (P, P̂) is jointly convex in P and P̂. If 𝜙(𝑠) grows
superlinearly with 𝑠, that is, if the asymptotic growth rate 𝜙∞ (1) is infinite, then
dP dP̂
D 𝜙 (P, P̂) is finite if and only if d𝜌 (𝑧) = 0 for 𝜌-almost all 𝑧 ∈ Z with d𝜌 (𝑧) = 0. Put
differently, D 𝜙 (P, P̂) is finite if and only if P ≪ P̂. In this special case, the chain
dP dP̂
rule for Radon-Nikodym derivatives implies that d𝜌 / d𝜌 = ddPP̂ . If 𝜙∞ (1) = ∞, the
Distributionally Robust Optimization 21
𝜙-divergence thus admits the more common (but less general) representation
( ∫
dP
𝜙 (𝑧) dP̂(𝑧) if P ≪ P̂,
D 𝜙 (P, P̂) = Z dP̂
+∞ otherwise.
We are now ready to define the 𝜙-divergence ambiguity set as
P = P ∈ P(Z) : D 𝜙 (P, P̂) ≤ 𝑟 . (2.10)
This set contains all probability distributions P supported on Z whose 𝜙-divergence
with respect to some prescribed reference distribution P̂ is at most 𝑟 ≥ 0.
Remark 2.7 (Csiszár Duals). The family of generalized 𝜙-divergences (which may
adopt finite values even if P 3 P̂) is invariant under permutations of P and P̂.
Formally, we have D 𝜙 (P, P̂) = D 𝜓 (P̂, P), where 𝜓 denotes the Csiszár dual of 𝜙
defined through 𝜓(𝑠) = 𝜙 𝜋 (1, 𝑠) = 𝑡𝜙(1/𝑡) (Ben-Tal, Ben-Israel and Teboulle 1991,
Lemma 2.3). One readily verifies that if 𝜙 is a valid entropy function in the sense
of Definition 2.4, then 𝜓 is also a valid entropy function. This relationship shows
that, even though 𝜙-divergences are generically asymmetric, we do not sacrifice
generality by focusing on divergence ambiguity sets of the form (2.10), with the
nominal distribution P̂ being the second argument of the divergence. From the
discussion after Proposition 2.6 it is clear that if 𝜙∞ (1) = ∞, then all distributions P
in the 𝜙-divergence ambiguity set (2.10) satisfy P ≪ P̂. Similarly, if the Csiszár dual
of 𝜙 satisfies 𝜓 ∞ (1) = ∞, then all distributions P in the 𝜙-divergence ambiguity set
satisfy P̂ ≪ P. Table 2.1 lists common entropy functions and their Csiszár duals.
We emphasize that the family of Cressie-Read divergences includes the (scaled)
Pearson 𝜒2 -divergence for 𝛽 = 2, the Kullback-Leibler divergence for 𝛽 → 1 and
the likelihood divergence for 𝛽 → 0 as special cases.
The DRO literature often focuses on the restricted 𝜙-divergence ambiguity set
P = P ∈ P(Z) : P ≪ P̂, D 𝜙 (P, P̂) ≤ 𝑟 (2.11)
22 D. Kuhn, S. Shafiee, and W. Wiesemann
where F denotes the family of all bounded Borel functions 𝑓 : Z → R. Note that F
is invariant under constant shifts. That is, if 𝑓 (𝑧) is a bounded Borel function, then
so is 𝑓 (𝑧) + 𝑐 for any constant 𝑐 ∈ R. Without loss of generality, we may thus
optimize over both 𝑓 ∈ F and 𝑐 ∈ R in the above maximization problem to obtain
∫ ∫
e 𝑓 (𝑧)+𝑐 − 1 dP̂(𝑧).
KL(P, P̂) = sup sup ( 𝑓 (𝑧) + 𝑐) dP(𝑧) −
𝑓 ∈F 𝑐∈R Z Z
For any fixed 𝑓 ∈ F, the inner maximization problem over 𝑐 is uniquely solved by
∫
★ 𝑓 (𝑧)
𝑐 = − log 𝑒 dP̂(𝑧) .
Z
Substituting this expression back into the objective function yields (2.12).
Proposition 2.9 establishes a link between the Kullback-Leibler divergence and
the entropic risk measure. This connection will become useful in Section 4.3.
The Kullback-Leibler ambiguity set of radius 𝑟 ≥ 0 around P̂ ∈ P(Z) is given by
P = P ∈ P(Z) : KL(P, P̂) ≤ 𝑟 . (2.13)
As 𝜙∞ (1) = +∞, all distributions P ∈ P are absolutely continuous with respect to P̂.
Thus, P coincides with the restricted Kullback-Leibler ambiguity set. El Ghaoui
et al. (2003) derive a closed-form expression for the worst-case value-at-risk of a
linear loss function when P̂ is a Gaussian distribution. Hu and Hong (2013) use
similar techniques to show that any distributionally robust individual chance con-
straint with respect to a Kullback-Leibler ambiguity set is equivalent to a classical
chance constraint with a rescaled confidence level. Calafiore (2007) studies worst-
case mean-risk portfolio selection problems when P̂ is a discrete distribution. The
Kullback-Leibler ambiguity set has also found applications in least-squares estim-
ation (Levy and Nikoukhah 2004), hypothesis testing (Levy 2008, Gül and Zoubir
2017), filtering (Levy and Nikoukhah 2012, Zorzi 2016, 2017a,b), the theory of
risk measures (Ahmadi-Javid 2012, Postek et al. 2016) and extreme value analysis
(Blanchet, He and Murthy 2020), among many others.
Proposition 2.11. The total variation distance coincides with the 𝜙-divergence
induced by the the entropy function with 𝜙(𝑠) = 12 |𝑠 − 1| for all 𝑠 ≥ 0.
Proof. The conjugate entropy function evaluates to 𝜙∗ (𝑡) = max{𝑡, − 21 } if 𝑡 ≤ 21
and to 𝜙∗ (𝑡) = +∞ if 𝑡 > 21 . By Proposition 2.6, the 𝜙-divergence corresponding to
the given entropy function thus admits the dual representation
∫ ∫
1
D 𝜙 (P, P̂) = sup 𝑓 (𝑧) dP(𝑧) − max 𝑓 (𝑧), − dP̂(𝑧), (2.15)
𝑓 ∈F Z Z 2
where F ′ denotes the family of all Borel functions 𝑓 : Z → [0, 1]. Moreover, as
the objective function of the maximization problem in (2.16) is linear in 𝑓 , we can
further restrict F ′ to contain only binary Borel functions 𝑓 : Z → {0, 1} without
sacrificing optimality. As there is a one-to-one correspondence between Borel sets
and their characteristic functions, we finally obtain the desired identity
D 𝜙 (P, P̂) = sup P(B) − P̂(B) : B ⊆ Z is a Borel set .
Hence, the claim follows.
The total variation ambiguity set of radius 𝑟 ≥ 0 around P̂ ∈ P(Z) is given by
P = P ∈ P(Z) : TV(P, P̂) ≤ 𝑟 .
Most of the existing literature focuses on the restricted total variation ambiguity
set, which contains all distributions P ∈ P that satisfy P ≪ P̂. Jiang and Guan
(2018, Theorem 1) and Shapiro (2017, Example 3.7) show that the worst-case ex-
pected loss with respect to a restricted total variation ambiguity set coincides with
a combination of a conditional value-at-risk and the essential supremum of the loss
with respect to P̂, see also Section 6.10. Rahimian, Bayraksan and Homem-de-
Mello (2019a,b, 2022) study the worst-case distributions of DRO problems over
unrestricted total variation ambiguity sets when Z is finite. The total variation am-
biguity set is related to Huber’s contamination model from robust statistics (Huber
1981), which assumes that a fraction 𝑟 ∈ (0, 1) of all samples in a statistical dataset
are drawn from an arbitrary contaminating distribution. Hence, the total vari-
ation distance between the target distribution to be estimated and the contaminated
data-generating distribution is at most 𝑟. It is thus natural to use a total variation
26 D. Kuhn, S. Shafiee, and W. Wiesemann
ambiguity set of radius 𝑟 around some estimated distribution as the search space for
the target distribution (Nishimura and Ozaki 2004, 2006, Bose and Daripa 2009,
Duchi, Hashimoto and Namkoong 2023, Tsanga and Shehadeha 2024).
Borel function, then so is 𝑐 𝑓 (𝑧) for any constant 𝑐 ∈ R. We may thus optimize
separately over 𝑓 ∈ F and 𝑐 ∈ R in the above maximization problem to obtain
V [ 𝑓 (𝑍)] 2
𝜒2 (P, P̂) = sup sup E P [ 𝑓 (𝑍)] − E P̂ [ 𝑓 (𝑍)] 𝑐 − P̂
𝑐
𝑓 ∈F 𝑐∈R 4
2
E P [ 𝑓 (𝑍)] − E P̂ [ 𝑓 (𝑍)]
= sup .
𝑓 ∈F VP̂ [ 𝑓 (𝑍)]
Note that the inner maximization problem over 𝑐 simply evaluates the conjugate
of the convex quadratic function VP̂ [ 𝑓 (𝑍)]𝑐2 /4 at E P [ 𝑓 (𝑍)] − E P̂ [ 𝑓 (𝑍)], which is
available in closed form. Thus, the claim follows.
As the 𝜒2 -divergence fails to be symmetric, it give rise to two complementary
ambiguity sets, which differ according to whether the reference distribution P̂ ∈
P(Z) is used as the first or the second argument of the 𝜒2 -divergence. Lam (2018)
defines the Pearson 𝜒2 -ambiguity set of radius 𝑟 ≥ 0 around P̂ as
P = P ∈ P(Z) : 𝜒2 (P, P̂) ≤ 𝑟 (2.17)
in order to analyze operations and service systems with dependent data. Philpott,
de Matos and Kapelevich (2018) develop a stochastic dual dynamic programming
algorithm for solving distributionally robust multistage stochastic programs with a
Pearson ambiguity set. In the context of static DRO, Duchi and Namkoong (2019)
show that robustification with respect to a Pearson ambiguity set is closely related
to variance regularization. Note that as 𝜙∞ (1) = +∞, the Pearson ambiguity set
coincides with its restricted version, which contains only distributions P ≪ P̂.
Klabjan, Simchi-Levi and Song (2013) define the Neyman 𝜒2 -ambiguity set as
P = P ∈ P(Z) : 𝜒2 (P̂, P) ≤ 𝑟
in order to formulate robust lot-sizing problems. Hanasusanto and Kuhn (2013) use
a Neyman ambiguity set with finite Z in the context of robust data-driven dynamic
programming. Finally, Hanasusanto et al. (2015a) use the same ambiguity set to
model the uncertainty in the mixture weights of multimodal demand distributions.
where Γ(P, P̂) represents the set of all couplings 𝛾 of P and P̂, that is, all joint
probability distributions of 𝑍 and 𝑍ˆ with marginals P and P̂, respectively.
By definition, we have 𝛾 ∈ Γ(P, P̂) if and only if 𝛾((𝑍, 𝑍) ˆ ∈ B × Z) = P(𝑍 ∈ B)
ˆ ∈ Z×B̂) = P̂( 𝑍ˆ ∈ B̂) for all Borel sets B, B̂ ⊆ Z. If the probability dis-
and 𝛾((𝑍, 𝑍)
tributions P and P̂ are visualized as two piles of sand, then any coupling 𝛾 ∈ Γ(P, P̂)
can be interpreted as a transportation plan, that is, an instruction for morphing P̂ into
the shape of P by moving sand between various origin-destination pairs in Z. In-
deed, for any fixed origin 𝑧ˆ ∈ Z, the conditional probability 𝛾(𝑧 ≤ 𝑍 ≤ 𝑧+d𝑧| 𝑍ˆ = 𝑧ˆ)
determines the proportion of the sand located at 𝑧ˆ that should be moved to (an in-
finitesimally small rectangle at) the destination 𝑧. If the cost of moving one unit
of probability mass from 𝑧ˆ to 𝑧 amounts to 𝑐(𝑧, 𝑧ˆ), then OT𝑐 (P, P̂) is the minimal
amount of money that is needed to morph P̂ into P. We now provide a dual
representation for generic optimal transport discrepancies.
Proposition 2.16 (Kantorovich Duality I). We have
∫ ∫
sup 𝑓 (𝑧) dP(𝑧) − 𝑔(ˆ𝑧) dP̂(ˆ𝑧)
OT𝑐 (P, P̂) = 𝑓 ∈L1 (P), 𝑔∈L1 (P̂) Z Z (2.19)
s.t. 𝑓 (𝑧) − 𝑔(ˆ𝑧) ≤ 𝑐(𝑧, 𝑧ˆ) ∀𝑧, 𝑧ˆ ∈ Z,
where L1 (P) and L1 (P̂) denote the sets of all Borel functions from Z to R that are
integrable with respect to P and P̂, respectively.
The dual problem (2.19) represents the profit maximization problem of a third
party that redistributes the sand from P̂ to P on behalf of the problem owner by
buying sand at the origin 𝑧ˆ at unit price 𝑔(ˆ𝑧) and selling sand at the destination 𝑧
at unit price 𝑓 (𝑧). The constraints ensure that it is cheaper for the problem owner
to use the services of the third party instead of moving the sand without external
help at the transportation cost 𝑐(𝑧, 𝑧ˆ) for every origin-destination pair (ˆ𝑧 , 𝑧). The
optimal price functions 𝑓 ★ and 𝑔★, if they exist, are termed Kantorovich potentials.
Proof of Proposition 2.16. For a general proof we refer to (Villani 2008, The-
orem 5.10 (i)). We prove the claim under the simplifying assumption that Z is
compact. In this case, the family C(Z × Z) of all continuous (and thus bounded)
functions 𝑓 : Z × Z → R equipped with the supremum norm constitutes a Banach
space. Its topological dual is the space M(Z ×Z) of all finite signed Borel measures
on Z × Z equipped with the total variation norm (Folland 1999, Corollary 7.18).
This means that for every continuous linear
∫ functional 𝜑 : C(Z × Z) → R there
exists 𝛾 ∈ M(Z × Z) such that 𝜑( 𝑓 ) = Z ×Z 𝑓 (𝑧, 𝑧ˆ) d𝛾(𝑧, 𝑧ˆ) for all 𝑓 ∈ C(Z × Z).
Distributionally Robust Optimization 29
where the convex functions 𝜙, 𝜓 : C(Z × Z) → (−∞, +∞] are defined through
0 if − ℎ(𝑧, 𝑧ˆ) ≤ 𝑐(𝑧, 𝑧ˆ) ∀𝑧, 𝑧ˆ ∈ Z,
𝜙(ℎ) =
+∞ otherwise,
and
∫ ∫
if ∃ 𝑓 , 𝑔 ∈ C(Z) with
ℎ(𝑧, 𝑧ˆ) dP(𝑧) dP̂(ˆ𝑧 )
𝜓(ℎ) = Z Z ℎ(𝑧, 𝑧ˆ) = 𝑔(ˆ𝑧 ) − 𝑓 (𝑧) ∀𝑧, 𝑧ˆ ∈ Z,
+∞
otherwise.
Note that (2.21) can be viewed as the conjugate of 𝜙 + 𝜓 with respect to the
pairing of C(Z × Z) and M(Z × Z) evaluated at the zero measure. Note also
that 𝜙 is continuous at the constant function ℎ0 ≡ 1 because the transportation
cost function 𝑐 is non-negative. In addition, ℎ0 belongs to the domain of 𝜓. The
Fenchel–Rockafellar duality theorem (Brezis 2011, Theorem 1.12) thus ensures
that the conjugate of the sum of the proper convex functions 𝜙 and 𝜓 coincides
with the infimal convolution of their conjugates 𝜙∗ and 𝜓 ∗ . Hence, (2.21) equals
(𝜙 + 𝜓)∗ (0) = inf 𝜙∗ (−𝛾) + 𝜓 ∗ (𝛾). (2.22)
𝛾∈M(Z ×Z)
the measure of any Borel set can be approximated with the integral of a continuous
function. Similarly, for any 𝛾 ∈ M(Z × Z) one readily verifies that
(
∗ 0 if 𝛾 ∈ Γ(P, P̂),
𝜓 (𝛾) =
+∞ otherwise.
Substituting the above formulas for 𝜙∗ and 𝜓 ∗ into (2.22) yields (2.20).
Relaxing the requirement 𝑓 , 𝑔 ∈ C(Z) to 𝑓 ∈ L1 (P) and 𝑔 ∈ L1 (P̂) on the right
hand side of (2.20) immediately leads to the upper bound
∫ ∫
sup 𝑓 (𝑧) dP(𝑧) − 𝑔(ˆ𝑧) dP̂(ˆ𝑧 )
OT𝑐 (P, P̂) ≤ 𝑓 ∈L1 (P), 𝑔∈L1 (P̂) Z Z (2.23)
s.t. 𝑓 (𝑧) − 𝑔(ˆ𝑧 ) ≤ 𝑐(𝑧, ˆ
𝑧 ) ∀𝑧, 𝑧ˆ ∈ Z.
On the other hand, it is clear that
∫
OT𝑐 (P, P̂) = inf sup 𝑐(𝑧, 𝑧ˆ) − 𝑓 (𝑧) + 𝑔(ˆ𝑧 ) d𝛾(𝑧, 𝑧ˆ)
𝛾∈M+ (Z ×Z) 𝑓 ∈L1 (P),𝑔∈L1 (P̂) Z ×Z
∫ ∫
+ 𝑓 (𝑧) dP(𝑧) − 𝑔(ˆ𝑧) dP̂(ˆ𝑧).
Z Z
Interchanging the order of minimization and maximization in the above expression
and then evaluating the inner infimum in closed form yields
∫ ∫
sup 𝑓 (𝑧) dP(𝑧) − 𝑔(ˆ𝑧) dP̂(ˆ𝑧 )
OT𝑐 (P, P̂) ≥ 𝑓 ∈L1 (P), 𝑔∈L1 (P̂) Z Z (2.24)
s.t. 𝑓 (𝑧) − 𝑔(ˆ𝑧 ) ≤ 𝑐(𝑧, ˆ
𝑧 ) ∀𝑧, 𝑧ˆ ∈ Z.
Combining (2.23) with (2.24) proves (2.19), and thus the claim follows.
The dual optimal transport problem (2.19) constitutes a linear program over the
price functions 𝑓 ∈ L1 (P) and 𝑔 ∈ L1 (P̂), and its objective function is linear in P
and P̂. As pointwise suprema of linear functions are convex, OT𝑐 (P, P̂) is thus
jointly convex in P and P̂. Problem (2.19) can be further simplified by invoking the
𝑐-transform 𝑓 𝑐 : Z → (−∞, +∞] of the price function 𝑓 , which is defined through
𝑓 𝑐 (ˆ𝑧 ) = sup 𝑓 (𝑧) − 𝑐(𝑧, 𝑧ˆ). (2.25)
𝑧 ∈Z
where the 𝑐-transforms 𝑓 𝑐 and 𝑔 𝑐 are defined in (2.25) and (2.26), respectively.
In addition, the first (second) supremum does not change if we require that 𝑓 = 𝑔 𝑐
(𝑔 = 𝑓 𝑐 ) for some function 𝑔 : Z → (−∞, +∞] ( 𝑓 : Z → [−∞, +∞)).
In addition, it ensures that the supremum does not change if we restrict the search
space to functions that are representable as 𝑓 = 𝑔 𝑐 for some 𝑔 : Z → (−∞, +∞].
Distributionally Robust Optimization 33
By (2.26), we thus have 𝑓 (𝑧) = inf 𝑧ˆ ∈Z 𝑔(ˆ𝑧) − 𝑑(𝑧, 𝑧ˆ). For any fixed 𝑧ˆ ∈ Z, the
auxiliary function 𝑓 𝑧ˆ (𝑧) = 𝑔(ˆ𝑧) − 𝑑(𝑧, 𝑧ˆ) is ostensibly 1-Lipschitz with respect to
the metric 𝑑. As infima of 1-Lipschitz functions remain 1-Lipschitz, we thus find
lip( 𝑓 ) ≤ 1. In summary, we have shown that restricting attention to 1-Lipschitz
functions does not reduce the supremum of the dual optimal transport problem.
Next, we prove that lip( 𝑓 ) ≤ 1 implies that 𝑓 𝑐 = 𝑓 . Indeed, for any 𝑧ˆ ∈ Z we have
𝑓 (ˆ𝑧 ) ≤ sup 𝑓 (𝑧) − 𝑑(𝑧, 𝑧ˆ) ≤ sup 𝑓 (ˆ𝑧 ) + 𝑑(𝑧, 𝑧ˆ) − 𝑑(𝑧, 𝑧ˆ) = 𝑓 (ˆ𝑧 ),
𝑧 ∈Z 𝑧 ∈Z
where the two inequalities hold because 𝑑(ˆ𝑧 , 𝑧ˆ) = 0 and lip( 𝑓 ) ≤ 1, respectively.
This implies via (2.25) that 𝑓 (ˆ𝑧) = sup 𝑧 ∈Z 𝑓 (𝑧) − 𝑑(𝑧, 𝑧ˆ) = 𝑓 𝑐 (ˆ𝑧) for all 𝑧ˆ ∈ Z.
Hence, 𝑓 𝑐 coincides with 𝑓 whenever lip( 𝑓 ) ≤ 1, and thus the claim follows.
The 𝑝-Wasserstein ambiguity set of radius 𝑟 ≥ 0 around P̂ ∈ P(Z) is defined as
P = P ∈ P(Z) : W 𝑝 (P, P̂) ≤ 𝑟 . (2.28)
Pflug, Pichler and Wozabal (2012) study robust portfolio selection problems, where
the uncertainty about the asset return distribution is captured by a 𝑝-Wasserstein
ball. They prove that—as 𝑟 approaches infinity—it becomes optimal to distribute
one’s capital equally among all available assets. Hence, this result reveals that
the popular 1/𝑁-investment strategy widely used in practice (DeMiguel, Garlappi
and Uppal 2009) is optimal under extreme ambiguity. Pflug et al. (2012), Pichler
(2013) and Wozabal (2014) further show that, for a broad range of convex risk
measures, the worst-case portfolio risk across all distributions in a 𝑝-Wasserstein
ball equals the nominal risk under P̂ plus a regularization term that scales with the
Wasserstein radius 𝑟; see also Section 8.3.
The Wasserstein ambiguity set corresponding to 𝑝 = 1 enjoys particular prom-
inence in DRO. The Kanthorovich-Rubinstein duality can be used to construct a
simple upper bound on the worst-case expectation of a Lipschitz continuous loss
function across all distributions in a 1-Wasserstein ball. This upper bound is given
by the sum of the expected loss under the nominal distribution P̂ plus a regular-
ization term that consists of the Lipschitz modulus of the loss function weighted
by the radius 𝑟 of the ambiguity set. Shafieezadeh-Abadeh, Mohajerin Esfahani
and Kuhn (2015) demonstrate that this upper bound is exact for distributionally
robust logistic regression problems. However, this exactness result extends in fact
to many linear prediction models with convex (Chen and Paschalidis 2018, 2019,
Blanchet, Kang and Murthy 2019b, Shafieezadeh-Abadeh et al. 2019, Wu, Li and
Mao 2022) and even nonconvex loss functions (Gao et al. 2024, Ho-Nguyen and
Wright 2023). More generally, 1-Wasserstein ambiguity sets have found numerous
applications in diverse areas such as two-stage and multi-stage stochastic program-
ming (Zhao and Guan 2018, Hanasusanto and Kuhn 2018, Duque and Morton
2020, Bertsimas, Shtern and Sturt 2023), chance constrained programming (Chen,
Kuhn and Wiesemann 2024b, Xie 2021, Ho-Nguyen, Kılınç-Karzan, Küçükyavuz
and Lee 2022, Shen and Jiang 2023), inverse optimization (Mohajerin Esfahani,
34 D. Kuhn, S. Shafiee, and W. Wiesemann
Theorem 2.20 (Gelbrich Bound). Assume that Z is equipped with the Euclidean
metric 𝑑(𝑧, 𝑧ˆ) = k𝑧 − 𝑧ˆ k 2 . For any distributions P, P̂ ∈ P(Z) with finite mean
vectors 𝜇, 𝜇ˆ ∈ R𝑑 and covariance matrices Σ, Σ̂ ∈ S+𝑑 , respectively, we have
ˆ 22 + Tr[Σ + Σ̂ − 2𝐶]
inf k𝜇 − 𝜇k
s.t. 𝛾 ∈ Γ(P, P̂), 𝐶 ∈ R𝑑×𝑑
= ∫ ⊤
𝑧−𝜇 𝑧−𝜇 Σ 𝐶 Σ 𝐶
d𝛾(𝑧, 𝑧ˆ) = ⊤ , 0.
Z ×Z 𝑧ˆ − 𝜇ˆ 𝑧ˆ − 𝜇ˆ 𝐶 Σ̂ 𝐶 ⊤ Σ̂
Note that the new decision variable 𝐶 is uniquely determined by the transportation
plan 𝛾, that is, it represents the cross-covariance matrix of 𝑍 and 𝑍ˆ under 𝛾. Thus,
its presence does not enlarge the feasible set. Note also that the linear matrix
inequality in the last expression is redundant because the second-order moment
matrix of 𝛾 is necessarily positive semidefinite. Thus, its presence does not reduce
the feasible set. Finally, note that the integral of the quadratic function
k𝑧 − 𝑧ˆ k 22 =k𝜇 − 𝜇k
ˆ 22 + k𝑧 − 𝜇k 22 + k 𝑧ˆ − 𝜇k
ˆ 22 − 2(𝑧 − 𝜇)⊤ (ˆ𝑧 − 𝜇)
ˆ
⊤ ⊤
+ 2(𝜇 − 𝜇)
ˆ (𝑧 − 𝜇) − 2(𝜇 − 𝜇) ˆ (ˆ𝑧 − 𝜇)
ˆ
Also, for any fixed 𝑓 , it is optimal to push 𝑔 down such that for all 𝑧ˆ ∈ Z we have
𝑔(ˆ𝑧) = sup 𝑓 (𝑧) − 1𝑑(𝑧, 𝑧ˆ )>𝑟 =⇒ sup 𝑓 (𝑧) − 1 ≤ 𝑔(ˆ𝑧) ≤ sup 𝑓 (𝑧). (2.31b)
𝑧 ∈Z 𝑧 ∈Z 𝑧 ∈Z
Combining the upper bound on 𝑔(ˆ𝑧 ) in (2.31b) with the upper bound on 𝑓 (𝑧)
in (2.31a) further implies that 𝑔(ˆ𝑧 ) ≤ sup 𝑧 ∈Z 𝑓 (𝑧) ≤ 1+inf 𝑧 ′ ∈Z 𝑔(𝑧′ ). At optimality,
(2.31a) and (2.31b) must hold simultaneously, and thus we have
inf 𝑔(𝑧′ ) ≤ 𝑓 (𝑧) ≤ 1 + inf 𝑔(𝑧′ ) and inf 𝑔(𝑧′ ) ≤ 𝑔(ˆ𝑧) ≤ 1 + inf 𝑔(𝑧′ )
𝑧 ′ ∈Z ′ 𝑧 ∈Z 𝑧 ′ ∈Z ′ 𝑧 ∈Z
for all 𝑧, 𝑧ˆ ∈ Z. Note that, as both P and P̂ are probability distributions, the objective
function of the dual optimal transport problem (2.30) remains invariant under the
substitutions 𝑓 (𝑧) ← 𝑓 (𝑧) − inf 𝑧 ′ ∈Z 𝑔(𝑧′ ) and 𝑔(ˆ𝑧 ) ← 𝑔(ˆ𝑧 ) − inf 𝑧 ′ ∈Z 𝑔(𝑧′ ). In the
following, we may thus assume without loss of generality that 0 ≤ 𝑓 (𝑧) ≤ 1 for
all 𝑧 ∈ Z and that 0 ≤ 𝑔(ˆ𝑧) ≤ 1 for all 𝑧ˆ ∈ Z.
Distributionally Robust Optimization 37
As 𝑓 and 𝑔 are now normalized to [0, 1], they admit the integral representations
∫ 1 ∫ 1
𝑓 (𝑧) = 1 𝑓 (𝑧)≥ 𝜏 d𝜏 ∀𝑧 ∈ Z and 𝑔(ˆ𝑧) = 1𝑔( 𝑧ˆ )≥ 𝜏 d𝜏 ∀ˆ𝑧 ∈ Z.
0 0
Next, one can show that 𝑓 and 𝑔 satisfy the constraints in (2.30) if and only if
1 𝑓 (𝑧)≥ 𝜏 − 1𝑔( 𝑧ˆ )≥ 𝜏 ≤ 1𝑑(𝑧, 𝑧ˆ )>𝑟 ∀𝑧, 𝑧ˆ ∈ Z, ∀𝜏 ∈ [0, 1]. (2.32)
Note first that (2.32) is trivially satisfied unless its left hand side evaluates to 1 and its
right hand side evaluates to 0. This happens if and only if 𝑓 (𝑧) ≥ 𝜏 and 𝑔(ˆ𝑧) < 𝜏 for
some 𝜏 ∈ [0, 1] and 𝑧, 𝑧ˆ ∈ Z with 𝑑(𝑧, 𝑧ˆ) ≤ 𝑟. This is impossible, however, because
it implies that 𝑓 (𝑧) − 𝑔(ˆ𝑧) > 0 for some 𝑧, 𝑧ˆ with 𝑑(𝑧, 𝑧ˆ) ≤ 𝑟, thus contradicting the
constraints in (2.30). Hence, the constraints in (2.30) imply (2.32). The converse
implication follows immediately from the integral representations of 𝑓 and 𝑔.
Finally, note that 1 𝑓 (𝑧)≥ 𝜏 and 1𝑔( 𝑧ˆ )≥ 𝜏 are the characteristic functions of the Borel
sets B = {𝑧 ∈ Z : 𝑓 (𝑧) ≥ 𝜏} and C = { 𝑧ˆ ∈ Z : 𝑔(ˆ𝑧 ) ≥ 𝜏}, respectively. Note also
that (2.32) holds if and only if C ⊇ B𝑟 . Recalling their integral representations, we
may thus conclude that the functions 𝑓 and 𝑔 are feasible in (2.30) if and only if
they represent convex combinations of (infinitely many) characteristic functions of
the form 1 𝑧 ∈B and 1 𝑧ˆ ∈C for some Borel sets B and C with C ⊇ B𝑟 . As the objective
function of (2.30) is linear in 𝑓 and 𝑔, its supremum does not change if we restrict
the feasible set to such characteristic functions. Hence, (2.30) reduces to
OT𝑐𝑟 (P, P̂) = sup P(B) − P̂(C) : B, C ⊆ Z are Borel sets with C ⊇ B𝑟 .
Clearly, it is always optimal to set C = B𝑟 , and thus the claim follows.
While Proposition 2.22 follows from (Strassen 1965, Theorem 11), the proof
shown here parallels that of (Villani 2003, Theorem 1.27). As a byproduct, Pro-
position 2.22 reveals that the Lévy-Prokhorov distance is symmetric, which is not
evident from its definition. Thus, it constitutes indeed a metric.
The Lévy-Prokhorov ambiguity set of radius 𝑟 ≥ 0 around P̂ ∈ P(Z) is defined as
P = P ∈ P(Z) : LP(P, P̂) ≤ 𝑟 .
For our purposes, the most important implication of Proposition 2.22 is that P can
be viewed as special instance of an optimal transport ambiguity set, that is, we have
P = P ∈ P(Z) : OT𝑐𝑟 (P, P̂) ≤ 𝑟
for any radius 𝑟 ≥ 0. Lévy-Prokhorov ambiguity sets were first introduced in the
context of chance-constrained programming (Erdoğan and Iyengar 2006). They
also naturally emerge in data-driven decision-making and the training of robust
machine learning models (Pydi and Jog 2021, Bennouna and Van Parys 2023, Ben-
nouna, Lucas and Van Parys 2023). We close this section with a useful corollary,
which follows immediately from the last part of the proof of Proposition 2.22.
38 D. Kuhn, S. Shafiee, and W. Wiesemann
ˆ under 𝛾 is given by
where the essential supremum of 𝑑(𝑍, 𝑍)
ˆ = inf 𝜏 : 𝛾(𝑑(𝑍, 𝑍)
ess sup 𝛾 𝑑(𝑍, 𝑍) ˆ > 𝜏) = 0 .
𝜏 ∈R
Definition 2.25 makes sense because the ∞-Wasserstein distance can be obtained
from the 𝑝-Wasserstein distance in the limit when 𝑝 tends to infinity.
Distributionally Robust Optimization 39
Proposition 2.26 (Givens and Shortt (1984)). For any P, P̂ ∈ P(Z) we have
W∞ (P, P̂) = lim W 𝑝 (P, P̂) = sup W 𝑝 (P, P̂).
𝑝→∞ 𝑝≥1
Minimizing both sides of this inequality across all 𝛾 ∈ Γ(P, P̂) further implies that
W 𝑝 (P, P̂) ≤ W∞ (P, P̂) for all 𝑝 ∈ [1, ∞). In summary, we may thus conclude that
lim W 𝑝 (P, P̂) = sup W 𝑝 (P, P̂) ≤ W∞ (P, P̂).
𝑝→∞ 𝑝≥1
It remains to be shown that the last inequality holds in fact as an equality. To see
this, fix some tolerance 𝜀 > 0. For any 𝑝 ∈ N, let 𝛾 𝑝 ∈ Γ(P, P̂) be a coupling
with E 𝛾 𝑝 [𝑑(𝑍, 𝑍)ˆ 𝑝 ] 1/ 𝑝 = W 𝑝 (P, P̂). Note that 𝛾 𝑝 exists because, as we will
see in Corollary 3.16 and Proposition 3.3 below, Γ(P, P̂) is weakly compact and
ˆ 𝑝 ] is weakly lower semicontinuous in 𝛾. Next, let {𝛾 𝑝( 𝑗) } 𝑗 ∈N be a
E 𝛾 [𝑑(𝑍, 𝑍)
subsequence that converges weakly to some coupling 𝛾∞ ∈ Γ(P, P̂), which exists
again because Γ(P, P̂) is weakly compact. We proceed by case distinction.
ˆ is finite, define the open set
Case 1. If ess sup 𝛾∞ [𝑑(𝑍, 𝑍)]
ˆ −𝜀 ,
B = (𝑧, 𝑧ˆ) ∈ Z × Z : 𝑑(𝑧, 𝑧ˆ) > ess sup 𝛾∞ [𝑑(𝑍, 𝑍)]
and note that 𝛾∞ (B) > 0 by the definition of the essential supremum. We then find
∫ 𝑝(1𝑗)
𝑝( 𝑗)
W 𝑝( 𝑗) (P, P̂) ≥ 𝑑(𝑧, 𝑧ˆ) d𝛾 𝑝( 𝑗) (𝑧, 𝑧ˆ)
B
1
ˆ −𝜀
≥ 𝛾 𝑝( 𝑗) (B) 𝑝( 𝑗) ess sup 𝛾∞ [𝑑(𝑍, 𝑍)]
1
≥ 𝛾 𝑝( 𝑗) (B) 𝑝( 𝑗) W∞ (P, P̂) − 𝜀 .
Since B is open and 𝛾 𝑝( 𝑗) converges weakly to 𝛾∞ as 𝑗 grows, the Portmanteau
theorem (Billingsley 2013, Theorem 2.1 (iiv)) implies that lim inf 𝑗→∞ 𝛾 𝑝( 𝑗) (B) ≥
𝛾∞ (B) > 0. Thus, 𝛾 𝑝( 𝑗) (B)1/ 𝑝( 𝑗) converges to 1 as 𝑗 grows, and we obtain
lim W 𝑝 (P, P̂) ≥ W∞ (P, P̂) − 𝜀.
𝑝→∞
As this inequality holds for any tolerance 𝜀 > 0, the above reasoning finally implies
that W 𝑝 (P, P̂) converges indeed to W∞ (P, P̂) for large 𝑝.
40 D. Kuhn, S. Shafiee, and W. Wiesemann
where the first equality holds because OT𝑐𝑟 (P, P̂) is non-negative and because the
underlying optimal transport problem is solvable. The second equality follows
from the definitions of 𝑐𝑟 and the ∞-Wasserstein distance.
Combining Proposition 2.27 with Corollary 2.23 immediately yields the follow-
ing equivalent characterization of the ∞-Wasserstein distance.
Corollary 2.28 (Givens and Shortt (1984)). The ∞-Wasserstein distance satisfies
W∞ (P, P̂) = inf 𝑟 ≥ 0 : P(B) ≤ P̂(B𝑟 ) for all Borel sets B ⊆ Z ,
Distributionally Robust Optimization 41
called Fréchet inequalities. Note that these Fréchet inequalities can be obtained by
minimizing or maximizing the probability of the composite event over all distribu-
tions in a Fréchet ambiguity set with Bernoulli marginals. More recently, there has
been growing interest in generalized Fréchet inequalities, which bound the risk of
general (not necessarily Boolean) functions of 𝑍 with respect to all distributions
in a Fréchet ambiguity set with general (not necessarily Bernoulli) marginals. For
example, a wealth of Fréchet inequalities for the risk of a sum of random variables
have emerged in finance and risk management (Rüschendorf 1983, 1991, Embrechts
and Puccetti 2006, Wang and Wang 2011, Wang, Peng and Yang 2013, Puccetti
and Rüschendorf 2013, Van Parys, Goulart and Embrechts 2016a, Blanchet, Lam,
Liu and Wang 2024a). In addition, Natarajan, Song and Teo (2009b) derive sharp
bounds for the worst-case expectation of a piecewise affine functions over a Fréchet
ambiguity set. We highlight that Fréchet ambiguity sets are also relevant because
they coincide with the feasible sets of multi-marginal optimal transport problems,
which can sometimes be solved in polynomial time (Pass 2015, Altschuler and
Boix-Adsera 2023, Natarajan, Padmanabhan and Ramachandra 2023).
General marginal ambiguity sets specify the marginal distributions of several
(possibly overlapping) subsets of the set {𝑍𝑖 : 𝑖 ∈ [𝑑]} of random variables. How-
ever, checking whether such an ambiguity set is non-empty is NP-complete even if
each 𝑍𝑖 is a Bernoulli random variable and each subset accommodates merely two
elements (Honeyman, Ladner and Yannakakis 1980, Georgakopoulos, Kavvadias
and Papadimitriou 1988). Computing worst-case expectations over marginal am-
biguity sets is thus intractable unless the subsets of random variables with known
marginals are disjoint (Doan and Natarajan 2012) or if the corresponding overlap
graph displays a running intersection property (Doan, Li and Natarajan 2015).
Marginal ambiguity sets are attractive because, given limited statistical data, it
is far easier to estimate low-dimensional marginals than their global dependence
structure. However, even univariate marginals cannot be estimated exactly. For this
reason, several researchers study marginal ambiguity sets that provide only limited
information about the marginals such as bounds on marginal moments or marginal
dispersion measures (Bertsimas et al. 2004, Bertsimas, Natarajan and Teo 2006a,b,
Chen, Sim, Sun and Teo 2010, Mishra, Natarajan, Tao and Teo 2012, Natarajan,
Sim and Uichanco 2018).
A related stream of literature focuses on ambiguity sets under which the ran-
dom variables 𝑍𝑖 , 𝑖 ∈ [𝑑], are independent and governed by ambiguous marginal
distributions. For example, the Hoeffding ambiguity set contains all joint distri-
butions on a box with independent (and completely unknown) marginals, whereas
the Bernstein ambiguity set contains all distributions from within the Hoeffding
ambiguity set subject to marginal moment bounds (Nemirovski and Shapiro 2007,
Hanasusanto et al. 2015a). Bernstein ambiguity sets that constrain the mean
as well as the mean-absolute deviation of each marginal are used to derive safe
tractable approximations for distributionally robust chance constrained programs
(Postek, Ben-Tal, den Hertog and Melenberg 2018), two-stage integer programs
Distributionally Robust Optimization 43
(Postek et al. 2018, Postek, Romeijnders, den Hertog and van der Vlerk 2019), and
queueing systems (Wang, Prasad, Hanasusanto and Hasenbein 2024d).
DRO with marginal ambiguity sets has close connections to submodularity and
to the theory of comonotonicity in risk management (Tchen 1980, Rüschendorf
2013, Bach 2013, 2019, Natarajan et al. 2023, Long, Qi and Zhang 2024). It
has a broad range of diverse applications ranging from discrete choice model-
ing (Natarajan et al. 2009b, Mishra, Natarajan, Padmanabhan, Teo and Li 2014,
Chen, Ma, Natarajan, Simchi-Levi and Yan 2022, Ruan, Li, Murthy and Natara-
jan 2022), queuing theory (van Eekelen, den Hertog and van Leeuwaarden 2022),
transportation (Wang, Chen and Liu 2020, Shehadeh 2023), chance constrained
programming (Xie, Ahmed and Jiang 2022), scheduling (Mak et al. 2015), invent-
ory management (Liu, Chen, Wang and Wang 2024a), the analysis of complex
networks (Chen, Padmanabhan, Lim and Natarajan 2020, Van Leeuwaarden and
Stegehuis 2021, Brugman et al. 2022) and mechanism design (Carroll 2017, Gravin
and Lu 2018, Chen et al. 2024a, Wang, Liu and Zhang 2024b, Wang 2024), etc.
For further details we refer to the comprehensive monograph by Natarajan (2021).
all distributions P ∈ P(R𝑑 ) that are point symmetric about the origin. This means
that P(𝑍 ∈ B) = P(−𝑍 ∈ B) for every Borel set B ⊆ R𝑑 . One can then show
that all extreme distributions of P are representable as P 𝜃 = 21 𝛿+𝜃 + 12 𝛿 − 𝜃 for
some 𝜃 ∈ R𝑑 . Thus, P admits a Choquet representation of the form (2.36). As
another example, let P be the family of all distributions P ∈ P(R𝑑 ) that are 𝛼-
unimodal about the origin for some 𝛼 > 0. This means that 𝑡 𝛼 P(𝑍 ∈ B/𝑡) is
non-decreasing in 𝑡 > 0 for every Borel set B ⊆ R𝑑 . One can then show that
every extreme distribution of P is a distribution P 𝜃 supported on the line segment
from 0 to 𝜃 ∈ R𝑑 with the property that P 𝜃 (k𝑍 k 2 ≤ 𝑡 k𝜃 k 2 ) = 𝑡 𝛼 for all 𝑡 ∈ [0, 1].
Thus, P admits again a Choquet representation of the form (2.36). We remark that
𝑑-unimodal distributions on R𝑑 are also called star-unimodal. One readily verifies
that a distribution with a continuous probability density function is star-unimodal
if and only if the density function is non-increasing along each ray emanaging from
the origin. In addition, one can show that the family of all 𝛼-unimodal distributions
converges—in a precise sense—to the family of all possible distributions on R𝑑
as 𝛼 tends to infinity. For more information on structural distribution families and
their Choquet representations we refer to (Dharmadhikari and Joag-Dev 1988).
The moment ambiguity sets of Section 2.1 are known to contain discrete distribu-
tions with only very few atoms; see Section 7. However, uncertainties encountered
in real physical, technical or economic systems are unlikely to follow such discrete
distributions. Instead, they are often expected to be unimodal. Hence, an effective
means to eliminate the pathological discrete distributions from a moment ambiguity
set is to intersect it with the structural ambiguity set of all 𝛼-unimodal distributions
for some 𝛼 > 0. Popescu (2005) combines ideas from Choquet theory and sums-
of-squares polynomial optimization to approximate worst-case expectations over
the resulting intersection ambiguity sets by a hierarchy of increasingly accurate
bounds, each of which is computed by solving a tractable semidefinite program.
Van Parys, Goulart and Kuhn (2016b) and Van Parys, Goulart and Morari (2019)
extend this approach and establish exact semidefinite programming reformula-
tions for the worst-case probability of a polyhedron and the worst-case conditional
value-at-risk of a piecewise linear convex loss function across all 𝛼-unimodal dis-
tributions in a Chebyshev ambiguity set; see also (Hanasusanto, Roitch, Kuhn
and Wiesemann 2015b). Li, Jiang and Mathieu (2019a) demonstrate that these
semidefinite programming reformulations can sometimes be simplified to highly
tractable second-order cone programs. Complementing moment information with
structural information generally leads to less conservative DRO models as Li, Ji-
ang and Mathieu (2016) demonstrate in the context of a power system application.
Lam, Liu and Zhang (2021) consider another basic notion of distributional shape
known as orthounimodality and build a corresponding Choquet representation to
address multivariate extreme event estimation. More recently, Lam, Liu and Sing-
ham (2024) combine Choquet theory with importance sampling and likelihood
ratio techniques for modeling distribution shapes.
Distributionally Robust Optimization 45
for all distributions P, P̂ ∈ P(Z) under which all test functions 𝑓 ∈ F are integrable.
The underlying maximization problem probes how well the test functions can
distinguish P from P̂. By construction, DF constitutes a pseudo-metric, that is, it is
non-negative and symmetric (because F = −F), vanishes if its arguments match,
and satisfies the triangle inequality. In addition, DF becomes a proper metric
if F separates distributions, in which case DF (P, P̂) vanishes only if P = P̂. The
ambiguity set of radius 𝑟 ≥ 0 around P̂ ∈ P(Z) with respect to DF is defined as
P = P ∈ P(Z) : DF (P, P̂) ≤ 𝑟 .
The proof of Proposition 2.11 reveals that the total variation distance is the in-
tegral probability metric generated by all Borel functions 𝑓 : Z → [−1/2, 1/2];
see (2.16). The Kantorovich-Rubinstein duality established in Corollary 2.19
further shows that the 1-Wasserstein distance is the integral probability metric
generated by all Lipschitz continuous functions 𝑓 : Z → R with lip( 𝑓 ) ≤ 1. In
addition, if H is a reproducing kernel Hilbert space of Borel functions 𝑓 : Z → R
46 D. Kuhn, S. Shafiee, and W. Wiesemann
with Hilbert norm k · k H , then the maximum mean discrepancy distance corres-
ponding to H is the integral probability metric generated by the standard unit ball
F = { 𝑓 ∈ H : k 𝑓 k H ≤ 1} in H. Maximum mean discrepancy ambiguity sets
are studied in (Staib and Jegelka 2019, Zhu, Jitkrittum, Diehl and Schölkopf 2020,
2021, Zeng and Lam 2022, Iyengar, Lam and Wang 2022). Husain (2020) un-
covers a deep connection between DRO problems and regularized empirical risk
minimization problems, which holds whenever the ambiguity set is defined via an
integral probability metric.
There is a close link between the continuity properties of the expected value
of 𝑓 (𝑍) with respect to the distribution P and the continuity properties of 𝑓 . Recall
that a function 𝐹 : P(Z) → R is weakly continuous if lim𝑖→∞ 𝐹(P𝑖 ) = 𝐹(P) for
every sequence P𝑖 ∈ P(Z), 𝑖 ∈ N, that converges weakly to P. Weak lower and
upper semicontinuity are defined analogously in the obvious way.
Proposition 3.3 (Continuity of Expected Values). If 𝑓 : Z → [−∞, +∞] is lower
semicontinuous and bounded from below, then E P [ 𝑓 (𝑍)] is weakly lower semicon-
Distributionally Robust Optimization 47
Here, both the second and the last equality follow from the monotone convergence
theorem, which applies because each 𝑓𝑖 is bounded and thus integrable with respect
to any probability distribution and because the 𝑓𝑖 , 𝑖 ∈ N, form a non-decreasing
sequence of non-negative functions. The inequality follows from the interchange
of the supremum over 𝑖 and the infimum over 𝑗, and the third equality holds
because P 𝑗 converges weakly to P and because 𝑓𝑖 is continuous and bounded. This
shows that E P [ 𝑓 (𝑍)] is weakly lower semicontinuous in P.
The proofs of the assertions regarding weak upper semicontinuity and weak
continuity are analogous and therefore omitted for brevity.
In the following we equip the family P(Z) of all probability distributions on Z
with the weak topology, which is generated by the open sets
𝑈 𝑓 , 𝛿 = {P ∈ P(Z) : |E P [ 𝑓 (𝑍)]| < 𝛿}
encoded by any continuous bounded function 𝑓 : Z → R and tolerance 𝛿 > 0. The
weak topology on P(Z) is metrized by the Prokhorov metric (Billingsley 2013,
Theorem 6.8), and therefore the notions of sequential compactness and compactness
are equivalent on P(Z); see, e.g., (Munkres 2000, Theorem 28.2).
Definition 3.4 (Tightness). A family P ⊆ P(Z) of distributions is tight if for any
tolerance 𝜀 > 0 there is a compact set C ⊆ Z with P(𝑍 ∉ C) ≤ 𝜀 for all P ∈ P.
A classical result by Prokhorov asserts that a distribution family is weakly
compact if and only if it is tight and weakly closed. Prokhorov’s theorem is the key
tool to show that an ambiguity set is weakly compact. We state it without proof.
Theorem 3.5 (Billingsley (2013, Theorem 5.1)). A family P ⊆ P(Z) of distribu-
tions is weakly compact if and only if it is tight as well as weakly closed.
48 D. Kuhn, S. Shafiee, and W. Wiesemann
In the following we revisit the ambiguity sets of Section 2 one by one and
determine under what conditions they are tight, weakly closed and weakly compact.
where the equality holds because P 𝑗 is supported on Z for every 𝑗 ∈ N, and the first
inequality follows from weak lower semicontinuity. This implies that P ∈ P(Z),
and thus P(Z) is weakly closed. Conversely, assume that P(Z) is weakly closed,
and consider a sequence 𝑧 𝑗 ∈ Z, 𝑗 ∈ N, converging to 𝑧. Then, the sequence of
Dirac distributions 𝛿 𝑧 𝑗 , 𝑗 ∈ N, converges weakly to 𝛿 𝑧 , and thus we find
0 = lim inf E 𝛿𝑧 𝑗 [𝛿Z (𝑍)] ≥ E 𝛿𝑧 [𝛿Z (𝑍)] ≥ 0.
𝑗 ∈N
Here, the first inequality holds again because E P [𝛿Z (𝑍)] is weakly lower semicon-
tinuous in P. This implies that E 𝛿𝑧 [𝛿Z (𝑍)] = 0, which holds if and only if 𝑧 ∈ Z.
Thus, Z is closed. Given these insights, the claim follows from Theorem 3.5.
By using Proposition 3.6, we can now show that a moment ambiguity set of the
form (2.1) is weakly compact whenever the underlying support set Z is compact,
the moment function 𝑓 is continuous and the uncertainty set F is closed.
Proposition 3.7 (Moment Ambiguity Sets). If Z ⊆ R𝑑 is a compact support set,
𝑓 : Z → R𝑚 is a continuous moment function and F ⊆ R𝑚 is a closed uncertainty
set, then the moment ambiguity set P defined in (2.1) is weakly compact.
Distributionally Robust Optimization 49
where the inequality holds because the quadratic function 𝑞(𝑧) = k𝑧k 22 · 𝜀/𝑑 ma-
jorizes the characteristic function of Z\C. Hence, P is indeed tight. However, P
is not necessarily weakly closed. To see this, suppose that 𝑑 = 1 and that Z = R.
2
In this case the distributions P𝑖 = 2𝑖12 𝛿 −𝑖 + 𝑖 𝑖−1 1
2 𝛿 0 + 2𝑖 2 𝛿 𝑖 have zero mean and unit
variance for all 𝑖 ∈ N. That is, they all belong to P. However, they converge weakly
to P = 𝛿0 , which is not an element of P. Thus, P fails to be weakly compact.
The family of all distributions on R𝑑 with bounded 𝑝-th-order moments is always
weakly compact even though ambiguity sets that fix the 𝑝-th-order moments to
prescribed values (e.g., the Chebyshev ambiguity set) may not be weakly compact.
50 D. Kuhn, S. Shafiee, and W. Wiesemann
for all 𝑡 ∈ [0, 1]. In the remainder we show that 𝑝 satisfies all desired properties.
By construction, 𝑝 depends only on 𝜙 and 𝑟 and coincides with the lower envelope
of infinitely many linear functions in 𝑡. Hence, 𝑝 is concave as well as upper
semicontinuous. By the definition of P and by Theorem 4.15 below, we also have
sup P(𝑍 ∈ B) = sup E P [1B (𝑍)] : D 𝜙 (P, P̂) ≤ 𝑟
P∈P P∈P(Z)
= inf 𝜆 0 + 𝜆𝑟 + E P̂ [(𝜙∗ ) 𝜋 (1B (𝑍) − 𝜆 0 , 𝜆)] (3.1)
𝜆0 ∈R,𝜆∈R+
=𝑝(P̂(𝑍 ∈ B)),
for any Borel set B, where the last equality follows from the definition of 𝑝. As
the worst-case probability on the left hand side of (3.1) falls within [0, 1] and as
P̂(𝑍 ∈ B) can adopt any value in [0, 1], it is clear that the range of 𝑝 is a subset
of [0, 1]. Next, we show that 𝑝 is continuous. To this end, note that the concavity
Distributionally Robust Optimization 51
and finiteness of 𝑝 on [0, 1] imply via (Rockafellar 1970, Theorem 10.1) that 𝑝 is
continuous on (0, 1). In addition, its upper semicontinuity prevents 𝑝 from jumping
at 0 or at 1. Thus, 𝑝 is indeed continuous throughout [0, 1]. Finally, setting B = ∅
or B = Z in (3.1) shows that 𝑝(0) = 0 and 𝑝(1) = 1, respectively. Consequently,
we may conclude that 𝑝 is surjective. This observation completes the proof.
where the inequality follows from the monotonicity of 𝑝 and choice of 𝑅. We have
thus shown that P(𝑍 ∉ C) ≤ 𝜀 for all P ∈ P, and thus P is tight.
It remains to be shown that P is weakly closed. To this end, recall first that P(Z)
is weakly closed because Z is closed; see Proposition 3.6. Next, recall from
Proposition 2.6 that any 𝜙-divergence admits a dual representation of the form
∫ ∫
D 𝜙 (P, P̂) = sup 𝑓 (𝑧) dP(𝑧) − 𝜙∗ ( 𝑓 (𝑧)) dP̂(𝑧), (3.2)
𝑓 ∈F Z Z
where we may assume without loss of generality that the dominating measure 𝜌 ∈
M+ (Z) is given by 𝜌 = P + P̂. This establishes the desired upper bound. It remains
to be shown that this bound is attained even if 𝜙(0) or 𝜙∞ (1) evaluate to infinity.
To this end, suppose that P and P̂ are mutually singular. This means that there exist
disjoint Borel sets B, B̂ ⊆ Z with P(𝑍 ∈ B) = 1 and P̂(𝑍 ∈ B̂) = 1. We thus have
∫ ∫ dP
d𝜌 (𝑧)
!
dP̂
D 𝜙 (P, P̂) = (𝑧) 𝜙(0) d𝜌(𝑧) + 0𝜙 d𝜌(𝑧)
B̂ d𝜌 B 0
∫
∞ dP
= 𝜙(0) + 𝜙 (𝑧) d𝜌(𝑧) = 𝜙(0) + 𝜙∞ (1).
B d𝜌
dP dP̂
The first equality holds because d𝜌 (𝑧) = 0 for 𝜌-almost all 𝑧 ∈ B̂ and d𝜌 (𝑧) = 0
for 𝜌-almost all 𝑧 ∈ B. The second equality follows from the definition of the
perspective function and exploits that the restriction of 𝜌 to B̂ coincides with P̂.
The third equality, finally, holds because the restriction of 𝜌 to B coincides with P.
Note that the upper bound is attained even if 𝜙(0) = ∞ or 𝜙∞ (1) = ∞.
The following example reveals that 𝜙-divergence ambiguity sets fail to be weakly
compact if 𝜙∞ (1) < ∞ and if the set Z without the atoms of P̂ is unbounded.
Example 3.14 (𝜙-Divergence Ambiguity Sets). Consider an entropy function 𝜙
with 𝜙∞ (1) < ∞. By Lemma 3.13, D 𝜙 (P, P̂) is bounded above by 𝑟 = 𝜙(0) + 𝜙∞ (1)
for all P, P̂ ∈ P. In addition, let P be the 𝜙-divergence ambiguity set with
center P̂ ∈ P(Z) and radius 𝑟 ∈ (0, 𝑟) defined in (2.10). Assume that for every
𝑅 > 0 there exists 𝑧0 ∈ Z with k𝑧0 k 2 ≥ 𝑅 and P̂(𝑍 = 𝑧0 ) = 0. This assumption
holds, for example, whenever Z is unbounded and convex, and it implies that P
fails to be tight. To see this, fix an arbitrary compact set C ⊆ Z, and select
any point 𝑧0 ∈ Z\C with P̂(𝑍 = 𝑧0 ) = 0. Such a point exists by assumption.
Next, consider the distributions P 𝜃 = (1 − 𝜃) P̂ + 𝜃 𝛿 𝑧0 parametrized by 𝜃 ∈ [0, 1].
Note that P̂ and 𝛿 𝑧0 are mutually singular and that 𝑓 (𝜃) = D 𝜙 (P 𝜃 , P̂) is a convex
continuous bijective function from [0, 1] to [0, 𝑟]. Set now 𝜀 = 21 𝑓 −1 (𝑟). For
𝜃 = 𝑓 −1 (𝑟), the distribution P 𝜃 satisfies D 𝜙 (P 𝜃 , P̂) = 𝑓 ( 𝑓 −1 (𝑟)) = 𝑟 and thus
belongs to P. In addition, P 𝜃 (𝑍 ∉ C) ≥ 𝑓 −1 (𝑟) > 𝜀 because 𝑧0 ∉ C. Note that 𝜀
is independent of C and 𝑧0 as long as P̂(𝑍 = 𝑧0 ) = 0. As the compact set C was
chosen arbitrarily, this implies that P fails to be tight and weakly compact.
Proof. We first show that the Fréchet ambiguity set is tight. For any 𝜀 > 0 and
𝑖 ∈ [𝑑], we can set 𝑧𝑖 and 𝑧𝑖 to the 𝜀/(2𝑑)-quantile and the (1 − 𝜀/(2𝑑))-quantile
of the distribution function 𝐹𝑖 , respectively. Setting C = ×𝑖∈ [𝑑] [𝑧 𝑖 , 𝑧𝑖 ] yields
Õ Õ
P(𝑍 ∉ C) ≤ P(𝑍𝑖 ∉ [𝑧𝑖 , 𝑧 𝑖 ]) = 𝜀/𝑑 = 𝜀,
𝑖∈ [𝑑] 𝑖∈ [𝑑]
where the inequality follows from the union bound. Thus, P is tight. It remains to
be shown that P is weakly closed. Note that the distribution function of 𝑍𝑖 under P
matches 𝐹𝑖 if and only if for every bounded continuous function 𝑓 we have
∫ +∞
E P [ 𝑓 (𝑍𝑖 )] = 𝑓 (𝑧𝑖 ) d𝐹𝑖 (𝑧𝑖 ).
−∞
This is true because every Borel distribution on R constitutes a Radon measure. The
set of all P ∈ P(R𝑑 ) satisfying the above equality for any fixed bounded and con-
tinuous function 𝑓 and any fixed index 𝑖 ∈ [𝑑] is weakly closed by Proposition 3.3.
Hence, P is weakly closed because closedness is preserved by intersection.
Lemma 3.17 allows us to prove that the optimal transport discrepancy OT𝑐 (P, P̂)
constitutes a weakly lower semicontinuous function of its inputs P and P̂.
Lemma 3.18 (Weak Lower Semicontinuity of Optimal Transport Discrepancies).
The optimal transport discrepancy OT𝑐 (P, P̂) is weakly lower semicontinuous
jointly in P and P̂.
Proof. Assume that P 𝑗 and P̂ 𝑗 , 𝑗 ∈ N, converge weakly to P and P̂, respectively,
and define the countable ambiguity sets P = {P 𝑗 } 𝑗 ∈N and P̂ = {P̂ 𝑗 } 𝑗 ∈N . By the
definition of sequential compactness, the weak closures of P and P̂ are weakly
compact. Prokhorov’s theorem (see Theorem 3.5) thus implies that both P and P̂
are tight. Hence, for any 𝜀 > 0 there exist two compact sets C, Cˆ ⊆ R𝑑 with
P 𝑗 (𝑍 ∉ C) ≤ 𝜀/2 and ˆ ≤ 𝜀/2 ∀ 𝑗 ∈ N.
P̂ 𝑗 ( 𝑍ˆ ∉ C)
Whenever 𝛾 ∈ Γ(P 𝑗 , P̂ 𝑗 ) for some 𝑗 ∈ N, we thus have
ˆ ∉ C × Cˆ ≤ P 𝑗 (𝑍 ∉ C) + P̂ 𝑗 (𝑍 ∉ C)
ˆ ≤ 𝜀.
𝛾 (𝑍, 𝑍)
As C × Cˆ is compact and as 𝜀 was chosen arbitrarily, this reveals that the union
Ø
Γ(P 𝑗 , P̂ 𝑗 ) (3.3)
𝑗 ∈N
is tight, which in turn implies via Prokhorov’s theorem that its closure is weakly
compact. Let now 𝛾★𝑗 be an optimal coupling of P 𝑗 and P̂ 𝑗 , which solves prob-
lem (2.18), and which exists thanks to Lemma 3.17. As all these optimal couplings
belong to some weakly compact set (i.e., the weak closure of (3.3)), we may assume
without loss of generality that 𝛾★𝑗, 𝑗 ∈ N, converges weakly to some distribution 𝛾.
Otherwise, we can pass to a subsequence. Clearly, we have 𝛾 ∈ Γ(P, P̂). For 𝛾★ an
optimal coupling of P and P̂, we then find
ˆ
lim inf OT𝑐 (P 𝑗 , P̂ 𝑗 ) = lim inf E 𝛾★𝑗 [𝑐(𝑍, 𝑍)]
𝑗→∞ 𝑗→∞
ˆ ≥ E 𝛾★ [𝑐(𝑍, 𝑍)]
≥ E 𝛾 [𝑐(𝑍, 𝑍)] ˆ = OT𝑐 (P, P̂),
where the two equalities follow from the definitions of 𝛾★𝑗 and 𝛾★, respectively.
ˆ is weakly lower semicontinuous in 𝛾
The first inequality holds because E 𝛾 [𝑐(𝑍, 𝑍)]
thanks to Proposition 3.3, and the second inequality follows from the suboptimality
of 𝛾 in (2.18). Thus, OT𝑐 (P, P̂) is weakly lower semicontinuous in P and P̂.
Lemma 3.18 is inspired by (Clément and Desch 2008, Lemma 5.2) and (Yue,
Kuhn and Wiesemann 2022, Theorem 1). Next, we prove that Wasserstein am-
biguity sets are weakly compact. Throughout this discussion we assume that the
metric underlying the transportation cost function is induced by a norm k · k on R𝑑 .
This assumption simplifies our derivations but could be relaxed. Recall that the
𝑝-Wasserstein distance W 𝑝 (P, P̂) for 𝑝 ≥ 1 is the 𝑝-th root of OT𝑐 (P, P̂), where the
transportation cost function is set to 𝑐(𝑧, 𝑧ˆ) = k𝑧 − 𝑧ˆ k 𝑝 ; see Definition 2.18.
56 D. Kuhn, S. Shafiee, and W. Wiesemann
Theorem 3.19 (𝑝-Wasserstein Ambiguity Sets). Assume that the metric 𝑑(·, ·) on Z
is induced by some norm k · k on the ambient space R𝑑 . If P̂ ∈ P(Z) has finite 𝑝-th
moments (i.e., E P̂ [k𝑍 k 𝑝 ] < ∞) for some exponent 𝑝 ≥ 1, then the 𝑝-Wasserstein
ambiguity set P defined in (2.28) is weakly compact.
Proof. We first show that all distributions P ∈ P have uniformly bounded 𝑝-th
moments. To this end, set 𝑟ˆ = E P̂ [k𝑍 k 𝑝 ] < ∞, and note that any P ∈ P satisfies
1
E P [k𝑍 k 𝑝 ] 𝑝 = W 𝑝 (P, 𝛿0 ) ≤ W 𝑝 (P, P̂) + W 𝑝 (P̂, 𝛿0 )
1
= W 𝑝 (P, P̂) + E P̂ [k𝑍 k 𝑝 ] 𝑝 ≤ 𝑟 + 𝑟.
ˆ
Here, the first inequality holds because the 𝑝-Wasserstein distance is a metric and
thus satisfies the triangle inequality, and the second inequality holds because P ∈ P.
We therefore have E P [k𝑍 k 𝑝 ] ≤ (𝑟 + 𝑟) ˆ 𝑝 for every P ∈ P. In other words, the
Wasserstein ball P is a subset of the 𝑝-th-order moment ambiguity set discussed
in Example 3.10. This implies that P is tight. Note further that P is defined as a
sublevel set of the function 𝑓 (P) = W 𝑝 (P, P̂), which is weakly lower semicontinuous
thanks to Lemma 3.18. Hence, P is weakly closed.
Finally, we prove that the ∞-Wasserstein ambiguity set is always weakly compact.
Corollary 3.20 (∞-Wasserstein Ambiguity Sets). Assume that the metric 𝑑(·, ·)
on Z is induced by some norm k · k on the ambient space R𝑑 . Then, the ∞-
Wasserstein ambiguity set defined in (2.34) is weakly compact for every P̂ ∈ P(Z).
Proof. We first show that P is tight. To this end, select any 𝜀 > 0 and any compact
set Cˆ ⊆ Z with P̂(𝑍 ∉ C) ˆ ≤ 𝜀. Note that Cˆ is guaranteed to exist because P̂ is a
probability distribution. Next, define C as the 𝑟-neighborhood Cˆ𝑟 of C,
ˆ that is, set
C = 𝑧 ∈ Z : ∃𝑧ˆ ∈ Cˆ with k𝑧 − 𝑧ˆ k ≤ 𝑟 ,
ˆ Any
see also (2.29). One readily verifies that C inherits compactness from C.
distribution P ∈ P satisfies W∞ (P, P̂) ≤ 𝑟. Consequently, we find
ˆ = P̂(𝑍 ∉ C)
P(𝑍 ∉ C) = P(𝑍 ∈ Z\C) ≤ P̂(𝑍 ∈ Z\C) ˆ ≤ 𝜀,
where the first inequality follows from Corollary 2.28 and the observation that the
𝑟-neighborhood of Z\C coincides with Z\C. ˆ The second inequality follows from
the definition of C. ˆ As 𝜀 was chosen arbitrarily, P is tight. It remains to be shown
that P is weakly closed. Proposition 2.26 readily implies that W∞ (P, P̂) ≤ 𝑟 if and
only if W 𝑝 (P, P̂) ≤ 𝑟 for all 𝑝 ≥ 1. Thus, we may conclude that
Ù
P= P ∈ P(R𝑑 ) : W 𝑝 (P, P̂) ≤ 𝑟 .
𝑝≥1
That is, the ∞-Wasserstein ambiguity set can be expressed as the intersection of
all 𝑝-Wasserstein ambiguity sets for 𝑝 ≥ 1, all of which are weakly closed by
Theorem 3.19. Hence, P is is indeed weakly closed, and the claim follows.
Distributionally Robust Optimization 57
Note that P represents a convex subset of the linear space of all finite signed Borel
measures on Z. Unless Z is finite, (4.1) thus constitutes an infinite-dimensional
convex program with a linear objective function. For this problem to be well-
defined, we assume that ℓ : Z → R is a Borel function. In line with (Rockafellar
and Wets 2009, Section 14.E), we define E P [ℓ(𝑍)] = −∞ if E P [max{ℓ(𝑍), 0}] = ∞
and E P [min{ℓ(𝑍), 0}] = −∞. This means that infeasibility trumps unboundedness.
More generally, throughout the rest of the paper, we assume that if the objective
function of a minimization (maximization) problem can be expressed as the dif-
ference of two terms, both of which evaluate to ∞, then the objective function
value should be interpreted as ∞ (−∞). This convention is in line with the rules of
extended arithmetic used in (Rockafellar and Wets 2009).
In the remainder we will show that (4.1) can be dualized by using elementary tools
from finite-dimensional convex analysis (Fenchel 1953, Rockafellar 1970) for a
broad class of finitely-parametrized ambiguity sets including all moment ambiguity
sets (Section 4.2), 𝜙-divergence ambiguity sets (Section 4.3) and optimal transport
ambiguity sets (Section 4.4). We broadly adopt the proof strategies developed
by Shapiro (2001) and Zhang et al. (2024b) for moment and optimal transport
ambiguity sets, respectively, and we extend them to 𝜙-divergence ambiguity sets.
set. By the definitions of the epigraph and the infimum operator, we find
epi(ℎ) = {(𝑢, 𝑡) ∈ U × R : ℎ(𝑢) ≤ 𝑡}
= {(𝑢, 𝑡) ∈ U × R : ∃𝑣 ∈ V with 𝐻(𝑢, 𝑣) ≤ 𝑡 + 𝜀 ∀𝜀 > 0}
Ù
= {(𝑢, 𝑡) ∈ U × R : ∃𝑣 ∈ V with 𝐻(𝑢, 𝑣) − 𝜀 ≤ 𝑡}.
𝜀>0
Thus, epi(ℎ) can be obtained by projecting ∩ 𝜀>0 epi(𝐻 −𝜀) to U ×R. The claim then
follows because epi(𝐻 − 𝜀) is convex for every 𝜀 > 0 thanks to the convexity of 𝐻
and because convexity is preserved under intersections and linear transformations;
see, e.g., (Rockafellar 1970, Theorems 2.1 & 5.7).
The following result marks a cornerstone of convex analysis. It states that the
biconjugate ℎ∗∗ (that is, the conjugate of ℎ∗ ) of a closed convex function ℎ coincides
with ℎ. Here, we adopt the standard convention that ℎ is closed if it is lower semi-
continuous and either ℎ(𝑢) > −∞ for all 𝑢 ∈ U or ℎ(𝑢) = −∞ for all 𝑢 ∈ U . We
use cl(ℎ) to denote the closure of ℎ, that is, the largest closed function below ℎ.
Lemma 4.2 (Fenchel–Moreau Theorem). For any convex function ℎ : R𝑑 → R,
we have ℎ ≥ ℎ∗∗ . The inequality becomes an equality on rint(dom(ℎ)).
Proof. By (Rockafellar 1970, Theorem 12.2), we have ℎ∗∗ = cl(ℎ) ≤ ℎ. In addition,
(Rockafellar 1970, Theorem 10.1) ensures that the convex function ℎ is continuous
on rint(dom(ℎ)) and thus coincides with cl(ℎ) there. Hence, the claim follows.
The main idea for dualizing the worst-case expectation problem (4.1) is to
represent its optimal value as −ℎ(𝑢), where ℎ(𝑢) = inf 𝑣 ∈V 𝐻(𝑢, 𝑣), U is a finite-
dimensional space of parameters 𝑢 that encode the ambiguity set P (such as a set of
prescribed moments or a size parameter), and V is an infinite-dimensional space of
finite signed measures on Z. In addition, 𝐻(𝑢, 𝑣) represents the negative expected
loss if the signed measure 𝑣 happens to be a probability measure in P ⊆ V and
evaluates to ∞ otherwise. If 𝐻(𝑢, 𝑣) is jointly convex on 𝑢 and 𝑣, then ℎ(𝑢) is
convex by virtue of Lemma 4.1. A problem dual to (4.1) can then be constructed
from the bi-conjugate ℎ∗∗ (𝑢). Lemma 4.2 provides conditions for strong duality.
If additionally E P [ℓ(𝑍)] > −∞ for every P ∈ P 𝑓 (Z), then ℎ∗∗ and ℎ match on the
cone generated by {1} × rint(C) except at the origin.
60 D. Kuhn, S. Shafiee, and W. Wiesemann
This establishes the desired formula for the bi-conjugate of ℎ. Assume now that
E P [ℓ(𝑍)] > −∞ for every P ∈ P 𝑓 (Z). It remains to be shown that ℎ(𝑢0 , 𝑢) =
ℎ∗∗ (𝑢0 , 𝑢) for all (𝑢0 , 𝑢) ≠ (0, 0) in the cone generated by {1} × rint(C). However,
this follows immediately from Lemma 4.2 and the observation that
rint(dom(ℎ)) = rint(cone({1} × C)) = cone({1} × rint(C))\{(0, 0)},
where the two equalities hold because of Lemma 4.3 and (Rockafellar 1970, Co-
rollary 6.8.1), respectively. Therefore, the claim follows.
Proposition 4.4 implies that ℎ(1, 𝑢) = ℎ∗∗ (1, 𝑢) for all 𝑢 ∈ rint(C). The following
main theorem exploits this relation to convert the maximization problem on the
right hand side of (4.2) to an equivalent dual minimization problem.
Theorem 4.5 (Duality Theory for Moment Ambiguity Sets). If P is the moment
ambiguity set (2.1), then the following weak duality relation holds.
∗ (𝜆)
inf 𝜆 0 + 𝛿F
sup E P [ℓ(𝑍)] ≤ s.t. 𝜆 0 ∈ R, 𝜆 ∈ R𝑚 (4.4)
P∈P
𝜆 0 + 𝑓 (𝑧)⊤ 𝜆 ≥ ℓ(𝑧) ∀𝑧 ∈ Z.
If E P [ℓ(𝑍)] > −∞ for all P ∈ P 𝑓 (Z) and F ⊆ C is a convex and compact set with
rint(F) ⊆ rint(C), then strong duality holds, that is, (4.4) becomes an equality.
Distributionally Robust Optimization 61
Here, the first inequality exploits Proposition 4.4 and Lemma 4.2, which ensures
that ℎ ≥ ℎ∗∗ , and the second inequality holds thanks to the max-min inequality.
The last equality follows from the definition of the support function 𝛿F ∗ . This
establishes the weak duality relation (4.4). Next, suppose that F is a convex
compact set with rint(F) ⊆ rint(C). Under this additional assumption, we have
sup E P [ℓ(𝑍)] = sup −ℎ(1, 𝑢) = sup −ℎ(1, 𝑢)
P∈P 𝑢∈F 𝑢∈rint(F )
= sup inf 𝜆0 + 𝑢⊤ 𝜆
𝑢∈rint(F ) (𝜆0 ,𝜆)∈L
= sup inf 𝜆0 + 𝑢⊤ 𝜆 = inf ∗
𝜆 0 + 𝛿F (𝜆),
𝑢∈F (𝜆0 ,𝜆)∈L (𝜆0 ,𝜆)∈L
where the first equality exploits (4.2). The second equality follows from two obser-
vations. First, rint(F) is non-empty and convex (Rockafellar 1970, Theorem 6.2).
Second, −ℎ(1, 𝑢) is concave in 𝑢, which ensures that −ℎ(1, 𝑢) cannot jump up on
the boundary of its domain C and—in particular—on the boundary of F ⊆ C.
Taken together, these observations imply that we can restrict F to rint(F) without
reducing the supremum. The third equality follows from Proposition 4.4, which
allows us to replace ℎ with ℎ∗∗ on rint(F) ⊆ rint(C). The fourth equality holds
because −ℎ∗∗ (1, 𝑢) is concave in 𝑢, which allows us to change rint(F) back to F.
Finally, the fifth equality follows from Sion’s minimax theorem (Sion 1958, The-
orem 4.2), which applies because F is convex and compact, L is convex and
𝜆 0 + 𝑢⊤ 𝜆 is biaffine in 𝑢 and (𝜆 0 , 𝜆). Therefore, strong duality holds.
Theorem 4.5 shows that the worst-case expectation problem (4.1) over the mo-
ment ambiguity set (2.1) admits a semi-infinite dual. Indeed, the dual problem
on the right hand side of (4.4) accommodates finitely many decision variables but
infinitely many constraints parametrized by the uncertainty realizations 𝑧 ∈ Z.
The dual problem can also be interpreted as a robust optimization problem with
uncertainty set Z. Note that we did not assume Z to be convex. In addition, we
emphasize that compactness of F is not a necessary condition for strong duality.
Indeed, strong duality can also be established under Slater-type conditions (Zhen,
Kuhn and Wiesemann 2023). Finally, the condition rint(F) ⊆ rint(C) is equi-
valent to the—seemingly weaker—requirement that F intersects rint(C). Indeed,
62 D. Kuhn, S. Shafiee, and W. Wiesemann
Lemma 4.8 (Support Function of Gelbrich Uncertainty Sets). Let F be the Gelbrich
uncertainty set (2.8) of radius 𝑟 ≥ 0 around ( 𝜇, ˆ Σ̂) ∈ R𝑑 × S+𝑑 , where G is the
Gelbrich distance of Definition 2.1. For any (𝜆, Λ) ∈ R𝑑 × S𝑑 , we then have
inf 𝛾 𝑟 2 − k 𝜇kˆ 2 − Tr(Σ̂) + Tr(𝐴) + 𝛼
𝑑
∗
s.t. 𝛼, 𝛾 ∈ R+ , 𝐴 ∈ S+
𝛿F (𝜆, Λ) = " 1# " #
𝛾𝐼𝑑 − Λ 𝛾 Σ̂ 2 𝛾𝐼 𝑑 − Λ 𝛾 𝜇ˆ + 𝜆2
0, 0.
1 𝜆 ⊤
𝛾 Σ̂ 2 𝐴 (𝛾 ˆ
𝜇 + 2 ) 𝛼
Proof. By Proposition 2.3, which provides a semidefinite representation of the
Gelbrich uncertainty set F, the support function of F satisfies
sup 𝜇⊤ 𝜆 + Tr(𝑀Λ)
𝑑 𝑑 𝑑×𝑑
s.t. 𝜇 ∈ R , 𝑀, 𝑈 ∈ S+ , 𝐶 ∈ R
∗
𝛿F (𝜆, Λ) = Tr(𝑀 − 2𝜇 𝜇ˆ ⊤ − 2𝐶) ≤ 𝑟 2 − k 𝜇k
ˆ 2 − Tr(Σ̂)
𝑀 −𝑈 𝐶 𝑈 𝜇
0,
⊤ 𝜇 ⊤ 1 0.
𝐶 Σ̂
By conic duality (Ben-Tal and Nemirovski 2001, Theorem 1.4.2), the maximization
problem in the above expression admits the dual minimization problem
inf 𝛾 𝑟 2 − k 𝜇k
ˆ 2 − Tr(Σ̂) + Tr(Σ̂ 𝐴22 ) + 𝛼
Theorem 4.9 (Duality Theory for Gelbrich Ambiguity Sets). If P is the Chebyshev
ambiguity set (2.4) with F representing the Gelbrich uncertainty (2.8), then the
following weak duality relation holds.
inf 𝜆 0 + 𝛾 𝑟 2 − k 𝜇k
ˆ 2 − Tr(Σ̂) + Tr(𝐴) + 𝛼
s.t. 𝜆 0 ∈ R, 𝛼, 𝛾 ∈ R+ , 𝜆 ∈ R𝑑 , Λ ∈ S𝑑 , 𝐴 ∈ S+𝑑
sup E P [ℓ(𝑍)] ≤ 𝜆 0 + 𝜆⊤ 𝑧 + 𝑧⊤ Λ𝑧 ≥ ℓ(𝑧) ∀𝑧 ∈ Z (4.6)
P∈P
" 1# " 𝜆
#
𝛾𝐼 𝑑 − Λ 𝛾 Σ̂ 2 𝛾𝐼 𝑑 − Λ 𝛾 ˆ
𝜇 + 2
0, 0.
1 𝜆 ⊤
𝛾 Σ̂ 2 𝐴 (𝛾 𝜇ˆ + 2 ) 𝛼
If E P [ℓ(𝑍)] > −∞ for all P ∈ P2 (Z) and 𝑟 > 0, then strong duality holds, that is,
the inequality (4.6) becomes an equality.
Proof. Weak duality follows immediately from the first claim of Theorem 4.6 and
Lemma 4.8. To prove strong duality, recall from Proposition 2.3 that the Gelbrich
uncertainty set F is convex and compact. In addition, recall from the proof of
Proposition 2.2 that the Gelbrich distance is continuous. As 𝑟 > 0, this implies that
rint(F) = (𝜇, 𝑀) ∈ R𝑑 × S+𝑑 : 𝑀 ≻ 𝜇𝜇⊤ , G (𝜇, 𝑀 − 𝜇𝜇⊤ ), ( 𝜇,
ˆ Σ̂) < 𝑟 ,
which in turn ensures that 𝑀 ≻ 𝜇𝜇⊤ for all (𝜇, 𝑀) ∈ rint(F). Therefore, strong
duality follows from the second claim of Theorem 4.6.
We close this section with some historical remarks. The classical problem of
moments asks whether there exists a distribution on Z with a given sequence of
moments. In the language of this survey, the problem of moments thus seeks to
determine whether a given moment ambiguity set of the form (2.1) is non-empty,
where 𝑓 is a polynomial and F is a singleton. The analysis of moment problems
has a long and distinguished history in mathematics dating back to the 19th century.
Notable contributions were made by Chebyshev (1874), Markov (1884), Stieltjes
(1894), Hamburger (1920) and Hausdorff (1923); see (Shohat and Tamarkin 1950)
for an early survey. The study of moment problems with tools from mathematical
optimization—in particular semi-infinite duality theory—was pioneered by Isii
(1960, 1962). Shapiro (2001) formulates the worst-case expectation problem over
a family of distributions with prescribed moments as an infinite-dimensional conic
linear program and establishes conditions for strong duality.
where the three equalities follow from the definition of the perspective function 𝜙 𝜋 ,
the substitution 𝛼 ← 𝛼/𝛽 and the replacement of 𝑡 by 𝜆𝑡/𝜆, respectively. Note that
these manipulations are admissible because 𝛽, 𝜆 > 0. If 𝛽 = 0, then we have
sup 𝑡𝛼 − 𝜆𝜙 𝜋 (𝛼, 𝛽) = sup 𝑡𝛼 − 𝜆𝜙∞ (𝛼)
𝛼∈R 𝛼∈R
∗
= sup 𝑡𝛼 − 𝜆𝛿dom(𝜙 ∗ ) (𝛼) = 𝜆𝛿 cl(dom(𝜙 ∗ )) (𝑡/𝜆),
𝛼∈R
where the first equality holds again because of the definition of 𝜙 𝜋 , and the second
equality exploits (Rockafellar 1970, Theorem 13.3). The third equality replaces 𝑡
with 𝜆𝑡/𝜆 and exploits the elementary observation that the conjugate of the support
function of a convex set coincides with the indicator function of the closure of this
set (Rockafellar 1970, Theorem 13.2). Thus, the claim follows.
Lemma 4.12 (Domain of Conjugate Entropy Functions). If 𝜙 is an entropy function
in the sense of Definition 2.4, then we have
(
(−∞, 𝜙∞ (1)] if 𝜙∞ (1) < ∞,
cl(dom(𝜙∗ )) =
R if 𝜙∞ (1) = ∞.
Proof. As 𝜙 is proper, convex and closed, (Rockafellar 1970, Theorem 8.5) implies
that its recession function 𝜙∞ is positive homogeneous. Recall that 𝜙(𝑠) = ∞ for
every 𝑠 < 0. We may thus conclude that 𝜙∞ (𝑡) = 𝑡 𝜙∞ (1) for 𝑡 > 0, 𝜙∞ (𝑡) = 0 for 𝑡 =
0 and 𝜙∞ (𝑡) = ∞ for 𝑡 < 0. In addition, (Rockafellar 1970, Theorem 13.3) implies
that the support function of dom(𝜙∗ ) coincides with the recession function 𝜙∞ . The
indicator function of cl(dom(𝜙∗ )) is known to coincide with the conjugate of the
support function of dom(𝜙∗ ), and therefore it satisfies
0 if 𝑠 ≤ 𝜙∞ (1),
𝛿cl(dom(𝜙∗ )) (𝑠) = sup 𝑠 − 𝜙∞ (1) 𝑡 =
𝑡 ∈R+ ∞ otherwise.
This shows that cl(dom(𝜙∗ )) = (−∞, 𝜙∞ (1)] if 𝜙∞ (1) < ∞ and that cl(dom(𝜙∗ )) = R
otherwise. Hence, the claim follows.
Proposition 4.13 (Bi-conjugate of ℎ). Assume that E P̂ [ℓ(𝑍)] > −∞. Then, the
bi-conjugate of ℎ defined in (4.7) satisfies
sup −𝜆 0 𝑢0 − 𝜆𝑢 − E P̂ [(𝜙∗ ) 𝜋 (ℓ(𝑍) − 𝜆 0 , 𝜆)]
𝜆 ∈R,𝜆∈R
ℎ∗∗ (𝑢0 , 𝑢) = 0 +
s.t. 𝜆 0 + 𝜆 𝜙∞ (1) ≥ sup ℓ(𝑧),
𝑧 ∈Z
Kullback-Leibler 𝑠 log(𝑠) − 𝑠 + 1 ∞ e𝑡 − 1
Likelihood − log(𝑠) + 𝑠 − 1 1 − log(1 − 𝑡)
Total variation 1 |𝑠 − 1| 1 max{𝑡, −1/2} + 𝛿(−∞,1/2] (𝑡)
2 2
Pearson 𝜒2 (𝑠 − 1)2 ∞ (𝑡/2 + 1)2+ − 1
1 (𝑠 − 1)2
√
Neyman 𝜒 2 𝑠 1 2−2 1−𝑡
𝛽/(𝛽−1)
𝑠 𝛽 −𝛽𝑠+𝛽−1 [(𝛽−1)𝑡+1] +
Cressie-Read for 𝛽 ∈ (0, 1) 𝛽(𝛽−1) 1 𝛽
𝛽/(𝛽−1)
𝑠 𝛽 −𝛽𝑠+𝛽−1 [(𝛽−1)𝑡+1] +
Cressie-Read for 𝛽 > 1 𝛽(𝛽−1) ∞ 𝛽
Table 4.1. Examples of entropy functions, their asymptotic slopes and their con-
jugates. Here, for any 𝑐 ∈ R, we use the [𝑐] + as a shorthand for max{𝑐, 0}.
dP̂
For any 𝜌 ∈ M+ (Z) and 𝛽 ∈ L1 (𝜌) with 𝛽 = d𝜌 𝜌-almost surely, we then find
∫
sup (ℓ(𝑧) − 𝜆 0 ) 𝛼(𝑧) − 𝜆𝜙 𝜋 (𝛼(𝑧), 𝛽(𝑧)) d𝜌(𝑧)
𝛼∈L1 (𝜌) Z
∫
= sup (ℓ(𝑧) − 𝜆 0 ) 𝛼 − 𝜆𝜙 𝜋 (𝛼, 𝛽(𝑧)) d𝜌(𝑧), (4.9)
Z 𝛼∈R
where the equality follows from (Rockafellar and Wets 2009, Theorem 14.60),
which applies because the objective function of the maximization problem in the
second line constitutes a normal integrand in the sense of (Rockafellar and Wets
2009, Definition 14.27). This can be verified by recalling that sums, perspectives
and concatenations of normal integrands are again normal integrands (Rockafellar
and Wets 2009, Section 14.E). Next, we partition Z into Z+ (𝛽) = {𝑧 ∈ Z : 𝛽(𝑧) >
0} and Z0 (𝛽) = {𝑧 ∈ Z : 𝛽(𝑧) = 0}. By Lemma 4.11, the integral (4.9) equals
∫ ∫
ℓ(𝑧) − 𝜆 0 ℓ(𝑧) − 𝜆 0
𝜆𝜙∗ 𝛽(𝑧) d𝜌(𝑧) + 𝜆𝛿cl(dom(𝜙∗ )) d𝜌(𝑧).
Z+ (𝛽) 𝜆 Z0 (𝛽) 𝜆
dP̂
As 𝛽 = d𝜌 𝜌-almost surely, and as P̂(𝑍 ∈ Z+ (𝛽)) = 1, the first of these integrals
simply reduces to an expectation with respect to the reference distribution and is thus
independent of 𝛽. The second integral still depends on 𝛽 through the integration
domain Z0 (𝛽). Thus, partially maximizing over 𝛼 allows us to recast (4.8) as
∗ ∗ ℓ(𝑍) − 𝜆 0
ℎ (−𝜆 0 , −𝜆) = E P̂ 𝜆𝜙
𝜆
∫
dP̂
ℓ(𝑧) − 𝜆 0
+ sup 𝜆𝛿cl(dom(𝜙∗ )) d𝜌(𝑧) : = 𝛽 𝜌-a.s. .
𝜌∈M+ (Z), Z0 (𝛽) 𝜆 d𝜌
𝛽∈L1 (𝜌)
If there exists 𝑧0 ∈ Z with (ℓ(𝑧0 ) − 𝜆 0 )/𝜆 ∉ cl(dom(𝜙∗ )), then ℎ∗ (−𝜆 0 , −𝜆) = ∞.
To see this, assume first that 𝑧0 is an atom of P̂. In this case, the expectation in the
first line evaluates to ∞. If 𝑧0 is not an atom of P̂, then the supremum in the second
line evaluates to ∞ because we may set 𝜌 = P̂ + 𝛿 𝑧0 and define 𝛽 ∈ L1 (𝜌) through
𝛽(𝑧) = 1 if 𝑧 ≠ 𝑧0 and 𝛽(𝑧0 ) = 0. Hence, we may conclude that
( h i
E 𝜆𝜙 ∗ ℓ(𝑍)−𝜆0 if ℓ(𝑧)−𝜆 0
∈ cl(dom(𝜙∗ )) ∀𝑧 ∈ Z,
ℎ∗ (−𝜆 0 , −𝜆) = P̂ 𝜆 𝜆
∞ otherwise.
Note that this formula was derived under the assumption that 𝜆 > 0. Note also
that, by Lemma 4.12, the condition (ℓ(𝑧) − 𝜆 0 )/𝜆 ∈ cl(dom(𝜙∗ )) is equivalent to the
requirement that 𝜆 0 + 𝜆 𝜙∞ (1) is larger than or equal to sup 𝑧 ∈Z ℓ(𝑧). We claim that
ℎ∗ (−𝜆 0 , −𝜆)
if 𝜆 ≥ 0 and 𝜆 0 + 𝜆 𝜙∞ (1) ≥ sup ℓ(𝑧),
E P̂ [(𝜙∗ ) 𝜋 (ℓ(𝑍) − 𝜆 0 , 𝜆)]
= 𝑧 ∈Z (4.10)
∞ otherwise,
Distributionally Robust Optimization 69
for all 𝜆 0 , 𝜆 ∈ R. Indeed, the above reasoning and the definition of the perspective
function (𝜙∗ ) 𝜋 ensure that (4.10) holds whenever 𝜆 ≠ 0. Note that ℎ∗ is convex and
closed thanks to (Rockafellar 1970, Theorem 12.2). The expression on the right
hand side of (4.10) is also convex and closed in (𝜆 0 , 𝜆). In particular, it is lower
semicontinuous thanks to Fatou’s lemma, which applies because 𝜙(1) = 0 such that
(𝜙∗ ) 𝜋 (𝑡, 𝜆) ≥ 𝑡 for all 𝑡 ∈ R and 𝜆 ∈ R+ and because E P̂ [ℓ(𝑍)] > −∞. Observe
also that (𝜙∗ ) 𝜋 is proper, closed and convex thanks to (Rockafellar 1970, page 35,
page 67 & Theorem 13.3). Hence, (4.10) must indeed hold for all 𝜆 0 , 𝜆 ∈ R.
Given (4.10), we finally obtain
ℎ∗∗ (𝑢0 , 𝑢) = sup −𝜆 0 𝑢0 − 𝜆𝑢 − ℎ∗ (−𝜆 0 , −𝜆)
𝜆0 ,𝜆∈R
sup −𝜆 0 𝑢0 − 𝜆𝑢 − E P̂ [(𝜙∗ ) 𝜋 (ℓ(𝑍) − 𝜆 0 , 𝜆)]
𝜆0 ∈R,𝜆∈R+
=
s.t. 𝜆 0 + 𝜆 𝜙∞ (1) ≥ sup ℓ(𝑧),
𝑧 ∈Z
where the inequality holds because of Lemma 4.2. Weak duality thus follows from
the first claim in Proposition 4.13. If 𝜙 is additionally continuous at 1, and if 𝑟 > 0,
then strong duality follows from the second claim in Proposition 4.13.
Recall now that the restricted 𝜙-divergence ambiguity set (2.11) is defined as
P = P ∈ P(Z) : P ≪ P̂, D 𝜙 (P, P̂) ≤ 𝑟 .
That is, P contains all distributions from within the (unrestricted) 𝜙-divergence
70 D. Kuhn, S. Shafiee, and W. Wiesemann
ambiguity set (2.10) that are absolutely continuous with respect to P̂. The worst-
case expected loss over P can again be expressed as −ℎ(1, 𝑟), where ℎ(𝑢0 , 𝑢) is
now defined as the infimum of the optimization problem (4.7) with the additional
constraint 𝑣 ≪ P̂. One readily verifies that ℎ remains convex and that {1} × R++
is still contained in rint(dom(ℎ)) despite this restriction. Indeed, the proof of
Lemma 4.10 remains valid almost verbatim.
Theorem 4.15 (Duality Theory for Restricted 𝜙-Divergence Ambiguity Sets). As-
sume that E P̂ [ℓ(𝑍)] > −∞. If P is the restricted 𝜙-divergence ambiguity set (2.11),
then the following weak duality relation holds.
sup E P [ℓ(𝑍)] ≤ inf 𝜆 0 + 𝜆𝑟 + E P̂ [(𝜙∗ ) 𝜋 (ℓ(𝑍) − 𝜆 0 , 𝜆)] (4.12)
P∈P 𝜆0 ∈R,𝜆∈R+
If additionally 𝑟 > 0 and 𝜙 is continuous at 1, then strong duality holds, that is, the
inequality (4.12) collapses to an equality.
Note that if (𝜆 0 , 𝜆) is feasible in (4.12), then (ℓ(𝑍) − 𝜆 0 , 𝜆) belongs P̂-almost
surely to dom((𝜙∗ ) 𝜋 ). Otherwise, its objective function value equals ∞. In view of
Lemma 4.12, this implies that 𝜆 0 + 𝜆 𝜙∞ (1) ≥ ess sup P̂ [ℓ(𝑍)]. In contrast, if (𝜆 0 , 𝜆)
is feasible in (4.11), then it satisfies the constraint 𝜆 0 + 𝜆 𝜙∞ (1) ≥ sup 𝑧 ∈Z ℓ(𝑧),
which is more restrictive unless 𝜙∞ (1) = ∞. Hence, the dual problem in (4.12) has
a (weakly) larger feasible set and a (weakly) smaller infimum than the dual problem
in (4.11). This is perhaps unsurprising because (4.12) corresponds to the worst-
case expectation problem over the restricted 𝜙-divergence ambiguity set, which is
(weakly) smaller than the corresponding unrestricted 𝜙-divergence ambiguity set.
Note also that the solution of a worst-case expectation problem over an unrestricted
𝜙-divergence ambiguity set depends on Z and not just on the support of P̂.
Indeed, one can proceed as in the proof of Proposition 4.13. However, the reasoning
simplifies significantly because the additional constraint 𝑣 ≪ P̂ allows us to set
the dominating measure 𝜌 in the definition of D 𝜙 to P̂. Thus, the Radon-Nikodym
derivative 𝛽 = dP̂/d𝜌 is P̂-almost surely equal to 1. This in turn implies that the
calculation of ℎ∗ requires no case distinction, that is, the set Z0 (𝛽) is empty.
Given the bi-conjugate of ℎ, both weak and strong duality can then be established
exactly as in the proof of Theorem 4.14. Details are omitted for brevity.
Van Parys et al. (2021, Proposition 5) establish a strong duality result for worst-
case expectations over likelihood ambiguity sets as introduced in Section 2.2.2.
Theorem 4.14 extends this result to general 𝜙-divergence ambiguity sets with a
significantly shorter proof that only uses tools from convex analysis. Ben-Tal et al.
Distributionally Robust Optimization 71
(2013) establish a strong duality result akin to Theorem 4.15 for restricted 𝜙-
divergence ambiguity sets under the assumption that the reference distribution P̂ is
discrete. Shapiro (2017) extends this result to general reference distributions by us-
ing tools from infinite-dimensional analysis. In contrast, our proof of Theorem 4.15
establishes the same duality result using finite-dimensional convex analysis.
As the objective and constraint functions of the minimization problem in (4.13) are
jointly convex in P and 𝑢, Lemma 4.1 implies that ℎ is convex. Recall also that 𝑐
is non-negative and satisfies 𝑐(𝑧, 𝑧) = 0 for all 𝑧 ∈ Z. If E P̂ [ℓ( 𝑍)]ˆ > −∞, it is
therefore easy to show that dom(ℎ) = R+ .
The following lemma will be instrumental for deriving the bi-conjugate of ℎ.
Recall that Γ(P, P̂) denotes the set of all couplings of P and P̂; see Definition 2.15.
Lemma 4.16 (Interchangeability Principle). If 𝑐 is a transportation cost function,
ℓ is upper semicontinuous and 𝜆 ≥ 0, then we have
sup sup E 𝛾 ℓ(𝑍) − 𝜆𝑐(𝑍, 𝑍)ˆ = E sup ℓ(𝑧) − 𝜆𝑐(𝑧, 𝑍) ˆ .
P̂
P∈P(Z) 𝛾∈Γ(P,P̂) 𝑧 ∈Z
One can show that Lemma 4.16 remains valid, for example, if Z is a Polish (sep-
arable metric) space equipped with its Borel 𝜎-algebra and even if 𝑐 and ℓ fail to be
lower and upper semicontinuous, respectively (Zhang et al. 2024b, Proposition 1).
Proof of Lemma 4.16. Define 𝐿 : Z → R through 𝐿(ˆ𝑧 ) = sup 𝑧 ∈Z ℓ(𝑧) − 𝜆𝑐(𝑧, 𝑧ˆ).
If 𝜆 = 1, then 𝐿 reduces to the 𝑐-transform of ℓ defined in (2.25). Note first that 𝐿
constitutes a pointwise supremum of upper semicontinuous functions and is thus
also upper semicontinuous and, in particular, Borel-measurable.
Observe next that, by the definition of 𝐿, we have ℓ(𝑧) − 𝜆𝑐(𝑧, 𝑧ˆ) ≤ 𝐿(ˆ𝑧 ) for
72 D. Kuhn, S. Shafiee, and W. Wiesemann
all 𝑧, 𝑧ˆ ∈ Z. This inequality persists if we integrate both sides with respect to any
coupling 𝛾 ∈ Γ(P, P̂) for any distribution P ∈ P(Z), and therefore we obtain
sup sup E 𝛾 ℓ(𝑍) − 𝜆𝑐(𝑍, 𝑍) ˆ ≤ E 𝐿( 𝑍) ˆ .
P̂
P∈P(Z) 𝛾∈Γ(P,P̂)
= sup sup ˆ
E P [ℓ(𝑍)] − 𝜆E 𝛾 [𝑐(𝑍, 𝑍)]
P∈P(Z) 𝛾∈Γ(P,P̂)
= sup ˆ
sup E 𝛾 ℓ(𝑍) − 𝜆𝑐(𝑍, 𝑍)
P∈P(Z) 𝛾∈Γ(P,P̂)
ˆ ,
= E P̂ sup ℓ(𝑧) − 𝜆𝑐(𝑧, 𝑍) (4.14)
𝑧 ∈Z
Distributionally Robust Optimization 73
where the second equality follows from Definition 2.15, the third equality holds
because the marginal distribution of 𝑍 under 𝛾 is given by P, and the fourth
equality exploits Lemma 4.16. The above reasoning implies that ℎ∗ (−𝜆) coincides
with (4.14) for all 𝜆 > 0. However, this formula remains valid at 𝜆 = 0. To see this,
note that ℎ∗ is convex and closed thanks to (Rockafellar 1970, Theorem 12.2). The
last expectation in (4.10) is also convex and closed in 𝜆 thanks to Fatou’s lemma,
which applies because sup 𝑧 ∈Z ℓ(𝑧) − 𝜆𝑐(𝑧, 𝑧ˆ) is larger than or equal to ℓ(ˆ𝑧) and
ˆ > −∞. Hence,
lower semicontinuous in 𝜆 for every 𝑧ˆ ∈ Z and because E P̂ [ℓ( 𝑍)]
the last expectation in (4.14) is indeed convex and lower-semicontinuous in 𝜆, and
thus it coincides indeed with ℎ∗ (−𝜆) for all 𝜆 ∈ R+ .
Given (4.14), we finally obtain the following formula for the bi-conjugate of ℎ.
∗∗ ∗
ℎ (𝑢) = sup −𝜆𝑢 − ℎ (−𝜆) = sup −𝜆𝑢 − E sup ℓ(𝑧) − 𝜆𝑐(𝑧, 𝑍) ˆ
P̂
𝜆≥0 𝜆≥0 𝑧 ∈Z
Here, the first equality holds because ℎ∗ (−𝜆) = ∞ whenever 𝜆 < 0. The second
equality follows from (4.14), which holds for any 𝜆 ≥ 0. This establishes the
desired formula for ℎ∗∗ . Lemma 4.2 and our earlier observation that dom(ℎ) = R+
finlly imply that ℎ(𝑢) = ℎ∗∗ (𝑢) for all 𝑢 ∈ R++ .
The following main theorem uses Proposition 4.17 to dualize the worst-case
expectation problem (4.1) with an optimal transport ambiguity set.
Theorem 4.18 (Duality Theory for Optimal Transport Ambiguity Sets). Assume
ˆ > −∞ and ℓ is upper semicontinuous. If P is the optimal transport
that E P̂ [ℓ( 𝑍)]
ambiguity set defined in (2.27), then the following weak duality relation holds.
ˆ
sup E P [ℓ(𝑍)] ≤ inf 𝜆𝑟 + E P̂ sup ℓ(𝑧) − 𝜆𝑐(𝑧, 𝑍) . (4.15)
P∈P 𝜆∈R+ 𝑧 ∈Z
If 𝑟 > 0, then strong duality holds, that is, (4.15) collapses to an equality.
Proof. Recall first that
sup E P [ℓ(𝑍)] = −ℎ(𝑟) ≤ −ℎ∗∗ (𝑟),
P∈P
where the inequality holds because of Lemma 4.2. Weak duality thus follows from
the first claim in Proposition 4.17. If 𝑟 > 0, then strong duality follows from the
second claim in Proposition 4.17. This concludes the proof.
Mohajerin Esfahani and Kuhn (2018) and Zhao and Guan (2018) use semi-
infinite duality theory to prove Theorem 4.18 in the special case when OT𝑐 is the
1-Wasserstein distance and when the reference distribution P̂ is discrete. Blanchet
and Murthy (2019) and Gao and Kleywegt (2023) prove a generalization of The-
orem 4.18 by leveraging a Fenchel duality theorem in Banach spaces and by devising
a constructive argument, respectively. They both allow for arbitrary optimal trans-
port discrepancies as well as arbitrary reference distributions on Polish spaces. The
74 D. Kuhn, S. Shafiee, and W. Wiesemann
proof shown here, which exploits the interchangeability principle of Lemma 4.16
and elementary tools from convex analysis, is due to Zhang et al. (2024b).
at level 𝛽 represents the smallest number 𝜏★ that weakly exceeds the loss with
probability 1 − 𝛽. Thus, it coincides with the leftmost (1 − 𝛽)-quantile of the loss
distribution 𝐹. For later reference we remark that the 𝛽-VaR can be reformulated as
𝛽-VaRP [ℓ(𝑍)] = inf {𝜏 ∈ R : P(ℓ(𝑍) ≥ 𝜏) ≤ 𝛽} . (5.2)
However, the infimum in (5.2) may not be attained. Note that the VaR is well-defined
and finite for any loss function ℓ ∈ L(R𝑑 ) and for any distribution P ∈ P(R𝑑 ).
Nonetheless, other law-invariant risk measures are finite only for certain sub-classes
of loss functions and distributions. In the remainder of this paper we will often
study risk measures that display some or all of the following structural properties.
Definition 5.3 (Properties of Risk Measures). A law-invariant risk measure 𝜚 is
(i) translation-invariant if
𝜚P [ℓ(𝑍) + 𝑐] = 𝜚P [ℓ(𝑍)] + 𝑐 ∀ℓ ∈ L(R𝑑 ), ∀𝑐 ∈ R, ∀P ∈ P(R𝑑 );
(ii) scale-invariant if
𝜚P [𝑐ℓ(𝑍)] = 𝑐𝜚P [ℓ(𝑍)] ∀ℓ ∈ L(R𝑑 ), ∀𝑐 ∈ R+ , ∀P ∈ P(R𝑑 );
(iii) monotone if
𝜚P [ℓ1 (𝑍)] ≤ 𝜚P [ℓ2 (𝑍)]
∀ℓ1 , ℓ2 ∈ L(R𝑑 ) with ℓ1 (𝑍) ≤ ℓ2 (𝑍) P-a.s., ∀P ∈ P(R𝑑 );
(iv) convex if
𝜚P [𝜃ℓ1 (𝑍) + (1 − 𝜃)ℓ2 (𝑍)] ≤ 𝜃 𝜚P [ℓ1 (𝑍)] + (1 − 𝜃)𝜚P [ℓ2 (𝑍)]
∀ℓ1 , ℓ2 ∈ L(R𝑑 ), ∀𝜃 ∈ [0, 1], ∀P ∈ P(R𝑑 ).
This problem seeks a decision 𝑥 that minimizes the worst-case risk of the random
loss ℓ(𝑥, 𝑍) with respect to all distributions of 𝑍 in the ambiguity set P. Below we
will show that the duality theory for worst-case expectation problems developed in
Section 4 has ramifications for a broad class of worst-case risk problems of the form
sup 𝜚P [ℓ(𝑍)]. (5.4)
P∈P
Here, we suppress as usual the dependence of the loss function on 𝑥 to avoid clutter.
76 D. Kuhn, S. Shafiee, and W. Wiesemann
and thus coincides with the optimal value of a penalty-based distributionally robust
optimization model with a 𝜙-divergence penalty. The equality in the above expres-
sion follows from Ben-Tal and Teboulle (2007, Theorem 4.2), which is reminiscent
of the strong duality theorem for worst-case expectation problems over restricted
𝜙-divergence ambiguity sets (see Theorem 4.15). The assumption that 𝜙∞ (1) = ∞
ensures indeed that D 𝜙 (P, P̂) is finite only if P ≪ P̂. We also remark that if 𝑔 is a
Distributionally Robust Optimization 77
1
= log E P exp 𝛽ℓ(𝑍) = 𝛽-ERMP [ℓ(𝑍)].
𝛽
The second equality holds because the unconstrained convex minimization problem
over 𝜏 is uniquely solved by 𝜏★ = 𝛽 −1 log(E P [exp(𝛽ℓ(𝑍))]), which can be verified
by inspecting the problem’s first-order optimality condition. In addition, as 𝛽 ∈
(0, 1), it is clear that the problem’s objective function is coercive in 𝜏.
Kupper and Schachermayer (2009) show that, with the exception of the expected
value, the entropic risk measure is the only relevant law-invariant risk measure that
obeys the tower property. That is, for any random vectors 𝑍1 and 𝑍2 it satisfies
𝛽-ERMP [ℓ(𝑍2 )] = 𝛽-ERMP [𝛽-ERMP [ℓ(𝑍2 )|𝑍1 ]],
where the conditional entropic risk measure 𝛽-ERMP [𝑍1 |𝑍2 ] is defined in the
obvious way by replacing the unconditional expectation in (5.8) with a conditional
expectation. The entropic risk measure is often used for modeling risk-aversion in
dynamic optimization problems, where the dynamic consistency of the decisions
taken at different points in time is a concern. For example, it occupies center stage
in finance (Föllmer and Schied 2008), risk-sensitive control (Whittle 1990, Başar
and Bernhard 1995) and economics (Hansen and Sargent 2008).
Proposition 5.14 (Dual Representation of the Entropic Risk Measure). Assume that
E P̂ [ℓ(𝑍)] > −∞. Then, the entropic risk measure admits the dual representation
1
𝛽-ERMP̂ [ℓ(𝑍)] = sup E P [ℓ(𝑍)] − · KL(P, P̂).
P∈P(Z) 𝛽
Proof. Let 𝜙 be the entropy function of the Kullback-Leibler divergence. Thus, we
have 𝜙∗ (𝑡) = e𝑡 − 1 for all 𝑡 ∈ R; see Table 4.1. By Proposition 5.13, the entropic
value-at-risk is the optimized certainty equivalent induced by the disutility function
𝑔(𝑡) = 𝛽 −1 (exp(𝛽𝑡) − 1) = 𝛽 −1 𝜙∗ (𝛽𝑡) = (𝛽 −1 𝜙)∗ (𝑡),
where the last equality uses (Rockafellar 1970, Theorem 16.1). This implies that
𝛽-ERMP̂ [ℓ(𝑍)] = inf 𝜏 + E P (𝛽 −1 𝜙)∗ (ℓ(𝑍) − 𝜏)
𝜏 ∈R
= sup E P [ℓ(𝑍)] − D 𝛽 −1 𝜙 (P, P̂)
P∈P(Z)
Here, the second equality follows from the strong duality relation (5.6), which
applies because E P̂ [ℓ(𝑍)] > −∞, and the third equality holds because the entropy
function 𝜙 was assumed to induce the Kullback-Leibler divergence.
We remark that Proposition 5.14 can also be proved by leveraging the Donsker-
Varadhan formula from Proposition 2.9 in lieu of the duality relation (5.6).
One can show that every optimized certainty equivalent 𝜚 is translation-invariant
80 D. Kuhn, S. Shafiee, and W. Wiesemann
Proof. Let V∗ be the topological dual of V, and define the bilinear form h·, ·i :
V × V → R through h𝑣 ∗ , 𝑣i = 𝑣 ∗ (𝑣). If we equip V ∗ with the weak topology
∗
where the two equalities follow from the definitions of ℎ and 𝐹, respectively. In
Distributionally Robust Optimization 81
where the first two equalities follow from the definitions of the bi-conjugate ℎ∗∗ and
the conjugate ℎ∗ , respectively, and the third equality exploits the definition of ℎ.
The fourth equality follows from the definition of the conjugate 𝐹 ∗, and the last
equality holds because 𝐹 ∗(𝑢, 𝑣) = −𝐻(𝑢, −𝑣). Thus, the desired minimax result
holds if we manage to prove that ℎ(0) = ℎ∗∗ (0).
By the definitions of ℎ∗ and ℎ and by the relation 𝐹 ∗(𝑢, 𝑣) = −𝐻(𝑢, −𝑣), we have
∗ ∗ ∗
{𝑣 ∈ V : ℎ (𝑣) ≤ 𝛼} = 𝑣 ∈ V : sup h𝑣 , 𝑣i − ℎ(𝑣 ) ≤ 𝛼
𝑣 ∗ ∈V ∗
∗ ∗
= 𝑣 ∈ V : sup sup h𝑣 , 𝑣i − 𝐹(𝑢, 𝑣 ) ≤ 𝛼
𝑢∈U 𝑣 ∗ ∈V ∗
= 𝑣 ∈ V : sup −𝐻(𝑢, −𝑣) ≤ 𝛼
𝑢∈U
Ù
=− {𝑣 ∈ V : 𝐻(𝑢, 𝑣) ≥ −𝛼}
𝑢∈U
It also guarantees that if 𝐻0 (𝑢, 𝑣) is convex and closed in 𝑢 and concave in 𝑣, then
so is 𝐻(𝑢, 𝑣). Thus, the feasible sets in any convex-concave minimax problem can
always be extended to the underlying vector spaces without changing the problem.
We now leverage Corollary 5.16 to derive a minimax theorem for optimized
certainty equivalents. This result exploits the inf-compactness of the objective
function of problem (5.5) in 𝜏. Shafiee and Kuhn (2024) establish similar minimax
theorems for a more general class of regular risk and deviation measures introduced
by Rockafellar and Uryasev (2013).
Theorem 5.18 (Minimax Theorem for Optimized Certainty Equivalents). Suppose
that P ⊆ P(Z) is non-empty and convex, 𝜚 is any optimized certainty equivalent
induced by a disutility function 𝑔, supP∈P E P [𝑔(ℓ(𝑍))] < ∞, and E P [ℓ(𝑍)] > −∞
for all P ∈ P. Then, 𝐺(𝜏, P) = 𝜏 + E P [𝑔(ℓ(𝑍) − 𝜏)] for 𝜏 ∈ R and P ∈ P satisfies
sup 𝜚P [ℓ(𝑍)] = sup inf 𝐺(𝜏, P) = inf sup 𝐺(𝜏, P).
P∈P P∈P 𝜏 ∈R 𝜏 ∈R P∈P
Proof. Note first that 𝐺(𝜏, P) is convex in 𝜏 and concave (in fact, linear) in P. In
addition, 𝐺(𝜏, P) is closed in 𝜏. To see this, observe that
lim′ inf 𝐺(𝜏 ′ , P) = lim′ inf E P [𝜏 ′ + 𝑔(ℓ(𝑍) − 𝜏 ′ )]
𝜏 →𝜏 𝜏 →𝜏
Distributionally Robust Optimization 83
and the sublevel sets {𝑢 ∈ U : 𝐻(𝑢, 𝑣) ≤ 𝛼} are compact for every 𝛼 ∈ R provided
that 𝑣 ∈ P. The claim thus follows from Corollary 5.16.
Theorem 5.18 implies that if 𝛽 ∈ (0, 1), then the worst-case 𝛽-CVaR satisfies
1
sup 𝛽-CVaRP [ℓ(𝑍)] = inf 𝜏 + sup E P [max{ℓ(𝑍) − 𝜏, 0}] (5.9)
P∈P 𝜏 ∈R 𝛽 P∈P
for any non-empty convex ambiguity set P ⊆ P(Z) provided that E P [|ℓ(𝑍)|] < ∞
for all P ∈ P. In the extant literature, the interchange of the supremum over P and
the infimum over 𝜏 is often justified with Sion’s minimax theorem (Sion 1958).
However, many studies overlook that Sion’s minimax theorem only applies if P
is weakly compact and E P [max{ℓ(𝑍) − 𝜏, 0}] is weakly upper semicontinuous
in P. As shown in Section 3, unfortunately, many popular ambiguity sets fail to
be weakly compact. In addition, E P [max{ℓ(𝑍) − 𝜏, 0}] fails to be weakly upper
semicontinuous unless the loss function ℓ is upper semicontinuous and bounded
84 D. Kuhn, S. Shafiee, and W. Wiesemann
on Z; see Proposition 3.3. All non-trivial convex loss functions on R𝑑 violate this
condition. In contrast, Theorem 5.18 offers a more general result that exploits the
inf-compactness in 𝜏 but obviates any restrictive topological conditions on P or ℓ.
The inner maximization problem in the resulting upper bound constitutes a worst-
case expectation problem. Hence, it is bounded above by the dual problem de-
rived in Theorem 4.5. Substituting this dual problem into the above expression
yields (5.10). Strong duality follows from the minimax theorem for optimized
certainty equivalents (Theorem 5.18) and the strong duality result for worst-case
expectation problems (Theorem 4.5), which apply under the given assumptions.
The semi-infinite constraint in (5.10) involves the composite function 𝑔(ℓ(𝑧)− 𝜏),
which fails to be concave in 𝑧 even if 𝑔 is non-decreasing and ℓ is concave. Thus,
checking whether a given (𝜏, 𝜆 0 , 𝜆) satisfies the semi-infinite constraint in (5.10) is
generically hard. In fact, Chen and Sim (2024, Theorem 1) prove that evaluating the
worst-case entropic risk is NP-hard even if ℓ is linear and P is a Markov ambiguity
Distributionally Robust Optimization 85
set. Hence, while providing theoretical insights, Theorem 5.18 does not necessarily
pave the way towards an efficient method for solving worst-case risk problems of the
form (5.4). Nevertheless, Theorem 5.18 provides a concise reformulation for (5.4)
that is susceptible to approximate iterative solution procedures.
If supP∈P E P [𝑔(ℓ(𝑍))] < ∞, E P [ℓ(𝑍)] > −∞ for all P ∈ P, 𝑟 > 0 and 𝜙 is con-
tinuous at 1, then strong duality holds, that is, the inequality becomes an equality.
A duality result akin to Theorem 5.20 also holds for worst-case risk problems
over restricted 𝜙-divergence ambiguity sets of the form
P = P ∈ P(Z) : P ≪ P̂, D 𝜙 (P, P̂) ≤ 𝑟 .
The proof of the next theorem follows immediately from Theorems 4.15 and 5.18.
Theorem 5.21 (Duality Theory for Restricted 𝜙-Divergence Ambiguity Sets II).
Assume that E P̂ [ℓ(𝑍)] > −∞. If P is the restricted 𝜙-divergence ambiguity
set (2.11), and 𝜚 is an optimized certainty equivalent induced by a disutility func-
tion 𝑔, then the following weak duality relation holds.
sup 𝜚P [ℓ(𝑍)] ≤ inf 𝜏 + 𝜆 0 + 𝜆𝑟 + E P̂ [(𝜙∗ ) 𝜋 (𝑔(ℓ(𝑍) − 𝜏) − 𝜆 0 , 𝜆)] .
P∈P 𝜏,𝜆0 ∈R, 𝜆∈R+
If supP∈P E P [𝑔(ℓ(𝑍))] < ∞, E P [ℓ(𝑍)] > −∞ for all P ∈ P, 𝑟 > 0 and 𝜙 is con-
tinuous at 1, then strong duality holds, that is, the inequality becomes an equality.
If supP∈P E P [𝑔(ℓ(𝑍))] < ∞, E P [ℓ(𝑍)] > −∞ for all P ∈ P and 𝑟 > 0, then strong
duality holds, that is, the inequality becomes an equality.
Worst-case risk problems with optimal transport ambiguity sets are studied by
Pflug and Wozabal (2007), Pichler (2013) and Wozabal (2014) in the context of
portfolio selection with linear loss functions and by Mohajerin Esfahani et al.
(2018) in the context of inverse optimization using the CVaR. Sadana, Delage and
Georghiou (2024) investigate worst-case entropic risk measures over ∞-Wasserstein
balls and establish tractable reformulations under standard convexity assumptions.
Kent, Li, Blanchet and Glynn (2021) and Sheriff and Mohajerin Esfahani (2023)
develop customized Frank-Wolfe algorithms in the space of probability distribu-
tion to address worst-case risk problems involving generic loss functions and risk
measures. Specifically, Kent et al. (2021) work with Wasserstein gradient flows and
use the corresponding notions of smoothness to establish the convergence of their
Frank-Wolfe algorithm. In contrast, Sheriff and Mohajerin Esfahani (2023) work
with Gâteaux derivatives, which leads to a different notion of smoothness and thus
to a different convergence analysis. Both algorithms display sublinear convergence
rates. When the reference distribution P̂ is discrete or when only samples from
P̂ are used, the algorithms’ iterates represent discrete distributions with progress-
ively increasing bit sizes. Theorem 5.22 provides a compact, albeit potentially
nonconvex, reformulation of the worst-case risk problem. This reformulation is
amenable to primal-dual gradient methods in the finite-dimensional space of the
dual variables, which are guaranteed to converge to a stationary point.
Worst-case risk problems represent special instances of optimization problems
over spaces of probability distributions. The mainstream methods to address such
problems leverage the machinery of Wasserstein gradient flows (Ambrosio, Gigli
and Savaré 2008). Wasserstein gradient flows have recently been used in the context
of distributionally robust optimization problems (Lanzetti, Bolognani and Dörfler
2022, Lanzetti, Terpin and Dörfler 2024, Xu, Lee, Cheng and Xie 2024), nonconvex
optimization (Chizat and Bach 2018, Chizat 2022) or variational inference (Jiang,
Distributionally Robust Optimization 87
Chewi and Pooladian 2024, Lambert, Chewi, Bach, Bonnabel and Rigollet 2022,
Diao, Balasubramanian, Chewi and Salim 2023, Zhang and Zhou 2020). The
results of this section are new and complementary to these existing works.
which maximizes the expected value of ℓ(𝑍) over the Markov ambiguity set of all
distributions supported on Z with mean 𝜇. The Markov ambiguity set is a moment
ambiguity set of the form (2.1) with 𝑓 (𝑧) = 𝑧 and F = {𝜇}. By Theorem 4.5 and
as the support function of F is linear, the problem dual to (6.1a) is given by
inf 𝜆 0 + 𝜆⊤ 𝜇 : 𝜆 0 + 𝜆⊤ 𝑧 ≥ ℓ(𝑧) ∀𝑧 ∈ Z . (6.1b)
𝜆0 ∈R, 𝜆∈R𝑑
Intuitively, this dual problem aims to find an affine function 𝑎(𝑧) = 𝜆 0 + 𝜆⊤ 𝑧 that
majorizes the loss function ℓ(𝑧) on Z and has minimal expected value E P [𝑎(𝑍)]
under any distribution P feasible in the primal problem (6.1a).
Proposition 6.1 (Jensen Bound). Suppose that Z is convex, 𝜇 ∈ Z, ℓ is concave,
and 𝜆★ is any supergradient of ℓ at 𝜇. Then, the primal problem (6.1a) is solved
by P★ = 𝛿 𝜇 , and the dual problem (6.1b) is solved by (𝜆★0 , 𝜆★), where 𝜆★0 =
ℓ(𝜇) − 𝜇⊤𝜆★. In addition, the optimal values of (6.1a) and (6.1b) both equal ℓ(𝜇).
88 D. Kuhn, S. Shafiee, and W. Wiesemann
(Edmundson 1956, Madansky 1959), and it shows that (6.1b) is solved by an affine
function that touches ℓ at the vertices 𝑒𝑖 , 𝑖 ∈ [𝑑], of Z. We emphasize, however,
that Proposition (6.2) remains valid with minor modifications if Z is an arbitrary
regular simplex in R𝑑 , that is, the convex hull of 𝑑 + 1 affinely independent vectors
𝑣 𝑖 ∈ R𝑑 , 𝑖 ∈ [𝑑 + 1]; see (Birge and Wets 1986, Gassmann and Ziemba 1986).
If the loss function ℓ(𝑥, 𝑧) in (1.2) is convex in 𝑧 for any fixed 𝑥 ∈ X , then
Proposition 6.2 implies that the DRO problem (1.2) is equivalent to the stochastic
program inf 𝑥 ∈X E P★ [ℓ(𝑥, 𝑍)], where P★ is independent of 𝑥. As P★ is a discrete
distribution with 𝑑 atoms, this stochastic program is usually easy to solve.
then the dual problem (6.2b) is solved by (𝜆★0 , 𝜆★𝑣, 𝜆★𝑤 , Λ★), where 𝜆★0 = 0 and
𝜆★𝑣 = 0, while 𝜆★𝑤 has elements 𝜆★𝑤,𝑖 and Λ★ has columns Λ★𝑖, 𝑖 ∈ [𝑑 𝑤 ]. The
optimal values of (6.2a) and (6.2b) coincide and are both equal to
𝑑𝑤
Õ
𝜇 𝑤,𝑖 ℓ(𝐶𝑒𝑖 /𝑤¯ 𝑖 , 𝑒𝑖 ).
𝑖=1
and
𝑑𝑤
Õ
E P★ 𝑉𝑊 ⊤ = 𝑤¯ 𝑖 𝐶𝑒𝑖 𝑒⊤
𝑖 /𝑤
¯ 𝑖 = 𝐶.
𝑖=1
for all 𝑣 ∈ V and 𝑤 ∈ W. The first inequality follows from the concavity of ℓ(𝑣, 𝑤)
in 𝑣 and the definition of Λ★𝑖 as a supergradient, while the second inequality follows
Distributionally Robust Optimization 91
from the convexity of ℓ(𝑣, 𝑤) in 𝑤 and Jensen’s inequality. Hence, (𝜆★0 , 𝜆★𝑣, 𝜆★𝑤 , Λ★)
is indeed feasible in (6.2b). A similar calculation reveals that the objective function
value of (𝜆★0 , 𝜆★𝑣, 𝜆★𝑤 , Λ★) in (6.2b) is given by the formula in the proposition
statement. Consequently, by weak duality as established in Theorem 4.5, we have
shown that P★ is primal optimal and (𝜆★0 , 𝜆★𝑣, 𝜆★𝑤 , Λ★) is dual optimal.
which maximizes the expected value of ℓ(𝑍) over the family of all univariate
distributions supported on Z with mean 𝜇 and mean absolute deviation 𝜎. Note
that problem (6.3a) optimizes over a moment ambiguity set of the form (2.1)
with 𝑓 (𝑧) = (𝑧, |𝑧 − 𝜇|) and F = {𝜇} × {𝜎}. By Theorem 4.5 and as the support
function of F is linear, the problem dual to (6.3a) is given by
inf {𝜆 0 + 𝜆 1 𝜇 + 𝜆 2 𝜎 : 𝜆 0 + 𝜆 1 𝑧 + 𝜆 2 |𝑧 − 𝜇| ≥ ℓ(𝑧) ∀𝑧 ∈ Z } . (6.3b)
𝜆0 ,𝜆1 ,𝜆2 ∈R
Intuitively, this dual problem aims to approximate the loss function from above with
a piecewise linear continuous function that has a kink at 𝜇. The problems (6.3a)
and (6.3b) can be solved in closed form if ℓ is convex.
Proposition 6.4 (Ben-Tal and Hochman Bound). Assume that Z = [0, 1], 𝜇 ∈ (0, 1)
and 𝜎 ∈ [0, 2𝜇(1− 𝜇)]. Suppose also that ℓ is a real-valued convex function. Then,
the primal problem (6.3a) is solved by
★ 𝜎 𝜎 𝜎 𝜎
P = 𝛿0 + 1 − − 𝛿𝜇 + 𝛿1 ,
2𝜇 2𝜇 2(1 − 𝜇) 2(1 − 𝜇)
92 D. Kuhn, S. Shafiee, and W. Wiesemann
Proposition 6.4 readily extends to support sets of the form Z = [𝑎, 𝑏] for
any 𝑎, 𝑏 ∈ R with 𝑎 < 𝜇 < 𝑏 by applying a linear coordinate transformation.
If ℓ(𝑥, 𝑧) in (1.2) is convex in 𝑧 for any fixed 𝑥 ∈ X , then Proposition 6.4 implies that
the DRO problem (1.2) is equivalent to the stochastic program inf 𝑥 ∈X E P★ [ℓ(𝑥, 𝑍)],
where the three-point distribution P★ is independent of 𝑥. Traditionally, this
stochastic program is used as a conservative approximation for a stochastic program
of the form inf 𝑥 ∈X E P [ℓ(𝑥, 𝑍)], where P is a known continuous distribution (Ben-
Tal and Hochman 1972). Unlike the Jensen and Edmundson-Madansky bounds,
which only use information about the location of P, and unlike the barycentric
approximation, which only uses information about the location and certain cross-
moments of P, the Ben-Tal and Hochman bound uses information about the location
as well as the dispersion of P. Thus, it provides a tighter approximation.
If 𝑍 is a 𝑑-dimensional random vector with independent components 𝑍𝑖 , 𝑖 ∈ [𝑑],
each of which has a known mean and mean absolute deviation, then one can show
that the worst-case expected value of a convex loss function is attained by P★ =
𝑑
⊗𝑖=1 P★𝑖, where each P★𝑖 is a three-point distribution constructed as in Proposition 6.4
Distributionally Robust Optimization 93
which maximizes the expected value of ℓ(𝑍) over the Chebyshev ambiguity set of
all univariate distributions supported on Z with mean 0 and variance 𝜎 2 . This
Chebyshev ambiguity set is a moment ambiguity set of the form (2.1) with 𝑓 (𝑧) =
(𝑧, 𝑧2 ) and F = {0} × {𝜎 2 }. By Theorem 4.5 and as the support function of F is
linear, the problem dual to (6.4a) is given by
inf 𝜆 0 + 𝜆 2 𝜎 2 : 𝜆 0 + 𝜆 1 𝑧 + 𝜆 2 (𝑧 − 𝜇)2 ≥ ℓ(𝑧) ∀𝑧 ∈ Z . (6.4b)
𝜆0 ,𝜆1 ,𝜆2 ∈R
Proposition 6.5 was first derived by Scarf (1958) in his pioneering treatise
on the distributionally robust newsvendor problem; see also (Jagannathan 1977,
Theorem 1). Note that if the mean of 𝑍 is known to equal 𝜇 ≠ 0 instead of 0, then
Scarf’s bound remains valid if we replace 𝑎 with 𝑎 − 𝜇. Gallego and Moon (1993)
extend Scarf’s bound to more general loss functions such as wedge functions or
ramp functions with a discontinuity, whereas Natarajan et al. (2018) extend Scarf’s
bound to more general ambiguity sets that not only contain information about the
mean and variance of 𝑍 but also about its semivariance. In addition, Das, Dhara
and Natarajan (2021) discuss variants of Scarf’s bound that rely on information
about the mean and the 𝛼-th moment of 𝑍 for any 𝛼 > 1.
Proposition 6.5 is often used to reformulate DRO problems of the form (1.2)
whose objective function is given by the expected value of a ramp function.
Examples include distributionally robust newsvendor, support vector machine or
mean-CVaR portfolio selection problems. In most of these applications, the loc-
ation 𝑎 of the kind of the ramp function is a decision variable or a function of
the decision variables. Thus, the worst-case distribution P★ is decision-dependent,
which means that Proposition 6.5 does not enable us to reduced the DRO prob-
lem (1.2) to a stochastic program with a single fixed worst-case distribution.
Distributionally Robust Optimization 95
which maximizes the probability of the event 𝑍 ∈ C over the Chebyshev ambiguity
set of all distributions on Z = R𝑑 with mean 0 and covariance matrix 𝐼 𝑑 . This
Chebyshev ambiguity set is a moment ambiguity set of the form (2.1) with 𝑓 (𝑧) =
(𝑧, 𝑧𝑧⊤ ) and F = {0} × {𝐼 𝑑 }. If we set ℓ to the characteristic function of C defined
through ℓ(𝑧) = 1 𝑧 ∈C for all 𝑧 ∈ Z, then the worst-case probability problem (6.5a)
can be recast as a worst-case expectation problem. By Theorem 4.5 and as the
support function of F is linear, the corresponding dual problem is thus given by
inf 𝜆 0 + hΛ, 𝐼 𝑑 i : 𝜆 0 + 𝜆⊤ 𝑧 + 𝑧⊤ Λ𝑧 ≥ ℓ(𝑧) ∀𝑧 ∈ Z . (6.5b)
𝜆0 ∈R,𝜆∈R𝑑 ,Λ∈S 𝑑
The problems (6.5a) and (6.5b) can be solved analytically if C is convex and closed.
Proposition 6.6 (Marshall and Olkin Bound). Suppose that Z = R𝑑 , C ⊆ R𝑑 is
convex and closed, and ℓ is the characteristic function of C. Set Δ = min𝑧 ∈C k𝑧k 2 ,
and let 𝑧0 ∈ R𝑑 be the unique minimizer of this problem. Then, the optimal values
of (6.5a) and (6.5b) are both equal to (1 + Δ2 ) −1 . If Δ = 0, then the supremum
of (6.5a) may not be attained. However, if Δ > 0, then (6.5a) is solved by
1 Δ2
P★ = 𝛿 𝑧 0 + Q,
1 + Δ2 1 + Δ2
where Q ∈ P(Z) is an arbitrary distribution with mean −𝑧0 /Δ2 and covariance
2
matrix 1+Δ
Δ2
(𝐼𝑑 − 𝑧0 𝑧⊤
0 /Δ ). For any Δ ≥ 0, problem (6.4b) is solved by
2
1 2𝑧0 𝑧0 𝑧⊤
0
𝜆★0 = , 𝜆★ = and Λ★ = .
(1 + Δ2 )2 (1 + Δ2 )2 (1 + Δ2 )2
Proof. Assume first that Δ = 0, that is, 0 ∈ C. For every 𝑗 ∈ N, let Q 𝑗 ∈ P(Z) be
any distribution with mean 0 and covariance matrix 𝑗 𝐼 𝑑 , and set
P 𝑗 = (1 − 1/ 𝑗) 𝛿0 + (1/ 𝑗) Q 𝑗 .
We thus have E P 𝑗 [𝑍] = 0 and E P 𝑗 [𝑍 𝑍 ⊤ ] = 𝐼𝑑 , which implies that P 𝑗 is feasible
in (6.5a). In addition, the objective function value of P 𝑗 in (6.5a) satisfies
P 𝑗 (𝑍 ∈ C) = 1 − 1/ 𝑗 + Q 𝑗 (𝑍 ∈ Z)/ 𝑗 ≥ 1 − 𝑗 −1 .
Driving 𝑗 to infinity reveals that problem (6.5a) is trivial for Δ = 0 and that its
supremum equals 1. Assume now that Δ > 0, and let Q ∈ P(Z) be an arbitrary
2
distribution with mean −𝑧0 /Δ2 and covariance matrix 1+Δ Δ2
(𝐼𝑑 − 𝑧0 𝑧⊤ 2
0 /Δ ). Such a
⊤
distribution is guaranteed to exist because 𝐼𝑑 𝑧0 𝑧0 /Δ . In addition, define P★ as
2
96 D. Kuhn, S. Shafiee, and W. Wiesemann
where P(𝜇, Σ) denotes the Chebyshev ambiguity set that contains all probability
distributions on R𝑑 with mean 𝜇 ∈ R𝑑 and covariance matrix Σ ∈ S+𝑑 .
We now describe a powerful tool for analyzing the Chebyshev risk with respect
to any law-, translation- and scale-invariant risk measure. To this end, recall that
if 𝑍 follows some distribution P on R𝑑 , then 𝐿 = ℓ(𝑍) follows the pushforward
distribution P ◦ ℓ −1 on R. If P is uncertain and only known to belong to some
ambiguity set P, then the distribution of 𝐿 = ℓ(𝑍) is also uncertain and only known
to belong to the pushforward ambiguity set P ◦ ℓ −1 = {P ◦ ℓ −1 : P ∈ P}. The
following proposition due to Popescu (2007) shows that linear pushforwards of
Chebyshev ambiguity sets are again Chebyshev ambiguity sets.
Proposition 6.7 (Pushforwards of Chebyshev Ambiguity Sets). If 𝜇 ∈ R𝑑 , Σ ∈ S+𝑑 ,
𝜃 ∈ R𝑑 , and ℓ : R𝑑 → R is the linear transformation defined through ℓ(𝑧) = 𝜃 ⊤ 𝑧,
then the pushforward of the Chebyshev ambiguity set P(𝜇, Σ) is the Chebyshev
ambiguity set of all distributions on R with mean 𝜃 ⊤ 𝜇 and variance 𝜃 ⊤ Σ𝜃, that is,
P(𝜇, Σ) ◦ ℓ −1 = P(𝜃 ⊤ 𝜇, 𝜃 ⊤ Σ𝜃).
Proof. Select first any distribution P ∈ P(𝜇, Σ). If the random vector 𝑍 follows P,
then the random variable 𝐿 = ℓ(𝑍) follows P ◦ ℓ −1 . Thus, we have
E P◦ℓ −1 [𝐿] = E P [ℓ(𝑍)] = E P [𝜃 ⊤ 𝑍] = 𝜃 ⊤ 𝜇,
where the first equality follows from the measure-theoretic change of variables
formula. Similarly, one can show that E P◦ℓ −1 [(𝐿 − 𝜃 ⊤ 𝜇)2 ] = 𝜃 ⊤ Σ𝜃. Thus, we find
P(𝜇, Σ) ◦ ℓ −1 ⊆ P(𝜃 ⊤ 𝜇, 𝜃 ⊤ Σ𝜃).
Next, select any Q 𝐿 ∈ P(𝜃 ⊤ 𝜇, 𝜃 ⊤ Σ𝜃). If 𝜃 ⊤ Σ𝜃 = 0, then Q 𝐿 = 𝛿 𝜃 ⊤ 𝜇 , which
coincides with the pushforward distribution P ◦ ℓ −1 for any P ∈ P(𝜇, Σ). In the
remainder of the proof we may thus assume that 𝜃 ⊤ Σ𝜃 ≠ 0. Let now 𝐿 be a random
variable governed by Q 𝐿 , and let 𝑀 be a 𝑑-dimensional random vector governed
by an arbitrary distribution Q 𝑀 ∈ P(R𝑑 ) with mean 𝜇 and covariance matrix Σ.
For example, we can set Q 𝑀 to the normal distribution N (𝜇, Σ). Assume 𝐿 and 𝑀
are independent. Then, the distribution P of the 𝑑-dimensional random vector
𝑍 = 𝜃 ⊤1Σ 𝜃 Σ𝜃 𝐿 + 𝐼𝑑 − 𝜃 ⊤1Σ 𝜃 Σ𝜃𝜃 ⊤ 𝑀
and
E P [(𝑍 − 𝜇)(𝑍 − 𝜇)⊤ ]
1 ⊤ 1 ⊤ 1 ⊤
= 𝜃 ⊤ Σ 𝜃 Σ𝜃𝜃 Σ + 𝐼𝑑 − 𝜃 ⊤ Σ 𝜃 Σ𝜃𝜃 Σ 𝐼𝑑 − 𝜃 ⊤ Σ 𝜃 𝜃𝜃 Σ = Σ.
The first equality in the above expression holds because 𝐿 and 𝑀 are independent,
𝐿 has variance 𝜃 ⊤ Σ𝜃 and 𝑀 has covariance matrix Σ. By construction, we further
98 D. Kuhn, S. Shafiee, and W. Wiesemann
Proof. If 𝜃 ⊤ Σ𝜃 = 0, then
sup 𝜚P 𝜃 ⊤ 𝑍 = 𝜃 ⊤ 𝜇 + sup 𝜚P 𝜃 ⊤ (𝑍 − 𝜇)
P∈P(𝜇,Σ) P∈P(𝜇,Σ)
⊤
=𝜃 𝜇+ sup 𝜚P [0] = 𝜃 ⊤ 𝜇,
P∈P(𝜇,Σ)
where the first equality holds because 𝜚 is translation invariant, whereas the second
equality holds because 𝜃 ⊤ (𝑍 − 𝜇) equals 0 in law under any P ∈ P(𝜇, Σ) and
because 𝜚 is law-invariant. Finally, the third equality follows from the scale-
invariance of 𝜚. If 𝜃 ⊤ Σ𝜃 > 0, on the other hand, then we have
sup 𝜚P 𝜃 ⊤ 𝑍 = 𝜃 ⊤ 𝜇 + sup 𝜚P 𝜃 ⊤ (𝑍 − 𝜇)
P∈P(𝜇,Σ) P∈P(𝜇,Σ)
⊤ 𝜃 ⊤ (𝑍 − 𝜇) √ ⊤
=𝜃 𝜇+ sup 𝜚P √ 𝜃 Σ𝜃
P∈P(𝜇,Σ) 𝜃 ⊤ Σ𝜃
√
= 𝜃 ⊤ 𝜇 + 𝛼 𝜃 ⊤ Σ𝜃,
where the first two equalities follow from the translation- and scale-invariance of 𝜚,
Distributionally Robust Optimization 99
respectively. The third equality follows from Proposition 6.7, the law-invariance
of 𝜚 and the definition of 𝛼. Indeed, the pushforward of the multivariate
√ Chebyshev
ambiguity set P(𝜇, Σ) under the transformation ℓ(𝑧) = 𝜃 ⊤ (𝑧 − 𝜇)/ 𝜃 ⊤ Σ𝜃 coincides
with the univariate standard Chebyshev ambiguity set P(0, 1).
The rest of the proof proceeds as follows. We first derive an analytical formula for
the worst-case 𝛽-VaR on the right hand side (Step 1). Next, we prove that the same
analytical formula provides an upper bound on the worst-case 𝛽-CVaR on the left
hand side (Step 2). The claim then follows from the above inequality.
Step 1. We first express the worst-case 𝛽-VaR as its smallest upper bound to find
sup 𝛽-VaRQ [𝐿] = inf 𝜏 : 𝛽-VaRQ (𝐿) ≤ 𝜏 ∀Q ∈ P(0, 1)
Q∈P(0,1) 𝜏 ∈R
2 The Chebyshev ambiguity set P(0, 1) is not weakly compact (see Example 3.9). Therefore, Sion’s
minimax theorem does not allow us to interchange the infimum over 𝜏 and the supremum over Q.
While we could instead invoke Theorem 5.18, this is actually not needed to prove Proposition 6.10.
100 D. Kuhn, S. Shafiee, and W. Wiesemann
where the first equality follows from Scarf’s bound derived in Proposition 6.5,
and the last equality is obtained by analytically solving the convex minimization
problem over 𝜏. The unique minimizer is given by
1 − 2𝛽
𝜏★ = p .
2 𝛽(1 − 𝛽)
This completes Step 2. The claim then follows by combining the analytical formula
for the worst-case 𝛽-VaR found in Step 1 and the analytical upper bound on the
worst-case 𝛽-CVaR found in Step 2 with the elementary inequality (6.6).
Propositions 6.9 and 6.10 provide an analytical formula for the Chebyshev risk
of a linear loss function provided that the underlying risk measure is the VaR or
the CVaR. The formula for the worst-case VaR was first derived in (Lanckriet et al.
2001, 2002, El Ghaoui et al. 2003); see also (Calafiore and El Ghaoui 2006). The
equality of the worst-case VaR and the worst-case CVaR was discovered in (Zymler
et al. 2013a). It not only holds for linear but also for arbitrary concave and arbitrary
quadratic (not necessarily concave) loss functions. Proposition 6.9 follows from
(Nguyen et al. 2021). The standard risk coefficient can be characterized in closed
form for a wealth of law-, translation- and scale-invariant risk measures other than
the VaR and the CVaR. It is available, for instance, for all spectral risk measures
and all risk measures that admit a Kusuoka representation (Li 2018) as well as all
distortion risk measures (Cai et al. 2023); see also (Nguyen et al. 2021).
By construction, G𝑟 ( 𝜇,
ˆ Σ̂) is the union of all Chebyshev ambiguity sets P(𝜇, Σ)
corresponding to a mean-covariance pair (𝜇, Σ) with G((𝜇, Σ), ( 𝜇, ˆ Σ̂)) ≤ 𝑟. This
decomposition of the Gelbrich ambiguity set into Chebyshev ambiguity sets allows
us via Proposition 6.9 to derive an analytical formula for the Gelbrich risk.
Proposition 6.11 (Gelbrich Risk). Assume that 𝜚 is a law-, translation- and scale-
invariant risk measure with standard risk coefficient 𝛼 ∈ R+ , there is 𝜃 ∈ R𝑑
with ℓ(𝑧) = 𝜃 ⊤ 𝑧 for all 𝑧 ∈ R𝑑 , and G𝑟 ( 𝜇,
ˆ Σ̂) is the Gelbrich ambiguity set of all
Distributionally Robust Optimization 101
Proof. Assume first that Σ̂ ≻ 0. If 𝜃 = 0, then the claim holds trivially because 𝜚
is law- and scale-invariant. If 𝑟 = 0, then the claim follows immediately from
Proposition 6.9. We may thus assume that 𝜃 ≠ 0 and 𝑟 > 0. In this case, we have
⊤ sup
sup 𝜚P 𝜃 ⊤ 𝑍
sup 𝜚P 𝜃 𝑍 = P∈P(𝜇,Σ)
P∈G𝑟 ( 𝜇,
ˆ Σ̂)
s.t. 𝜇 ∈ R𝑑 , Σ ∈ S𝑑 , G (𝜇, Σ), ( 𝜇,
ˆ Σ̂) ≤ 𝑟
+
√
⊤ 𝜃 + 𝛼 𝜃 ⊤ Σ𝜃
sup 𝜇
𝑑 𝑑
= s.t. 𝜇 ∈ R , Σ ∈ S+
1
2 + Tr Σ + Σ̂ − 2 Σ̂ 21 Σ Σ̂ 21 2 ≤ 𝑟 2 ,
k𝜇 − ˆ
𝜇k
where the first equality exploits the decomposition of the Gelbrich ambiguity set
into Chebyshev ambiguity sets. The second equality follows from Proposition 6.9
and Definition 2.1. By dualizing the resulting convex optimization problem, we find
n o
⊤
𝜚P 𝜃 𝑍 = inf 𝛾 𝑟 2 − Tr(Σ̂) + sup 𝜇⊤ 𝜃 − 𝛾k𝜇 − 𝜇k ˆ 2
sup
𝛾∈R+ 𝜇∈R𝑑
P∈G𝑟 ( 𝜇,
ˆ Σ̂)
n √ 1 1 1
o (6.8)
+ sup 𝛼 𝜃 Σ𝜃 − 𝛾 Tr Σ − 2 Σ̂ Σ Σ̂
⊤ 2 2 2
.
Σ∈S+𝑑
Strong duality holds because 𝑟 > 0, which implies that ( 𝜇, ˆ Σ̂) constitutes a Slater
point for the primal maximization problem. If 𝛾 = 0, then the maximization
problems over 𝜇 and Σ in (6.8) are unbounded. We may thus restrict 𝛾 to be strictly
positive. For any fixed 𝛾 > 0, the maximization problem over 𝜇 can be solved in
closed form. Its optimal value is given by 𝜇ˆ ⊤ 𝜃 + k𝜃 k 2 /(4𝛾). By introducing an
auxiliary variable 𝑡, the maximization problem over Σ can be reformulated as
1 1
1
sup 𝛼𝑡 − 𝛾 Tr Σ − 2 Σ̂ 2 Σ Σ̂ 2 2
(6.9)
s.t. 𝑡 ∈ R+ , Σ ∈ S+𝑑 , 𝑡 2 − 𝜃 ⊤ Σ𝜃 ≤ 0.
Note that 𝑡 = 0 and Σ = 𝜃𝜃 ⊤ form a Slater point for (6.9) because 𝜃 ≠ 0. Thus,
1 1 1
problem (6.9) admits a strong dual. The variable substitution 𝐵 ← (Σ̂ 2 Σ Σ̂ 2 ) 2
allows us to reformulate this dual problem more concisely as
inf sup 𝛼𝑡 − 𝜆𝑡 2 + sup Tr 𝐵2 Δ𝜆 + 2𝛾 Tr(𝐵),
(6.10)
𝜆∈R+ 𝑡 ∈R+
𝐵∈S+𝑑
1 1
where Δ𝜆 = Σ̂ − 2 (𝜆𝜃𝜃 ⊤ − 𝛾𝐼 𝑑 )Σ̂ − 2 for any 𝜆 ≥ 0. Note that Δ𝜆 is well-defined
because Σ̂ ≻ 0. Recall now that the standard risk coefficient 𝛼 was assumed to be
102 D. Kuhn, S. Shafiee, and W. Wiesemann
𝛼2 𝜃 ⊤ Σ̂𝜃 𝛼2 k𝜃 k 2 p
= inf + 𝛾 Tr Σ̂ + −1 = 𝛾 Tr Σ̂ + + 𝛼 𝜃 ⊤ Σ̂𝜃.
0<𝜆<𝛾 k 𝜃 k −2 4𝜆 𝜆 − k𝜃 k 2 /𝛾 4 𝛾
Here, the first equality exploits the Sherman-Morrison formula (Bernstein 2009,
Corollary 2.8.8) to rewrite the inverse matrix, and the second equality is obtained
by solving the minimization problem over 𝜆 analytically. Indeed, the infimum is
attained at the unique solution 𝜆★ of the first-order condition
1 k𝜃 k 2 2 p ⊤
= + 𝜃 Σ̂𝜃
𝜆 𝛾 𝛼
in the interior of the feasible set. In summary, we have solved both embedded
subproblems in (6.8) analytically. Substituting their optimal values into (6.8) yields
p 1 + 𝛼2 k𝜃 k 2
sup 𝜚P [𝜃 ⊤ 𝑍] = inf 𝜇ˆ ⊤ 𝜃 + 𝛼 𝜃 ⊤ Σ̂𝜃 + 𝛾𝑟 2 +
P∈G𝑟 ( 𝜇,
ˆ Σ̂) 𝛾≥0 4 𝛾
p p
= 𝜇ˆ ⊤ 𝜃 + 𝛼 𝜃 ⊤ Σ̂𝜃 + 𝑟 1 + 𝛼2 k𝜃 k.
Here, the second equality is obtained by solving the minimization problem over 𝛾
in closed form. We have thus established the desired formula (6.7) for Σ̂ ≻ 0.
It remains to be shown that (6.7) remains valid even if Σ̂ is singular. To this end,
use 𝐽(Σ̂) as a shorthand for the Gelbrich risk as a function of Σ̂. By leveraging
Berge’s maximum theorem (Berge 1963, pp. 115–116) and the continuity of the
Gelbrich distance (see the discussion after Proposition 2.2), it is easy to show
that 𝐽(Σ̂) is continuous on S+𝑑 . The claim thus follows by noting that (6.7) holds
for all Σ̂ ≻ 0, that both sides of (6.7) are continuous in Σ̂ and that every Σ̂ ∈ S+𝑑
can be expressed as a limit of positive definite matrices.
Proposition 6.11 is due to Nguyen et al. (2021). It shows that, for a broad class
of risk measures, the worst-case risk over a Gelbrich ambiguity set reduces to a
Distributionally Robust Optimization 103
which maximizes the expected value of ℓ(𝑍) over the Kullback-Leibler ambiguity
set of all distributions supported on Z whose Kullback-Leibler divergence with
respect to P̂ ∈ P(Z) is at most 𝑟 ≥ 0. The Kullback-Leibler ambiguity set is a 𝜙-
divergence ambiguity set of the form (2.10), where 𝜙 satisfies 𝜙(𝑠) = 𝑠 log(𝑠) − 𝑠 + 1
for all 𝑠 ≥ 0. As 𝜙∞ (1) = +∞, we have KL(P, P̂) = ∞ unless P ≪ P̂. Hence,
problem (6.11a) maximizes only over distributions P that are absolutely continuous
with respect to P̂. Note that 𝜙∗ (𝑡) = 𝑒𝑡 − 1 for all 𝑡 ∈ R. By Theorem 4.14 and the
definition of the perspective function, the problem dual to (6.11a) is thus given by
ℓ(𝑍) − 𝜆 0
inf 𝜆 0 + 𝜆(𝑟 − 1) + E P̂ 𝜆 exp . (6.11b)
𝜆0 ∈R, 𝜆∈R+ 𝜆
The problems (6.11a) and (6.11b) can be solved in closed form if the loss function ℓ
is linear and the nominal distribution P̂ is Gaussian.
Proposition 6.12 (Worst-Case Expectations over KL Ambiguity Sets). Suppose
that Z = R𝑑 , P̂ ∈ P(Z) is a normal distribution with mean 𝜇ˆ ∈ R𝑑 and covariance
𝑑 , and 𝑟 > 0. Suppose also that ℓ is linear, that is, there exists 𝜃 ∈ R 𝑑
matrix Σ̂ ∈ S++
with ℓ(𝑧) = 𝜃 ⊤ 𝑧 for all 𝑧 ∈ Z. Then, the primal problem (6.11a) is solved by the
1 1
normal distribution P★ with mean 𝜇ˆ + (2𝑟) 2 Σ̂𝜃/(𝜃 ⊤ Σ̂𝜃) 2 and covariance matrix Σ̂.
1 1
The dual problem (6.11b) is solved by (𝜆★0 , 𝜆★), where 𝜆★ = (𝜃 ⊤ Σ̂𝜃) 2 /(2𝑟) 2 and
𝜆★0 = 𝜆★ log E P̂ exp ℓ(𝑍)/𝜆★ .
1 1
The optimal values of (6.11a) and (6.11b) are both equal to 𝜇ˆ ⊤ 𝜃 + (2𝑟) 2 (𝜃 ⊤ Σ̂𝜃) 2 .
Proof. Focus first on the dual problem (6.11b), and fix any 𝜆 ≥ 0. Then, the partial
minimization problem over 𝜆 0 is solved by
𝜆★0 (𝜆) = 𝜆 log E P̂ [exp (ℓ(𝑍)/𝜆)] .
Substituting this parametric minimizer back into (6.11b) shows that the optimal
value of the dual problem (6.11b) is given by
1
ℓ(𝑍)
inf 𝜆𝑟 + 𝜆 log E P̂ exp = inf 𝜆𝑟 + 𝜇ˆ ⊤ 𝜃 + 𝜃 ⊤ Σ̂𝜃
𝜆∈R+ 𝜆 𝜆∈R+ 2𝜆
1 1
= 𝜇ˆ ⊤ 𝜃 + (2𝑟) 2 (𝜃 ⊤ Σ̂𝜃) 2 ,
104 D. Kuhn, S. Shafiee, and W. Wiesemann
where the first equality exploits the linearity of ℓ, the normality of P̂ and the formula
for the expected value of a log-normal distribution. The second equality holds
1 1
because the minimization problem over 𝜆 ≥ 0 is solved by 𝜆★ = (𝜃 ⊤ Σ̂𝜃) 2 /(2𝑟) 2 .
Next, define P★ ∈ P(Z) as the normal distribution with mean 𝜇★ = 𝜇ˆ + Σ̂𝜃/𝜆★ and
covariance matrix Σ★ = Σ̂. Comparing the density functions of P̂ and P★ shows that
⊤
dP★ 𝜃 (𝑧 − 𝜇)
ˆ 𝜃 ⊤ Σ̂𝜃
(𝑧) = exp − ∀𝑧 ∈ Z.
dP̂ 𝜆★ 2(𝜆★)2
By Definition 2.8, we thus obtain
∫
𝜃 ⊤ Σ̂𝜃
★
★ dP
KL(P , P̂) = log (𝑧) dP★(𝑧) = = 𝑟,
Z dP̂ 2(𝜆★)2
where the second and the third equalities follow readily from our formula for the
Radon-Nikodym derivative dP★/dP̂ and from basic algebra, respectively. Hence,
P★ is feasible in (6.11b). In addition, its objective function value is given by
1 1
E P★ [ℓ(𝑍)] = 𝜃 ⊤ 𝜇★ = 𝜇ˆ ⊤ 𝜃 + (2𝑟) 2 (𝜃 ⊤ Σ̂𝜃) 2 .
As the objective function values of P★ and (𝜆★0 , 𝜆★) with 𝜆★0 = 𝜆★0 (𝜆★) match,
weak duality as established in Theorem 4.14 implies that P★ is primal optimal and
that (𝜆★0 , 𝜆★) is dual optimal. This observation completes the proof.
which maximizes the expected value of ℓ(𝑍) over a total variation ball of radius 𝑟 ∈
[0, 1] around P̂ ∈ P(Z). Recall from Section 2.2.3 that the total variation distance
is a 𝜙-divergence and that the underlying entropy function satisfies 𝜙(𝑠) = 21 |𝑠 − 1|
for all 𝑠 ≥ 0 and 𝜙(𝑠) = ∞ for all 𝑠 < 0. Recall also that the total variation distance
between two distributions is bounded above by 1 and that this bound is attained
if the two distributions are mutually singular. An elementary calculation reveals
that the conjugate entropy function satisfies 𝜙∗ (𝑡) = max{𝑡 + 12 , 0} − 12 if 𝑡 ≤ 21 and
Distributionally Robust Optimization 105
𝜙∗ (𝑡) = +∞ if 𝑡 > 21 . By Theorem 4.14, the problem dual to (6.12a) is thus given by
1
𝜆
inf 𝜆0 + 𝜆 𝑟 − + E P̂ max ℓ(𝑍) − 𝜆 0 + , 0
𝜆0 ∈R, 𝜆∈R+ 2 2 (6.12b)
s.t. 𝜆 0 + 𝜆/2 ≥ sup ℓ(𝑧).
𝑧 ∈Z
The problems (6.12a) and (6.12b) can be solved in closed form if Z is compact.
Proposition 6.13 (Worst-Case Expectations over Total Variation Balls). Suppose
that Z ⊆ R𝑑 is compact, P̂ ∈ P(Z) and 𝑟 ∈ (0, 1), and define 𝛽𝑟 = 1 − 𝑟. In
addition, assume that E P̂ [ℓ(𝑍)] > −∞ and ℓ is upper semicontinuous. Then, the
optimal values of (6.12a) and (6.12b) are both equal to
(1 − 𝛽𝑟 ) · sup ℓ(𝑧) + 𝛽𝑟 · 𝛽𝑟 -CVaRP̂ [ℓ(𝑍)] . (6.13)
𝑧 ∈Z
The proof of Proposition 6.13 will reveal that (6.12a) and (6.12b) are both
solvable. Indeed, we will construct optimal solutions P★ and (𝜆★0 , 𝜆★) for (6.12a)
and (6.12b), respectively. A precise description of these optimizers is cumbersome
and thus omitted from the proposition statement. If the loss ℓ(𝑍) has a continuous
distribution under P̂, however, then P★ admits a simpler and more intuitive descrip-
tion. Indeed, in this case, P★ is obtained from P̂ by shifting the probability mass
of all outcomes 𝑧 ∈ Z associated with a high loss ℓ(𝑧) ≥ 𝛽𝑟 -VaRP̂ [ℓ(𝑍)] to some
outcome 𝑧 ∈ Z associated with the highest possible loss ℓ(𝑧) = max 𝑧 ′ ∈Z ℓ(𝑧′ ).
Proof of Proposition 6.13. For ease of notation, set ℓ = sup 𝑧 ∈Z ℓ(𝑧). Focus first on
the dual problem (6.12b), and fix any 𝜆 ≥ 0. Note that the dual objective function is
non-decreasing in 𝜆 0 . The partial minimization problem over 𝜆 0 is therefore solved
by 𝜆★0 (𝜆) = ℓ −𝜆/2. Substituting this parametric minimizer back into (6.12b) shows
that the optimal value of the dual problem is given by
ℓ + inf 𝜆(𝑟 − 1) + E P̂ max ℓ(𝑍) − ℓ + 𝜆, 0
𝜆∈R+
= 𝑟 ℓ + (1 − 𝑟) inf 𝜏 + (1 − 𝑟) −1 E P̂ max ℓ(𝑍) − 𝜏, 0 ,
𝜏 ≤ℓ
To construct a primal maximizer, assume first that P̂(ℓ(𝑍) < ℓ) ≤ 𝑟, which implies
that 𝛽𝑟 -CVaRP̂ [ℓ(𝑍)] = ℓ. Thus, the optimal value of the dual problem (6.12b)
simplifies to ℓ, which is attained by any distribution P★ that is obtained from P̂ by
moving all probability mass from {𝑧 ∈ Z : ℓ(𝑧) < ℓ} to {𝑧 ∈ Z : ℓ(𝑧) = ℓ}.
106 D. Kuhn, S. Shafiee, and W. Wiesemann
Next, assume that P̂(ℓ(𝑍) < ℓ) > 𝑟, which implies that 𝛽𝑟 -VaRP̂ [ℓ(𝑍)] < ℓ. In
this case, we partition Z into the following four subsets.
Z1 = 𝑧 ∈ Z : 𝛽𝑟 -VaRP̂ [ℓ(𝑍)] > ℓ(𝑧)
Z2 = 𝑧 ∈ Z : ℓ > ℓ(𝑧) = 𝛽𝑟 -VaRP̂ [ℓ(𝑍)]
Z3 = 𝑧 ∈ Z : ℓ > ℓ(𝑧) > 𝛽𝑟 -VaRP̂ [ℓ(𝑍)]
Z4 = 𝑧 ∈ Z : ℓ = ℓ(𝑧)
Note that Z1 and Z3 can be empty, whereas Z2 and Z4 must be non-empty. We
also define P̂𝑖 as the nominal distribution P̂ conditioned on the event 𝑍 ∈ Z𝑖 for
all 𝑖 ∈ [4], and we define UZ4 as the uniform distribution on Z4 . Next, we set
P★ = 𝛽𝑟 − P̂(𝑍 ∈ Z3 ) − P̂(𝑍 ∈ Z4 ) · P̂2 +
value of (6.12b). Weak duality as established in Theorem 4.14 thus implies that P★
solves the primal problem (6.12a). This observation completes the proof.
Jiang and Guan (2018) and Shapiro (2017) study a variant of problem (6.12a) that
maximizes over a restricted total variation ball. Thus, they additionally impose P ≪
P̂ in (6.12a). The supremum of the resulting restricted problem amounts to
(1 − 𝛽𝑟 ) · ess supP̂ [ℓ(𝑍)] + 𝛽𝑟 · 𝛽𝑟 -CVaRP̂ [ℓ(𝑍)] ,
which may be strictly smaller than (6.13). If additionally ℓ(𝑍) has a continuous
marginal distribution under P̂, then the supremum is no longer attained.
which maximizes the expected value of ℓ(𝑍) over a Lévy-Prokhorov ball of ra-
dius 𝑟 ∈ [0, 1] around P̂ ∈ P(Z). We assume here that the Lévy-Prokhorov dis-
tance is induced by a norm k · k on R𝑑 . By Proposition 2.22, the Lévy-Prokhorov
ball of radius 𝑟 ∈ (0, 1) coincides with the optimal transport ambiguity set
P = P ∈ P(Z) : OT𝑐𝑟 (P, P̂) ≤ 𝑟 ,
where the transportation cost function 𝑐𝑟 is defined through 𝑐𝑟 (𝑧, 𝑧ˆ) = 1 k 𝑧− 𝑧ˆ k>𝑟 .
Theorem 4.18 thus implies that the problem dual to (6.14a) is given by
ˆ
inf 𝜆𝑟 + E P̂ sup ℓ(𝑧) − 𝜆𝑐𝑟 (𝑧, 𝑍) (6.14b)
𝜆∈R+ 𝑧 ∈Z
Proof. For ease of notation we introduce two auxiliary functions 𝑓 and 𝑔 from Z
to R, which are defined through 𝑓 (𝑧) = ℓ(𝑧)−𝜆 · 1 k 𝑧− 𝑧ˆ k>𝑟 and 𝑔(𝑧) = ℓ𝑟 (𝑧)−𝜆 · 1 𝑧≠ 𝑧ˆ
for all 𝑧 ∈ Z. Note that both 𝑓 and 𝑔 are upper semicontinuous.
First, select 𝑧★ ∈ arg max𝑧 ∈Z 𝑓 (𝑧), which exists because Z is compact and 𝑓 is
upper semicontinuous. If k𝑧★ − 𝑧ˆ k > 𝑟, then the definition of ℓ𝑟 implies that
sup 𝑓 (𝑧) = 𝑓 (𝑧★) = ℓ(𝑧★) − 𝜆 ≤ ℓ𝑟 (𝑧★) − 𝜆 = 𝑔(𝑧★) ≤ sup 𝑔(𝑧).
𝑧 ∈Z 𝑧 ∈Z
Next, select 𝑧˜ ∈ arg max𝑧 ∈Z 𝑔(𝑧). If 𝑧˜ ≠ 𝑧ˆ, then with 𝑧★ ∈ arg max 𝑧 ∈Z ℓ(𝑧) we have
sup 𝑔(𝑧) = 𝑔(˜𝑧) = ℓ𝑟 (˜𝑧 ) − 𝜆 ≤ ℓ(𝑧★) − 𝜆 ≤ 𝑓 (𝑧★ ) = sup 𝑓 (𝑧),
𝑧 ∈Z 𝑧 ∈Z
where the inequalities follow from the definition of 𝑧★ and the non-negativity of 𝜆.
Conversely, if 𝑧˜ = 𝑧ˆ, then with 𝑧★ ′ ′
𝑟 ∈ arg max 𝑧 ′ ∈Z {ℓ(𝑧 ) : k𝑧 − 𝑧ˆ k ≤ 𝑟} we have
Proof of Proposition 6.14. Lemma 6.15 allows us to reformulate the dual prob-
lem (6.14b) in terms of the adversarial loss function ℓ𝑟 as
inf 𝜆𝑟 + E P̂ sup ℓ𝑟 (𝑧) − 𝜆 · 1 𝑧≠ 𝑧ˆ . (6.16)
𝜆∈R+ 𝑧 ∈Z
where the equality holds because TV = OT𝑐0 as shown in Proposition 2.24. Since
sup 𝑧 ∈Z ℓ𝑟 (𝑧) = sup 𝑧 ∈Z ℓ(𝑧) = ℓ, Proposition 6.13 readily implies that the supremum
of the resulting maximization problem over a total variation ball is given by
ˆ ,
(1 − 𝛽𝑟 ) · ℓ + 𝛽𝑟 · 𝛽𝑟 -CVaRP̂ ℓ𝑟 ( 𝑍)
Distributionally Robust Optimization 109
which exists thanks to (Rockafellar and Wets 2009, Corollary 14.6 and The-
orem 14.37), and define P̂ 𝜓 = P̂ ◦ 𝜓 −1 as the pushforward distribution of P̂ under 𝜓.
Next, we construct a primal maximizer under the assumption that P̂ 𝜓 (ℓ(𝑍) < ℓ) > 𝑟.
To this end, we partition Z into the following four subsets.
ˆ > ℓ(𝑧)
Z1 = 𝑧 ∈ Z : 𝛽𝑟 -VaRP̂ 𝜓 [ℓ( 𝑍)]
ˆ
Z2 = 𝑧 ∈ Z : ℓ > ℓ(𝑧) = 𝛽𝑟 -VaRP̂ 𝜓 [ℓ( 𝑍)]
ˆ
Z3 = 𝑧 ∈ Z : ℓ > ℓ(𝑧) > 𝛽𝑟 -VaRP̂ 𝜓 [ℓ( 𝑍)]
Z4 = 𝑧 ∈ Z : ℓ = ℓ(𝑧)
where the two equalities follow again from the proof of Proposition 6.13 and from
the measure-theoretic change of variables formula, respectively. As ℓ(𝜓(ˆ𝑧)) = ℓ𝑟 (ˆ𝑧)
for every 𝑧ˆ ∈ Z, the objective function value of P★ in (6.14a) matches the optimal
value of the dual problem (6.14b). Weak duality as established in Theorem 4.18
thus implies that P★ solves the primal problem (6.14a). If P̂ 𝜓 (ℓ(𝑍) < ℓ) ≤ 𝑟, the
construction of a primal maximizer is simpler and thus omitted for brevity.
110 D. Kuhn, S. Shafiee, and W. Wiesemann
The results of this section were first obtained by Bennouna and Van Parys (2023)
under the assumption that the nominal distribution P̂ is discrete.
which maximizes the expected value of ℓ(𝑍) over an ∞-Wasserstein ball of ra-
dius 𝑟 ∈ R+ around P̂ ∈ P(Z). We assume here that the ∞-Wasserstein distance
is induced by a given norm k · k on R𝑑 . Recall from Proposition 2.27 that the
∞-Wasserstein ambiguity set coincides with the optimal transport ambiguity set
P = P ∈ P(Z) : OT𝑐𝑟 (P, P̂) ≤ 0 ,
where the transportation cost function 𝑐𝑟 is defined through 𝑐𝑟 (𝑧, 𝑧ˆ) = 1 k 𝑧− 𝑧ˆ k>𝑟 .
We emphasize that, while the radius of the ∞-Wasserstein ball under considera-
tion is 𝑟, the radius of the corresponding optimal transport ambiguity set P is 0.
Theorem 4.18 thus implies that the problem dual to (6.17a) is given by
ˆ
inf E P̂ sup ℓ(𝑧) − 𝜆𝑐𝑟 (𝑧, 𝑍) (6.17b)
𝜆∈R+ 𝑧 ∈Z
Also, it is uniformly bounded above by sup 𝑧 ∈Z ℓ(𝑧), which is a finite constant thanks
to the compactness of Z and the upper semicontinuity of ℓ. By the monotone
convergence theorem, the optimal value of the dual problem (6.17b) thus satisfies
ˆ ˆ
inf E P̂ sup ℓ(𝑧) − 𝜆𝑐𝑟 (𝑧, 𝑍) = E P̂ inf sup ℓ(𝑧) − 𝜆𝑐𝑟 (𝑧, 𝑍) = E P̂ ℓ𝑟 ( 𝑍) ˆ ,
𝜆∈R+ 𝑧 ∈Z 𝜆∈R+ 𝑧 ∈Z
where the second equality holds because Z is compact. Weak duality as established
in Theorem 4.18 thus implies that P★ solves the primal problem (6.17a).
Proposition 6.16 shows that the worst-case expectation of the original loss ℓ(𝑍)
with respect to an ∞-Wasserstein ball coincides with the crisp expectation of the
ˆ with respect to the nominal distribution P̂. This result was
adversarial loss ℓ𝑟 ( 𝑍)
first discovered by Gao et al. (2017) for discrete nominal distributions and later
extend by Gao et al. (2024) to general nominal distributions. The loss function ℓ𝑟
is routinely used in machine learning for the adversarial training of neural net-
works (Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow and Fergus 2014,
Goodfellow et al. 2015). Proposition 6.16 thus reveals an intimate connection
between adversarial training and distributionally robust optimization with respect
to an ∞-Wasserstein ambiguity set. This connection has been further explored in
the context of adversarial classification by García Trillos and García Trillos (2022),
García Trillos and Murray (2022), García Trillos and Jacobs (2023), Bungert et al.
(2023, 2024), Pydi and Jog (2024), Frank and Niles-Weed (2024a) and Frank and
Niles-Weed (2024b).
which maximizes the expected value of ℓ(𝑍) over a 1-Wasserstein ball of radius 𝑟 ∈
R+ around P̂ ∈ P(Z). We assume here that the 1-Wasserstein distance is induced by
a given norm k · k on R𝑑 . Thus, the 1-Wasserstein
ambiguity set coincides with the
optimal transport ambiguity set P = P ∈ P(Z) : OT𝑐 (P, P̂) ≤ 𝑟 corresponding to
the transportation cost function 𝑐 is defined through 𝑐(𝑧, 𝑧ˆ) = k𝑧− 𝑧ˆ k. Theorem 4.18
thus implies that the problem dual to (6.18a) is given by
ˆ
inf 𝜆𝑟 + E P̂ sup ℓ(𝑧) − 𝜆k𝑧 − 𝑍 k (6.18b)
𝜆≥0 𝑧 ∈Z
Lipschitz continuous, then the optimal values of (6.18a) and (6.18b) are equal to
ˆ + 𝑟 lip(ℓ).
E P̂ [ℓ( 𝑍)]
Under the conditions of Proposition 6.17, the supremum of the primal prob-
lem (6.18a) is usually not attained. The proof constructs a sequence of distributions
that attain the supremum asymptotically. These distributions move an increasingly
small portion of P̂ increasingly far along the direction of steepest increase of ℓ.
Intuitively, the amount of probability mass transported over a distance Δ must decay
as O(𝑟/Δ) as Δ grows. The dual problem (6.18b) is solved by 𝜆★ = lip(ℓ).
Proof of Proposition 6.17. As the convex function ℓ is Lipschitz continuous, it is
in particular proper and closed. By the Fenchel-Moreau theorem (Lemma 4.2) ℓ
thus admits the dual representation
ℓ(𝑧) = sup 𝑧⊤ 𝑦 − ℓ ∗ (𝑦),
𝑦 ∈dom(ℓ ∗ )
The maximum in the last expression is attained by some 𝑦★ ∈ R𝑑 because lip(ℓ) < ∞
by assumption. Next, define 𝑧★ as any optimal solution of max k 𝑧 k ≤1 (𝑦★)⊤ 𝑧. By
construction, we thus have (𝑦★)⊤ 𝑧★ = k𝑦★ k ∗ . We also introduce a sequence {𝑦 𝑖 }𝑖∈N
in dom(ℓ ∗ ) that converges to 𝑦★, and we set 𝑞 𝑖 = 𝑖 −1 (1 + |ℓ ∗ (𝑦 𝑖 )|) −1 for every 𝑖 ∈ N.
In addition, we define 𝑓𝑖 : R𝑑 → R𝑑 through 𝑓𝑖 (𝑧) = 𝑧 + 𝑟𝑧★/𝑞 𝑖 for any 𝑖 ∈ N.
Thus, 𝑓𝑖 represents the translation that shifts each point in R𝑑 along the direction 𝑧★
by a distance equal to 𝑟/𝑞 𝑖 . We further define
P𝑖 = (1 − 𝑞 𝑖 ) P̂ + 𝑞 𝑖 P̂ ◦ 𝑓𝑖−1 ,
where P̂ ◦ 𝑓𝑖−1 stands for the pushforward distribution of P̂ under 𝑓𝑖 . Intuitively, P𝑖
is obtained by decomposing P̂ into two parts (1 − 𝑞 𝑖 )P̂ and 𝑞 𝑖 P̂ and then translating
the second part by 𝑟𝑧★/𝑞 𝑖 . By construction, we thus have OT𝑐 (P𝑖 , P̂) ≤ 𝑟 and
E P𝑖 [ℓ(𝑍)] = (1 − 𝑞 𝑖 ) E P̂ [ℓ(𝑍)] + 𝑞 𝑖 E P̂ [ℓ(𝑍 + 𝑟𝑧★/𝑞 𝑖 )]
≥ (1 − 𝑞 𝑖 ) E P̂ [ℓ(𝑍)] + 𝑞 𝑖 E P̂ [(𝑦 𝑖 )⊤ (𝑍 + 𝑟𝑧★/𝑞 𝑖 ) − ℓ ∗ (𝑦 𝑖 )].
Here, the inequality follows from the representation of ℓ in terms of its conjugate ℓ ∗ .
Distributionally Robust Optimization 113
for all 𝑧ˆ ∈ Z, where the second inequality follows from the Lipschitz continuity
of ℓ, and the equality holds thanks to the definition of 𝜆★. Thus, the objective
function value of 𝜆★ in the dual problem (6.18b) is given by
★ ˆ ˆ + 𝑟 lip(ℓ).
𝜆 𝑟 + E P̂ sup ℓ(𝑧) − 𝜆k𝑧 − 𝑍 k = E P̂ ℓ( 𝑍)
𝑧 ∈Z
where the first inequality holds because P may adapt to 𝛽 when the supremum is
evaluated inside the integral, and the second inequality follows from the standard
max-min inequality. The first equality follows from the results on worst-case
expectations over 1-Wasserstein balls in Section 6.13.
To derive the converse inequality, we assume first that 𝜎({1}) = 0. The general
case will be addressed later. Note that 𝜇 = inf P∈P E P [ℓ(𝑍)] is finite because ℓ is
Lipschitz continuous and because E P̂ [k𝑍 k] < ∞, which implies via the proof of
Theorem 3.19 that all distributions in P have uniformly bounded first moment. We
may assume without loss of generality that 𝜇 ≥ 0. Otherwise, we may replace ℓ(𝑧)
with ℓ(𝑧)− 𝜇, which simply increases the worst-case risk by −𝜇 because any spectral
risk measure is translation invariant. The assumption that 𝜇 ≥ 0 then implies that
𝛽-CVaRP [ℓ(𝑍)] ≥ E P [ℓ(𝑍)] ≥ 0 ∀𝛽 ∈ [0, 1], ∀P ∈ P.
Thus, we have
∫ 1− 𝛿
sup 𝜚P [ℓ(𝑍)] = sup sup 𝛽-CVaRP [ℓ(𝑍)] d𝜎(𝛽)
P∈P P∈P 𝛿>0 𝛿
∫ 1− 𝛿
= sup sup 𝛽-CVaRP [ℓ(𝑍)] d𝜎(𝛽),
𝛿>0 P∈P 𝛿
where the first equality follows from the monotone convergence theorem and the
Distributionally Robust Optimization 115
assumption that 𝜎({0}) = 𝜎({1}) = 0. Hence, for any 𝜀 > 0 there is 𝛿 > 0 with
∫ 1− 𝛿
sup 𝜚P [ℓ(𝑍)] − sup 𝛽-CVaRP [ℓ(𝑍)] d𝜎(𝛽) ≤ 𝜀 (6.20a)
P∈P P∈P 𝛿
and
∫ 1 ∫ 1− 𝛿
−1
𝛽 d𝜎(𝛽) − 𝛽 −1 d𝜎(𝛽) ≤ 𝜀. (6.20b)
0 𝛿
Recall now from Theorem 3.19 that P is weakly compact and thus tight. Hence,
there exists a compact set C ⊆ R𝑑 with P(𝑍 ∉ C) ≤ 𝛿/2 for every P ∈ P. As C
is compact, 𝜏 = min𝑧 ∈C ℓ(𝑧) and 𝜏 = max𝑧 ∈C ℓ(𝑧) are both finite. Using the trivial
bounds P(ℓ(𝑍) ≥ 𝜏) ≥ P(𝑍 ∈ C) and P(ℓ(𝑍) ≤ 𝜏) ≥ P(𝑍 ∈ C) and noting that
P(𝑍 ∈ C) ≥ 1 − 𝛿/2 for every P ∈ P, one can then readily show that
for all 𝛽 ∈ [𝛿, 1 − 𝛿] and for all P ∈ P. Next, define 𝑦 𝑖 ∈ dom(ℓ ∗ ), 𝑞 𝑖 ∈ [0, 1], the
function 𝑓𝑖 : R𝑑 → R𝑑 and the distribution P𝑖 = (1 − 𝑞 𝑖 ) P̂ + 𝑞 𝑖 P̂ ◦ 𝑓𝑖−1 for 𝑖 ∈ N
as in Section 6.13. We then obtain
∫ 1− 𝛿
1
sup 𝜚P [ℓ(𝑍)] ≥ inf 𝜏 + E P𝑖 [max {ℓ(𝑍) − 𝜏, 0}] d𝜎(𝛽)
P∈P 𝛿 𝜏 ∈R 𝛽
∫ 1− 𝛿
1 − 𝑞𝑖 (6.21)
= inf 𝜏 + E P̂ [max {ℓ(𝑍) − 𝜏, 0}]
𝛿 𝜏 ∈ [ 𝜏,𝜏 ] 𝛽
𝑞𝑖
+ E P̂ max ℓ(𝑍 + 𝑟𝑧★/𝑞 𝑖 ) − 𝜏, 0 d𝜎(𝛽).
𝛽
The inequality in (6.21) holds because 𝛽-CVaRP [ℓ(𝑍)] ≥ 0 for all 𝛽 ∈ [0, 1] by
assumption and because P𝑖 ∈ P as shown in Section 6.13. The equality follows from
the definition of P𝑖 and from (Rockafellar and Uryasev 2002, Theorem 10), which
ensures that the minimization problem over 𝜏 is solved by 𝛽-VaRP [ℓ(𝑍)] ∈ [𝜏, 𝜏].
As ℓ is proper, convex and lower semicontinuous, and as 𝑦 𝑖 belongs to the domain
of ℓ ∗ , the Fenchel-Moreau theorem further implies that
Substituting this estimate into (6.21) and letting 𝑖 tend to infinity yields
∫ 1− 𝛿
1 − 𝑞𝑖
sup 𝜚P [ℓ(𝑍)] ≥ lim inf 𝜏 + E P̂ [max {ℓ(𝑍) − 𝜏, 0}] d𝜎(𝛽)
P∈P 𝑖→∞ 𝛿 𝜏 ∈ [ 𝜏,𝜏 ] 𝛽
∫ 1− 𝛿
+ 𝑟 lip(ℓ) 𝛽 −1 d𝜎(𝛽)
𝛿
∫ 1− 𝛿 ∫ 1− 𝛿
= 𝛽-CVaRP [ℓ(𝑍)] d𝜎(𝛽) + 𝑟 lip(ℓ) 𝛽 −1 d𝜎(𝛽),
𝛿 𝛿
ℓ ∗ (𝑦
where we have used that 𝑞 𝑖 as well as 𝑞 𝑖 ⊤ ★
𝑖 ) converge to 0 and that 𝑦 𝑖 𝑧 converges
★ ⊤ ★
to (𝑦 ) 𝑧 = lip(ℓ) as 𝑖 tends to infinity; see also Section 6.13. The equality
follows from the monotone convergence theorem, which applies because 𝑞 𝑖 is
monotonically decreasing with 𝑖. Letting 𝜀 tend to 0 thus implies via (6.20) that
∫ 1 ∫ 1
sup 𝜚P [ℓ(𝑍)] ≥ 𝛽-CVaRP [ℓ(𝑍)] d𝜎(𝛽) + 𝑟 lip(ℓ) 𝛽 −1 d𝜎(𝛽).
P∈P 0 0
This lower bound matches the upper bound derived in the first part of the proof, and
thus the claim follows, provided that 𝜎({1}) = 0. If the probability distribution 𝜎
has an atom at 1, then it can be decomposed as 𝜎 = 𝜎 ˆ + 𝜎({1}) · 𝛿1 , where 𝜎ˆ is a
non-negative measure on (0, 1). We can thus decompose the risk under P as
∫ 1
𝜚P [ℓ(𝑍)] = 𝛽-CVaRP [ℓ(𝑍)] d𝜎(𝛽)
ˆ + 𝜎({1}) · E P [ℓ(𝑍)].
0
The first term in this decomposition can then be handled as above, and the second
term can be handled as in Section 6.13. Details are omitted for brevity.
mutually dual norms on R𝑑 and that 𝑝, 𝑞 ∈ (1, ∞) are conjugate exponents with
1 1 (𝑞−1) /𝑞 𝑞 . Then, the following statements hold.
𝑝 + 𝑞 = 1. Define 𝜑(𝑞) = (𝑞 − 1)
1 𝑝 ∗ 1 𝑞
(i) If 𝑓 (𝑧) = 𝑝 k𝑧k , then 𝑓 (𝑦) = 𝑞 k𝑦k ∗ .
(ii) If 𝑔(𝑧) = k𝑧 − 𝑧ˆ k 𝑝 , then 𝑔∗ (𝑦) = 𝑦 ⊤ 𝑧ˆ + 𝜑(𝑞) k𝑦k ∗𝑞 .
Proof. As for assertion (i), fix any 𝑧, 𝑦 ∈ R𝑑 . We then have
1 1 1 1
𝑧⊤ 𝑦 − k𝑧k 𝑝 ≤ k𝑧k k𝑦k ∗ − k𝑧k 𝑝 ≤ max 𝑡 k𝑦k ∗ − 𝑡 𝑝 = k𝑦k ∗𝑞 ,
𝑝 𝑝 𝑡 ≥0 𝑝 𝑞
where the first inequality follows from the construction of the dual norm, and the
second inequality is obtained by maximizing over 𝑡 = k𝑧k. The equality holds
1/( 𝑝−1)
because the maximization problem is solved by 𝜏 = k𝑦k ∗ . Both inequalities
⊤
collapse to equalities if 𝑧 ∈ arg max k 𝑧 k=𝜏 𝑧 𝑦. This allows us to conclude that
1 1 𝑞
𝑓 ∗ (𝑦) = sup 𝑧⊤ 𝑦 − k𝑧k 𝑝 = k𝑦k ∗ .
𝑧 ∈R𝑑 𝑝 𝑞
As for assertion (ii), note that
1
𝑔∗ (𝑦) = sup 𝑦 ⊤ 𝑧 − k𝑧 − 𝑧ˆ k 𝑝 = 𝑦 ⊤ 𝑧ˆ + 𝑝 · sup (𝑦/𝑝)⊤ 𝑧 − k𝑧k 𝑝
𝑧 ∈R𝑑 𝑧 ∈R𝑑 𝑝
𝑝
= 𝑦 ⊤ 𝑧ˆ + k𝑦/𝑝k∗𝑞 = 𝑦 ⊤ 𝑧ˆ + 𝜑(𝑞) k𝑦k ∗𝑞 ,
𝑞
where the last two equalities exploit assertion (i) and the definition of 𝜑(𝑞).
We now show that the worst-case CVaR of a linear loss function ℓ(𝑧) = 𝜃 ⊤ 𝑧 over
a 𝑝-Wasserstein ball of radius 𝑟 around P̂ equals the sum of the nominal CVaR
under P̂ and a regularization term that scales with the norm of 𝜃 and with 𝑟.
Proposition 6.20 (𝑝-Wasserstein Risk). Assume that P̂ ∈ P(R𝑑 ) with E P̂ [k𝑍 k 𝑝 ] <
∞ for some 𝑝 ∈ (1, ∞) and for some norm k · k on R𝑑 . Define P = {P ∈
P(R𝑑 ) : W 𝑝 (P, P̂) ≤ 𝑟}, where 𝑟 ≥ 0 and W 𝑝 is the 𝑝-Wasserstein distance with
transportation cost function 𝑐(𝑧, 𝑧ˆ) = k𝑧 − 𝑧ˆ k 𝑝 . If 𝜃 ∈ R𝑑 and 𝛽 ∈ (0, 1), then
sup 𝛽-CVaRP [𝜃 ⊤ 𝑍] = 𝛽-CVaRP̂ [𝜃 ⊤ 𝑍] + 𝑟 𝛽 −1/ 𝑝 k𝜃 k ∗ .
P∈P
Proof. By the definition of the CVaR by Rockafellar and Uryasev (2000), we have
1
sup 𝛽-CVaRP [𝜃 ⊤ 𝑍] ≤ inf 𝜏 + sup E P max 𝜃 ⊤ 𝑍 − 𝜏, 0 , (6.22)
P∈P 𝜏 ∈R 𝛽 P∈P
where the inequality is obtained by interchanging the supremum over P and the
infimum over 𝜏. The underlying worst-case expectation problem satisfies
sup E P max 𝜃 ⊤ 𝑍 − 𝜏, 0
P∈P
118 D. Kuhn, S. Shafiee, and W. Wiesemann
" #
≤ inf 𝜆𝑟 + E P̂ sup max 𝜃 𝑧 − 𝜏, 0 − 𝜆k𝑧 − 𝑍ˆ k
𝑝 ⊤ 𝑝
𝜆≥0 𝑧 ∈R𝑑
" ( )#
= inf 𝜆𝑟 𝑝 + E P̂ max sup 𝜃 ⊤ 𝑧 − 𝜏 − 𝜆k𝑧 − 𝑍ˆ k 𝑝 , sup −𝜆k𝑧 − 𝑍ˆ k 𝑝
𝜆≥0 𝑧 ∈R𝑑 𝑧 ∈R𝑑
= inf 𝜆𝑟 + E P̂ max 𝜃 𝑍ˆ − 𝜏 + 𝜑(𝑞)𝜆
𝑝 ⊤ 𝑞
k𝜃/𝜆k ∗ ,0
𝜆≥0
where the inequality exploits weak duality, and the first equality is obtained by
interchanging the order of the two maximization operations. The second equality
follows from Lemma 6.19(ii). Substituting the resulting formula into (6.22) and
interchanging the infimum over 𝜏 with the infimum over 𝜆 then yields
sup 𝛽-CVaRP [𝜃 ⊤ 𝑍]
P∈P
𝜆𝑟 𝑝 1
≤ inf + inf 𝜏 + E P̂ max 𝜃 ⊤ 𝑍ˆ − 𝜏 + 𝜑(𝑞)𝜆 k𝜃/𝜆k ∗𝑞 , 0
𝜆≥0 𝛽 𝜏 ∈R 𝛽
𝜆𝑟 𝑝
= inf + 𝛽-CVaRP̂ [𝜃 ⊤ 𝑍ˆ + 𝜑(𝑞)𝜆 k𝜃/𝜆k ∗𝑞 ]
𝜆≥0 𝛽
𝑝
ˆ + inf 𝜆𝑟 + 𝜑(𝑞)𝜆 k𝜃/𝜆k ∗𝑞 ,
= 𝛽-CVaRP̂ [𝜃 ⊤ 𝑍]
𝜆≥0 𝛽
where the equalities follow from the definition and the translation invariance of the
CVaR, respectively. Solving the minimization problem over 𝜆 analytically yields
ˆ + 𝑟 𝛽 −1/ 𝑝 k𝜃 k ∗ .
sup 𝛽-CVaRP [𝜃 ⊤ 𝑍] ≤ 𝛽-CVaRP̂ [𝜃 ⊤ 𝑍]
P∈P
where the first equality follows from our convention that 0 · ∞ = ∞, which implies
that 0 𝑓 (𝑧) = 𝛿dom( 𝑓 ) (𝑧). The second equality follows from the definition of the
support function, and the third equality holds because 𝑓 is convex and closed,
which implies via Lemma 4.2 that 𝑓 = 𝑓 ∗∗ . Finally, the fourth equality follows
from (Rockafellar 1970, Theorem 13.3), and the last equality exploits the definition
of the perspective function for 𝛼 = 0. This completes the proof of assertion (i).
As for assertion (ii), assume first that 𝛼 > 0, and note that
𝑔∗ (𝑦) = sup 𝑦 ⊤ 𝑧 − 𝑓 𝜋 (𝑧, 𝛼) = 𝛼 sup 𝑦 ⊤ (𝑧/𝛼) − 𝑓 (𝑧/𝛼) = 𝛼 𝑓 ∗ (𝑦) = cl(𝛼 𝑓 ∗ )(𝑦),
𝑧 ∈R𝑑 𝑧 ∈R𝑑
Here, the first equality exploits the definition of the perspective. The second and
the third equalities follow from (Rockafellar 1970, Theorem 13.3) and (Rockafellar
1970, Theorem 13.2), respectively. The last equality, finally, holds because 0 𝑓 ∗ =
𝛿dom( 𝑓 ∗ ) by our conventions of extended arithmetic. This proves assertion (ii).
The following lemma derives a formula for the conjugate of a sum of functions.
Distributionally Robust Optimization 121
If there exists 𝑧¯ ∈ ∩𝑘 ∈ [𝐾 ] rint(dom( 𝑓 𝑘 )), then the inequality in the above expression
reduces to an equality, and the minimum is attained for every 𝑦 ∈ R𝑑 .
The infimum on the right hand side of (7.1) defines a function of 𝑦. This function
is called the infimal convolution of the functions 𝑓 𝑘∗ , 𝑘 ∈ [𝐾]. Thus, Lemma 7.2
asserts that, under a mild Slater-type condition, the conjugate of a sum of functions
coincides with the infimal convolution of the conjugates of these functions.
Proof of Lemma 7.2. By using a standard variable splitting trick and the max-min
inequality, one can show that the conjugate of 𝑓 admits the following upper bound.
Õ
𝑓 ∗ (𝑦) = sup 𝑦⊤ 𝑧 − 𝑓 (𝑧 𝑘 ) : 𝑧 𝑘 = 𝑧 ∀𝑘 ∈ [𝐾]
𝑧,𝑧1 ,...,𝑧𝐾 ∈R𝑑 𝑘 ∈ [𝐾 ]
Õ
= sup inf 𝑦⊤ 𝑧 − 𝑓 (𝑧 𝑘 ) − 𝑦 ⊤𝑘 (𝑧 − 𝑧 𝑘 )
𝑧,𝑧1 ,...,𝑧𝐾 ∈R𝑑 𝑦1 ,...,𝑦𝐾 ∈R𝑑
𝑘 ∈ [𝐾 ]
Õ
≤ inf sup 𝑦⊤ 𝑧 − 𝑓 (𝑧 𝑘 ) − 𝑦 ⊤𝑘 (𝑧 − 𝑧 𝑘 )
𝑦1 ,...,𝑦𝐾 ∈R𝑑 𝑧,𝑧1 ,...,𝑧𝐾 ∈R𝑑
𝑘 ∈ [𝐾 ]
Õ Õ
⊤
= inf sup 𝑦 𝑧 − 𝑦 ⊤𝑘 𝑧 + sup 𝑦 ⊤𝑘 𝑧 𝑘 − 𝑓 (𝑧 𝑘 )
𝑦1 ,...,𝑦𝐾 ∈R𝑑 𝑧 ∈R𝑑 𝑑
𝑘 ∈ [𝐾 ] 𝑘 ∈ [𝐾 ] 𝑧 𝑘 ∈R
Í
The supremum over 𝑧 in the resulting expression evaluates to 0 if 𝑘 ∈ [𝐾 ] 𝑦 𝑘 = 𝑦
and to ∞ otherwise. In addition, the supremum over 𝑧 𝑘 evaluates to 𝑓 𝑘∗ (𝑦 𝑘 ) for
every 𝑘 ∈ [𝐾]. Substituting these analytical formulas into the last expression yields
Õ Õ
∗ ∗
𝑓 (𝑦) ≤ inf 𝑓 (𝑦 𝑘 ) : 𝑦𝑘 = 𝑦 .
𝑦1 ,...,𝑦𝐾 ∈R𝑑
𝑘 ∈ [𝐾 ] 𝑘 ∈ [𝐾 ]
The resulting lower bound involves the conjugate of a sum of several functions. By
Lemma 7.2, the conjugate of this sum is bounded below by the infimal convolution
of the conjugates of all functions in the sum. Consequently, we obtain
( 𝐾 𝐾
)
Õ Õ
inf (P) ≥ sup − 𝑓 ∗ (𝛽0 ) − (𝛼 𝑘 𝑔 𝑘 )∗ (𝛽 𝑘 ) : 𝛽𝑘 = 0 . (7.2)
𝛼1 ,..., 𝛼𝐾 ∈R+ 𝑘=1 𝑘=0
𝛽0 ,...,𝛽𝐾 ∈R𝑑
𝑧⊤𝑄 1 𝑧 + 2𝑞 ⊤ 𝑑 ⊤ ⊤
1 𝑧 + 𝑟 1 ≥ 0 ∀𝑧 ∈ R : 𝑧 𝑄 0 𝑧 + 2𝑞 0 𝑧 + 𝑟 0 ≤ 0
holds if and only if there exists 𝛼 ∈ R+ with
𝑄 1 + 𝛼𝑄 0 𝑞 1 + 𝛼𝑞 0
0.
𝑞⊤ ⊤
1 + 𝛼𝑞 0 𝑟 1 + 𝛼𝑟 0
124 D. Kuhn, S. Shafiee, and W. Wiesemann
set (2.4) with uncertain moments, the worst-case distributions constitute mixtures
of distributions with first and second moments that are determined by the optimal
solution of the finite bi-dual problem. For 𝜙-divergence ambiguity sets centered at
a discrete distribution P̂, Section 7.3 will show that the worst-case distributions are
supported on the atoms of P̂ and (if 𝜙 grows at most linearly) on arg max 𝑧 ∈Z ℓ(𝑧)
with probability weights determined by the optimal solution to the finite bi-dual
problem. Similarly, for the optimal transport ambiguity set (2.27) centered at a
discrete distribution P̂, Section 7.4 will show that the worst-case distributions con-
stitute mixtures of discrete distributions, with the locations and probability weights
of their atoms determined by the optimal solution to the finite bi-dual problem.
Here, the first equivalence holds thanks to Assumption 7.8 (i), and the second
equivalence follows from Proposition 7.7, which applies because 𝑧0 ∈ rint(Z)
constitutes a Slater point thanks to Assumption 7.8 (ii). In addition, as the loss
function is quadratic, strong duality follows readily from Theorem 4.6.
Recall next that the Gelbrich ambiguity set (2.8) is defined in as an instance of
the Chebyshev ambiguity set (2.4) with moment uncertainty set
𝑑 𝑑 ∃Σ ∈ S+𝑑 with 𝑀 = Σ + 𝜇𝜇⊤ ,
F = (𝜇, 𝑀) ∈ R × S+ : .
G (𝜇, Σ), ( 𝜇,
ˆ Σ̂) ≤ 𝑟
Here, ( 𝜇,
ˆ Σ̂) is a nominal mean-covariance pair, and 𝑟 ≥ 0 is a size parameter. The
next result follows directly from Theorems 4.9 and 7.9. We thus omit its proof.
Theorem 7.10 (Finite Dual Reformulation for Gelbrich Ambiguity Sets). If P is
the Chebyshev ambiguity set (2.4) with F given by (2.8) and Assumption 7.8 holds,
then the worst-case expectation problem (4.1) satisfies the weak duality relation
sup E P [ℓ(𝑍)]
P∈P
Distributionally Robust Optimization 127
If 𝑟 > 0, then strong duality holds, that is, the above inequality becomes an equality.
sup E P [ℓ(𝑍)]
P∈P
Õ
sup Tr(𝑄 𝑗 Θ 𝑗 ) + 2𝑞 ⊤𝑗 𝜃 𝑗 + 𝑞 0𝑗 𝑝 𝑗
𝑗∈[𝐽 ]
s.t. 𝜇 ∈ R𝑑 , 𝑀 ∈ S𝑑 , 𝑝 ∈ R , 𝜃 ∈ R𝑑 , Θ ∈ S𝑑
∀ 𝑗 ∈ [𝐽]
+ 𝑗 + 𝑗 𝑗 +
≤ Θ𝑗 𝜃 𝑗 (7.5)
⊤ 0, Tr(𝑄 0 Θ 𝑗 ) − 2𝑧⊤ 0 𝑄 0 𝜃 𝑗 + 𝑧⊤ 0 𝑄 0 𝑧0 𝑝 𝑗 ≤ 𝑝 𝑗 ∀ 𝑗 ∈ [𝐽]
𝜃𝑗 𝑝𝑗
Õ Õ Õ
𝑝 𝑗 = 1, 𝜇 = 𝜃 𝑗, 𝑀 = Θ 𝑗 , (𝜇, 𝑀) ∈ F .
𝑗∈[𝐽 ] 𝑗∈[𝐽 ] 𝑗∈[𝐽 ]
If F is a convex and compact set with 𝑀 ≻ 𝜇𝜇⊤ for all (𝜇, 𝑀) ∈ rint(F), then
strong duality holds, that is, the inequality (7.5) becomes an equality.
Proof. By decomposing the Gelbrich ambiguity set into Chebyshev ambiguity sets
of the form P(𝜇, 𝑀) = {P ∈ P(Z) : E P [𝑍] = 𝜇, E P [𝑍 𝑍 ⊤ ] = 𝑀 }, we obtain
The inner maximization problem on the right hand side of (7.6) represents a worst-
case expectation problem over an instance of the ambiguity set (2.4) with the
moment uncertainty set being the singleton {(𝜇, 𝑀)}. The support function of this
singleton is given by 𝛿 ∗{(𝜇,𝑀)} (𝜆, Λ) = 𝜆⊤ 𝜇 + Tr(Λ𝑀). Thus, Theorem 7.9 implies
128 D. Kuhn, S. Shafiee, and W. Wiesemann
that the inner supremum on the right hand side of (7.6) is bounded above by
inf 𝜆 0 + 𝜆⊤ 𝜇 + Tr(Λ𝑀)
s.t. 𝜆 0 ∈ R, 𝜆 ∈ R𝑑 , Λ ∈ S𝑑 , 𝛼 ∈ R+𝐽
" #
1
Λ − 𝑄 𝑗 + 𝛼 𝑗 𝑄0 2 𝜆 − 𝑞 𝑗 − 𝛼 𝑗 𝑄 0 𝑧 0
1
⊤ 0 ∀ 𝑗 ∈ [𝐽].
2 𝜆 − 𝑞 𝑗 − 𝛼 𝑗 𝑄 0 𝑧 0 𝜆 0 − 𝑞 0𝑗 + 𝛼 𝑗 (𝑧⊤
0 𝑄 0 𝑧 0 − 1)
Strong duality holds because the primal minimization problem admits a Slater
point. Indeed, by defining Λ = 𝜆 0 𝐼 𝑑 and setting 𝜆 0 to a large value, one can ensure
that the linear matrix inequality in the primal problem holds strictly. Replacing the
inner supremum on the right hand side of (7.6) with the above dual semidefinite
program yields the upper bound in (7.5). If F is convex and compact with 𝑀 ≻ 𝜇𝜇⊤
for all (𝜇, 𝑀) ∈ rint(F), then (7.5) becomes an equality thanks to Theorem 7.9.
Note that the bi-dual reformulation in (7.5) is solvable whenever F is compact.
Indeed, its objective function is ostensibly continuous. In addition, it is easy to
verify that its feasible region is compact provided that F is compact.
Theorem 7.12 (Finite Bi-Dual Reformulation for Gelbrich Ambiguity Sets). If P is
the Chebyshev ambiguity set (2.4) with F given by (2.8) and Assumption 7.8 holds,
then the worst-case expectation problem (4.1) satisfies the weak duality relation
sup E P [ℓ(𝑍)]
P∈P
Õ
max Tr(𝑄 𝑗 Θ 𝑗 ) + 2𝑞 ⊤𝑗 𝜃 𝑗 + 𝑞 0𝑗 𝑝 𝑗
𝑗∈[𝐽 ]
s.t. 𝜇 ∈ R𝑑 , 𝑀, 𝑈 ∈ S+𝑑 , 𝐶 ∈ R𝑑×𝑑
𝑝 𝑗 ∈ R+ , 𝜃 𝑗 ∈ R𝑑 , Θ 𝑗 ∈ S+𝑑 ∀ 𝑗 ∈ [𝐽]
𝑀 −𝑈 𝐶 𝑈 𝜇 Θ 𝑗 𝜃 𝑗
≤ ⊤ 0, ⊤ 0, ⊤ 0 ∀ 𝑗 ∈ [𝐽] (7.7)
𝐶 Σ̂ 𝜇 1 𝜃 𝑗 𝑝 𝑗
Tr(𝑄 0 Θ 𝑗 ) − 2𝑧⊤ 𝜃 𝑗 + 𝑧⊤ ≤ 𝑝𝑗 ∀ 𝑗 ∈ [𝐽]
Õ 0 𝑄0Õ 0 𝑄 0 𝑧 0 𝑝 𝑗Õ
𝑝 𝑗 = 1, 𝜇 = 𝜃 𝑗, 𝑀 = Θ𝑗
𝑗∈[𝐽 ] 𝑗∈[𝐽 ] 𝑗∈[𝐽 ]
2 − 2𝜇 ⊤ 𝜇ˆ + Tr(𝑀 + Σ̂ − 2𝐶) ≤ 𝑟 2 .
k ˆ
𝜇k
2
Distributionally Robust Optimization 129
If 𝑟 > 0, then strong duality holds, that is, the above inequality becomes an equality.
The proof of Theorem 7.12 follows from Proposition 2.3 and Theorem 7.11 and is
thus omitted. We are now ready to construct extremal distributions P★ ∈ P(Z) that
attain the supremum of the worst-case expectation problem (4.1) over the Chebyshev
ambiguity set (2.4). To this end, fix any maximizer (𝜇★, 𝑀 ★, 𝑝★, 𝜃★, Θ★) of the
bi-dual problem (7.5), which exists if F is compact. Next, define the index sets
J ∞ = 𝑗 ∈ [𝐽] : 𝑝★𝑗 = 0, Θ★𝑗 ≠ 0 and J + = 𝑗 ∈ [𝐽] : 𝑝★𝑗 > 0 ,
The inequality in the above expression holds because P ∈ P(Z), and the equality
holds because P ∼ (𝜇, 𝑀). The following lemma by Hanasusanto et al. (2015a,
Proposition 6.1) shows the reverse implication. That is, if 𝜇 and 𝑀 satisfy the
above inequality, then there is a (discrete) distribution P ∼ (𝜇, 𝑀) supported on Z.
Lemma 7.13 (Distributions on Ellipsoids with Given Moments). If Z is the ellips-
oid from Assumption 7.8 (ii), and if Tr(𝑄 0 𝑀)+2𝑧⊤ ⊤ 𝑑
0 𝜇 +𝑧 0 𝑄 0 𝑧 0 ≤ 1 for some 𝑀 ∈ S+
and 𝜇 ∈ R𝑑 with 𝑀 𝜇𝜇⊤ , then there exists a discrete distribution P ∈ P(Z) with
at most 2𝑑 atoms that satisfies P ∼ (𝜇, 𝑀).
The proof of Lemma 7.13 is simple but tedious and thus omitted.
Theorem 7.14 (Extremal Distributions of Chebyshev Ambiguity Sets). If all con-
ditions of Theorem 7.11 for weak as well as strong duality are satisfied and that
(𝜇★, 𝑀 ★, 𝑝★, 𝜃★, Θ★) solves (7.5), then the following hold.
Proof. As for assertion (i), the constraints of problem (7.7) imply that
Tr(𝑄 0 Θ★𝑗/𝑝★𝑗) − 2𝑧⊤ ★ ★ ⊤
0 𝑄 0 𝜃 𝑗 /𝑝 𝑗 + 𝑧 0 𝑄 0 𝑧 0 ≤ 1
130 D. Kuhn, S. Shafiee, and W. Wiesemann
and
Θ★𝑗 𝜃★𝑗
0 ⇐⇒ Θ★𝑗/𝑝★𝑗 (𝜃★𝑗/𝑝★𝑗)(𝜃★𝑗/𝑝★𝑗)⊤
(𝜃★𝑗 )⊤ 𝑝★𝑗
for all 𝑗 ∈ J + . Lemma 7.13 thus guarantees that there exist discrete distribu-
tions P★𝑗 ∼ (𝜃★𝑗/𝑝★𝑗, Θ★𝑗/𝑝★𝑗), 𝑗 ∈ J + , all of which are supported on Z. Con-
Í
sequently, P★ = 𝑗 ∈J + 𝑝★𝑗 P★𝑗 is also supported on Z. In addition, we have
Õ Õ Õ
EP★ [𝑍] = 𝑝★𝑗 · EP★𝑗 [𝑍] = 𝑝★𝑗 · 𝜃★𝑗/𝑝★𝑗 = 𝜃★𝑗 = 𝜇★
𝑗 ∈J + 𝑗 ∈J + 𝑗 ∈J +
and
Õ Õ Õ
EP★ [𝑍 𝑍 ⊤ ] = 𝑝★𝑗 · EP★𝑗 [𝑍 𝑍 ⊤ ] = 𝑝★𝑗 · Θ★𝑗/𝑝★𝑗 = Θ★𝑗 = 𝑀 ★,
𝑗 ∈J + 𝑗 ∈J + 𝑗 ∈J +
that is, P★ ∼ (𝜇★ , 𝑀 ★). As (𝜇★, 𝑀 ★) ∈ F, it is now clear that P★ ∈ P and that
Õ
E P★ [ℓ(𝑍)] ≤ sup E P [ℓ(𝑍)] = Tr(𝑄 𝑗 Θ★𝑗) + 2𝑞 ⊤𝑗 𝜃★𝑗 + 𝑞 0𝑗 𝑝★𝑗,
P∈P 𝑗∈[𝐽 ]
where the equality follows from strong duality as established in Theorem 7.11. At
the same time, the definition of P★ as a mixture distribution and the definition of ℓ
in (7.3) as a pointwise maximum of quadratic component functions implies that
Õ Õ
E P★ [ℓ(𝑍)] ≥ 𝑝★𝑗 · E P★𝑗 [ℓ 𝑗 (𝑍)] = Tr(𝑄 𝑗 Θ★𝑗) + 2𝑞 ⊤𝑗 𝜃★𝑗 + 𝑞 0𝑗 𝑝★𝑗 .
𝑗 ∈J + 𝑗∈[𝐽 ]
Specifically, the inequality holds because ℓ ≥ ℓ 𝑗 for every 𝑗 ∈ [𝐽], and the equality
holds because 𝜃★𝑗 = 0 and Θ★𝑗 = 0 whenever 𝑝★𝑗 = 0. Indeed, if 𝑝★𝑗 = 0, then Θ★𝑗 = 0
because the index set J ∞ is empty, and the linear matrix inequality in (7.5) implies
that 𝜃★𝑗 = 0 whenever Θ★𝑗 = 0. The above inequalities thus ensure that P★ solves the
worst-case expectation problem (4.1). This completes the proof of assertion (i).
Next, we address assertion (ii). Similar arguments as in the proof of asser-
tion (i) can be used to show that P𝑚 ∈ P for every 𝑚 ≥ |J ∞ |. This implies that
E P𝑚 [ℓ(𝑍)] ≤ supP∈P E P [ℓ(𝑍)] whenever 𝑚 ≥ |J ∞ |. In addition, we observe that
Õ Õ
lim E P𝑚 [ℓ(𝑍)] ≥ lim 𝑝𝑚
𝑗 · E P 𝑚 [ℓ 𝑗 (𝑍)] = lim 𝑝 𝑚𝑗 · E P𝑚 [ℓ 𝑗 (𝑍)]
𝑚→∞ 𝑚→∞ 𝑚→∞
𝑗 ∈J 𝑗 ∈J
Õ
= Tr(𝑄 𝑗 Θ★𝑗) + 2𝑞 ⊤𝑗 𝜃★𝑗 + 𝑞 0𝑗 𝑝★𝑗 = sup E P [ℓ(𝑍)],
𝑗∈[𝐽 ] P∈P
where the second equality exploits the definition of P𝑚 and the third equality follows
from strong duality as established in Theorem 7.11. This completes the proof.
Theorem 7.14 also applies to the Gelbrich ambiguity set, which constitutes a
Chebyshev ambiguity set of the form (2.4) with F given by (2.8). The extremal
distribution P★ identified in Theorem 7.14 (i) constitutes a mixture of different
Distributionally Robust Optimization 131
for some 𝑁 ∈ N, where the probabilities 𝑝ˆ𝑖 , 𝑖 ∈ [𝑁], are strictly positive and sum
to 1, and where 𝑧ˆ𝑖 ∈ Z for every 𝑖 ∈ [𝑁]. In addition, ℓ(𝑧) ∈ R for all 𝑧 ∈ Z.
The requirement that 𝑝ˆ𝑖 be positive for every 𝑖 ∈ [𝑁] is non-restrictive because
atoms with zero probability can simply be eliminated without changing P̂.
Theorem 7.17 (Finite Dual Reformulation for 𝜙-Divergence Ambiguity Sets). If P
is the 𝜙-divergence ambiguity set (2.10) and Assumption 7.16 holds, then the worst-
case expectation problem (4.1) satisfies the weak duality relation
Õ
inf
𝜆0 ∈R,𝜆∈R+ 𝜆 0 + 𝜆𝑟 + 𝑝ˆ𝑖 · (𝜙∗ ) 𝜋 (ℓ(ˆ𝑧𝑖 ) − 𝜆 0 , 𝜆)
sup E P [ℓ(𝑍)] ≤ 𝑖∈ [ 𝑁 ] (7.9)
P∈P
s.t. 𝜆 0 + 𝜆 𝜙 ∞
(1) ≥ sup ℓ(𝑧),
𝑧 ∈Z
where ℓ is a shorthand for sup 𝑧 ∈Z ℓ(𝑧). The product 𝑝 0 𝜙∞ (1) is assumed to equal 0
if 𝑝 0 = 0 and 𝜙∞ (1) = ∞. Similarly, 𝑝 0 ℓ is assumed to equal 0 if 𝑝 0 = 0 and ℓ = ∞.
The finite bi-dual reformulation (7.10) can readily be derived from the primal
worst-case expectation problem (4.1) or from its finite dual reformulation (7.9).
We find it insightful to derive (7.10) from (7.9). This is also more consistent with
134 D. Kuhn, S. Shafiee, and W. Wiesemann
the general proof strategy outlined in Section 7.1. We will briefly touch on the
derivation of (7.10) from the primal problem (4.1) after the proof.
Proof of Theorem 7.18. Assume first that 𝜙∞ (1) < ∞. Under the assumptions
stated in the theorem, the worst-case expectation problem (4.1) and its dual (7.9)
share the same optimal value thanks to Theorem 7.17. By dualizing the single
explicit constraint in (4.11) and using Lemma 7.1 (i), we thus find
Õ
sup E P [ℓ(𝑍)] = inf 𝜆 0 + 𝜆𝑟 + 𝑝ˆ𝑖 sup 𝑦 𝑖 (ℓ(ˆ𝑧𝑖 ) − 𝜆 0 ) − 𝜆𝜙(𝑦 𝑖 )
P∈P 𝜆0 ∈R,𝜆∈R+ 𝑦𝑖 ∈R+
𝑖∈ [ 𝑁 ]
+ sup ℓ − 𝜆 0 − 𝜆𝜙∞ (1) 𝑝 0 .
𝑝0 ∈R+
Interchanging the infima and suprema and rearranging terms further yields
sup E P [ℓ(𝑍)]
P∈P
Õ Õ
= sup 𝑝0ℓ + 𝑝ˆ 𝑖 𝑦 𝑖 ℓ(ˆ𝑧𝑖 ) + inf 1 − 𝑝0 − 𝑝ˆ 𝑖 𝑦 𝑖 𝜆 0
𝑝0 ,𝑦1 ,...,𝑦 𝑁 ∈R+ 𝜆0 ∈R
𝑖∈ [ 𝑁 ] 𝑖∈ [ 𝑁 ]
Õ
+ inf 𝑟 − 𝑝 0 𝜙∞ (1) − 𝜙(𝑦 𝑖 ) 𝜆
𝜆∈R+
𝑖∈ [ 𝑁 ]
Õ
sup 𝑝0ℓ + 𝑝ˆ𝑖 𝑦 𝑖 ℓ(ˆ𝑧𝑖 )
𝑝0 ,𝑦0 ...,𝑦 𝑁 ∈R+ 𝑖∈ [ 𝑁 ]
= Õ Õ
s.t. 𝑝0 + 𝑝ˆ𝑖 𝑦 𝑖 = 1, 𝑝 0 𝜙∞ (1) + 𝑝ˆ𝑖 𝜙(𝑦 𝑖 ) ≤ 𝑟.
𝑖∈ [ 𝑁 ] 𝑖∈ [ 𝑁 ]
The first equality in the above expression follows from strong duality, which holds
because 𝑟 > 0 and 𝜙 is continuous at 1. Indeed, these conditions ensure that the
resulting maximization problem admits a Slater point with 𝑝 0 = 0 and 𝑦 𝑖 = 1 for
all 𝑖 ∈ [𝑁]. The substitution 𝑝 𝑖 ← 𝑝ˆ 𝑖 𝑦 𝑖 , 𝑖 ∈ [𝑁], finally shows that the obtained
problem is equivalent to (7.10). This proves the claim for 𝜙∞ (1) < ∞.
Suppose next that 𝜙∞ (1) = ∞ in which case 0 𝜙∞ (1) evaluates to ∞. Hence, the
constraint in (4.11) is satisfied for any (𝜆 0 , 𝜆) ∈ R × R+ and is thus redundant. By
repeating the steps from the first part of the proof with obvious minor modifications
shows that (7.10) still holds if we assume that 𝑝 0 𝜙∞ (1) and 𝑝 0 ℓ evaluate to 0
when 𝑝 0 = 0. Indeed, this means that 𝑝 0 = 0 is the only feasible solution in (7.10),
and problem (7.10) can be simplified by eliminating 𝑝 0 altogether.
The finite bi-dual reformulation on the right hand side of (7.10) has a linear
objective function and a compact convex feasible region. Therefore, it is solvable
thanks to Weierstrass’ maximum theorem. In particular, note that the feasible region
is a subset of the probability simplex in R 𝑁 +1 . If there exists a worst-case scenario
𝑧ˆ0 ∈ arg max 𝑧 ∈Z ℓ(ˆ𝑧) (which must satisfy ℓ(𝑧0 ) = ℓ), then any maximizer 𝑝★ of the
Í𝑁 ★
bi-dual can be used to construct an extremal distribution P★ = 𝑖=0 𝑝 𝑖 𝛿 𝑧ˆ𝑖 for the
Distributionally Robust Optimization 135
where the first equality exploits the definition of D 𝜙 , and the second equality
exploits our choice of the reference distribution 𝜌. In addition, the inequality
follows from the constraints of problem (7.10) and the observation that
𝜙 𝜋 (𝑝★0 , 0) = 𝜙∞ (𝑝★0 ) = 𝑝★0 𝜙∞ (1).
This confirms that P★ is feasible in (4.1). Also, its objective function value equals
𝑁
Õ
E P★ [ℓ(𝑍)] = 𝑝★𝑖ℓ(ˆ𝑧𝑖 ).
𝑖=0
As ℓ(ˆ𝑧0 ) = ℓ, we may conclude that E P★ [ℓ(𝑍)] coincides with the maximum of the
bi-dual reformulation in (7.10), which in turn matches the supremum of (4.1) by
virtue of Theorem 7.18. Hence, P★ is indeed a maximizer of problem (4.1).
Recall that if 𝜙∞ (1) = ∞, then D 𝜙 (P, P̂) = ∞ unless P ≪ P̂. Therefore,
every distribution P in a 𝜙-divergence ambiguity set around P̂ must be absolutely
continuous with respect to P̂. If 𝜙∞ (1) < ∞, on the other hand, then P can assign
a positive probability to points in Z that have zero probability under P̂. Note that
D 𝜙 (P, P̂) only depends on how much probability mass P removes from the support
of P̂, but it does not depend on where that probability mass is moved. As nature
aims to maximize the expected loss, it will move all of this probability mass to a
point with maximal loss within Z (i.e., to some point 𝑧ˆ0 ∈ arg max 𝑧 ∈Z ℓ(ˆ𝑧)).
If P is the restricted 𝜙-divergence ambiguity set (2.11), Assumption 7.16 holds,
𝑟 > 0 and 𝜙 is continuous at 1, then Theorem 7.18 remains valid with a minor
modification. That is, one must append the constraint 𝑝 0 = 0 to the finite bi-dual
reformulation on the right hand side of (7.10). Details are omitted for brevity.
where 𝑐𝑖 : Z → R is defined through 𝑐𝑖 (𝑧) = 𝑐(𝑧, 𝑧ˆ𝑖 ) for every 𝑖 ∈ [𝑁]. If 𝑟 > 0,
then strong duality holds, that is, the above inequality becomes an equality.
The dual minimization problem of Theorem (7.20) constitutes a finite convex
program because the conjugates (−ℓ 𝑗 )∗ , 𝑐∗𝑖 and 𝑔∗𝑘 and their perspectives are convex
functions. It accommodates O(𝑁 𝐽𝐾) decision variables and O(𝑁 𝐽) constraints.
Distributionally Robust Optimization 137
For any fixed 𝑖 ∈ [𝑁] and 𝑗 ∈ [𝐽], Assumptions 7.19 (i) and 7.19 (ii) imply that the
embedded maximization problem over 𝑧 constitutes a convex program. In addition,
this problem admits a Slater point 𝑧ˆ𝑖 thanks to Assumptions 7.19 (i) and 7.19 (iv).
In order to dualize this convex program, we first recall from Lemma 7.2 that the
conjugate of 𝑓 (𝑧) = −ℓ 𝑗 (𝑧) + 𝜆𝑐𝑖 (𝑧) at 𝜁 ∈ R𝑑 can be represented as
n o
𝑓 ∗ (𝜁) = min (−ℓ 𝑗 )∗ (𝜁𝑖ℓ𝑗 ) + (𝑐∗𝑖 ) 𝜋 (𝜁𝑖𝑐𝑗 , 𝜆) : 𝜁𝑖ℓ𝑗 + 𝜁𝑖𝑐𝑗 = 𝜁 .
𝜁𝑖ℓ𝑗 ,𝜁𝑖𝑐𝑗 ∈R𝑑
where the second and the third equalities follow from (Rockafellar 1970, The-
orem 13.3) and from the definition of the perspective, respectively. Combining the
above observations proves the claim for 𝛼 = 0.
The next lemma derives a formula for the conjugate of a sum of scaled pre-
spectives. It thus generalizes Lemma 7.21, which addresses only one single scaled
perspective, and it is also related to Lemma 7.2, which characterizes the conjugate
of a sum of arbitrary convex functions—not necessarily scaled perspectives.
Lemma 7.22 (Conjugates of Perspective Functions II). Suppose that 𝑓𝑖 : R𝑑 → R,
𝑖 ∈ [𝑚], are proper, convex Í and closed and that there is 𝑧¯ ∈ ∩𝑖∈ [𝑚] rint(dom( 𝑓𝑖 )).
Let 𝑓 (𝑧1 , . . . , 𝑧 𝑚 , 𝜆) = 𝑖∈ [𝑚] 𝛼𝑖 𝑓𝑖𝜋 (𝑧𝑖 , 𝜆) be a weighted sum of the corresponding
Distributionally Robust Optimization 139
perspective functions with weight vector 𝛼 ∈ R+𝑚 . Then, the conjugate of 𝑓 satisfies
( Í
if ∃𝛽 ∈ R𝑚 with 𝑖∈ [𝑚] 𝛽𝑖 = 𝑦 0 and
0
𝑓 ∗ (𝑦 1 , . . . , 𝑦 𝑚 , 𝑦 0 ) = ( 𝑓𝑖∗ ) 𝜋 (𝑦 𝑖 , 𝛼𝑖 ) ≤ −𝛽𝑖 ∀𝑖 ∈ [𝑚],
∞ otherwise.
Proof. By using a variable splitting trick as in the proof of Lemma 7.2, we find
Õ Õ
𝑓 ∗ (𝑦 1 , . . . , 𝑦 𝑚 , 𝑦 0 ) = sup sup 𝑦 0 𝜆 + 𝑦⊤
𝑖 𝑧 𝑖 − 𝛼𝑖 𝑓𝑖𝜋 (𝑧𝑖 , 𝜆)
𝑧1 ,...,𝑧𝑚 ∈R𝑑 𝜆∈R+ 𝑖∈ [𝑚] 𝑖∈ [𝑚]
Õ
sup 𝑦⊤ 𝜋
𝑖 𝑧 𝑖 − 𝛼𝑖 𝑓𝑖 (𝑧 𝑖 , 𝜆 𝑖 )
𝑦 0𝜆 +
= 𝑧1 ,...,𝑧𝑚 ∈R𝑑 𝑖∈ [𝑚]
𝜆∈R, 𝜆1 ,...,𝜆𝑚 ∈R+
s.t. 𝜆 𝑖 = 𝜆 𝑖 ∈ [𝑚]
The resulting convex maximization problem admits a Slater point. To see this,
recall that there exists 𝑧¯ ∈ ∩𝑖∈ [𝑚] rint(dom( 𝑓𝑖 )). As dom( 𝑓 𝜋 ) is contained in the
cone generated by dom( 𝑓 )× {1}, we may thus conclude that the solution with 𝜆 = 1,
𝜆 𝑖 = 1 and 𝑧𝑖 = 𝑧¯ for all 𝑖 ∈ [𝑚] constitutes a Slater point. Therefore, the above
maximization problem admits a strong Lagrangian dual, that is, we have
𝑓 ∗ (𝑦 1 , . . . , 𝑦 𝑚 , 𝑦 0 )
Õ
= min sup 𝑦 0𝜆 + 𝑦⊤ 𝜋
𝑖 𝑧 𝑖 − 𝛼𝑖 𝑓𝑖 (𝑧 𝑖 , 𝜆 𝑖 ) + 𝛽𝑖 (𝜆 𝑖 − 𝜆)
𝛽1 ,...,𝛽𝑚 ∈R
𝑧1 ,...,𝑧𝑚 ∈R𝑑 𝑖∈ [𝑚]
𝜆∈R, 𝜆1 ,...,𝜆𝑚 ∈R+
Õ
𝜋 ∗
Õ
= min (𝛼𝑖 𝑓𝑖 ) (𝑦 𝑖 , 𝛽𝑖 ) : 𝛽𝑖 = 𝑦 0 ,
𝛽1 ,...,𝛽𝑚 ∈R
𝑖∈ [𝑚] 𝑖∈ [𝑚]
see also Theorem 7.4. By Lemma 7.21, we further have (𝛼𝑖 𝑓𝑖𝜋 )∗ = 𝛿C𝑖 , where
C𝑖 = (𝑦, 𝑦 0 ) ∈ R𝑑 × R : ( 𝑓𝑖∗ ) 𝜋 (𝑦, 𝛼𝑖 ) ≤ −𝑦 0
for all 𝑖 ∈ [𝑚]. Substituting this alternative expression for (𝛼𝑖 𝑓𝑖𝜋 )∗ into the above
dual problem yields the desired formula. Thus, the claim follows.
We emphasize that Lemmas 7.21 and 7.22 are complementary to Lemma 4.11.
Indeed, while Lemma 4.11 evaluates the conjugate only with respect to the first
argument of a perspective function, Lemmas 7.21 and 7.22 do so with respect to
both arguments. We are now ready to derive a finite bi-dual reformulation of the
worst-case expectation problem over an optimal transport ambiguity set.
Theorem 7.23 (Finite Bi-Dual Reformulation for Optimal Transport Ambiguity
Sets). If P is the optimal transport ambiguity set (2.27) and Assumptions 7.16
and 7.19 hold, then the worst-case expectation problem (4.1) satisfies the weak
140 D. Kuhn, S. Shafiee, and W. Wiesemann
duality relation
sup E P [ℓ(𝑍)]
P∈P
Õ Õ
sup −(−ℓ 𝑗 ) 𝜋 (𝑝 𝑖 𝑗 𝑧ˆ𝑖 + 𝑧𝑖 𝑗 , 𝑝 𝑖 𝑗 )
𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]
s.t. 𝑝 𝑖 𝑗 ∈ R+ , 𝑧𝑖 𝑗 ∈ R𝑑 ∀𝑖 ∈ [𝑁], 𝑗 ∈ [𝐽]
𝜋
𝑔 𝑘 (𝑝 𝑖 𝑗 𝑧ˆ𝑖 + 𝑧𝑖 𝑗 , 𝑝 𝑖 𝑗 ) ≤ 0 ∀𝑖 ∈ [𝑁], 𝑗 ∈ [𝐽], 𝑘 ∈ [𝐾]
≤ Õ (7.13)
𝑝 𝑖 𝑗 = 𝑝ˆ𝑖 ∀𝑖 ∈ [𝑁]
𝑗∈[𝐽 ]
Õ Õ
𝑐𝑖𝜋 (𝑝 𝑖 𝑗 𝑧ˆ𝑖 + 𝑧𝑖 𝑗 , 𝑝 𝑖 𝑗 ) ≤ 𝑟,
𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]
where 𝑐𝑖 : Z → R is defined through 𝑐𝑖 (𝑧) = 𝑐(𝑧, 𝑧ˆ𝑖 ) for every 𝑖 ∈ [𝑁]. If 𝑟 > 0,
then strong duality holds, that is, the above inequality becomes an equality.
Proof. We will show that (7.13) is obtained by dualizing the finite dual reformula-
tion (7.11) of problem (4.1). To see this, we assign Lagrange multipliers 𝑝 𝑖 𝑗 ∈ R+
and 𝑧𝑖 𝑗 ∈ R𝑑 , 𝑖 ∈ [𝑁], 𝑗 ∈ [𝐽], to the first and second constraint groups in (7.11),
respectively. The Lagrangian dual of (7.11) can then be represented compactly as
sup inf 𝐿 1 (𝑠; 𝑝, 𝑧) + 𝐿 2 (𝜁 ℓ ; 𝑝, 𝑧) + 𝐿 3 (𝜆, 𝜁 𝑐 ; 𝑝, 𝑧, 𝜆) + 𝐿 4 (𝛼, 𝜁 𝑔 ; 𝑝, 𝑧),
𝑝≥0,𝑧 𝜆≥0, 𝛼≥0
𝑠,𝜁 ℓ ,𝜁 𝑐 ,𝜁 𝑔
where the Lagrangian is additively separable with respect to four disjoint groups
of primal decision variables, namely, 𝑠, 𝜁 ℓ , (𝜆, 𝜁 𝑐 ) and (𝛼, 𝜁 𝑔 ). The corresponding
partial Lagrangians are defined as follows.
Õ Õ Õ
𝐿 1 (𝑠; 𝑝, 𝑧) = 𝑝ˆ𝑖 𝑠𝑖 − 𝑝 𝑖 𝑗 𝑠𝑖
𝑖∈ [ 𝑁 ] 𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]
Õ Õ
ℓ
𝐿 2 (𝜁 ; 𝑝, 𝑧) = 𝑝 𝑖 𝑗 · (−ℓ 𝑗 )∗ (𝜁𝑖ℓ𝑗 ) − 𝑧⊤ ℓ
𝑖 𝑗 𝜁𝑖 𝑗
𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]
Õ Õ
𝑐
𝐿 3 (𝜆, 𝜁 ; 𝑝, 𝑧) = 𝜆𝑟 + 𝑝 𝑖 𝑗 · (𝑐∗𝑖 ) 𝜋 (𝜁𝑖𝑐𝑗 , 𝜆) − 𝑧⊤ 𝑐
𝑖 𝑗 𝜁𝑖 𝑗
𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]
Õ Õ Õ
𝑔 𝑔
𝐿 4 (𝛼, 𝜁 𝑔 ; 𝑝, 𝑧) = 𝑝 𝑖 𝑗 · (𝑔∗𝑘 ) 𝜋 (𝜁𝑖 𝑗𝑘 , 𝛼𝑖 𝑗𝑘 ) − 𝑧⊤
𝑖 𝑗 𝜁𝑖 𝑗𝑘
𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ] 𝑘 ∈ [𝐾 ]
These partial Lagrangians can be minimized separately with respect to the primal
decision variables. For example, an elementary calcucation shows that
Õ
0 if 𝑝 𝑖 𝑗 = 𝑝ˆ 𝑖 ∀𝑖 ∈ [𝑁],
inf 𝐿 1 (𝑠; 𝑝, 𝑧) = 𝑗∈[𝐽 ]
𝑠
−∞ otherwise.
Recall now that −ℓ 𝑗 is proper, convex and closed, which implies via Lemma 4.2 that
Distributionally Robust Optimization 141
Similarly, recall that 𝑐𝑖 is proper, convex and closed such that 𝑐∗∗ 𝑖 = 𝑐 𝑖 . Note also
𝑐 𝑐
that minimizing 𝐿 3 (𝜆, 𝜁 ; 𝑝, 𝑧) with respect to 𝜆 and 𝜁 amounts to evaluating the
conjugate of a sum of perspective functions with one common argument. By using
Lemma 7.22 and applying a few elementary manipulations we thus find
( Í Í
if ∃𝛽𝑖 𝑗 ∈ R𝑚 with 𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ] 𝛽𝑖 𝑗 = 𝑟 and
0
inf 𝐿 3 (𝜆, 𝜁 𝑐 ; 𝑝, 𝑧) = 𝑐𝑖𝜋 (𝑧𝑖 𝑗 , 𝑝 𝑖 𝑗 ) ≤ 𝛽𝑖 𝑗 ∀𝑖 ∈ [𝑁], 𝑗 ∈ [𝐽],
𝜆≥0,𝜁 𝑐
−∞ otherwise.
Finally, recall that 𝑔 𝑘 is proper, convex and closed such that 𝑔∗∗ 𝑘 = 𝑔 𝑘 . Note also
that minimizing 𝐿 4 (𝛼, 𝜁 𝑔 ; 𝑝, 𝑧) with respect to 𝛼 and 𝜁 𝑔 amounts to evaluating the
conjugate of a sum of perspective functions with mutually different arguments. By
using Lemma 7.21 and applying a few elementary manipulations we thus find
(
0 if 𝑔 𝑘𝜋 𝑧𝑖 𝑗 , 𝑝 𝑖 𝑗 ) ≤ 0 ∀𝑖 ∈ [𝑁], 𝑗 ∈ [𝐽], 𝑘 ∈ [𝐾],
inf 𝑔 𝐿 4 (𝛼, 𝜁 𝑔 ; 𝑝, 𝑧) =
𝛼≥0,𝜁 −∞ otherwise.
Substituting the infima of the partial Lagrangians into the dual objective yields the
following equivalent reformulation for the problem dual to (7.11).
Õ Õ
sup −(−ℓ 𝑗 ) 𝜋 (𝑧𝑖 𝑗 , 𝑝 𝑖 𝑗 )
𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]
s.t. 𝑝 𝑖 𝑗 ∈ R+ , 𝛽𝑖 𝑗 ∈ R, 𝑧𝑖 𝑗 ∈ R𝑑 ∀𝑖 ∈ [𝑁], 𝑗 ∈ [𝐽]
𝑔 𝑘𝜋 (𝑧𝑖 𝑗 , 𝑝 𝑖 𝑗 ) ≤ 0 ∀𝑖 ∈ [𝑁], 𝑗 ∈ [𝐽], 𝑘 ∈ [𝐾]
Õ
𝑝 𝑖 𝑗 = 𝑝ˆ𝑖 ∀𝑖 ∈ [𝑁] (7.14)
𝑗∈[𝐽 ]
𝑐𝑖𝜋 (𝑧𝑖 𝑗 , 𝑝 𝑖 𝑗 ) ≤ 𝛽𝑖 𝑗 ∀𝑖 ∈ [𝑁], 𝑗 ∈ [𝐽]
Õ Õ
𝛽𝑖 𝑗 = 𝑟
𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]
Note that if the finite dual reformulation (7.11) of the worst-case expectation prob-
lem is viewed as an instance of the primal convex program (P), then problem (7.14)
represents the corresponding instance of the dual convex program (D). By As-
sumptions 7.16 and 7.19, problem (7.14) admits a Slater point with 𝑝 𝑖 𝑗 = 𝑝ˆ𝑖 /𝐽 and
𝑧𝑖 𝑗 = 𝑧ˆ𝑖 for all 𝑖 ∈ [𝑁] and 𝑗 ∈ [𝐽]. Thus, strong duality holds thanks to The-
orem 7.4 (i). It remains to be shown that (7.14) is equivalent to (7.13). To this end,
note first that the last constraint in (7.14) can be relaxed to a less-than-or-equal-to
inequality without increasing the problem’s supremum such that 𝛽𝑖 𝑗 = 𝑐𝑖𝜋 (𝑧𝑖 𝑗 , 𝑝 𝑖 𝑗 )
142 D. Kuhn, S. Shafiee, and W. Wiesemann
|J𝑖∞ |
𝑧𝑖★𝑗
1−
𝑝★𝑖𝑗 if 𝑗 ∈ J𝑖+ , 𝑧ˆ𝑖 +
𝑝𝑖★𝑗
if 𝑗 ∈ J𝑖+ ,
𝑚
𝑝 𝑖𝑚𝑗 = and 𝑧𝑖𝑚𝑗 = 𝑧𝑖★𝑗
𝑝𝑚ˆ 𝑖
if 𝑗 ∈ J𝑖∞ ,
𝑧ˆ𝑖 +
if 𝑗 ∈ J𝑖∞ .
𝑝𝑖𝑚𝑗
Proof. In view of assertion (i), we first show that P★ defined in the statement of
the theorem is feasible in the worst-case expectation problem (4.1). To this end,
observe first that feasibility of (𝑝★, 𝑧★) in (7.13) implies that 𝑝★𝑖𝑗 ≥ 0 for all 𝑖 ∈ [𝑁]
Í Í
and 𝑗 ∈ [𝐽], and that 𝑖∈ [ 𝑁 ] 𝑗 ∈J𝑖+ 𝑝★𝑖𝑗 = 1. Note also that 𝑧ˆ𝑖 + 𝑧★𝑖𝑗 /𝑝★𝑖𝑗 ∈ Z for
all 𝑖 ∈ [𝑁] and 𝑗 ∈ J𝑖+ due to the second constraint in (7.13). This confirms that
P★ ∈ P(Z). The penultimate constraint group of problem (7.13) also implies that
Õ Õ
𝑝★𝑖𝑗 𝛿 ★ ★
∈ Γ(P★, P̂)
𝑧ˆ𝑖 +𝑧𝑖 𝑗 / 𝑝𝑖 𝑗 , 𝑧ˆ𝑖
𝑖∈ [ 𝑁 ] 𝑗 ∈J𝑖+
constitutes a valid transportation plan for morphing P̂ into P★. Thus, we find
Õ Õ
OT𝑐 (P★, P̂) ≤ 𝑝★𝑖𝑗 · 𝑐(ˆ𝑧𝑖 + 𝑧★𝑖𝑗 /𝑝★𝑖𝑗 , 𝑧ˆ𝑖 )
𝑖∈ [ 𝑁 ] 𝑗 ∈J𝑖+
Õ Õ
= 𝑐𝑖𝜋 (𝑝★𝑖𝑗 𝑧ˆ𝑖 + 𝑧★𝑖𝑗 , 𝑝★𝑖𝑗 ) ≤ 𝑟.
𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]
Here, the equality holds because all terms corresponding to 𝑖 ∈ [𝑁] and 𝑗 ∉ J𝑖+
vanish. Indeed, if 𝑗 ∉ J𝑖+ , then 𝑝★𝑖𝑗 = 0. As J𝑖∞ = ∅, this implies that 𝑧★𝑖𝑗 = 0.
Thus, we have 𝑐𝑖𝜋 (𝑝★𝑖𝑗 𝑧ˆ𝑖 + 𝑧★𝑖𝑗 , 𝑝★𝑖𝑗 ) = 𝑐𝑖𝜋 (0, 0) = 𝑐∞
𝑖 (0) = 0 by the definitions of
the perspective and the recession function. The second inequality in the above
expression follows from the last constraint in (7.13). In summary, we have shown
that P★ is feasible in (4.1). As for the objective function value of P★, note that
Õ Õ
E P★ [ℓ(𝑍)] ≤ sup E P [ℓ(𝑍)] ≤ −(−ℓ 𝑗 ) 𝜋 (𝑝★𝑖𝑗 𝑧ˆ𝑖 + 𝑧★𝑖𝑗 , 𝑝★𝑖𝑗 ),
P∈P 𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]
where the second inequality follows from the weak duality relation established in
144 D. Kuhn, S. Shafiee, and W. Wiesemann
Theorem 7.23. At the same time, however, the expected loss under P★ satisfies
Õ Õ 𝑧𝑖★𝑗
★ ′
E P★ [ℓ(𝑍)] = max
′
𝑝 𝑖𝑗 ℓ 𝑗 ˆ
𝑧 𝑖 + 𝑝★ 𝑗 ∈[𝐽] 𝑖𝑗
𝑖∈ [ 𝑁 ] 𝑗 ∈J +
Õ Õ
≥ −(−ℓ 𝑗 ) 𝜋 (𝑝★𝑖𝑗 𝑧ˆ𝑖 + 𝑧★𝑖𝑗 , 𝑝★𝑖𝑗 )
𝑖∈ [ 𝑁 ] 𝑗 ∈J +
Õ Õ
= −(−ℓ 𝑗 ) 𝜋 (𝑝★𝑖𝑗 𝑧ˆ𝑖 + 𝑧★𝑖𝑗 , 𝑝★𝑖𝑗 ),
𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]
where the inequality uses the definition of the perspective function and the trivial
observation that 𝑗 ∈ J + is a feasible choice for 𝑗 ′ ∈ [𝐽]. The last equality holds
once more because 𝑝★𝑖𝑗 = 0 implies 𝑧★𝑖𝑗 = 0 and (−ℓ 𝑗 ) 𝜋 (0, 0) = (−ℓ 𝑗 )∞ (0) = 0 by
the definition of the perspective and the recession function. In summary, the above
inequalities imply that P★ is optimal in (4.1). Hence, assertion (i) follows.
As for assertion (ii), we first show that P𝑚 ∈ P for any fixed 𝑚 ≥ max𝑖∈ [ 𝑁 ] |J𝑖∞ |.
The constraints
Í Í of problem (7.13) imply that 𝑝 𝑖𝑚𝑗 ≥ 0 for all 𝑗 ∈ J𝑖 and 𝑖 ∈ [𝑁] and
that 𝑖∈ [ 𝑁 ] 𝑗 ∈J 𝑝 𝑖𝑚𝑗 = 1. They also imply that 𝑧𝑖𝑚𝑗 ∈ Z for every 𝑗 ∈ J𝑖 and 𝑖 ∈
[𝑁]. This is easy to see if 𝑗 ∈ J𝑖+ . If 𝑗 ∈ J𝑖∞ , on the other hand, then 𝑝★𝑖𝑗 = 0,
𝑧★𝑖𝑗 ≠ 0 and 𝑔 𝑘𝜋 (𝑧★𝑖𝑗 , 0) ≤ 0 for all 𝑘 ∈ [𝐾], which implies via (Rockafellar 1970,
Theorem 8.6) that 𝑧★𝑖𝑗 is a recession direction of Z. Geometrically, this means that
the ray emanating from any point in Z along the direction 𝑧★𝑖𝑗 never leaves Z. Thus,
𝑧𝑖𝑚𝑗 = 𝑧ˆ𝑖 + 𝑚 𝑧★𝑖𝑗 / 𝑝ˆ𝑖 ∈ Z for all 𝑖 ∈ [𝑁] and 𝑗 ∈ J𝑖∞ . In addition, one verifies that
Õ Õ
𝑝 𝑖𝑚𝑗 𝛿 𝑧 𝑚 , 𝑧ˆ ∈ Γ(P★, P̂)
𝑖𝑗 𝑖
𝑖∈ [ 𝑁 ] 𝑗 ∈J𝑖
where the first equality follows from the definitions of 𝑝 𝑖𝑚𝑗 and 𝑧𝑖𝑚𝑗 . The second
Distributionally Robust Optimization 145
inequality holds because the transportation cost function 𝑐(𝑧, 𝑧ˆ) is non-negative and
convex in 𝑧, which implies that both terms in the third line are non-decreasing in 𝑚.
The second equality follows from Assumption 7.24, which ensures that 𝑐(𝑧, 𝑧ˆ) is
real-valued such that the reference point in the definition of the recession function
of 𝑐(·, 𝑧ˆ𝑖 ) can be chosen freely. The third equality exploits the definition of the
perspective function 𝑐𝑖𝜋 and the observation that 𝑐𝑖𝜋 (0, 0) = 𝑐∞ 𝑖 (0) = 0. Finally,
the last inequality follows from the last constraint of problem (7.13). We have
thus shown that P𝑚 is feasible in (4.1). In analogy to analysis for P★, one can
show that the asymptoticÍ expected
Í loss lim𝑚→∞ E P𝑚 [ℓ(𝑍)] is at least as large
as the optimal value 𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ] −(−ℓ 𝑗 ) 𝜋 (𝑝★𝑖𝑗 𝑧ˆ𝑖 + 𝑧★𝑖𝑗 , 𝑝★𝑖𝑗 ) of the finite bi-dual
reformulation (7.13). However, as the suprema of (4.1) and (7.13) match, it is clear
that the distributions P𝑚 , 𝑚 ∈ N, must be asymptotically optimal in (4.1).
where the first equality uses the definition of the perspective function, and the
second equality holds because the transportation cost function grows superlinearly.
Thus, (𝑝★, 𝑧★) violates the last constraint of problem (7.13), which contradicts its
assumed feasibility. We may thus conclude that J𝑖∞ = ∅ and that (4.1) is solvable.
As for assertion (ii), assume now that Z is bounded. Without loss of generality,
we may also assume that 𝑝★𝑖𝑗 = 0 for some 𝑖 ∈ [𝑁] and 𝑗 ∈ [𝐽] for otherwise J𝑖∞ is
trivially empty. The constraints of problem (7.13) then ensure that 𝑔 𝑘𝜋 (𝑧★𝑖𝑗 , 0) ≤ 0
for all 𝑘 ∈ [𝐾], which implies via (Rockafellar 1970, Theorem 8.6) that 𝑧★𝑖𝑗 is a
recession direction of Z. As Z is compact, however, this implies that 𝑧★𝑖𝑗 = 0. We
may thus again conclude that J𝑖∞ = ∅ and that (4.1) is solvable.
146 D. Kuhn, S. Shafiee, and W. Wiesemann
problem, on the other hand, is less evident because this problem assumes somewhat
unrealistically that the decision-maker observes the distribution that governs 𝑍.
Nevertheless, the dual DRO problem has deep connections to robust statistics,
machine learning as well as several other disciplines as we explain below.
From the perspective of robust statistics, a minimizer 𝑥★ of the primal DRO
problem (1.2) can be interpreted as a robust estimator for the minimizer of the
stochastic program min 𝑥 ∈X E P0 [ℓ(𝑥, 𝑍)] corresponding to an unknown distribu-
tion P0 . When 𝑥★ and P★ satisfy the saddle point condition (7.16), then the robust
estimator 𝑥★ constitutes a best response to P★. Hence, it solves the stochastic pro-
gram corresponding to P★; see also (Lehmann and Casella 2006, Chapter 5). For
this reason, P★ is often referred to as the least favorable distribution. The existence
of P★ makes 𝑥★ a plausible estimator because it ensures that 𝑥★ is the minimizer of
a stochastic program corresponding to some distribution in the ambiguity set.
Algorithms for computing Nash equilibria of DRO problems are also relevant
for applications in machine learning. To see this, recall that adversarial training
aims to immunize machine learning models against adversarial perturbations of
the input data (Szegedy et al. 2014, Goodfellow et al. 2015, Mądry, Makelov,
Schmidt, Tsipras and Vladu 2018, Wang, Ma, Bailey, Yi, Zhou and Gu 2019,
Kurakin, Goodfellow and Bengio 2022). In this context, it is of interest to generate
adversarial examples, that is, maliciously designed inputs that mislead prediction
models encoded by parameters 𝑥 ∈ X . As a naïve approach to construct adversarial
examples, one could simply solve the worst-case expectation problem
sup E P [ℓ(𝑥,
ˆ 𝑍)], (7.17)
P∈P
which seeks a test distribution that maximizes the expected prediction loss of one
particular model encoded by 𝑥. ˆ Thus, any solution P★ of (7.17) can be viewed
as an adversarial attack, and samples drawn from P★ are naturally interpreted as
adversarial examples. In order to develop efficient strategies for attacking as well
as defending prediction models, however, it is desirable to construct adversarial
attacks that fool a broad spectrum of different models. Such attacks are called
transferable in the machine learning literature (Tramèr, Papernot, Goodfellow,
Boneh and McDaniel 2017, Demontis, Melis, Pintor, Jagielski, Biggio, Oprea,
Nita-Rotaru and Roli 2019, Kurakin et al. 2022). The dual DRO problem (7.15)
can be used to construct transferable attacks in a systematic manner. Indeed, the
solutions of (7.15) are not tailored to a particular model 𝑥ˆ ∈ X . Instead, they
aim to attack all models 𝑥 ∈ X simultaneously. If the primal DRO problem (1.2)
has a unique minimizer 𝑥★, then this minimizer can be recovered by solving the
stochastic program corresponding to the adversary’s Nash strategy P★.
To date, dual DRO problems have only been investigated in the context of specific
applications. For example, it is known that the least favorable distributions in dis-
tributionally robust estimation and Kalman filtering problems with a 2-Wasserstein
ambiguity set centered at a Gaussian reference distribution are themselves Gaus-
148 D. Kuhn, S. Shafiee, and W. Wiesemann
8. Regularization by Robustification
Classical stochastic optimization seeks decisions that perform well under a prob-
ability distribution P̂ estimated from training data. By ignoring any information
about estimation errors in P̂, however, stochastic optimization tends to output over-
fitted decisions that incur a low expected loss under P̂ but may perform poorly
under the unknown population distribution P. This problem becomes more acute if
training data is scarce. A key advantage of DRO vis-à-vis stochastic optimization
is that it has access to information about estimation errors. DRO uses this informa-
tion to prevent overfitting. Robustifying a stochastic optimization problem against
distributional uncertainty can thus be viewed as a form of implicit regularization.
We now show that there is often a deep connection between implicit regulariz-
ation (achieved by robustifying a problem against distributional uncertainty) and
explicit regularization (achieved by adding a penalty term to the problem’s objective
function). This discussion complements and extends several results from Section 6.
For example, in Section 6.9 we have seen that the worst-case expected value of a
linear loss function with respect to a Kullback-Leibler ambiguity set centered at
a Gaussian distribution coincides with the nominal expected loss and a variance
regularization term. Similarly, in Section 6.13 we have seen that the worst-case
expected value of a convex loss function with respect to a 1-Wasserstein ambiguity
set coincides with the nominal expected loss and a Lipschitz regularization term.
See also Sections 6.14 and 6.15 for some variants and generalizations of this result.
Distributionally Robust Optimization 149
Hence, the worst-case expected loss with respect to a Pearson 𝜒2 -divergence am-
biguity set of radius 𝑟 around P̂ is bounded above by the mean-standard deviation
1
risk measure with risk-aversion coefficient 𝑟 2 evaluated under P̂. By slight abuse
1 1
of terminology, the scaled standard deviation 𝑟 2 VP̂ [ℓ(𝑍)] 2 is commonly referred
to as a variance regularizer. By leveraging Theorem 4.15, the above bound can be
extended to arbitrary (possibly unbounded) Borel loss functions. This extension
critically relies on the following lemma.
Lemma 8.1 (Variance Formula). For any reference distribution P̂ ∈ P(Z), size
150 D. Kuhn, S. Shafiee, and W. Wiesemann
parameter 𝑟 > 0 and Borel function ℓ ∈ L(R𝑑 ) with E P̂ [|ℓ(𝑍)|] < ∞, we have
E P̂ [(ℓ(𝑍) − 𝜆 0 )2 ] 1 1
inf 𝜆𝑟 + = 𝑟 2 VP̂ [ℓ(𝑍)] 2 . (8.1)
𝜆0 ∈R,𝜆∈R+ 4𝜆
Proof. If E P̂ [ℓ(𝑍)2 ] = ∞, then both sides of (8.1) evaluate to ∞, and thus the claim
follows. In the remainder of the proof, we may thus assume that E P̂ [ℓ(𝑍)2 ] < ∞.
In this case, one readily verifies that the partial minimization problem over 𝜆 0 is
solved by 𝜆★0 = E P̂ [ℓ(𝑍)]. Substituting 𝜆★0 back into the objective function reveals
that the infimum on the left hand side of (8.1) equals inf 𝜆∈R+ 𝜆𝑟 + VP̂ [ℓ(𝑍)]/4𝜆.
In order to prove (8.1),
p it suffices to realize that this minimization problem over 𝜆
is solved by 𝜆★ = VP̂ [ℓ(𝑍)]/(4𝑟). This observation completes the proof.
Theorem 8.2 (Variance Regularization). If P is the Pearson 𝜒2 -divergence ambi-
guity set (2.17) and E P̂ [|ℓ(𝑍)|] < ∞, then we have
1 1
sup E P [ℓ(𝑍)] ≤ E P̂ [ℓ(𝑍)] + 𝑟 2 VP̂ [ℓ(𝑍)] 2 .
P∈P
Proof. The claim trivially holds if 𝑟 = 0. We may thus assume that 𝑟 > 0.
Recall now that the entropy function 𝜙 inducing the Pearson 𝜒2 -divergence satisfies
𝜙(𝑠) = (𝑠 − 1)2 if 𝑠 ≥ 0 and 𝜙(𝑠) = ∞ if 𝑠 < 0. Hence, the conjugate entropy
function 𝜙∗ satisfies 𝜙∗ (𝑡) = 14 𝑡 2 + 𝑡 if 𝑡 ≥ −2 and 𝜙∗ (𝑡) = −1 if 𝑡 < −2, and its
domain is given by dom(𝜙∗ ) = R. As 𝜙∞ (1) = ∞, all distributions P ∈ P are
absolutely continuous with respect to P̂. Thus, Theorem 4.15 applies, and we find
sup E P [ℓ(𝑍)] = inf 𝜆 0 + 𝜆𝑟 + E P̂ [(𝜙∗ ) 𝜋 (ℓ(𝑍) − 𝜆 0 , 𝜆)]
P∈P 𝜆0 ∈R,𝜆∈R+
E P̂ [(ℓ(𝑍) − 𝜆 0 )2 ]
≤ inf E P̂ [ℓ(𝑍)] + 𝜆𝑟 +
𝜆0 ∈R,𝜆∈R+ 4𝜆
1 1
= E P̂ [ℓ(𝑍)] + 𝑟 2 VP̂ [ℓ(𝑍)] 2 ,
where the inequality holds because 𝜙∗ (𝑡) ≤ 14 𝑡 2 + 𝑡, and the second equality follows
from Lemma 8.1. Thus, the claim follows.
Most 𝜙-divergences are smooth and non-negative and thus resemble the Pearson
𝜒2 -divergence locally around 1 (Polyanskiy and Wu 2024, § 7.10). Accordingly,
one can use a Taylor expansion to show that robustification over a 𝜙-divergence am-
biguity set of sufficiently small size 𝑟 is often equivalent to variance regularization.
To formalize this result, we assume from now on that 𝜙 is differentiable.
Assumption 8.3 (Differentiability). The entropy function 𝜙 is twice continuously
differentiable on a neighborhood of 1 with 𝜙(1) = 𝜙′ (1) = 0 and 𝜙′′ (1) = 2.
The assumption that 𝜙′ (1) = 0 incurs no loss of generality. Indeed, any entropy
function 𝜙 is equivalent to a transformed entropy function 𝜙˜ defined through 𝜙(𝑡)
˜ =
𝜙(𝑡) − 𝜙′ (1) · 𝑡 + 𝜙′ (1) with 𝜙˜′ (1) = 0. That is, both 𝜙 and 𝜙˜ induce the same
Distributionally Robust Optimization 151
divergence. Note that all entropy functions listed in Table 2.1—except for the
one associated with the total variation—satisfy 𝜙′ (1) = 0. The assumption that
𝜙′′ (1) = 2 serves as an arbitrary normalization but will simplify calculations.
Recall now that the restricted 𝜙-divergence ambiguity set (2.11) is defined as
P = P ∈ P(Z) : P ≪ P̂, D 𝜙 (P, P̂) ≤ 𝑟 .
Proof. Note that (8.2) trivially holds if 𝑟 = 0. Similarly, if VP̂ [ℓ(𝑍)] = 0, then ℓ(𝑍)
coincides P̂-almost surely with E P̂ [ℓ(𝑍)]. As P is a restricted 𝜙-divergence am-
biguity set, this readily implies that E P [ℓ(𝑍)] = E P̂ [ℓ(𝑍)] for all P ∈ P. Indeed,
any P ∈ P satisfies P ≪ P̂. Hence, (8.2) is again trivially satisfied. In the remainder
of the proof we my therefore assume that 𝑟 > 0 and that VP̂ [ℓ(𝑍)] > 0.
Assumption 8.3 implies that 𝜙(𝑠) = (𝑠 − 1)2 + 𝑜(𝑠2 ). By Taylor’s theorem with
Peano remainder, 𝜙 can thus be bounded from below (or above) locally around 1
by a quadratic function whose second derivative is slightly smaller (or larger) than
𝜙′′ (1) = 2. Thus, there exists a function 𝜅 : R+ → R+ with lim 𝜀↓0 𝜅(𝜀) = 0 and
1
· 𝑠2 ≤ 𝜙(1 + 𝑠) ≤ (1 + 𝜅(𝜀)) · 𝑠2 ∀𝑠 ∈ [−𝜀, +𝜀] (8.3)
1 + 𝜅(𝜀)
for all sufficiently small 𝜀. The rest of the proof proceeds in two steps, both of
which exploit (8.3). First, we show that the right hands side of (8.2) provides
a lower bound on the worst-case expected loss over P (Step 1). Next, we show
that the right hands side of (8.2) also provides an upper bound on the worst-case
expected loss over P (Step 2). Taken together, Steps 1 and 2 will imply the claim.
Section 2.2). Thus, the worst-case expectation problem over P can be recast as
sup E P̂ [ℓ(𝑍) 𝑓 (𝑍)]
𝑓 ∈L1 (P̂)
sup E P [ℓ(𝑍)] = s.t. P̂( 𝑓 (𝑍) ≥ 0) = 1
P∈P
E P̂ [ 𝑓 (𝑍)] = 1
E P̂ [𝜙( 𝑓 (𝑍))] ≤ 𝑟.
Renaming 𝑓 (𝑧) + 1 as 𝑓 (𝑧) further yields
sup E P̂ [ℓ(𝑍) 𝑓 (𝑍)]
𝑓 ∈L1 (P̂)
sup E P [ℓ(𝑍)] = E P̂ [ℓ(𝑍)] + s.t. P̂( 𝑓 (𝑍) ≥ −1) = 1 (8.4)
P∈P
E P̂ [ 𝑓 (𝑍)] = 0
E P̂ [𝜙(1 + 𝑓 (𝑍))] ≤ 𝑟.
Next, introduce an auxiliary function 𝜀 : R+ → R+ satisfying
1 ess sup P̂ [|ℓ(𝑍) − E P̂ [ℓ(𝑍)]|]
𝜀(𝑟) = 2𝑟 2 · 1
.
VP̂ [ℓ(𝑍)] 2
In addition, for every 𝑟 ∈ R+ , define the function 𝑓𝑟★ ∈ L1 (P̂) through
1
𝑟2 ℓ(𝑧) − E P̂ [ℓ(𝑍)]
𝑓𝑟★(𝑧) = 1
· 1
.
(1 + 𝜅(𝜀(𝑟))) 2 VP̂ [ℓ(𝑍)] 2
By construction, we may thus conclude that
1 ℓ(𝑍) − E P̂ [ℓ(𝑍)]
𝑓𝑟★(𝑍) ≤ 𝑟 2 · 1
≤ 𝜀(𝑟) P̂-a.s. (8.5)
VP̂ [ℓ(𝑍)] 2
for every 𝑟 ∈ R+ , where the two inequalities follow from the definitions of 𝑓𝑟★(𝑧)
and 𝜀(𝑟), respectively. In addition, we have E P̂ [ 𝑓𝑟★(𝑍)] = 0 and
E P̂ 𝜙(1 + 𝑓𝑟★(𝑍)) ≤ (1 + 𝜅(𝜀(𝑟))) · E P̂ 𝑓𝑟★(𝑍)2 ) = 𝑟
for all sufficiently small 𝑟. The inequality in the above expression follows from (8.5)
and from the upper bound on 𝜙 in (8.3), which holds for all sufficiently small 𝜀. The
equality exploits the definition of 𝑓𝑟★. This shows that 𝑓𝑟★ constitutes a feasible solu-
tion for the maximization problem in (8.4) if 𝑟 is sufficiently small. Substituting 𝑓𝑟★
into (8.4) then yields the desired lower bound. Indeed, we have
sup E P [ℓ(𝑍)] ≥ E P̂ [ℓ(𝑍)] + E P̂ ℓ(𝑍) 𝑓𝑟★(𝑍)
P∈P
1
𝑟2 E P̂ ℓ(𝑍)(ℓ(𝑍) − E P̂ [ℓ(𝑍)])
= E P̂ [ℓ(𝑍)] + 1
· 1
(1 + 𝜅(𝜀(𝑟))) 2 VP̂ [ℓ(𝑍)] 2
1 1 1
= E P̂ [ℓ(𝑍)] + 𝑟 2 VP̂ [ℓ(𝑍)] 2 + 𝑜(𝑟 2 ),
Distributionally Robust Optimization 153
for all sufficiently small 𝑟, where the first equality follows from the definition
of 𝑓𝑟★. The second equality exploits the Taylor expansion of the inverse square root
function around 1 and the elementary observation that lim𝑟 ↓0 𝜅(𝜀(𝑟)) = 0.
Step 2. The Huber loss ℎ 𝜀 : R → R with tuning parameter 𝜀 > 0 is defined through
1 2
𝑠 if |𝑠| ≤ 𝜀,
ℎ 𝜀 (𝑠) = 2 1 2
𝜀|𝑠| − 2 𝜀 otherwise.
By construction, ℎ 𝜀 is continuously differentiable, depends quadratically on 𝑠 if
|𝑠| ≤ 𝜀 and depends linearly on 𝑠 if |𝑠| > 𝜀. Its conjugate ℎ∗𝜀 : R → R satisfies
1 2
∗ 𝑡 if |𝑡 | ≤ 𝜀,
ℎ 𝜀 (𝑡) = 2
∞ otherwise.
The lower bound on 𝜙 in (8.3) and the convexity of 𝜙 imply that
2
𝜙(𝑠) ≥ ℎ 𝜀 (𝑠 − 1) ∀𝑠 ∈ R
1 + 𝜅(𝜀)
whenever 𝜀 is sufficiently small. This uniform lower bound on 𝜙 in terms of ℎ 𝜀
gives rise to a uniform upper bound on 𝜙∗ in terms of ℎ∗𝜀 . Indeed, we have
2
𝜙∗ (𝑡) ≤ sup 𝑠𝑡 − ℎ 𝜀 (𝑠 − 1)
𝑠∈R 1 + 𝜅(𝜀)
(
2 (1+𝜅(𝜀))𝑡 2 2𝜀
1 if |𝑡 | ≤ 1+𝜅(𝜀) ,
ℎ∗
=𝑡+ 2 𝑡(1 + 𝜅(𝜀)) = 𝑡 + 4 (8.6)
1 + 𝜅(𝜀) 𝜀 ∞ otherwise,
for all sufficiently small 𝜀. The first equality in (8.6) is obtained by applying the
variable transformation 𝑠 ← 𝑠 − 1 and by extracting the constant 2/(1 + 𝜅(𝜀)) from
the supremum. The second equality follows from the definition of ℎ∗𝜀 . By weak
duality as established in Theorem 4.15, we then find
sup E P [ℓ(𝑍)] ≤ inf 𝜆 0 + 𝜆𝑟 + E P̂ [(𝜙∗ ) 𝜋 (ℓ(𝑍) − 𝜆 0 , 𝜆)]
P∈P 𝜆0 ∈R,𝜆∈R+
inf E P̂ [ℓ(𝑍)] + 𝜆𝑟 + 1+𝜅(𝜀(𝑟
4𝜆
))
E P̂ (ℓ(𝑍) − 𝜆 0 )2
𝜆0 ∈R,𝜆∈R+
≤
2𝜀(𝑟 )𝜆
s.t.
P̂ |ℓ(𝑍) − 𝜆 0 | ≤ 1+𝜅(𝜀(𝑟 )) = 1,
(8.7)
where the second inequality follows from the definition of the perspective function
and from (8.6), which holds for all sufficiently small 𝜀. Here, we have re-used the
function 𝜀(𝑟) introduced in Step 1. Next, we set 𝜆★0 = E P̂ [ℓ(𝑍)] and define
1
1 + 𝜅(𝜀(𝑟)) 2 1
𝜆★𝑟 = 1
· VP̂ [ℓ(𝑍)] 2
2𝑟 2
154 D. Kuhn, S. Shafiee, and W. Wiesemann
for any 𝑟 > 0. Note that (𝜆★0 , 𝜆★𝑟) is feasible in (8.7) provided that 𝑟 is sufficiently
small; in particular, 𝑟 must be small enough to satisfy 𝜅(𝜀(𝑟)) ≤ 3. Indeed, we have
1
2𝜀(𝑟)𝜆 ★ 𝜀(𝑟)V [ℓ(𝑍)] 2
𝑟 P̂
P̂ |ℓ(𝑍) − 𝜆★0 | ≤ = P̂ |ℓ(𝑍) − E P̂ [ℓ(𝑍)]| ≤
1 + 𝜅(𝜀(𝑟)) 1
𝑟 1 + 𝜅(𝜀(𝑟))
2
12
2
1 ess sup P̂ |ℓ(𝑍) − E P̂ [ℓ(𝑍)]|
= P̂ |ℓ(𝑍) − E P̂ [ℓ(𝑍)]| ≤ = 1,
1 + 𝜅(𝜀(𝑟)) 2
where the first equality follows from the definitions of 𝜆★0 and 𝜆★𝑟, the second
equality follows from the definition of 𝜀(𝑟), and the last equality holds because
𝜅(𝜀(𝑟)) ≤ 3. Substituting (𝜆★0 , 𝜆★𝑟) into (8.7) then yields the desired upper bound.
1 + 𝜅(𝜀(𝑟)) h ★ 2
i
sup E P [ℓ(𝑍)] ≤ E P̂ [ℓ(𝑍)] + 𝜆★𝑟𝑟 + E P̂ ℓ(𝑍) − 𝜆 0
P∈P 4𝜆★𝑟
1
(1 + 𝜅(𝜀(𝑟))) 2 1 1
= E P̂ [ℓ(𝑍)] + · 𝑟 2 VP̂ [ℓ(𝑍)] 2
2
1 h
(1 + 𝜅(𝜀(𝑟))) 12 2 i
+ 1
𝑟 2E
P̂ ℓ(𝑍) − E P̂ [ℓ(𝑍)]
2VP̂ [ℓ(𝑍)] 2
1 1 1
= E P̂ [ℓ(𝑍)] + 𝑟 2 VP̂ [ℓ(𝑍)] 2 + 𝑜(𝑟 2 )
Here, the first equality follows from the definitions of 𝜆★0 and 𝜆★𝑟, and the second
equality holds because lim𝑟 ↓0 𝜅(𝜀(𝑟)) = 0. Hence, the claim follows.
allel line of research, Gotoh, Kim and Lim (2018, 2021) derive a Taylor expansion
of the penalty-based worst-case expected loss supP∈P(Z) E P [ℓ(𝑍)] − 𝑟1 D 𝜙 (P, P̂).
They focus again on discrete empirical reference distributions and provide both
first- as well as higher-order terms of the corresponding Taylor expansion.
Maurer and Pontil (2009) show that variance-regularized empirical risk min-
imization may provide faster rates of convergence to the expected loss under the
population distribution compared to standard empirical risk minimization. This
improved convergence highlights the potential benefits of incorporating variance
regularization in the learning process. Unfortunately, simple stochastic optimiza-
tion problems with a mean-variance objective are NP-hard even if the underlying
loss function is convex in the decision variables (Ahmed 2006). In contrast, the
worst-case expectation with respect to any ambiguity set preserves the convexity of
the underlying loss function. Theorem 8.4 thus suggests that the worst-case expec-
ted loss over a restricted 𝜙-divergence ambiguity set provides a convex surrogate for
the nonconvex—but statistically attractive—variance-regularized empirical loss.
where the second equality exploits the symmetry of 𝐷 𝑘 ℓ(ˆ𝑧) (Banach 1938, Satz 1).
By slight abuse of notation, we use the same symbol k · k for the tensor norm as for
the underlying vector norm k · k. The following theorem generalizes Proposition 8.5
to any 𝑝 ∈ N. This result is due to Shafiee et al. (2023, Theorem 3.2).
Theorem 8.6 (Variation and Lipschitz Regularization). If P is the 𝑝-Wasserstein
ambiguity set (2.28) for some 𝑝 ∈ N, where W 𝑝 is induced by a norm k · k on R𝑑 ,
Z is convex and ℓ is 𝑝 − 1 times continuously differentiable, then we have
𝑝−1 𝑘
Õ 𝑝
𝑟 1
ˆ +
sup E P [ℓ(𝑍)] ≤ E P̂ [ℓ( 𝑍)] ˆ 𝑞𝑘 𝑞𝑘 + 𝑟 lip(𝐷 𝑝−1 ℓ),
E P̂ k𝐷 𝑘 ℓ( 𝑍)k
P∈P 𝑘=1
𝑘! 𝑝!
where 𝑝 𝑘 = 𝑝/𝑘 and 𝑞 𝑘 = 𝑝/(𝑝 − 𝑘) for all 𝑘 ∈ [ 𝑝 − 1].
Distributionally Robust Optimization 157
Proof. Select any P ∈ P and any optimal coupling 𝛾★ ∈ Γ(P, P̂) with W 𝑝 (P, P̂) =
E 𝛾★ [k𝑍 − 𝑍ˆ k 𝑝 ] 1/ 𝑝 , which exists by Lemma 3.17. As 𝛾★ ∈ Γ(P, P̂), we have
ˆ = E 𝛾★ ℓ(𝑍) − ℓ( 𝑍)
E P [ℓ(𝑍)] − E P̂ [ℓ( 𝑍)] ˆ .
By (Krantz and Parks 2002, Theorem 2.2.5), we can expand ℓ(𝑧) − ℓ(ˆ𝑧) as a Taylor
series with Lagrange remainder. Thus, there exists a Borel function 𝑓 : Z ×Z → Z
that maps any pair (𝑧, 𝑧ˆ) to a point on the line segment between 𝑧 and 𝑧ˆ such that
𝑝−1
Õ 1 𝑘 1
ℓ(𝑧) − ℓ(ˆ𝑧) = 𝐷 ℓ(ˆ𝑧) [𝑧 − 𝑧ˆ] 𝑘 + 𝐷 𝑝 ℓ( 𝑓 (𝑧, 𝑧ˆ)) [𝑧 − 𝑧ˆ] 𝑝
𝑘=1
𝑘! 𝑝!
𝑝−1
Õ 1 1
≤ k𝐷 𝑘 ℓ(ˆ𝑧)k k𝑧 − 𝑧ˆ k 𝑘 + k𝐷 𝑝 ℓ( 𝑓 (𝑧, 𝑧ˆ))k k𝑧 − 𝑧ˆ k 𝑝 . (8.9)
𝑘=1
𝑘! 𝑝!
The inequality in (8.9) follows from the definition of the tensor norm. By Hölder’s
inequality, the expected value of the 𝑘-th term in (8.9) with respect to 𝛾★ satisfies
1 1
ˆ k𝑍 − 𝑍ˆ k 𝑘 ≤ E 𝛾★ k𝑍 − 𝑍ˆ k 𝑘 𝑝𝑘 𝑝𝑘 E 𝛾★ k𝐷 𝑘 ℓ( 𝑍)k
E 𝛾★ k𝐷 𝑘 ℓ( 𝑍)k ˆ 𝑞𝑘 𝑞 𝑘
1
ˆ 𝑞𝑘 𝑞 𝑘 ,
≤ 𝑟 𝑘 E P̂ k𝐷 𝑘 ℓ( 𝑍)k
where 𝑝 𝑘 = 𝑝/𝑘 and 𝑞 𝑘 = 𝑝/(𝑝 − 𝑘) represent conjugate exponents. The second
inequality in the above expression holds because 𝛾★ ∈ Γ(P, P̂), which implies that
1 𝑘
E 𝛾★ [k𝑍 − 𝑍ˆ k 𝑘 𝑝𝑘 ] 𝑝𝑘 = E 𝛾★ [k𝑍 − 𝑍ˆ k 𝑝 ] 𝑝 = W 𝑝 (P, P̂) 𝑘 ≤ 𝑟 𝑘 .
As Z is convex, we may conclude that 𝑓 (𝑧, 𝑧ˆ) ∈ Z for all 𝑧, 𝑧ˆ ∈ Z. Thus, the
expected value of the Lagrange remainder in (8.9) with respect to 𝛾★ satisfies
ˆ k𝑍 − 𝑍ˆ k 𝑝
E 𝛾★ k𝐷 𝑝 ℓ( 𝑓 (𝑍, 𝑍))k
≤ sup k𝐷 𝑝 ℓ(ˆ𝑧)k E 𝛾★ k𝑍 − 𝑍ˆ k 𝑝 ≤ 𝑟 𝑝 sup k𝐷 𝑝 ℓ(ˆ𝑧 )k ≤ 𝑟 𝑝 lip(𝐷 𝑝−1 ℓ),
𝑧ˆ ∈Z 𝑧ˆ ∈Z
where the second inequality exploits again Hölder’s inequality and the properties
of the optimal coupling 𝛾★. The third inequality follows from the mean value
theorem. The desired inequality is finally obtained by combining the upper bounds
on the expected values of all terms in (8.9) with respect to 𝛾★.
Theorem 8.6 shows that the worst-case expected loss over a 𝑝-Wasserstein ball
is bounded above by the sum of the expected loss under the reference distribution,
𝑝 − 1 variation regularization terms, and a Lipschitz regularization term. Note
that 𝑝 1 = 𝑝 and 𝑞 = 𝑞 1 = 𝑝/(𝑝 − 1) are Hölder conjugates and that 𝐷 1 ℓ = ∇ℓ.
Thus, the term corresponding to 𝑘 = 1 in the upper bound of Theorem 8.6 can
ˆ 𝑞 ]] 1/𝑞 . The next theorem, which is
be expressed more explicitly as E P̂ [k∇ℓ( 𝑍)k
adapted from (Bartl, Drapeau, Oblój and Wiesel 2021, Gao et al. 2024), reveals
that this variation regularizer matches the leading term of a Taylor expansion of the
worst-case expected loss in the radius 𝑟 of the 𝑝-Wasserstein ball for any 𝑝 > 1.
158 D. Kuhn, S. Shafiee, and W. Wiesemann
Recall that all norms on R𝑑 are topologically equivalent. Thus, in the smoothness
condition we could equivalently use the primal norm instead of the dual norm to
measure differences between gradients. However, working with the dual norm is
more convenient and will simplify the proof of Theorem 8.7.
Proof of Theorem 8.7. For any fixed 𝛿 ∈ R+ and 𝑧ˆ ∈ Z, we define the variation of
the loss function ℓ over a norm ball of radius 𝛿 around 𝑧ˆ as
𝑉 𝛿 (ˆ𝑧 ) = sup ℓ(𝑧) − ℓ(ˆ𝑧) : k𝑧 − 𝑧ˆ k ≤ 𝛿 .
𝑧 ∈Z
Note that 𝑉 𝛿 (ˆ𝑧 ) is finite because ℓ is continuous thanks to the smoothness condition.
As a preparation to prove the theorem, we first establish simple upper and lower
bounds on 𝑉 𝛿 (ˆ𝑧 ). As Z is convex, the line segment from 𝑧ˆ to any 𝑧 ∈ Z is contained
in Z. The mean value theorem then implies that there exists a point 𝑧¯ ∈ Z on this
line segment that satisfies ℓ(𝑧) − ℓ(ˆ𝑧) = ∇ℓ(¯𝑧 )⊤ (𝑧 − 𝑧ˆ). Thus, we have
ℓ(𝑧) − ℓ(ˆ𝑧) − ∇ℓ(ˆ𝑧)⊤ (𝑧 − 𝑧ˆ) = ∇ℓ(¯𝑧)⊤ (𝑧 − 𝑧ˆ) − ∇ℓ(ˆ𝑧)⊤ (𝑧 − 𝑧ˆ)
≤ k∇ℓ(¯𝑧) − ∇ℓ(ˆ𝑧)k ∗ k𝑧 − 𝑧ˆ k ≤ 𝐿 k𝑧 − 𝑧ˆ k 2 ,
where the two inequalities follow from the definition of the dual norm and from
the smoothness condition, respectively. This implies that
∇ℓ(ˆ𝑧 )⊤ (𝑧 − 𝑧ˆ) − 𝐿 k𝑧 − 𝑧ˆ k 2 ≤ ℓ(𝑧) − ℓ(ˆ𝑧) ≤ ∇ℓ(ˆ𝑧)⊤ (𝑧 − 𝑧ˆ) + 𝐿 k𝑧 − 𝑧ˆ k 2 . (8.11)
The first inequality in (8.11) gives rise to a lower bound on 𝑉 𝛿 (ˆ𝑧). Indeed, we find
𝑉 𝛿 (ˆ𝑧 ) ≥ sup ∇ℓ(ˆ𝑧 )⊤ (𝑧 − 𝑧ˆ) − 𝐿 k𝑧 − 𝑧ˆ k 2 : k𝑧 − 𝑧ˆ k ≤ 𝛿
𝑧 ∈Z
≥ sup ∇ℓ(ˆ𝑧 )⊤ (𝑧 − 𝑧ˆ) : k𝑧 − 𝑧ˆ k ≤ 𝛿 − 𝐿𝛿 2 = k∇ℓ(ˆ𝑧)k ∗ 𝛿 − 𝐿𝛿2 , (8.12)
𝑧 ∈Z
where the equality follows from the definition of the dual norm. Similarly, the
second inequality in (8.11) gives rise to the following upper bound on 𝑉 𝛿 (ˆ𝑧 ).
𝑉 𝛿 (ˆ𝑧 ) ≤ k∇ℓ(ˆ𝑧)k ∗ 𝛿 + 𝐿𝛿2 ∀𝛿 ∈ R+ (8.13)
Distributionally Robust Optimization 159
This upper bound grows quadratically with 𝛿 and is therefore too loose for our
purposes if 𝑝 < 2. In this case, we must establish an alternative upper bound that
grows only as 𝛿 𝑝 . This is possible thanks to the growth condition on ℓ. To see this,
define the worst-case variation of ℓ over any ball of radius 𝛿0 as
𝑉 = sup ℓ(𝑧) − ℓ(ˆ𝑧) : 𝑧, 𝑧ˆ ∈ Z, k𝑧 − 𝑧ˆ k ≤ 𝛿0 .
The second inequality follows from the growth condition on ℓ and the estimates
k𝑧 − (𝑧 + 𝑑)k = 𝛿0 and k(𝑧 + 𝑑) − 𝑧ˆ k ≥ k𝑑 k − k𝑧 − 𝑧ˆ k ≥ 𝛿0 . Thus, ℓ(𝑧) − ℓ(ˆ𝑧) admits
a finite upper bound independent of 𝑧 and 𝑧ˆ, which confirms that 𝑉 is finite.
The growth condition on ℓ ensures that 𝑉 𝛿 (ˆ𝑧) ≤ max{𝑉 , 𝑔𝛿 𝑝 }. Combining this
estimate with (8.13) and defining 𝑢(𝛿) = min{max{𝑉 , 𝑔𝛿 𝑝 }, 𝐿𝛿2 } yields
n o
𝑉 𝛿 (ˆ𝑧) ≤ min max{𝑉 , 𝑔𝛿 𝑝 }, k∇ℓ(ˆ𝑧)k ∗ 𝛿 + 𝐿𝛿 2 ≤ k∇ℓ(ˆ𝑧)k ∗ 𝛿 + 𝑢(𝛿).
Note that 𝑢(𝛿) = 𝑔𝛿 𝑝 for all sufficiently large 𝛿 and 𝑢(𝛿) = 𝐿𝛿2 for all sufficiently
small 𝛿. In between there is a (possibly empty) interval on which 𝑢(𝛿) = 𝑉 is
constant. Since 𝑝 ≤ 2, in all three regimes, 𝑢(𝛿) can be bounded above by 𝑔′ 𝛿 𝑝
for some growth parameter 𝑔′ ∈ R+ . Setting 𝐺 to the largest of these three growth
parameters, we may thus conclude that
𝑉 𝛿 (ˆ𝑧) ≤ k∇ℓ(ˆ𝑧)k ∗ 𝛿 + 𝐺𝛿 𝑝 ∀𝛿 ∈ R+ . (8.14)
Thus, if 𝑝 ≤ 2, then 𝑉 𝛿 (ˆ𝑧) admits an upper bound that grows only as 𝛿 𝑝 .
The remainder of the proof proceeds in two steps. First, we show that the right
hand side of (8.10) provides a lower bound on the worst-case expected loss over P
(Step 1). Next, we show that the right hand side of (8.10) also provides an upper
bound on the worst-case expected loss over P (Step 2). This will prove the claim.
where the set Δ in (8.15b) represents the family of all Borel functions 𝛿 : Z → R+ .
The second inequality in the above expression can be justified as follows. Select
any 𝛿 ∈ Δ feasible in (8.15b), and define 𝑓 ∈ F as any Borel function satisfying
𝑓 (ˆ𝑧 ) ∈ arg max ℓ(𝑧) : k𝑧 − 𝑧ˆ k ≤ 𝛿(ˆ𝑧) ∀ˆ𝑧 ∈ Z.
𝑧 ∈Z
Such a Borel function exists thanks to (Rockafellar and Wets 2009, Corollary 14.6
and Theorem 14.37). As 𝛿 is feasible in (8.15b), this function 𝑓 satisfies
ˆ − 𝑍ˆ k 𝑝 ≤ E 𝛿( 𝑍)
E P̂ k 𝑓 ( 𝑍) ˆ 𝑝 ≤ 𝑟𝑝
P̂
and is thus feasible in (8.15a). Its objective function value in (8.15a) satisfies
h i
ˆ
E P̂ ℓ( 𝑓 ( 𝑍)) ˆ + E P̂ 𝑉 𝛿( 𝑍)
= E P̂ ℓ( 𝑍) ˆ ( ˆ
𝑍) .
Hence, any feasible solution in (8.15b) gives rise to a feasible solution in (8.15a)
with the same objective function value. This proves the inequality in (8.15a).
Substituting the lower bound (8.12) on 𝑉 𝛿 (ˆ𝑧) into (8.15b) then yields the estimate
(
sup E P̂ k∇ℓ( 𝑍)k ˆ ∗ 𝛿( 𝑍)
ˆ − 𝐿𝛿( 𝑍)
ˆ 2
ˆ + 𝛿 ∈Δ
sup E P [ℓ(𝑍)] ≥ E P̂ ℓ( 𝑍) (8.16)
P∈P s.t. E P̂ 𝛿( 𝑍)ˆ 𝑝 ≤ 𝑟 𝑝.
ˆ ∗ = 0 P̂-almost surely, then we have established the desired lower bound.
If k∇ℓ( 𝑍)k
From now on we may thus assume that E P̂ [k∇ℓ( 𝑍)k ˆ ∗ ] > 0. Next, we construct a
★
function 𝛿 ∈ Δ feasible in the maximization problem in (8.16) and use its objective
function value as a lower bound on the problem’s supremum. Specifically, we set
k∇ℓ(ˆ𝑧)k ∗𝑞−1 𝑟
𝛿★(ˆ𝑧 ) = ∀ˆ𝑧 ∈ Z,
ˆ ∗𝑞 ] 1/ 𝑝
E P̂ [k∇ℓ( 𝑍)k
which is well-defined by the integrability condition. As 𝑞 − 1 = 𝑞/𝑝, we find
1
ˆ 𝑝 = 𝑟 𝑝 and E P̂ k∇ℓ( 𝑍)k
E P̂ 𝛿★( 𝑍) ˆ ∗ 𝛿★( 𝑍)
ˆ 𝑝 = 𝑟 · E P̂ k∇ℓ( 𝑍)k
ˆ ∗𝑞 𝑞 .
where the first inequality follows from the estimate (8.13), and the second inequality
holds because the supremum over 𝛿 is duplicated. The resulting upper bound on
the worst-case expected loss thus coincides with the sum of two infima. One
readily verifies that the maximization problem over 𝛿 in (8.18a) is solved by 𝛿★ =
ˆ ∗𝑞/ 𝑝 . Thus, the infimum in (8.18a) equals
(𝑝𝜆 1 )−𝑞/ 𝑝 k∇ℓ( 𝑍)k
1 𝑞
ˆ ∗𝑞 = 𝑟 · E k∇ℓ( 𝑍)k
1
ˆ ∗𝑞 𝑞 ,
inf 𝜆 1 𝑟 𝑝 + (𝜆 1 𝑝)− 𝑝 E P̂ k∇ℓ( 𝑍)k P̂ (8.19a)
𝜆1 ∈R+ 𝑞
where the equality holds because the resulting minimization problem over 𝜆 1 is
ˆ ∗𝑞 1/𝑞 . Similarly, the maximization problem
solved by 𝜆★1 = 𝑝𝑟 − 𝑝/𝑞 E P̂ k∇ℓ( 𝑍)k
over 𝛿 in (8.18b) is solved by 𝛿★ = 𝐶1 𝜆 2−1/( 𝑝−2) , where 𝐶1 represents a positive
constant that only depends on 𝑝 and 𝐿. Thus, the infimum in (8.18b) equals
2
− 𝑝−2
inf 𝜆 2 𝑟 𝑝 + 𝐶2 𝜆 2 = 𝐶3 𝑟 2 , (8.19b)
𝜆2 ∈R+
where 𝐶2 and 𝐶3 are other positive constants depending on 𝑝 and 𝐿. The equality
in (8.19b) is obtained by solving the minimization problem over 𝜆 2 in closed form.
Replacing (8.18a) with (8.19a) and (8.18b) with (8.19b) finally yields
1
ˆ + 𝑟 · E k∇ℓ( 𝑍)k
sup E P [ℓ(𝑍)] ≤ E P̂ ℓ( 𝑍) ˆ ∗𝑞 𝑞 + O(𝑟 2 ).
P̂
P∈P
Assume next that 𝑝 ≤ 2. In this case, we have
𝑝 ˆ ∗ 𝛿 + 𝐺𝛿 − (𝜆 1 + 𝜆 2 )𝛿
sup E P [ℓ(𝑍)] ≤ inf (𝜆 1 + 𝜆 2 )𝑟 + E P̂ sup k∇ℓ( 𝑍)k 𝑝 𝑝
P∈P 𝜆1 ,𝜆2 ∈R+ 𝛿 ∈R+
162 D. Kuhn, S. Shafiee, and W. Wiesemann
𝑝
≤ inf 𝜆 1 𝑟 + E P̂ ˆ ∗ 𝛿 − 𝜆1 𝛿 𝑝
sup k∇ℓ( 𝑍)k (8.20a)
𝜆1 ∈R+ 𝛿 ∈R+
+ inf 𝜆 2 𝑟 𝑝 + sup 𝐺𝛿 𝑝 − 𝜆 2 𝛿 𝑝 , (8.20b)
𝜆2 ∈R+ 𝛿 ∈R+
where the first inequality follows from the estimate (8.14). Note that the infimum
in (8.20a) is identical to that in (8.18a) and thus simplifies to (8.19a). Next, note
that the maximization problem over 𝛿 in (8.20b) is unbounded unless 𝜆 2 ≥ 𝐺.
This condition thus constitutes an implicit constraint for the minimization problem
over 𝜆 2 . Whenever 𝜆 2 satisfies this constraint, however, the supremum over 𝛿 eval-
uates to 0, and therefore the infimum over 𝜆 2 evaluates to 𝐺𝑟 𝑝 . Replacing (8.20a)
with (8.19a) and (8.20b) with 𝐺𝑟 𝑝 finally yields
1
ˆ + 𝑟 · E P̂ k∇ℓ( 𝑍)k
sup E P [ℓ(𝑍)] ≤ E P̂ ℓ( 𝑍) ˆ ∗𝑞 𝑞 + O(𝑟 𝑝 ).
P∈P
As both O(𝑟 2 ) and O(𝑟 𝑝 ) for 0 < 𝑝 ≤ 2 are of the order 𝑜(𝑟), the claim follows.
The proof of Theorem 8.7 reveals that the variation 𝑉 𝛿 (ˆ𝑧) equals k∇ℓ(ˆ𝑧 )k ∗ 𝛿 to first
ˆ ∗𝑞 ] 1/𝑞
order in 𝛿. Hence, it is natural to refer to the regularization term E P̂ [k∇ℓ( 𝑍)k
appearing in (8.10) as the total variation.
Regularizers penalizing the Lipschitz moduli, gradients, Hessians or tensors of
higher-order partial derivatives are successfully used in the adversarial training of
neural networks (Lyu, Huang and Liang 2015, Jakubovitz and Giryes 2018, Finlay
and Oberman 2021, Bai, He, Jiang and Obloj 2017) and in the stabilizing training
of generative adversarial networks (Roth, Lucchi, Nowozin and Hofmann 2017,
Nagarajan and Kolter 2017, Gulrajani, Ahmed, Arjovsky, Dumoulin and Courville
2017). However, these regularizers introduce nonconvexity into an otherwise
convex optimization problem. Theorems 8.6 and 8.7 thus suggest that the worst-
case expected loss with respect to a Wasserstein ambiguity set provides a convex
surrogate for the empirical loss with Lipschitz and/or variation regularizers.
where the equality holds because the L𝑞 -norm is dual to the L 𝑝 -norm.
The results of this section also rely on the fundamentals of comonotonicity
theory, which we review next. For any Borel measurable function 𝑓 : R𝑑 → R the
distribution function 𝐹 : R → [0, 1] of the random variable 𝑓 (𝑍) under P is defined
through 𝐹(𝜏) = P( 𝑓 (𝑍) ≤ 𝜏) for every 𝜏 ∈ R, and the corresponding (left) quantile
function 𝐹 ← : [0, 1] → R is defined through 𝐹 ← (𝑞) = inf{𝜏 ∈ R : 𝐹1 (𝜏) ≥ 𝑞}
for every 𝑞 ∈ [0, 1]. Note that if 𝐹 is invertible, then 𝐹 ← = 𝐹 −1 . Note also
that 𝐹 is generally right-continuous, whereas 𝐹 ← is generally left-continuous. The
definition of the quantile function 𝐹 ← also readily implies the equivalence
𝐹(𝜏) ≥ 𝑞 ⇐⇒ 𝜏 ≥ 𝐹 ← (𝑞) ∀𝜏 ∈ R, ∀𝑞 ∈ [0, 1]. (8.21)
Definition 8.11 (Comonotonicity). Two random variables 𝑓 (𝑍) and 𝑔(𝑍) induced
by Borel measurable functions 𝑓 , 𝑔 : R𝑑 → R are comonotonic under P if
P ( 𝑓 (𝑍) ≤ 𝜏1 ∧ 𝑔(𝑍) ≤ 𝜏2 ) = min {𝐹(𝜏1 ), 𝐺(𝜏2 )} ∀𝜏1 , 𝜏2 ∈ R,
where 𝐹 and 𝐺 denote the distribution functions of 𝑓 (𝑍) and 𝑔(𝑍) under P.
The following proposition sheds more light on Definition 8.11. It shows that
comonotonic random variables can essentially always be expressed as functions of
each other (McNeil, Frey and Embrechts 2015, Corollary 5.17).
Proposition 8.12 (Comonotonicity). Let 𝑓 (𝑍) and 𝑔(𝑍) be two random variables
with respective distribution functions 𝐹 and 𝐺 under P as in Definition 8.11. If 𝐹
164 D. Kuhn, S. Shafiee, and W. Wiesemann
is continuous, then 𝑓 (𝑍) and 𝑔(𝑍) are comonotonic under P if and only if
𝑔(𝑍) = 𝐺 ← (𝐹( 𝑓 (𝑍))) P-a.s.
Proof. Note first that 𝐹( 𝑓 (𝑍)) follows the standard uniform distribution on [0, 1]
under P. To see this, note that for any 𝑞 ∈ [0, 1] we have
P (𝐹( 𝑓 (𝑍)) ≤ 𝑞) = P 𝑓 (𝑍) ≤ 𝐹 ← (𝑞) = 𝐹 𝐹 ← (𝑞) = 𝑞,
where the first two equalities follow from the definitions of 𝐹 ← and 𝐹, respectively,
while the last equality holds because 𝐹 is continuous.
Assume now that 𝑓 (𝑍) and 𝑔(𝑍) are comonotonic under P. Hence, we have
P ( 𝑓 (𝑍) ≤ 𝜏1 ∧ 𝑔(𝑍) ≤ 𝜏2 ) = min {𝐹(𝜏1 ), 𝐺(𝜏2 )}
= P (𝐹( 𝑓 (𝑍)) ≤ min {𝐹(𝜏1 ), 𝐺(𝜏2 )})
= P (𝐹( 𝑓 (𝑍)) ≤ 𝐹(𝜏1 ) ∧ 𝐹( 𝑓 (𝑍)) ≤ 𝐺(𝜏2 ))
= P 𝐹 ← (𝐹( 𝑓 (𝑍))) ≤ 𝜏1 ∧ 𝐺 ← (𝐹( 𝑓 (𝑍))) ≤ 𝜏2
for all 𝜏1 , 𝜏2 ∈ R. Here, the second equality holds because 𝐹( 𝑓 (𝑍)) follows the
standard uniform distribution under P. The last equality holds thanks to (8.21). As
𝐹 ← (𝐹( 𝑓 (𝑍))) is P-almost surely equal to 𝑓 (𝑍), we thus have
P ( 𝑓 (𝑍) ≤ 𝜏1 ∧ 𝑔(𝑍) ≤ 𝜏2 ) = P 𝑓 (𝑍) ≤ 𝜏1 ∧ 𝐺 ← (𝐹( 𝑓 (𝑍))) ≤ 𝜏2 .
for all 𝜏1 , 𝜏2 ∈ R. Hence, ( 𝑓 (𝑍), 𝑔(𝑍)) and ( 𝑓 (𝑍), 𝐺 ← (𝐹( 𝑓 (𝑍)))) are equal in law
under P. This implies in particular that the distribution of 𝑔(𝑍) conditional on 𝑓 (𝑍)
coincides with the distribution of 𝐺 ← (𝐹( 𝑓 (𝑍))) conditional on 𝑓 (𝑍) under P. As
the latter distribution is given by the Dirac point mass at 𝐺 ← (𝐹( 𝑓 (𝑍))), we may
conclude that 𝑔(𝑍) is P-almost surely equal to 𝐺 ← (𝐹( 𝑓 (𝑍))).
Assume now that 𝑔(𝑍) = 𝐺 ← (𝐹( 𝑓 (𝑍))) P-almost surely. Thus, we have
P ( 𝑓 (𝑍) ≤ 𝜏1 ∧ 𝑔(𝑍) ≤ 𝜏2 ) = P 𝑓 (𝑍) ≤ 𝜏1 ∧ 𝐺 ← (𝐹( 𝑓 (𝑍))) ≤ 𝜏2
Next, we show that the correlation of two random variables with fixed marginals
is maximal if they are comonotonic (McNeil et al. 2015, Theorem 5.25).
Theorem 8.13 (Attainable Correlations). Let 𝑓 , 𝑓 ★, 𝑔 and 𝑔★ be real-valued Borel
measurable functions on R𝑑 . Assume that, if 𝑍 is governed by P, then 𝑓 (𝑍)
and 𝑓 ★(𝑍) have the same distribution function 𝐹, whereas 𝑔(𝑍) and 𝑔★(𝑍) have
the same distribution function 𝐺. If 𝑓 ★(𝑍) and 𝑔★(𝑍) are comonotonic, then
E P [ 𝑓 (𝑍) · 𝑔(𝑍)] ≤ E P 𝑓 ★(𝑍) · 𝑔★(𝑍) .
Proof. Define the joint distribution function 𝐻 : R2 → [0, 1] of 𝑓 (𝑍) and 𝑔(𝑍)
under P via 𝐻(𝜏1 , 𝜏2 ) = P( 𝑓 (𝑍) ≤ 𝜏1 ∧ 𝑔(𝑍) ≤ 𝜏2 ) for all 𝜏1 , 𝜏2 ∈ R. By (McNeil
Distributionally Robust Optimization 165
et al. 2015, Lemma 5.24), the covariance of 𝑓 (𝑍) and 𝑔(𝑍) under P satisfies
∫ +∞ ∫ +∞
covP ( 𝑓 (𝑍), 𝑔(𝑍)) = (𝐻(𝜏1 , 𝜏2 ) − 𝐹(𝜏1 ) 𝐺(𝜏2 )) d𝜏1 d𝜏2 . (8.22)
−∞ −∞
In addition, by the classical Fréchet bounds for copulas (McNeil et al. 2015,
Remark 5.8), we know that 𝐻(𝜏1 , 𝜏2 ) ≤ min{𝐹(𝜏1 ), 𝐺(𝜏2 )} for all 𝜏1 , 𝜏2 ∈ R. As
the marginal distribution functions 𝐹 and 𝐺 are fixed, it is evident from (8.22)
that the covariance of the random variables 𝑓 (𝑍) and 𝑔(𝑍) is maximized if their
joint distribution function 𝐻(𝜏1 , 𝜏2 ) coincides with its Fréchet upper bound. This,
however, happens if and only if 𝑓 (𝑍) and 𝑔(𝑍) are comonotonic under P. We have
thus shown that covP ( 𝑓 (𝑍), 𝑔(𝑍)) ≤ covP ( 𝑓 ★(𝑍), 𝑔★(𝑍)), which in turn implies that
E P [ 𝑓 (𝑍) · 𝑔(𝑍)] = covP ( 𝑓 (𝑍), 𝑔(𝑍)) + E P [ 𝑓 (𝑍)] · E P [𝑔(𝑍)]
≤ covP ( 𝑓 ★(𝑍), 𝑔★(𝑍)) + E P 𝑓 ★(𝑍) · E P 𝑔★(𝑍)
= E P 𝑓 ★(𝑍) · 𝑔★(𝑍) .
Here, the inequality exploits the assumption that 𝑓 (𝑍) equals 𝑓 ★(𝑍) in law and
that 𝑔(𝑍) equals 𝑔★(𝑍) in law under P. Hence, the claim follows.
We are now ready to show that if 𝜚 is a Lipschitz continuous L 𝑝 -risk measure
and ℓ is a Lipschitz continuous loss function, then the risk 𝜚P [ℓ(𝑍)] is Lipschitz
continuous in the distribution P with respect to the 𝑝-Wasserstein distance.
Theorem 8.14 (Lipschitz Continuity of Risk Measures). If ℓ : R𝑑 → R is a
Lipschitz continuous loss function with respect to some norm k · k on R𝑑 , 𝑝 ≥ 1
and 𝜚 a Lipschitz continuous and law-invariant convex L 𝑝 -risk measure, then
ˆ ≤ lip(𝜚) · lip(ℓ) · W 𝑝 (P, P̂)
𝜚P [ℓ(𝑍)] − 𝜚P̂ [ℓ( 𝑍)]
1 1
for all P, P̂ ∈ P(R𝑑 ). Here, W 𝑝 is defined with respect to k · k, and 𝑝 + 𝑞 = 1.
for all ℎ′ ∈ L𝑞 (P) (Rockafellar 1974, Theorem 5). The relation (8.23b) defines
a law-invariant convex risk measure 𝜚∗ . Indeed, 𝜚∗ is convex because pointwise
suprema of affine functions are convex. In addition, 𝜚∗ inherits law-invariance
from 𝜚. Note that ℎ ∈ L𝑞 (P) attains the supremum in (8.23a) at ℓ ′ = ℓ if and only if
𝜚P [ℓ(𝑍)] = E P [ℎ(𝑍) · ℓ(𝑍)] − 𝜚∗P [ℎ(𝑍)]
166 D. Kuhn, S. Shafiee, and W. Wiesemann
ˆ 𝑧 ))) ∀ˆ𝑧 ∈ R𝑑 .
ℎ̂★(ˆ𝑧) = 𝐹 ← (𝐹(ℓ(ˆ
Note that 𝐹ˆ is continuous because P̂ is non-atomic and ℓ is (Lipschitz) continuous.
By Proposition 8.12, the random variables ℎ̂★( 𝑍)ˆ and ℓ( 𝑍)
ˆ are thus comonotonic
ˆ
and have distribution functions 𝐹 and 𝐹 under P̂, respectively. Hence, ℎ̂★ is feasible
in the maximization problem in (8.24). In addition, by Theorem 8.13, ℎ̂★ is optimal.
Next, select any transportation plan 𝛾 ∈ Γ(P, P̂). As the marginal distributions
Distributionally Robust Optimization 167
of 𝑍 and 𝑍ˆ under 𝛾 are given by P and P̂, respectively, the above implies that
ˆ
𝜚P [ℓ(𝑍)] − 𝜚P̂ [ℓ( 𝑍)]
sup ˆ · ℓ( 𝑍)
E 𝛾 ℎ̂(𝑍, 𝑍) ˆ
≤ E 𝛾 [ℎ(𝑍) · ℓ(𝑍)] − ℎ̂∈L𝑞 (𝛾) (8.25)
s.t. ˆ ≤ 𝜏 = 𝐹(𝜏) ∀𝜏 ∈ R.
𝛾 ℎ̂(𝑍, 𝑍)
Note that we have relaxed the maximization problem in (8.25) by allowing the
function ℎ̂ to depend both on 𝑍 and 𝑍. ˆ However, this extra flexibility does not
result in a higher optimal value. Indeed, Theorem 8.13 ensures that the supremum
is attained by any function ℎ̂ for which the random variables ℎ̂(𝑍, 𝑍) ˆ and ℓ( 𝑍)
ˆ are
ˆ
comonotonic and for which ℎ̂(𝑍, 𝑍) has distribution function 𝐹. As we have seen
before, there exists a function with these properties that does not depend on 𝑍.
Hence, the right to select a function ℎ̂ that depends on 𝑍 is worthless.
Observe now that the function ℎ̂(𝑍, 𝑍) ˆ = ℎ(𝑍) is feasible in (8.25). Thus, we find
ˆ ≤ E 𝛾 [ℎ(𝑍) · ℓ(𝑍)] − E 𝛾 ℎ(𝑍) · ℓ( 𝑍)
𝜚P [ℓ(𝑍)] − 𝜚P̂ [ℓ( 𝑍)] ˆ
≤ E 𝛾 ℎ(𝑍) · ℓ(𝑍) − ℓ( 𝑍) ˆ
≤ E 𝛾 ℎ(𝑍) · lip(ℓ) · k𝑍 − 𝑍ˆ k
1 1
≤ lip(ℓ) · E 𝛾 k𝑍 − 𝑍ˆ k 𝑝 𝑝 · E P [ℎ(𝑍)𝑞 ] 𝑞
where the second inequality holds because all convex risk measures are monotonic,
which implies that the subgradient ℎ(𝑍) is P-almost surely non-negative. The third
inequality exploits the Lipschitz continuity of the loss function, and the fourth
inequality follows from Hölder’s inequality. As the resulting inequality holds for
all couplings 𝛾 ∈ Γ(P, P̂), the definition of the 𝑝-Wasserstein distance implies that
1
ˆ ≤ lip(ℓ) · W 𝑝 (P, P̂) · E P [ℎ(𝑍)𝑞 ] 𝑞
𝜚P [ℓ(𝑍)] − 𝜚P̂ [ℓ( 𝑍)]
≤ lip(𝜚) · lip(ℓ) · W 𝑝 (P, P̂),
where the second inequality follows from Lemma 8.10. The claim then follows by
interchanging the roles of P and P̂.
Recall now that we assumed P and P̂ are non-atomic. This assumption was needed
to show that the supremum in (8.24) is attained. In general, one can extend P to
a distribution P′ on R𝑑+1 under which (𝑍1 , . . . , 𝑍 𝑑 ) and 𝑍 𝑑+1 are independent and
have marginal distributions equal to P and to the uniform distribution on [0, 1],
respectively. In the same way, P̂ can be extended to a distribution P̂′ on R𝑑+1 . By
construction, P′ and P̂′ are non-atomic. As 𝜚 is law-invariant, we further have
ˆ = 𝜚P′ [ℓ(𝑍)] − 𝜚 ′ [ℓ( 𝑍)]
𝜚P [ℓ(𝑍)] − 𝜚P̂ [ℓ( 𝑍)] ˆ .
P̂
The right hand side of this equation can now be bounded as above.
Theorem 8.14 and Corollary 8.15 are due to Pichler (2013). Corollary 8.15
shows that the worst-case risk over all distributions in a 𝑝-Wasserstein ball is upper
bounded by the sum of the nominal risk and a Lipschitz regularization term for
a broad spectrum of law-invariant convex risk measures. If the loss function ℓ is
linear, that is, if ℓ(𝑧) = 𝜃 ⊤ 𝑧 for some 𝜃 ∈ R𝑑 , then this upper bound is often tight
(Pflug et al. 2012, Wozabal 2014). In this case the Lipschitz modulus of ℓ simplifies
to k𝜃 k ∗ . For example, the CVaR at level 𝛽 ∈ (0, 1] is a law-invariant convex L 𝑝 -
risk measure, and it is Lipschitz continuous with Lipschitz modulus 𝛽 −1/ 𝑝 . Thus,
Corollary 8.15 applies. From Proposition 6.20 we know, however, that the upper
bound is exact in this case. If additionally 𝑝 = 1, then Proposition 6.18 implies that
the upper bound remains exact whenever ℓ is convex and Lipschitz continuous.
that some mild regularity conditions hold. In this case, Theorem 4.5 implies that
inf 𝜆 0 + 𝛿F ∗ (𝜆)
inf sup E P [ℓ(𝑥, 𝑍)] = s.t. 𝑥 ∈ X , 𝜆 0 ∈ R, 𝜆 ∈ R𝑚
𝑥 ∈X P∈P
𝜆 0 + 𝑓 (𝑧)⊤ 𝜆 ≥ ℓ(𝑥, 𝑧) ∀𝑧 ∈ Z.
If the support function 𝛿F ∗ (𝜆) is known in closed form, then the resulting minimiz-
Definition 9.2 (Noise Oracle). Given any fixed decision 𝑦˜ ∈ Y, a noise oracle
outputs a solution to the noise problem
sup max 𝑔 𝑗 ( 𝑦˜ , 𝑧). (9.3)
𝑧 ∈Z 𝑗 ∈ [𝑚]
In the following, we first survey the scenario approach, which replaces the
semi-infinite program (9.1) with a finite scenario problem that offers stochastic
approximation guarantees. This approach calls the scenario oracle only once. We
then review cutting plane techniques that iteratively call scenario and noise oracles
to generate a solution sequence that attains the optimal value of problem (9.1),
either within finitely many iterations or asymptotically. Next, we study online
convex optimization algorithms, which do not require expensive scenario and/or
noise oracles and instead solve only deterministic versions of problem (9.1) and
use cheap first-order updates of the candidate decisions and/or incumbent worst-
case parameter realizations. We close with an overview of specialized numerical
solution methods that are tailored to specific ambiguity sets.
for a given tolerance 𝜀 > 0, where Y𝑐 = {𝑦 ∈ Y : 𝑓 (𝑦) ≤ 𝑐}. Checking (9.5) re-
quires the solution of a saddle point problem. Note that the objective function of this
saddle point problem is convex in 𝑦 but but fails to be concave in 𝑧 when 𝑚 > 1.
Therefore, standard primal-dual algorithms from online convex optimization do
not apply. Nevertheless, Ho-Nguyen and Kılınç-Karzan (2018) construct an online
algorithm that outputs a trajectory of candidate solutions 𝑦˜ and uncertainty realiza-
tions 𝑧˜ that converge to a saddle point. This method uses a first-order algorithm A 𝑦
for solving the (parametric) minimization problem min 𝑦 ∈Y𝑐 max 𝑗 ∈ [𝑚] 𝑔 𝑗 (𝑦, 𝑧) as
well as a first-order algorithm A 𝑗 for solving the (parametric) maximization prob-
lem max𝑧 ∈Z 𝑔 𝑗 (𝑦, 𝑧) for each 𝑗 ∈ [𝑚] as subroutines. Specifically, A 𝑦 is assumed
to map any history of candidate solutions 𝑦˜ 1 , . . . , 𝑦˜ 𝑡 and uncertainty realizations
𝑧1𝑗 , . . . 𝑧𝑡𝑗 ∈ Z for 𝑗 ∈ [𝑚] and 𝑡 ∈ N to a new candidate solution 𝑦˜ 𝑡+1 such that
Õ Õ
max 𝑔 𝑗 ( 𝑦˜ 𝑡 , 𝑧˜𝑡𝑗 ) − min max 𝑔 𝑗 (𝑦, 𝑧˜𝑡𝑗 ) ≤ R 𝑦 (𝑇 ) ∀𝑇 ∈ N,
𝑗 ∈ [𝑚] 𝑦 ∈Y𝑐 𝑗 ∈ [𝑚]
𝑡 ∈ [𝑇 ] 𝑡 ∈ [𝑇 ]
The total regret bound in the above expression grows sublinearly with 𝑇 . Un-
der the usual convexity assumptions, Algorithms 3 and 5 can be combined to a
joint algorithm that finds a 𝛿-optimal and 𝜀-feasible solution to the semi-infinite
program (9.1) in O(𝜀 −2 log(1/𝛿)) iterations. Thus, the iteration complexity did
not improve vis-à-vis the algorithm by Ben-Tal et al. (2015b). However, the
computational effort per iteration is significantly lower for Algorithm 5 than for
Algorithm 4. Indeed, Algorithm 4 solves a feasibility problem with 𝑚 uncertainty
realizations in each iteration, whereas Algorithm 5 only calls the algorithms A 𝑦
and A 𝑗 , 𝑗 ∈ [𝑚], which compute cheap first-order updates. For further details,
we refer to (Ho-Nguyen and Kılınç-Karzan 2018). In addition, Ho-Nguyen and
Kılınç-Karzan (2019) exploit structural properties of the objective and constraint
functions to reduce the overall iteration complexity to O(𝜀 −1 log(1/𝛿)).
Up until now, all the algorithms discussed in this section relied on the bisection
method to reduce the semi-infinite program (9.1) to a sequence of robust feasib-
ility problems (9.4). This introduces unnecessary computational overhead. As a
remedy, Postek and Shtern (2024) use primal-dual saddle point algorithms that ad-
dress the following perspective reformulation of problem (9.1), which was initially
introduced in (Ho-Nguyen and Kılınç-Karzan 2018, Appendix A).
Õ
min max 𝑚 𝑓 (𝑦) + 𝜆 𝑗 𝑔 𝑗 (𝑦, 𝑧/𝜆 𝑗 )
𝑦 ∈Y 𝑧 ∈Z,𝜆∈Δ
𝑗 ∈ [𝑚]
Distributionally Robust Optimization 175
Í
Here, Δ𝑚 = {𝜆 ∈ R+𝑚 : 𝑗 ∈ [𝑚] 𝜆 𝑗 = 1}, and 0 · 𝑔 𝑗 (𝑦, 𝑧/0) is interpreted as the
negative recession function of the convex function −𝑔 𝑗 (𝑦, ·). By construction, the
objective function of this reformulation is convex in 𝑦 and jointly concave in 𝑍 and 𝜆.
While the primal-dual saddle point algorithm of Postek and Shtern (2024) typically
enjoys an iteration complexity of O(𝜀 −2 ), where 𝜀 now represents the primal-dual
gap in the saddle point formulation, it requires more sophisticated oracles than
those used by Ho-Nguyen and Kılınç-Karzan (2018, 2019). This is primarily
because the perspective transformation eliminates favorable properties such as
strong convexity and smoothness, and it also significantly degrades the Lipschitz
constants. To address this challenge while still relying on standard oracles as
in (Ho-Nguyen and Kılınç-Karzan 2018, 2019), Tu, Chen and Yue (2024) propose
a two-layer algorithm based on the following Lagrangian formulation of (9.1).
Õ
max𝑚 min max 𝑓 (𝑦) + 𝜆 𝑗 𝑔 𝑗 (𝑦, 𝑧)
𝜆∈R+ 𝑦 ∈Y 𝑧 ∈Z
𝑗 ∈ [𝑚]
Tu et al. (2024) show that their algorithm has an iteration complexity of O(𝜀 −3 ) or
O(𝜀 −2 ), depending on the smoothness of the objective and constraint functions.
lems over optimal transport ambiguity sets with generic reference distributions.
Other stochastic optimization schemes leverage variance reduction techniques (Yu,
Lin, Mazumdar and Jordan 2022) and zeroth-order random reshuffling algorithms
(Maheshwari, Chiu, Mazumdar, Sastry and Ratliff 2022). These works typic-
ally rely on the duality results introduced in Section 4 and subsequently apply
stochastic subgradient descent, using subgradients of the regularized loss function
ℓ𝑐 (𝑥, 𝑧ˆ) = sup 𝑧 ∈Z ℓ(𝑥, 𝑧) − 𝜆𝑐(𝑧, 𝑧ˆ) with respect to 𝑥 and 𝜆. Ho-Nguyen and Wright
(2023) extend this approach to nonconvex robust binary classification problems.
Sinha et al. (2018) examine relaxed distributionally robust neural network training
problems, assuming that the required level of robustness against adversarial perturb-
ations is sufficiently small. This is tantamount to forcing 𝜆 to exceed a sufficiently
large threshold. If 𝑐(𝑧, 𝑧ˆ) = k𝑧 − 𝑧ˆ k 22 , this in turn ensures that the maximization
problem over 𝑧 that defines ℓ𝑐 (𝑥, 𝑧ˆ) has a strongly concave objective function and
is thus efficiently solvable. Stochastic subgradients of ℓ𝑐 (𝑥, 𝑧ˆ) are therefore readily
available thanks to Danskin’s theorem. Shafiee et al. (2023) leverage nonconvex
duality theorems, such as Toland’s duality principle, to solve distributionally robust
portfolio selection problems. Algorithms that minimize the variation-regularized
nominal loss, which is known to approximate the worst-case expected loss thanks
to Theorem 8.7, are explored by Li et al. (2022) and Bai et al. (2017). Finally, Wang
et al. (2021, 2024a) and Azizian et al. (2023b) introduce entropy and 𝜙-divergence
regularizers to improve the efficiency of algorithms for Wasserstein DRO prob-
lems, and Vincent, Azizian, Malick and Iutzeler (2024) provide a Python library
for training related distributionally robust machine learning models.
be interested in, as well as the two key performance criteria of excess risk and
out-of-sample disappointment. Subsequently, Section 10.2 surveys asymptotic
analyses, which are based on the laws of large numbers, the central limit theorem,
the empirical likelihood approach as well as the large and moderate deviations
principles. Finally, Section 10.3 reviews non-asymptotic analyses, which rely on
measure concentration bounds as well as generalization bounds.
Our review of the statistical properties of DRO omits several important topics.
For example, we do not cover domain adaptation guarantees (Farnia and Tse 2016,
Volpi, Namkoong, Sener, Duchi, Murino and Savarese 2018, Lee and Raginsky
2018, Lee, Park and Shin 2020, Sutter, Krause and Kuhn 2021, Taşkesen, Yue,
Blanchet, Kuhn and Nguyen 2021, Rychener et al. 2024), which ensure that a DRO
model trained on data from some source distribution generalizes to a different target
distribution. We also omit discussions of adversarial generalization bounds (Sinha
et al. 2018, Wang et al. 2019, Tu, Zhang and Tao 2019, Kwon, Kim, Won and Paik
2020, An and Gao 2021), which use DRO to analyze model robustness against
adversarial perturbations, as well as applications in high-dimensional statistical
learning (Aolaritei, Shafiee and Dörfler 2022b). Finally, we do not cover Bayesian
guarantees (Gupta 2019, Shapiro, Zhou and Lin 2023, Liu, Su and Xu 2024b),
which focus on average-case rather than worst-case performance guarantees.
Note that problem (10.1) constitutes a classical stochastic program. While (10.1) is
theoretically sound, it faces two significant practical limitations. First, the distribu-
tion P0 underlying a decision problem is rarely known in practice. Second, even if
P0 was known, evaluating the objective function of (10.1) requires the computation
of an integral, which is intractable in high dimensions even for simple nonlinear
loss functions (Dyer and Stougie 2006, Hanasusanto, Kuhn and Wiesemann 2016).
In practice, we often observe the true probability distribution P0 indirectly
through historical data. From now on we thus assume to have access to 𝑁 inde-
pendent training samples from P0 , denoted as 𝑍1 , . . . , 𝑍 𝑁 . The goal of data-driven
optimization is to construct a decision from the training samples. This decision
should perform well not just on the training data, but also on unseen test samples
from P0 . The performance of a data-driven decision on test data is also called its
out-of-sample performance. Formally, data-driven optimization aims to learn a de-
cision rule T 𝑁 : Z 𝑁 ⇒ X that maps training samples from the product space Z 𝑁
to a set of candidate decisions in the decision space X . Note that T 𝑁 constitutes a
Distributionally Robust Optimization 179
formed from the training samples 𝑍1 , . . . , 𝑍 𝑁 and to construct the decision rule
T 𝑁 (𝑍1 , . . . , 𝑍 𝑁 ) = arg min E P̂ 𝑁 [ℓ(𝑥, 𝑍)]. (10.3)
𝑥 ∈X
As the empirical distribution is discrete, the SAA approach obviates the need to
evaluate high-dimensional integrals and is thus computationally attractive. Never-
theless, the performance of its optimal solutions on test data can be disappointing
even when the test data are independently sampled from the true distribution P0 .
This phenomenon has been observed across various application domains and has
been given different names depending on the context. In finance, Michaud (1989)
identifies this issue as the error maximization effect of portfolio optimization. Stat-
istics and machine learning recognizes it as overfitting, a well-known challenge
where models perform well on training data but fail to generalize to new, unseen
test data. In the stochastic programming literature, Shapiro (2003) refers to this
phenomenon as the optimization bias, and in decision analysis the effect has been
described as the optimizer’s curse (Smith and Winkler 2006).
The disappointing out-of-sample performance of the SAA decisions prompted
statisticians and machine learners to add a regularization term to the objective
function in (10.3). The regularization term serves two purposes. It not only
combats overfitting to the training data, but it also encourages simpler decisions.
Such simplicity aligns with the principle of parsimony and reflects nature’s inherent
tendency towards simplicity. As Jeffreys and Wrinch (1921) aptly noted,
“The existence of simple laws is, then, apparently, to be regarded as a quality
of nature; and accordingly we may infer that it is justifiable to prefer a simple
law to a more complex one that fits our observations slightly better.”
180 D. Kuhn, S. Shafiee, and W. Wiesemann
which can be viewed as a variant of the regularized SAA decision rule. The cor-
responding data-dependent regularization function is called the DRO regularizer
and is given by
𝑅ˆ 𝑁 (𝑥) = sup E P [ℓ(𝑥, 𝑍)] − E P̂ 𝑁 [ℓ(𝑥, 𝑍)]. (10.4)
P∈ P̂ 𝑁
Thus, it depends on both the decision 𝑥 and the observed training data 𝑍1 , . . . , 𝑍 𝑁 .
The regularizer (10.4) quantifies how much the worst-case expected loss across all
distributions P ∈ P̂ 𝑁 can exceed the in-sample expected loss E P̂ 𝑁 [ℓ(𝑥, 𝑍)].
The performance of decision rules in data-driven optimization is primarily meas-
ured by two criteria, each of which is aligned with a different field of study and
addresses a different set of practical concerns. The first criterion, excess risk, is pre-
dominantly used in statistics. It quantifies the distance of a data-driven decision 𝑋ˆ 𝑁
to an optimal decision 𝑥0 . The second criterion, out-of-sample disappointment, is
more commonly employed in operations research. It provides a measure of how
much the out-of-sample risk of a data-driven decision 𝑋ˆ 𝑁 exceeds the in-sample
risk of 𝑋ˆ 𝑁 . In the following, we formally define both criteria.
Excess Risk. Let 𝜂 ∈ (0, 1) be a significance level, T 𝑁 be a decision rule, and
Δ : X × X0 → R+ be a performance function. Suppose that 𝑋ˆ 𝑁 ∈ T 𝑁 (𝑍1 , . . . , 𝑍 𝑁 ).
The excess risk criterion offers the guarantee that for any size 𝑁 ≥ 𝑁(X , Z, 𝜂) of
the training set, we have
P0𝑁 [Δ( 𝑋ˆ 𝑁 , 𝑥0 )] ≤ 𝛿ˆ 𝑁 ] ≥ 1 − 𝜂 (10.5)
for some (possibly data-dependent) error certificate 𝛿ˆ 𝑁 . In statistical learning
theory, performance functions often measure the regret in terms of the loss function
Distributionally Robust Optimization 181
ℓ under the true distribution P0 . Specifically, for any feasible candidate decisions
𝑥 ∈ X and any optimal decision 𝑥0 ∈ X0 , the regret takes the form
Δ(𝑥, 𝑥0 ) = E P0 [ℓ(𝑥, 𝑍)] − E P0 [ℓ(𝑥0 , 𝑍)] = E P0 [ℓ(𝑥, 𝑍)] − min E P0 [ℓ(𝑥, 𝑍)] ≥ 0.
𝑥 ∈X
In compressed sensing and M-estimation problems with linear models, performance
is often defined as the estimation error in the decision space, and it takes the form
Δ(𝑥, 𝑥0 ) = k𝑥 − 𝑥0 k 22 .
Here, we assume for simplicity that the minimizer 𝑥0 is unique. We refer to (Mendel-
son 2003, Bousquet, Boucheron and Lugosi 2004) for an introduction to statistical
learning theory. For more advanced treatments, we refer to (Anthony and Bart-
lett 1999, Koltchinskii 2011, Vapnik 2013, Shalev-Shwartz and Ben-David 2014,
Vershynin 2018, Wainwright 2019).
Out-of-Sample Disappointment. Let 𝜂 ∈ (0, 1) be a significance level and T 𝑁
be a decision rule. Suppose that 𝑋ˆ 𝑁 ∈ T 𝑁 (𝑍1 , . . . , 𝑍 𝑁 ). The out-of-sample
disappointment criterion offers the guarantee that for any size 𝑁 ≥ 𝑁(X , Z, 𝜂) of
the training set, we have
P0𝑁 E P0 [ℓ( 𝑋ˆ 𝑁 , 𝑍)] ≤ 𝐿ˆ 𝑁 ≥ 1 − 𝜂 (10.6)
for some (possibly data-dependent) loss certificate 𝐿ˆ 𝑁 . Alternatively, one can
express (10.6) as a probabilistic bound on the difference between the out-of-sample
performance and the in-sample performance,
P0𝑁 E P0 [ℓ( 𝑋ˆ 𝑁 , 𝑍)] − E P̂ 𝑁 [ℓ( 𝑋ˆ 𝑁 , 𝑍)] ≤ 𝛿ˆ𝑁 ≥ 1 − 𝜂,
for some error certificate 𝛿ˆ 𝑁 . Both criteria become equivalent when we set 𝛿ˆ 𝑁 =
E P̂ 𝑁 [ℓ( 𝑋ˆ 𝑁 , 𝑍)] + 𝐿ˆ 𝑁 . Unlike the excess risk bound (10.5), the out-of-sample
disappointment bound (10.6) does not require explicit knowledge of an optimal
decision 𝑥0 and solely leverages the statistical properties of P0 . As we will see in
the following sections, 𝐿ˆ 𝑁 and 𝛿ˆ 𝑁 typically correspond to the optimal value of the
DRO problem (1.2) and the DRO regularizer (10.4), respectively.
The next sections focus on ambiguity sets that are centered at the empirical dis-
tribution P̂ 𝑁 defined in (10.2). Specifically, we consider ambiguity sets constructed
using a discrepancy measure D : P(Z) × P(Z) → [0, ∞]:
P̂ 𝑁 = {P ∈ P(Z) : D(P, P̂ 𝑁 ) ≤ 𝑟 𝑁 }. (10.7)
The discrepancy measure D could be a 𝜙-divergence or a Wasserstein distance.
We will explain how the radius 𝑟 𝑁 should scale with the training sample size 𝑁 to
obtain the least conservative statistical guarantees.
regularity conditions, the laws of large numbers guarantee that the empirical loss
E P̂ 𝑁 [ℓ(𝑥, 𝑍)] converges P0 -almost surely to the true expected loss E P0 [ℓ(𝑥, 𝑍)],
uniformly on X (see, e.g., Shapiro et al. 2009, § 7.2.5). This implies that the optimal
value and the set of optimal solutions of the SAA problem exhibit asymptotic
consistency, that is, they both converge to their counterparts in the stochastic
program under P0 as the sample size 𝑁 approaches infinity. The central limit
theorem, on the other hand hand, implies that the scaled difference between the
empirical loss (under P̂ 𝑁 ) and true expected loss (under P0 ) converges weakly to a
normal distribution with mean zero and variance equal to the true variance of the
loss under P0 (see, e.g., Shapiro et al. 2009, § 5.1.2). Thus, the optimal value of
the SAA problem also exhibits asymptotic normality. The asymptotic properties
of the SAA decision rule have been studied extensively, see, e.g., (Cramér 1946,
Huber 1967, Dupacová and Wets 1988, Shapiro 1989, 1990, 1991, 1993, King and
Wets 1991, King and Rockafellar 1993, Van der Vaart 1998, Lam 2021).
Building on these foundations, we will next review the asymptotic consistency
and normality of DRO decision rules. While studying these asymptotic behavi-
ors, different theoretical frameworks provide distinct insights. The central limit
theorem and empirical likelihood approaches characterize the typical fluctuations
around the mean under an appropriate scaling. The central limit theorem establishes
Gaussian convergence, whereas the empirical likelihood theory provides a nonpara-
metric framework for constructing likelihood ratio tests with asymptotic 𝜒2 -limits,
enabling hypothesis testing without specific parametric assumptions. In contrast,
large deviations theory examines the tail behavior of distribution sequences. Rather
than focusing on typical fluctuations, it characterizes the exponential decay rate of
probabilities associated with rare events far from the mean. Moderate deviations
theory bridges the gap between the typical and rare event analyses provided by the
aforementioned frameworks. It studies the asymptotic behavior of distribution se-
quences at intermediate scales, thus investigating larger deviations than the central
limit theorem but smaller deviations than large deviations theory.
The analysis employs the functional central limit theorem alongside careful Taylor
expansions of the worst-case expectation akin to those presented in Section 8. In
particular, the authors establish that, under appropriate regularity conditions, the
limiting distributions are normal with explicitly characterized means and variances.
For some 𝑟 ∈ R+ . Thus, the set Cˆ𝑁 is the image of a 𝜙-divergence neighborhood
around the empirical distribution P̂ 𝑁 under 𝜃. The key tool to establish probabilistic
bounds is the so called profile divergence 𝜋 𝑁 : R → R+ , which is defined through
𝜋 𝑁 (𝜏) = inf D 𝜙 (P, P̂ 𝑁 ) : 𝜃(P) = 𝜏 . (10.9)
P∈P(Z)
where 𝜒𝑑20 denotes the 𝜒2 -distribution with 𝑑0 degrees of freedom. Thus, Cˆ𝑁
constitutes an asymptotically exact (1 − 𝜂)-confidence interval for 𝜃(P0 ) if we set 𝑟
in (10.8) to the (1 − 𝜂)-quantile of a 𝜒2 -distribution with 𝑑0 degrees of freedom.
In the context of stochastic programming problems, the statistical quantity of
interest is typically the optimal value of the stochastic program, that is, 𝜃(P) =
inf 𝑥 ∈X E P [ℓ(𝑥, 𝑍)]. In this case, the set Cˆ𝑁 becomes the interval
h i
Cˆ𝑁 = inf inf E P [ℓ(𝑥, 𝑍)], sup inf E P [ℓ(𝑥, 𝑍)] ,
P∈ P̂ 𝑁 𝑥 ∈X P∈ P̂ 𝑁
𝑥 ∈X
for all 𝜃 ∈ Θ and for all Borel sets B ⊆ Θ. Here, we assume that the sequence 𝑏 𝑁 ,
𝑁 ∈ N, tends monotonically towards infinity. If (10.10) holds, one can show under
mild conditions that 𝐼(𝜃, 𝜃) = 0 because 𝜃ˆ 𝑁 converges to 𝜃 in probability under P 𝜃 .
It is therefore natural to interpret 𝐼(𝜃 ′ , 𝜃) as a discrepancy function that quantifies the
dissimilarity between the estimator realization 𝜃 ′ and the probabilistic model 𝜃. As 𝐼
is lower semicontinuous, the minimization problems on the left and on the right hand
side of (10.10) share the same infimum 𝑟 = inf 𝜃 ′ ∈int(B) 𝐼(𝜃 ′ , 𝜃) = inf 𝜃 ′ ∈cl(B) 𝐼(𝜃 ′ , 𝜃)
for most Borel sets B of interest. In these cases, the inequalities in (10.10) collapse
to equalities, and (10.10) simplifies to the more intuitive statement
P 𝜃 (𝜃ˆ 𝑁 ∈ B) = exp (−𝑟𝑏 𝑁 + 𝑜(𝑏 𝑁 )) .
That is, the probability of the estimator 𝜃ˆ𝑁 falling into the set B decays exponentially
at rate 𝑟 with speed 𝑏 𝑁 , where 𝑟 can be interpreted as the 𝐼-distance from 𝜃 to B.
Several statistics of practical interest satisfy large deviations principles. For
example, if Z is finite and {P 𝜃 : 𝜃 ∈ Θ} is the family of all distributions on Z
encoded by the corresponding probability vectors 𝜃 ∈ Θ, where Θ is the probability
simplex of appropriate dimension, then the empirical distribution P̂ 𝑁 correspond-
ing to the empirical probability vector 𝜃ˆ 𝑁 is an estimator for the data-generating
distribution P 𝜃 . In this case, Sanov’s theorem (Cover and Thomas 2006, The-
orem 11.4.1) asserts that 𝜃ˆ 𝑁 satisfies a large deviations principle with rate function
𝐼(𝜃 ′ , 𝜃) = KL(P 𝜃 ′ , P 𝜃 ) and speed 𝑏 𝑁 = 𝑁. Similarly, if {P 𝜃 : 𝜃 ∈ Θ} is any
distribution family parametrized by its unknown mean vector 𝜃 = E P 𝜃 [𝑍] and
if the log-moment generating function Λ 𝜃 (𝑡) Í = log(E P 𝜃 [exp(𝑡 ⊤ 𝑍)]) is finite for
all 𝑡, 𝜃 ∈ R𝑑 , then the sample mean 𝜃ˆ 𝑁 = 𝑁1 𝑖∈ [ 𝑁 ] 𝑍𝑖 is an estimator for 𝜃. In this
case, Cramér’s theorem (Cramér 1938) asserts that 𝜃ˆ 𝑁 satisfies a large deviations
Distributionally Robust Optimization 187
principle with rate function 𝐼(𝜃 ′ , 𝜃) = Λ∗𝜃 (𝜃 ′ ) and speed 𝑏 𝑁 = 𝑁. Note that the log-
moment generating function Λ 𝜃 as well as its conjugate Λ∗𝜃 are both convex. We
remark that a large deviations principle with sublinear speed (lim 𝑁 →∞ 𝑏 𝑁 /𝑁 = 0)
is sometimes referred to as a moderate deviations principle. For an example of a
moderate deviations principle we refer to (Jongeneel, Sutter and Kuhn 2022).
Van Parys et al. (2021) leverage Sanov’s theorem to show that the optimal
value of the DRO problem with a likelihood ambiguity set of radius 𝑟 around the
empirical distribution P̂ 𝑁 yields the least conservative confidence bound on the
optimal value of the true stochastic program, asymptotically as the sample size 𝑁
grows large, with significance level 𝜂 decaying exponentially as 𝑒 −𝑟 𝑁 . More
generally, Sutter, Van Parys and Kuhn (2024) assume that P0 is known to belong
to a parametric distribution family {P 𝜃 : 𝜃 ∈ Θ} and that 𝜃 admits an estimator 𝜃ˆ 𝑁
that satisfies a large deviations principle with rate function 𝐼 and speed 𝑏 𝑁 = 𝑁.
Under some regularity conditions, they then show that the optimal value of the
DRO problem with ambiguity set P̂ 𝑁 = {P 𝜃 : 𝜃 ∈ Θ, 𝐼(𝜃ˆ 𝑁 , 𝜃) ≤ 𝑟} yields again
the least conservative confidence bound on the optimal value of the true stochastic
program with significance level 𝜂 ∝ 𝑒 −𝑟 𝑁 . Similar statistical optimality results can
sometimes be obtained even when the training samples are serially dependent, e.g.,
when they are generated by a Markov process with unknown transition probability
matrix or certain autoregressive processes (Sutter et al. 2024).
The DRO estimators by Van Parys et al. (2021) and Sutter et al. (2024) lack
asymptotic consistency because they exploit large deviations principles with linear
speed 𝑏 𝑁 = 𝑁. Bennouna and Van Parys (2021) show that asymptotic consistency
can be recovered by relying on moderate deviations principles with sublinear speed.
This line of research has seen significant recent developments. The use of large
and moderate deviations principles has also been extended to various learning
and control settings such as distributionally robust Markov decision processes (Li,
Sutter and Kuhn 2021), bandit problems (Van Parys and Golrezaei 2024), bootstrap-
based methods (Bertsimas and Van Parys 2022), optimal learning (Ganguly and
Sutter 2023, Liu et al. 2023), control (Jongeneel, Sutter and Kuhn 2021, Jongeneel
et al. 2022), contextual learning (Srivastava, Wang, Hanasusanto and Ho 2021),
and robust statistics (Chan, Van Parys and Bennouna 2024).
P0𝑁 P0 ∈ P̂ 𝑁 ≥ 1 − 𝜂.
(10.11)
We then have
P0𝑁 E P0 [ℓ(𝑥, 𝑍)] ≤ sup E P [ℓ(𝑥, 𝑍)] ∀𝑥 ∈ X ≥ 1 − 𝜂. (10.12a)
P∈ P̂ 𝑁
The proof of (10.12a) and (10.12b) readily follows from the measure concentra-
tion bound (10.11) and is therefore omitted. Theorem 10.1 asserts that the worst-
case expected loss provides an upper confidence bound on the true expected loss
under the unknown data-generating distribution uniformly across all loss functions.
Moreover, it also asserts that the optimal value of the DRO problem (1.2) provides
an upper confidence bound on the out-of-sample performance of its optimizers.
When using 𝜙-divergences to construct P̂ 𝑁 as in (10.7), the probabilistic require-
ment (10.11) only applies to underlying distributions P0 that are discrete (Polyanskiy
and Wu 2024, § 7). In contrast, the Wasserstein distance applies to generic distri-
butions P0 . This area of study has a rich history, with seminal contributions from
Dudley (1969), Ajtai, Komlós and Tusnády (1984), and Dobrić and Yukich (1995).
More recent advancements have been made by Bolley, Guillin and Villani (2007),
Boissard and Le Gouic (2014), Dereich, Scheutzow and Schottstedt (2013), and
Fournier and Guillin (2015). Of particular importance to our discussion is the
following measure concentration result, which serves as the foundation for finite
sample guarantees in DRO over 𝑝-Wasserstein ambiguity sets.
Theorem 10.2 (Measure Concentration (Fournier and Guillin 2015, Theorem 2)).
Suppose that P̂ 𝑁 is the empirical distribution constructed from 𝑁 independent
samples from P0 , 𝑝 ≠ 𝑑/2, and that P0 is light-tailed in the sense that there exist
𝛼 > 𝑝 and 𝐴 > 0 such that E P0 (exp(k𝑍 k 𝛼 )) ≤ 𝐴. Then, there are constants
𝑐1 , 𝑐2 > 0 that depend on P0 only through 𝛼, 𝐴, and 𝑑 such that for any 𝜂 ∈ (0, 1],
Distributionally Robust Optimization 189
the concentration inequality P0𝑁 (𝑊 𝑝 (P0 , P̂) ≤ 𝑟 𝑁 ) ≥ 1−𝜂 holds whenever 𝑟 exceeds
log(𝑐 /𝜂) min{1/𝑑,1/2}
log(𝑐1 /𝜂)
1
if 𝑁 ≥ ,
𝑐2 𝑁 𝑐2
𝑟(𝑑, 𝑁, 𝜂) = 1/𝛼 (10.13)
log(𝑐1 /𝜂) log(𝑐1 /𝜂)
if 𝑁 < .
𝑐2 𝑁 𝑐2
The result remains valid for 𝑝 = 𝑑/2 but with a more complicated formula
for 𝑟(𝑑, 𝑁, 𝜂) (Fournier and Guillin 2015, Theorem 2). Intuitively, Theorem 10.2
asserts that any 𝑝-Wasserstein ball P̂ 𝑁 of 𝑟 𝑁 ≥ 𝑟(𝑑, 𝑁, 𝜂) around P̂ 𝑁 represents
a (1 − 𝜂)-confidence set for the unknown data-generating distribution P0 . For
uncertainty dimensions 𝑑 > 2, the critical radius 𝑟(𝑑, 𝑁, 𝜂) of this confidence set
1
decays as O(𝑁 − 𝑑 ). In other words, to reduce the critical radius by 50%, the
sample size must increase by 2𝑑 . Unfortunately, this curse of dimensionality is
fundamental, and the decay rate of 𝑟(𝑑, 𝑁, 𝜂) is essentially optimal (Fournier and
Guillin 2015, § 1.3). Explicit constants 𝑐1 and 𝑐2 are provided by Fournier (2022).
Generic measure concentration bounds suffer from a curse of dimensionality.
Shafieezadeh-Abadeh et al. (2019) and Wu et al. (2022) show that this curse can
be overcome in the context of linear prediction models by projecting 𝑍 to a one-
1
dimensional random variable, yielding the parametric convergence rate O(𝑁 − 2 ).
Nietert et al. (2024a) develop a similar approach for rank-𝑘 linear models, where
1
2 < 𝑘 < 𝑑, and achieve an improved rate of O(𝑁 − 𝑘 ) based on 𝑘-sliced Wasserstein
distances. The 1-sliced Wasserstein distance is also used by Olea et al. (2022) to
1
obtain the parametric rate O(𝑁 − 2 ) for a class of regression problems.
We conclude this section by highlighting that the DRO approach admits instance-
dependent regret bounds, which essentially depend on no complexity measures of
the decision space or the loss function. Instead, they only depend on the complexity
of the optimal solution 𝑥0 through the DRO regularizer 𝑅ˆ 𝑁 (𝑥0 ). Zeng and Lam
(2022, Theorem 4.1) and Nietert et al. (2024a, Theorem 1) establish such bounds
for DRO problems over the ambiguity set (10.7) when D is the maximum mean
discrepancy and the (outlier-robust) Wasserstein distance, respectively. Similar
instance-dependent guarantees for DRO problems with Wasserstein ambiguity sets
are developed by Hou, Kassraie, Kratsios, Krause and Rothfuss (2023).
where the loss certificate 𝐿ˆ 𝑁 (𝑥) depends on the decision 𝑥 ∈ X . For example, a
guarantee of the form (10.14) can be obtained by combining empirical Bernstein
190 D. Kuhn, S. Shafiee, and W. Wiesemann
inequalities (Maurer and Pontil 2009) and a DRO model with a 𝜒2 -divergence
ambiguity set (Duchi and Namkoong 2019, Theorem 2). In this case, the certi-
ficate 𝐿ˆ 𝑁 (𝑥) reduces to the sum of the expected loss under P̂ 𝑁 and a variance
regularizer under P0 . Alternatively, a guarantee of the form (10.14) can also be
obtained by combining transport inequalities (Marton 1986, Talagrand 1996) and a
DRO model with a Wasserstein ambiguity set (Gao 2023, Theorem 1). In this case,
𝐿ˆ 𝑁 (𝑥) reduces to the sum of the expected loss under P̂ 𝑁 and a variation regularizer
under P0 . The second step consists in converting the individual guarantee (10.14)
to a uniform guarantee. For example, if X is finite, this can easily be achieved by
using the union bound. If X is uncountable, one may use one of several standard
techniques. If the loss function is Lipschitz continuous in 𝑥 ∈ X uniformly across
all 𝑧 ∈ Z and X is compact, then one can discretize X by uniform gridding. In
this case, the loss at an arbitrary point is uniformly approximated by the loss at
the nearest grid point, and a uniform guarantee can again be obtained by using the
union bound. However, the number of grid points needed for an 𝜀-approximation
is of the order O((1/𝜀)𝑑 ), which is impractical in high dimensions 𝑑. A more
sophisticated approach to discretize X exploits structural knowledge of the loss
function at multiple scales. However, obtaining tight approximation in high dimen-
sions remains challenging. In order to mitigate the computational burden related
to discretization, one may exploit several complexity measures that quantify the
expressiveness of the functions ℓ(𝑥, ·) for all 𝑥 ∈ X such as the VC dimension or
the Rademacher complexity as well as its local version. Nonetheless, Rademacher
complexities can be computationally challenging to compute. For full details we
refer to (Boucheron, Lugosi and Massart 2013, Vershynin 2018, Wainwright 2019).
The last step consists in approximating the certificate 𝐿ˆ 𝑁 (𝑥) by the worst-case
expected loss over a data-driven ambiguity set P̂ 𝑁 based on the 𝜒2 -divergence
or a Wasserstein distance. The corresponding approximation error can be con-
trolled by leveraging Taylor approximations as in Theorems 8.4 and 8.7 together
with appropriate concentration inequalities. In summary, this procedure shows
that the optimal value of a data-driven DRO problem over a 𝜒2 -divergence or a
Wasserstein ambiguity set provides a finite-sample upper confidence bound on the
corresponding stochastic program under the unknown true distribution P0 .
Duchi and Namkoong (2019) and Gao (2023) derive generalization bounds of this
kind for 𝜒2 -divergence and Wasserstein ambiguity sets, respectively, while Azizian,
Iutzeler and Malick (2023a) extend their analysis to entropic regularized optimal
1
transport ambiguity sets. All these bounds exhibit the parametric rate O(𝑁 − 2 ). In
addition, Duchi and Namkoong (2019) demonstrate that, under certain curvature
conditions, 𝜒2 -divergence decision rules can achieve the fast rate O(𝑁 −1 ).
Philipp Schneider, Buse Sen, Bradley Sturt and Man-Chung Yue for their valuable
feedback on the paper. We are responsible for all remaining errors.
References
C. Acerbi (2002), Spectral measures of risk: A coherent representation of subjective risk
aversion, Journal of Banking & Finance 26(7), 1505–1518.
A. Ahmadi-Javid (2012), Entropic value-at-risk: A new coherent risk measure, Journal of
Optimization Theory and Applications 155(3), 1105–1123.
S. Ahmed (2006), Convexity and decomposition of mean-risk stochastic programs, Math-
ematical Programming 106(3), 433–446.
M. Ajtai, J. Komlós and G. Tusnády (1984), On optimal matchings, Combinatorica 4(4),
259–264.
F. Al Taha, S. Yan and E. Bitar (2023), A distributionally robust approach to regret optimal
control using the Wasserstein distance, in IEEE Conference on Decision and Control,
pp. 2768–2775.
S. M. Ali and S. D. Silvey (1966), A general class of coefficients of divergence of one
distribution from another, Journal of the Royal Statistical Society: Series B 28(1),
131–142.
J. M. Altschuler and E. Boix-Adsera (2023), Polynomial-time algorithms for multimarginal
optimal transport problems with structure, Mathematical Programming 199(1), 1107–
1178.
L. Ambrosio, N. Gigli and G. Savaré (2008), Gradient Flows: In Metric Spaces and in the
Space of Probability Measures, Springer.
Y. An and R. Gao (2021), Generalization bounds for (Wasserstein) robust optimization, in
Advances in Neural Information Processing Systems, pp. 10382–10392.
B. Analui and G. C. Pflug (2014), On distributionally robust multiperiod stochastic optim-
ization, Computational Management Science 11, 197–220.
M. Anthony and P. L. Bartlett (1999), Neural Network Learning: Theoretical Foundations,
Cambridge University Press.
J. Anunrojwong, S. R. Balseiro and O. Besbes (2024), On the robustness of second-price
auctions in prior-independent mechanism design, Operations Research (Forthcoming).
L. Aolaritei, N. Lanzetti, H. Chen and F. Dörfler (2022a), Uncertainty propagation via
optimal transport ambiguity sets, arXiv:2205.00343.
L. Aolaritei, S. Shafiee and F. Dörfler (2022b), Wasserstein distributionally robust estim-
ation in high dimensions: Performance analysis and optimal hyperparameter tuning,
arXiv:2206.13269.
R. Arora and R. Gao (2022), Data-driven multistage distributionally robust linear optimiz-
ation with nested distance, Available from Optimization Online.
P. Artzner, F. Delbaen, J.-M. Eber and D. Heath (1999), Coherent measures of risk,
Mathematical Finance 9(3), 203–228.
C. Atkinson and A. F. Mitchell (1981), Rao’s distance measure, Sankhyā: The Indian
Journal of Statistics, Series A 43(3), 345–365.
W. Azizian, F. Iutzeler and J. Malick (2023a), Exact generalization guarantees for (regu-
larized) Wasserstein distributionally robust models, in Advances in Neural Information
Processing Systems, pp. 14584–14596.
192 D. Kuhn, S. Shafiee, and W. Wiesemann
D. Boskos, J. Cortés and S. Martínez (2020), Data-driven ambiguity sets with probabilistic
guarantees for dynamic processes, IEEE Transactions on Automatic Control 66(7),
2991–3006.
P. Bossaerts, P. Ghirardato, S. Guarnaschelli and W. R. Zame (2010), Ambiguity in asset
markets: Theory and experiment, The Review of Financial Studies 23(4), 1325–1359.
S. Boucheron, G. Lugosi and P. Massart (2013), Concentration Inequalities: A Nonasymp-
totic Theory of Independence, Oxford University Press.
O. Bousquet, S. Boucheron and G. Lugosi (2004), Introduction to statistical learning
theory, in Advanced Lectures on Machine Learning (O. Bousquet, U. von Luxburg and
G. Rätsch, eds), Springer, pp. 169–207.
G. E. Box (1953), Non-normality and tests on variances, Biometrika 40(3-4), 318–335.
G. E. Box (1979), Robustness in the strategy of scientific model building, in Robustness in
statistics (R. L. Launer and G. N. Wilkinson, eds), Academic Press, pp. 201–236.
Y. Brenier (1991), Polar factorization and monotone rearrangement of vector-valued func-
tions, Communications on Pure and Applied Mathematics 44(4), 375–417.
H. Brezis (2011), Functional Analysis, Sobolev Spaces and Partial Differential Equations,
Springer.
J. Brugman, J. S. Van Leeuwaarden and C. Stegehuis (2022), Sharpest possible clustering
bounds using robust random graph analysis, Physical Review E 106(6), 064311.
M. Buckert, C. Schwieren, B. M. Kudielka and C. J. Fiebach (2014), Acute stress affects
risk taking but not ambiguity aversion, Frontiers in Neuroscience 8, 82.
N. Bui, D. Nguyen and V. A. Nguyen (2022), Counterfactual plans under distributional
ambiguity, in International Conference on Learning Representations.
L. Bungert, N. García Trillos and R. Murray (2023), The geometry of adversarial training in
binary classification, Information and Inference: A Journal of the IMA 12(2), 921–968.
L. Bungert, T. Laux and K. Stinson (2024), A mean curvature flow arising in adversarial
training, arXiv:2404.14402.
L. Cabantous (2007), Ambiguity aversion in the field of insurance: Insurers’ attitude to
imprecise and conflicting probability estimates, Theory and Decision 62(3), 219–240.
J. Cai, J. Y.-M. Li and T. Mao (2023), Distributionally robust optimization under distorted
expectations, Operations Research (Forthcoming).
G. C. Calafiore (2007), Ambiguous risk measures and optimal robust portfolios, SIAM
Journal on Optimization 18(3), 853–877.
G. C. Calafiore and M. C. Campi (2005), Uncertain convex programs: Randomized solu-
tions and confidence levels, Mathematical Programming 102(1), 25–46.
G. C. Calafiore and M. C. Campi (2006), The scenario approach to robust control design,
IEEE Transactions on Automatic Control 51(5), 742–753.
G. C. Calafiore and L. El Ghaoui (2006), On distributionally robust chance-constrained
linear programs, Journal of Optimization Theory and Applications 130(1), 1–22.
G. C. Calafiore, F. Dabbene and R. Tempo (2011), Research on probabilistic methods for
control system design, Automatica 47(7), 1279–1293.
M. C. Campi and A. Caré (2013), Random convex programs with 𝐿 1 -regularization:
sparsity and generalization, SIAM Journal on Control and Optimization 51(5), 3532–
3557.
M. C. Campi and S. Garatti (2008), The exact feasibility of randomized solutions of
uncertain convex programs, SIAM Journal on Optimization 19(3), 1211–1230.
Distributionally Robust Optimization 197
N. García Trillos, M. Jacobs and J. Kim (2023), The multimarginal optimal transport for-
mulation of adversarial multiclass classification, Journal of Machine Learning Research
24(45), 1–56.
H. Gassmann and W. Ziemba (1986), A tight upper bound for the expectation of a con-
vex function of a multivariate random variable, in Stochastic Programming 84 Part I
(A. Prékopa and R. J.-B. Wets, eds), Vol. 27, Springer, pp. 39–53.
M. Gelbrich (1990), On a formula for the 𝐿 2 Wasserstein metric between measures on
Euclidean and Hilbert spaces, Mathematische Nachrichten 147(1), 185–203.
G. Georgakopoulos, D. Kavvadias and C. H. Papadimitriou (1988), Probabilistic satisfiab-
ility, Journal of Complexity 4(1), 1–11.
R. Ghanem, D. Higdon and H. Owhadi (2017), Handbook of Uncertainty Quantification,
Springer.
S. Ghosh, M. Squillante and E. Wollega (2021), Efficient stochastic gradient descent for
learning with distributionally robust optimization, in Advances in Neural Information
Processing Systems, pp. 28310–28322.
I. Gilboa and D. Schmeidler (1989), Maxmin expected utility with a non-unique prior,
Journal of Mathematical Economics 18(2), 141–153.
C. Givens and R. Shortt (1984), A class of Wasserstein metrics for probability distributions,
The Michigan Mathematical Journal 31(2), 231–240.
M. Goerigk and J. Kurtz (2023), Data-driven robust optimization using deep neural net-
works, Computers & Operations Research 151, Article 106087.
I. J. Goodfellow, J. Shlens and C. Szegedy (2015), Explaining and harnessing adversarial
examples, in International Conference on Learning Representations.
J.-y. Gotoh, M. J. Kim and A. E. Lim (2018), Robust empirical optimization is almost the
same as mean–variance optimization, Operations Research Letters 46(4), 448–452.
J.-y. Gotoh, M. J. Kim and A. E. Lim (2021), Calibration of distributionally robust empirical
optimization models, Operations Research 69(5), 1630–1650.
N. Gravin and P. Lu (2018), Separation in correlation-robust monopolist problem with
budget, in SIAM Symposium on Discrete Algorithms, pp. 2069–2080.
M. Green and D. J. N. Limebeer (1995), H-infinity control theory: A tutorial, Automatica
31(2), 213–222.
G. Gül and A. M. Zoubir (2017), Minimax robust hypothesis testing, IEEE Transactions
on Information Theory 63(9), 5572–5587.
I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin and A. Courville (2017), Improved
training of Wasserstein GANs, in Advances in Neural Information Processing Systems,
pp. 5769–5779.
V. Gupta (2019), Near-optimal Bayesian ambiguity sets for distributionally robust optim-
ization, Management Science 65(9), 4242–4260.
M. Gürbüzbalaban, A. Ruszczyński and L. Zhu (2022), A stochastic subgradient method for
distributionally robust non-convex and non-smooth learning, Journal of Optimization
Theory and Applications 194(3), 1014–1041.
J. Hajar, T. Kargin and B. Hassibi (2023), Wasserstein distributionally robust regret-optimal
control under partial observability, in Allerton Conference on Communication, Control,
and Computing, pp. 1–6.
A. Hakobyan and I. Yang (2024), Wasserstein distributionally robust control of partially
observable linear stochastic systems, IEEE Transactions on Automatic Control 69(9),
6121–6136.
Distributionally Robust Optimization 203
N. Ho-Nguyen and F. Kılınç-Karzan (2018), Online first-order framework for robust convex
optimization, Operations Research 66(6), 1670–1692.
N. Ho-Nguyen and F. Kılınç-Karzan (2019), Exploiting problem structure in optimization
under uncertainty via online convex optimization, Mathematical Programming 177(1),
113–147.
N. Ho-Nguyen and S. J. Wright (2023), Adversarial classification via distributional robust-
ness with Wasserstein ambiguity, Mathematical Programming 198(2), 1411–1447.
N. Ho-Nguyen, F. Kılınç-Karzan, S. Küçükyavuz and D. Lee (2022), Distributionally
robust chance-constrained programs with right-hand side uncertainty under Wasserstein
ambiguity, Mathematical Programming 196(1–2), 641–672.
P. Honeyman, R. E. Ladner and M. Yannakakis (1980), Testing the universal instance
assumption, Information Processing Letters 10(1), 14–19.
L. J. Hong, Z. Huang and H. Lam (2021), Learning-based robust optimization: Procedures
and statistical guarantees, Management Science 67(6), 3447–3467.
R. A. Horn and C. R. Johnson (1985), H∞-optimal control and related minimax design
problems, IEEE Transactions on Automatic Control 30(10), 1057–1069.
S. Hou, P. Kassraie, A. Kratsios, A. Krause and J. Rothfuss (2023), Instance-dependent
generalization bounds via optimal transport, Journal of Machine Learning Research
24(1), 16815–16865.
M. Hsu, M. Bhatt, R. Adolphs, D. Tranel and C. F. Camerer (2005), Neural systems
responding to degrees of uncertainty in human decision-making, Science 310(5754),
1680–1683.
Y. Hu, X. Chen and N. He (2021), On the bias-variance-cost tradeoff of stochastic optim-
ization, in Advances in Neural Information Processing Systems, pp. 22119–22131.
Y. Hu, J. Wang, X. Chen and N. He (2024), Multi-level Monte-Carlo gradient methods for
stochastic optimization with biased oracles, arXiv:2408.11084.
Z. Hu and L. J. Hong (2013), Kullback-Leibler divergence constrained distributionally
robust optimization, Available from Optimization Online.
Z. Hu, L. J. Hong and A. M.-C. So (2013), Ambiguous probabilistic programs, Available
from Optimization Online.
K. Huang, H. Yang, I. King, M. R. Lyu and L. Chan (2004), The minimum error minimax
probability machine, Journal of Machine Learning Research 5, 1253–1286.
P. Huber (1981), Robust Statistics, Wiley.
P. J. Huber (1964), Robust estimation of a location parameter, The Annals of Mathematical
Statistics 35(1), 73–101.
P. J. Huber (1967), The behavior of maximum likelihood estimates under nonstandard
conditions, in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics
and Probability, pp. 221–233.
P. J. Huber (1968), Robust confidence limits, Zeitschrift für Wahrscheinlichkeitstheorie
und verwandte Gebiete 10(4), 269–278.
H. Husain (2020), Distributional robustness with IPMs and links to regularization and
GANs, in Advances in Neural Information Processing Systems, pp. 11816–11827.
K. Isii (1960), The extrema of probability determined by generalized moments (I) Bounded
random variables, Annals of the Institute of Statistical Mathematics 12(2), 119–134.
K. Isii (1962), On sharpness of Tchebycheff-type inequalities, Annals of the Institute of
Statistical Mathematics 14(1), 185–197.
Distributionally Robust Optimization 205
J. E. Kelley, Jr (1960), The cutting-plane method for solving convex programs, Journal of
the Society for Industrial and Applied Mathematics 8(4), 703–712.
C. Kent, J. Li, J. Blanchet and P. W. Glynn (2021), Modified Frank Wolfe in probability
space, in Advances in Neural Information Processing Systems, pp. 14448–14462.
J. M. Keynes (1921), A Treatise on Probability, Macmillan.
L. G. Khachiyan (1979),A polynomial algorithm in linear programming,Doklady Akademii
Nauk 244(5), 1093–1096.
H. K. Khalil (1996), Control System Analysis and Design with Advanced Design Tools,
Prentice Hall.
A. J. King and R. T. Rockafellar (1993), Asymptotic theory for solutions in statistical
estimation and stochastic programming, Mathematics of Operations Research 18(1),
148–162.
A. J. King and R. J.-B. Wets (1991), Epi-consistency of convex stochastic programs,
Stochastics and Stochastic Reports 34(1-2), 83–92.
D. Klabjan, D. Simchi-Levi and M. Song (2013), Robust stochastic lot-sizing by means of
histograms, Production and Operations Management 22(3), 691–710.
F. H. Knight (1921), Risk, Uncertainty and Profit, Houghton Mifflin.
Ç. Koçyiğit, G. Iyengar, D. Kuhn and W. Wiesemann (2020), Distributionally robust
mechanism design, Management Science 66(1), 159–189.
Ç. Koçyiğit, N. Rujeerapaiboon and D. Kuhn (2022), Robust multidimensional pricing:
Separation without regret, Mathematical Programming 196(1–2), 841–874.
V. Koltchinskii (2011), Oracle Inequalities in Empirical Risk Minimization and Sparse
Recovery Problems, Springer.
P. Kouvelis and G. Yu (1997), Robust Discrete Optimization and Its Applications, Springer.
A. L. Krain, A. M. Wilson, R. Arbuckle, X. F. Castellanos and M. P. Milham (2006),
Distinct neural mechanisms of risk and ambiguity: A meta-analysis of decision-making,
NeuroImage 32(1), 477–484.
S. G. Krantz and H. R. Parks (2002), A Primer of Real Analytic Functions, Springer.
D. Kuhn (2005), Generalized Bounds for Convex Multistage Stochastic Programs, Springer.
D. Kuhn, P. Mohajerin Esfahani, V. A. Nguyen and S. Shafieezadeh-Abadeh (2019),
Wasserstein distributionally robust optimization: Theory and applications in machine
learning, INFORMS Tutorials in Operations Research pp. 130–166.
S. Kullback (1959), Information theory and statistics, Wiley.
M. Kupper and W. Schachermayer (2009), Representation results for law invariant time
consistent functions, Mathematics and Financial Economics 2(3), 189–210.
A. Kurakin, I. J. Goodfellow and S. Bengio (2022), Adversarial machine learning at scale,
in International Conference on Learning Representations.
S. Kusuoka (2001), On law invariant coherent risk measures, in Advances in Mathematical
Economics (S. Kusuoka and T. Maruyama, eds), Springer, pp. 83–95.
Y. Kwon, W. Kim, J.-H. Won and M. C. Paik (2020), Principled learning method for
Wasserstein distributionally robust optimization with local perturbations, in Interna-
tional Conference on Machine Learning, pp. 5567–5576.
D. N. Lal (1955), A note on a form of Tchebycheff’s inequality for two or more variables,
Sankhyā: The Indian Journal of Statistics 15(3), 317–320.
H. Lam (2016), Robust sensitivity analysis for stochastic systems, Mathematics of Opera-
tions Research 41(4), 1248–1275.
Distributionally Robust Optimization 207
B. C. Levy (2008), Robust hypothesis testing with a relative entropy tolerance, IEEE
Transactions on Information Theory 55(1), 413–421.
B. C. Levy and R. Nikoukhah (2004), Robust least-squares estimation with a relative
entropy constraint, IEEE Transactions on Information Theory 50(1), 89–104.
B. C. Levy and R. Nikoukhah (2012), Robust state space filtering under incremental model
perturbations subject to a relative entropy tolerance, IEEE Transactions on Automatic
Control 58(3), 682–695.
D. Levy, Y. Carmon, J. C. Duchi and A. Sidford (2020), Large-scale methods for distri-
butionally robust optimization, in Advances in Neural Information Processing Systems,
pp. 8847–8860.
B. Li, R. Jiang and J. L. Mathieu (2016), Distributionally robust risk-constrained optimal
power flow using moment and unimodality information, in IEEE Conference on Decision
and Control, pp. 2425–2430.
B. Li, R. Jiang and J. L. Mathieu (2019a), Ambiguous risk constraints with moment and
unimodality information, Mathematical Programming 173(1-2), 151–192.
C. Li, U. Turmunkh and P. P. Wakker (2019b), Trust as a decision under ambiguity,
Experimental Economics 22(1), 51–75.
D. Li and S. Martínez (2020), Data assimilation and online optimization with performance
guarantees, IEEE Transactions on Automatic Control 66(5), 2115–2129.
J. Li, C. Chen and A. M.-C. So (2020), Fast epigraphical projection-based incremental
algorithms for Wasserstein distributionally robust support vector machine, in Advances
in Neural Information Processing Systems, pp. 4029–4039.
J. Li, S. Huang and A. M.-C. So (2019c), A first-order algorithmic framework for Wasser-
stein distributionally robust logistic regression, in Advances in Neural Information Pro-
cessing Systems, pp. 3937–3947.
J. Li, S. Lin, J. Blanchet and V. A. Nguyen (2022), Tikhonov regularization is optimal trans-
port robust under martingale constraints, in Advances in Neural Information Processing
Systems, pp. 17677–17689.
J. Y.-M. Li (2018), Closed-form solutions for worst-case law invariant risk measures with
application to robust portfolio optimization, Operations Research 66(6), 1533–1541.
J. Y.-M. Li and T. Mao (2022), A general Wasserstein framework for data-driven distribu-
tionally robust optimization: Tractability and applications, arXiv:2207.09403.
M. Li, T. Sutter and D. Kuhn (2021), Distributionally robust optimization with Markovian
data, in International Conference on Machine Learning, pp. 6493–6503.
Z. Li, R. Ding and C. A. Floudas (2011), A comparative theoretical and computational
study on robust counterpart optimization: I. Robust linear optimization and robust
mixed integer linear optimization, Industrial & Engineering Chemistry Research 50(18),
10567–10603.
F. Liese and I. Vajda (1987), Convex Statistical Distances, Teubner.
S. Lin, J. Blanchet, P. Glynn and V. A. Nguyen (2024), Small sample behavior of
Wasserstein projections, connections to empirical likelihood, and other applications,
arXiv:2408.11753.
F. Liu, Z. Chen, R. Wang and S. Wang (2024a), Newsvendor under mean-variance ambi-
guity and misspecification, arXiv:2405.07008.
J. Liu, Z. Su and H. Xu (2024b), Bayesian distributionally robust Nash equilibrium and its
application, arXiv:2410.20364.
Distributionally Robust Optimization 209
Z. Liu and P.-L. Loh (2023), Robust W-GAN-based estimation under Wasserstein contam-
ination, Information and Inference: A Journal of the IMA 12(1), 312–362.
Z. Liu, B. P. Van Parys and H. Lam (2023), Smoothed 𝑓 -divergence distribution-
ally robust optimization: Exponential rate efficiency and complexity-free calibration,
arXiv:2306.14041.
D. Z. Long, J. Qi and A. Zhang (2024), Supermodularity in two-stage distributionally
robust optimization, Management Science 70(3), 1394–1409.
C. Lyu, K. Huang and H.-N. Liang (2015), A unified gradient regularization family for
adversarial examples, in International Conference on Data Mining, pp. 301–309.
A. Madansky (1959), Bounds on the expectation of a convex function of a multivariate
random variable, The Annals of Mathematical Statistics 30(3), 743–746.
A. Mądry, A. Makelov, L. Schmidt, D. Tsipras and A. Vladu (2018), Towards deep learn-
ing models resistant to adversarial attacks, in International Conference on Learning
Representations.
C. Maheshwari, C.-Y. Chiu, E. Mazumdar, S. Sastry and L. Ratliff (2022), Zeroth-order
methods for convex-concave min-max problems: Applications to decision-dependent
risk minimization, in International Conference on Artificial Intelligence and Statistics,
pp. 6702–6734.
H.-Y. Mak, Y. Rong and J. Zhang (2015), Appointment scheduling with limited distribu-
tional information, Management Science 61(2), 316–334.
A. Markov (1884), On certain applications of algebraic continued fractions, PhD thesis, St
Petersburg (in Russian).
A. W. Marshall and I. Olkin (1960), A one-sided inequality of the Chebyshev type, The
Annals of Mathematical Statistics 31(2), 488–491.
K. Marton (1986), A simple proof of the blowing-up lemma, IEEE Transactions on In-
formation Theory 32(3), 445–446.
A. Maurer and M. Pontil (2009), Empirical Bernstein bounds and sample variance penal-
ization, in Conference on Learning Theory.
R. D. McAllister and P. Mohajerin Esfahani (2023),Distributionally robust model predictive
control: Closed-loop guarantees and scalable algorithms, arXiv:2309.12758.
A. McNeil, R. Frey and P. Embrechts (2015), Quantitative Risk Management: Concepts,
Techniques and Tools, Princeton University Press.
S. Mendelson (2003), A few notes on statistical learning theory, in Advanced Lectures on
Machine Learning (S. Mendelson and A. J. Smola, eds), Springer, pp. 1–40.
R. O. Michaud (1989), The Markowitz optimization enigma: Is ‘optimized’ optimal?,
Financial Analysts Journal 45(1), 31–42.
J. Milz and M. Ulbrich (2020), An approximation scheme for distributionally robust non-
linear optimization, SIAM Journal on Optimization 30(3), 1996–2025.
J. Milz and M. Ulbrich (2022), An approximation scheme for distributionally robust PDE-
constrained optimization, SIAM Journal on Control and Optimization 60(3),1410–1435.
V. K. Mishra, K. Natarajan, D. Padmanabhan, C.-P. Teo and X. Li (2014), On theoretical
and empirical aspects of marginal distribution choice models, Management Science
60(6), 1511–1531.
V. K. Mishra, K. Natarajan, H. Tao and C.-P. Teo (2012), Choice prediction with semidefin-
ite optimization when utilities are correlated, IEEE Transactions on Automatic Control
57(10), 2450–2463.
210 D. Kuhn, S. Shafiee, and W. Wiesemann
S. Peng (1997), Backward SDE and related G-expectation, in Backward Stochastic Dif-
ferential Equations in Finance (N. El Karoui, S. Peng and M. C. Quenez, eds), Wiley,
pp. 141–160.
S. Peng (2007a), G-Brownian motion and dynamic risk measure under volatility uncer-
tainty, arXiv:0711.2834.
S. Peng (2007b), G-expectation, G-Brownian motion and related stochastic calculus of Itô
type, in Stochastic Analysis and Applications (F. E. Benth, G. Di Nunno, T. Lindstrom,
B. Oksendal and T. Zhang, eds), Springer, pp. 541–567.
S. Peng (2019), Nonlinear Expectations and Stochastic Calculus under Uncertainty: With
Robust CLT and G-Brownian Motion, Springer.
S. Peng (2023), G-Gaussian processes under sublinear expectations and q-Brownian motion
in quantum mechanics, Numerical Algebra, Control and Optimization 13(3-4), 583–603.
G. Perakis and G. Roels (2008), Regret in the newsvendor model with partial information,
Operations Research 56(1), 188–203.
S. Pesenti, Q. Wang and R. Wang (2024), Optimizing distortion riskmetrics with distribu-
tional uncertainty, Mathematical Programming (Forthcoming).
G. C. Pflug and A. Pichler (2014), Multistage Stochastic Optimization, Springer.
G. C. Pflug and D. Wozabal (2007), Ambiguity in portfolio selection, Quantitative Finance
7(4), 435–442.
G. C. Pflug, A. Pichler and D. Wozabal (2012), The 1/𝑁 investment strategy is optimal
under high model ambiguity, Journal of Banking & Finance 36(2), 410–417.
R. R. Phelps (1965), Lectures on Choquet’s Theorem, van Nostrand Mathematical Studies.
A. B. Philpott, V. L. de Matos and L. Kapelevich (2018), Distributionally robust SDDP,
Computational Management Science 15, 431–454.
A. Pichler (2013), Evaluations of risk measures for different probability measures, SIAM
Journal on Optimization 23(1), 530–551.
I. Pinelis (2016), On the extreme points of moments sets, Mathematical Methods of Oper-
ations Research 83(3), 325–349.
I. Pólik and T. Terlaky (2007), A survey of the S-lemma, SIAM Review 49(3), 371–418.
Y. Polyanskiy and Y. Wu (2024), Information Theory: From Coding to Learning, Cam-
bridge University Press.
I. Popescu (2005), A semidefinite programming approach to optimal-moment bounds for
convex classes of distributions, Mathematics of Operations Research 30(3), 632–657.
I. Popescu (2007), Robust mean-covariance solutions for stochastic optimization, Opera-
tions Research 55(1), 98–112.
K. Postek and S. Shtern (2024), First-order algorithms for robust optimization problems via
convex-concave saddle-point Lagrangian reformulation, INFORMS Journal on Comput-
ing (Forthcoming).
K. Postek, A. Ben-Tal, D. den Hertog and B. Melenberg (2018), Robust optimization with
ambiguous stochastic constraints under mean and dispersion information, Operations
Research 66(3), 814–833.
K. Postek, D. den Hertog and B. Melenberg (2016), Computationally tractable counterparts
of distributionally robust constraints on risk measures, SIAM Review 58(4), 603–650.
K. Postek, W. Romeijnders, D. den Hertog and M. H. van der Vlerk (2019), An approxima-
tion framework for two-stage ambiguous stochastic integer programs under mean-MAD
information, European Journal of Operational Research 274(2), 432–444.
Distributionally Robust Optimization 213
G. Puccetti and L. Rüschendorf (2013), Sharp bounds for sums of dependent risks, Journal
of Applied Probability 50(1), 42–53.
M. S. Pydi and V. Jog (2021), Adversarial risk via optimal transport and optimal couplings,
IEEE Transactions on Information Theory 67(9), 6031–6052.
M. S. Pydi and V. Jog (2024), The many faces of adversarial risk: An expanded study,
IEEE Transactions on Information Theory 70(1), 550–570.
H. Rahimian and S. Mehrotra (2022), Frameworks and results in distributionally robust
optimization, Open Journal of Mathematical Optimization 3, 1–85.
H. Rahimian, G. Bayraksan and T. Homem-de-Mello (2019a), Controlling risk and demand
ambiguity in newsvendor models, European Journal of Operational Research 279(3),
854–868.
H. Rahimian, G. Bayraksan and T. Homem-de-Mello (2019b), Identifying effective scen-
arios in distributionally robust stochastic programs with total variation distance, Math-
ematical Programming 173(1), 393–430.
H. Rahimian, G. Bayraksan and T. Homem-de-Mello (2022), Effective scenarios in
multistage distributionally robust optimization with a focus on total variation distance,
SIAM Journal on Optimization 32(3), 1698–1727.
M. D. Reid and R. C. Williamson (2011), Information, divergence and risk for binary
experiments, Journal of Machine Learning Research 12(22), 731–817.
H. Richter (1957), Parameterfreie Abschätzung und Realisierung von Erwartungswerten,
Blätter der DGVFM 3(2), 147–162.
R. T. Rockafellar (1970), Convex Analysis, Princeton University Press.
R. T. Rockafellar (1974), Conjugate Duality and Optimization, SIAM.
R. T. Rockafellar and J. O. Royset (2013), Superquantiles and their applications to risk,
random variables, and regression, INFORMS Tutorials in Operations Research pp. 151–
167.
R. T. Rockafellar and J. O. Royset (2014), Random variables, monotone relations, and
convex analysis, Mathematical Programming 148(1-2), 297–331.
R. T. Rockafellar and J. O. Royset (2015), Measures of residual risk with connections to re-
gression, risk tracking, surrogate models, and ambiguity, SIAM Journal on Optimization
25(2), 1179–1208.
R. T. Rockafellar and S. Uryasev (2000), Optimization of conditional value-at-risk, Journal
of Risk 2(3), 21–41.
R. T. Rockafellar and S. Uryasev (2002), Conditional value-at-risk for general loss distri-
butions, Journal of Banking & Finance 26(7), 1443–1471.
R. T. Rockafellar and S. Uryasev (2013), The fundamental risk quadrangle in risk man-
agement, optimization and statistical estimation, Surveys in Operations Research and
Management Science 18(1-2), 33–53.
R. T. Rockafellar and R. J.-B. Wets (2009), Variational Analysis, Springer.
R. T. Rockafellar, S. Uryasev and M. Zabarankin (2006), Generalized deviations in risk
analysis, Finance and Stochastics 10(1), 51–74.
R. T. Rockafellar, S. Uryasev and M. Zabarankin (2008), Risk tuning with generalized
linear regression, Mathematics of Operations Research 33(3), 712–729.
W. W. Rogosinski (1958), Moments of non-negative mass, Proceedings of the Royal Society
of London. Series A. Mathematical and Physical Sciences 245(1240), 1–27.
N. Rontsis, M. A. Osborne and P. J. Goulart (2020), Distributionally ambiguous optimiz-
ation for batch Bayesian optimization, Journal of Machine Learning Research 21(149),
1–26.
214 D. Kuhn, S. Shafiee, and W. Wiesemann
S. Shafiee and D. Kuhn (2024), Minimax theorems and Nash equilibria in distributionally
robust optimization problems, Working Paper.
S. Shafiee, L. Aolaritei, F. Dörfler and D. Kuhn (2023), New perspectives on regulariz-
ation and computation in optimal transport-based distributionally robust optimization,
arXiv:2303.03900.
S. Shafieezadeh-Abadeh, D. Kuhn and P. Mohajerin Esfahani (2019), Regularization via
mass transportation, Journal of Machine Learning Research 20(103), 1–68.
S. Shafieezadeh-Abadeh, P. Mohajerin Esfahani and D. Kuhn (2015), Distributionally
robust logistic regression, in Advances in Neural Information Processing Systems,
pp. 1576–1584.
S. Shafieezadeh-Abadeh, V. A. Nguyen, D. Kuhn and P. Mohajerin Esfahani (2018),
Wasserstein distributionally robust Kalman filtering, in Advances in Neural Information
Processing Systems, pp. 8474–8483.
S. Shalev-Shwartz (2012), Online learning and online convex optimization, Foundations
and Trends in Machine Learning 4(2), 107–194.
S. Shalev-Shwartz and S. Ben-David (2014), Understanding Machine Learning: From
Theory to Algorithms, Cambridge University Press.
A. Shapiro (1989), Asymptotic properties of statistical estimators in stochastic program-
ming, The Annals of Statistics 17(2), 841–858.
A. Shapiro (1990), On differential stability in stochastic programming, Mathematical
Programming 47(1-3), 107–116.
A. Shapiro (1991), Asymptotic analysis of stochastic programs, Annals of Operations
Research 30(1), 169–186.
A. Shapiro (1993), Asymptotic behavior of optimal solutions in stochastic programming,
Mathematics of Operations Research 18(4), 829–845.
A. Shapiro (2001), On duality theory of conic linear problems, in Semi-Infinite Program-
ming (M. Á. Goberna and M. A. López, eds), Kluwer Academic Publishers, pp. 135–165.
A. Shapiro (2003), Monte Carlo sampling methods, in Stochastic Programming
(A. Ruszczyński and A. Shapiro, eds), Elsevier, pp. 353–425.
A. Shapiro (2013), On Kusuoka representation of law invariant risk measures, Mathematics
of Operations Research 38(1), 142–152.
A. Shapiro (2017), Distributionally robust stochastic programming, SIAM Journal on
Optimization 27(4), 2258–2275.
A. Shapiro and A. Kleywegt (2002), Minimax analysis of stochastic problems, Optimization
Methods and Software 17(3), 523–542.
A. Shapiro, D. Dentcheva and A. Ruszczyński (2009), Lectures on Stochastic Program-
ming: Modeling and Theory, SIAM.
A. Shapiro, E. Zhou and Y. Lin (2023), Bayesian distributionally robust optimization, SIAM
Journal on Optimization 33(2), 1279–1304.
K. S. Shehadeh (2023), Distributionally robust optimization approaches for a stochastic
mobile facility fleet sizing, routing, and scheduling problem, Transportation Science
57(1), 197–229.
K. S. Shehadeh, A. E. Cohn and R. Jiang (2020), A distributionally robust optimiza-
tion approach for outpatient colonoscopy scheduling, European Journal of Operational
Research 283(2), 549–561.
H. Shen and R. Jiang (2023), Chance-constrained set covering with Wasserstein ambiguity,
Mathematical Programming 198(1), 621–674.
216 D. Kuhn, S. Shafiee, and W. Wiesemann
B. Taşkesen, M.-C. Yue, J. Blanchet, D. Kuhn and V. A. Nguyen (2021), Sequential domain
adaptation by synthesizing distributionally robust experts, in International Conference
on Machine Learning, pp. 10162–10172.
A. H. Tchen (1980), Inequalities for distributions with given marginals, The Annals of
Probability 8(4), 814–827.
A. Terpin, N. Lanzetti and F. Dörfler (2024), Dynamic programming in probability spaces
via optimal transport, SIAM Journal on Control and Optimization 62(2), 1183–1206.
A. Terpin, N. Lanzetti, B. Yardim, F. Dörfler and G. Ramponi (2022), Trust region policy
optimization with optimal transport discrepancies: Duality and algorithm for continuous
actions, in Advances in Neural Information Processing Systems, pp. 19786–19797.
Y. L. Tong (1980), Probability Inequalities in Multivariate Distributions, Academic Press.
F. Tramèr, N. Papernot, I. Goodfellow, D. Boneh and P. McDaniel (2017), The space of
transferable adversarial examples, arXiv:1704.03453.
M. Y. Tsanga and K. S. Shehadeha (2024), On the trade-off between distributional belief
and ambiguity: Conservatism, finite-sample guarantees, and asymptotic properties,
arXiv:2410.19234.
K. Tu, Z. Chen and M.-C. Yue (2024), A max-min-max algorithm for large-scale robust
optimization, arXiv:2404.05377.
Z. Tu, J. Zhang and D. Tao (2019), Theoretical analysis of adversarial learning: A minimax
approach, in Advances in Neural Information Processing Systems, pp. 12280–12290.
A. Van Der Vaart and J. A. Wellner (2000), Preservation theorems for Glivenko-Cantelli
and uniform Glivenko-Cantelli classes, in High Dimensional Probability II (E. Giné,
D. M. Mason and J. A. Wellner, eds), Springer, pp. 115–133.
A. W. Van der Vaart (1998), Asymptotic Statistics, Cambridge University Press.
W. J. van Eekelen, D. den Hertog and J. S. van Leeuwaarden (2022), MAD dispersion
measure makes extremal queue analysis simple, INFORMS Journal on Computing 34(3),
1681–1692.
W. J. van Eekelen, G. A. Hanasusanto, J. J. Hasenbein and J. S. van Leeuwaarden (2023),
Second-order bounds for the M/M/s queue with random arrival rate, arXiv:2310.09995.
J. S. Van Leeuwaarden and C. Stegehuis (2021), Robust subgraph counting with
distribution-free random graph analysis, Physical Review E 104(4), 044313.
B. P. Van Parys (2024), Efficient data-driven optimization with noisy data, Operations
Research Letters 54, Article 107089.
B. P. Van Parys and N. Golrezaei (2024), Optimal learning for structured bandits, Manage-
ment Science 70(6), 3951–3998.
B. P. Van Parys, P. J. Goulart and P. Embrechts (2016a), Fréchet inequalities via convex
optimization, Available from Optimization Online.
B. P. Van Parys, P. J. Goulart and D. Kuhn (2016b), Generalized Gauss inequalities via
semidefinite programming, Mathematical Programming 156(1-2), 271–302.
B. P. Van Parys, P. J. Goulart and M. Morari (2019), Distributionally robust expectation
inequalities for structured distributions, Mathematical Programming 173(1-2), 251–280.
B. P. Van Parys, D. Kuhn, P. J. Goulart and M. Morari (2015), Distributionally robust
control of constrained stochastic systems, IEEE Transactions on Automatic Control
61(2), 430–442.
B. P. Van Parys, P. Mohajerin Esfahani and D. Kuhn (2021), From data to decisions:
Distributionally robust optimization is optimal, Management Science 67(6), 3387–3402.
V. Vapnik (2013), The Nature of Statistical Learning Theory, Springer.
218 D. Kuhn, S. Shafiee, and W. Wiesemann
J.-J. Zhu, W. Jitkrittum, M. Diehl and B. Schölkopf (2021), Kernel distributionally robust
optimization: Generalized duality theorem and stochastic approximation, in Interna-
tional Conference on Artificial Intelligence and Statistics, pp. 280–288.
L. Zhu, M. Gürbüzbalaban and A. Ruszczyński (2023), Distributionally robust learn-
ing with weakly convex losses: Convergence rates and finite-sample guarantees,
arXiv:2301.06619.
S. Zhu, L. Xie, M. Zhang, R. Gao and Y. Xie (2022b), Distributionally robust weighted
𝑘-nearest neighbors, in Advances in Neural Information Processing Systems, pp. 29088–
29100.
M. Zorzi (2014), Multivariate spectral estimation based on the concept of optimal predic-
tion, IEEE Transactions on Automatic Control 60(6), 1647–1652.
M. Zorzi (2016), Robust Kalman filtering under model perturbations, IEEE Transactions
on Automatic Control 62(6), 2902–2907.
M. Zorzi (2017a), Convergence analysis of a family of robust Kalman filters based on the
contraction principle, SIAM Journal on Control and Optimization 55(5), 3116–3131.
M. Zorzi (2017b), On the robustness of the Bayes and Wiener estimators under model
uncertainty, Automatica 83, 133–140.
L. F. Zuluaga and J. F. Pena (2005), A conic programming approach to generalized
Tchebycheff inequalities, Mathematics of Operations Research 30(2), 369–388.
S. Zymler, D. Kuhn and B. Rustem (2013a), Distributionally robust joint chance constraints
with second-order moment information, Mathematical Programming 137(1-2), 167–
198.
S. Zymler, D. Kuhn and B. Rustem (2013b), Worst-case value at risk of nonlinear portfolios,
Management Science 59(1), 172–188.