0% found this document useful (0 votes)
4 views

Distributionally Robust Optimization

Uploaded by

chi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Distributionally Robust Optimization

Uploaded by

chi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 221

Distributionally Robust Optimization

arXiv:2411.02549v1 [math.OC] 4 Nov 2024

Daniel Kuhn
Risk Analytics and Optimization Chair,
École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
E-mail: daniel.kuhn@epfl.ch

Soroosh Shafiee
School of Operations Research and Information Engineering,
Cornell University, Ithaca, NY, USA
E-mail: shafiee@cornell.edu

Wolfram Wiesemann
Imperial College Business School,
Imperial College London, London, United Kingdom
E-mail: ww@imperial.ac.uk
Distributionally robust optimization (DRO) studies decision problems under uncer-
tainty where the probability distribution governing the uncertain problem parameters
is itself uncertain. A key component of any DRO model is its ambiguity set, that
is, a family of probability distributions consistent with any available structural or
statistical information. DRO seeks decisions that perform best under the worst dis-
tribution in the ambiguity set. This worst case criterion is supported by findings
in psychology and neuroscience, which indicate that many decision-makers have a
low tolerance for distributional ambiguity. DRO is rooted in statistics, operations re-
search and control theory, and recent research has uncovered its deep connections to
regularization techniques and adversarial training in machine learning. This survey
presents the key findings of the field in a unified and self-contained manner.

CONTENTS
1 Introduction 2
2 Ambiguity Sets 9
3 Topological Properties of Ambiguity Sets 46
4 Duality Theory for Worst-Case Expectation Problems 57
5 Duality Theory for Worst-Case Risk Problems 74
6 Analytical Solutions of Nature’s Subproblem 87
7 Finite Convex Reformulations of Nature’s Subproblem 119
8 Regularization by Robustification 148
9 Numerical Solution Methods for DRO Problems 168
10 Statistical Guarantees 177
References 191
2 D. Kuhn, S. Shafiee, and W. Wiesemann

1. Introduction
Traditionally, mathematical optimization studies problems of the form
inf ℓ(𝑥),
𝑥 ∈X

where a decision 𝑥 is sought from the set X ⊆ R𝑛 of feasible solutions that minim-
izes a loss function ℓ : R𝑛 → R. With its early roots in the development of calculus
by Isaac Newton, Gottfried Wilhelm Leibniz, Pierre de Fermat and others in the
late 17th century, mathematical optimization has a rich history that involves con-
tributions from numerous mathematicians, economists, engineers, and scientists.
The birth of modern mathematical optimization is commonly credited to George
Dantzig, whose simplex algorithm developed in 1947 solves linear optimization
problems where ℓ is affine and X is a polyhedron (Dantzig 1956). Subsequent mile-
stones include the development of the rich theory of convex analysis (Rockafellar
1970) as well as the discovery of polynomial-time solution methods for linear
(Khachiyan 1979, Karmarkar 1984) and broad classes of nonlinear convex optim-
ization problems (Nesterov and Nemirovskii 1994).
Classical optimization problems are deterministic, that is, all problem data are as-
sumed to be known with certainty. However, most decision problems encountered
in practice depend on parameters that are corrupted by measurement errors or that
are revealed only after a decision must be determined and committed. A naïve
approach to model uncertainty-affected decision problems as deterministic optim-
ization problems would be to replace all uncertain parameters with their expected
values or with appropriate point predictions. However, it has long been known
and well-documented that decision-makers who replace an uncertain parameter of
an optimization problem with its mean value fall victim to the ‘flaw of averages’
(Savage, Scholtes and Zweidler 2006, Savage 2012). In order to account for un-
certainty realizations that deviate from the mean value, Beale (1955) and Dantzig
(1955) independently introduced stochastic programs of the form
inf E P [ℓ(𝑥, 𝑍)] , (1.1)
𝑥 ∈X

which explicitly model the uncertain problem parameters 𝑍 as a random vector


that is governed by a probability distribution P, and where a decision is sought that
performs best in expectation (or, subsequently, according to some risk measure).
Since then, stochastic programming has grown into a mature field (Birge and
Louveaux 2011, Shapiro, Dentcheva and Ruszczyński 2009), and it provides the
theoretical underpinnings of the empirical risk minimization principle in machine
learning (Bishop 2006, Hastie, Tibshirani and Friedman 2009).
Despite their success in theory and practice, stochastic programs suffer from at
least two shortcomings. Firstly, the assumption that the probability distribution P is
known precisely is unrealistic in many practical settings, and stochastic programs
can be sensitive to mis-specifications of this distribution. This effect has been
described by different communities as the optimizer’s curse (Smith and Winkler
Distributionally Robust Optimization 3

2006), the error-maximization effect of optimization (Michaud 1989, DeMiguel


and Nogales 2009), the optimization bias (Shapiro 2003) or overfitting (Bishop
2006, Hastie et al. 2009). Secondly, evaluating the expected loss of a fixed decision
requires computing a multi-dimensional integral, which is provably hard already for
embarrassingly simple loss functions and distributions. Hence, stochastic programs
suffer from a curse of dimensionality, that is, their computational complexity
generically displays an exponential dependence on the dimension of the random
vector 𝑍. To alleviate both shortcomings, Soyster (1973) proposed to model
uncertainty-affected decision problems as robust optimization problems of the form
inf sup ℓ(𝑥, 𝑧).
𝑥 ∈X 𝑧 ∈Z

Robust optimization replaces the probabilistic description of the uncertain problem


parameters with a set-based description and seeks for decisions that perform best
in view of the worst anticipated parameter realization 𝑧 from within an uncertainty
set Z. After an extended period of neglect, the ideas of Soyster (1973) have been
revisited and substantially extended in the late nineties onwards by Kouvelis and Yu
(1997), El Ghaoui, Oustry and Lebret (1998), El Ghaoui and Lebret (1998a,b), Ben-
Tal and Nemirovski (1999b, 1998, 1999a), Bertsimas and Sim (2004) and others.
For reviews of the robust optimization literature, we refer to Ben-Tal, El Ghaoui
and Nemirovski (2009), Rustem and Howe (2009) and Bertsimas and den Hertog
(2022). We point out that similar ideas have been developed independently in the
areas of robust stability (Horn and Johnson 1985, Doyle, Glover, Khargonekar and
Francis 1989, Green and Limebeer 1995), which investigates whether a system
remains stable in the face of parameter variations, and robust control (Zames 1966,
Khalil 1996, Zhou, Doyle and Glover 1996), which designs systems that maintain
a desirable performance in the presence of parameter variations. For textbook
introductions to robust stability and control, we refer to Zhou and Doyle (1999) and
Dullerud and Paganini (2001). Hansen and Sargent (2008) adapt robust control
techniques to economic problems affected by model uncertainty, where they design
policies that perform well across a range of possible model mis-specifications.
While robust optimization reduces the informational and computational burden
that plagues stochastic programs, its equal treatment of all parameter realizations
within the uncertainty set and its exclusive focus on worst-case scenarios can
make it overly conservative for practical applications. These concerns prompted
researchers to study distributionally robust optimization problems of the form
inf sup E P [ℓ(𝑥, 𝑍)] , (1.2)
𝑥 ∈X P∈P

which model the uncertain problem parameters 𝑍 as a random vector that is gov-
erned by some distribution P from within an ambiguity set P, and where a decision
is sought that performs best in view of its expected value under the worst distribution
P ∈ P. Distributionally robust optimization (DRO) thus blends the distributional
perspective of stochastic programming with the worst-case focus of robust optim-
4 D. Kuhn, S. Shafiee, and W. Wiesemann

ization. Herbert E. Scarf is commonly credited with pioneering this approach in


his study on newsvendor problems where the uncertain demand distribution is only
characterized through its mean and variance (Scarf 1958). Subsequently, Dupačová
(1966, 1987, 1994) and Shapiro and Kleywegt (2002) have studied DRO problems
whose ambiguity sets specify the support, some lower-order moments, independ-
ence patterns or other structural properties of the unknown probability distribution.
Ermoliev, Gaivoronski and Nedeva (1985) and Gaivoronski (1991) have developed
early solution approaches for DRO problems over moment ambiguity sets. The
advent of modern DRO is often attributed to the works of Bertsimas and Popescu
(2002, 2005), who derive probability inequalities under partial distributional in-
formation and apply their techniques to option pricing problems, of El Ghaoui,
Oks and Oustry (2003) and Calafiore and El Ghaoui (2006), who study DRO prob-
lems where a quantile of the objective function should be minimized, or a set of
uncertainty-affected constraints should be satisfied with high probability, across all
probability distributions with known moment bounds, and of Delage and Ye (2010),
who study similar DRO problems with a worst-case expected value objective.
Early research on DRO has primarily focused on moment ambiguity sets, which
contain all distributions on a prescribed support set Z that satisfy finitely many mo-
ment constraints. In contrast to stochastic programs, DRO problems with moment
ambiguity sets sometimes exhibit favorable scaling with respect to the dimension
of the random vector 𝑍. However, strikingly different distributions can share
identical moments. As a consequence, moment ambiguity sets always include a
wide range of distributions, including some implausible ones that can safely be
ruled out when ample historical data is available. This prompted Ben-Tal, den
Hertog, De Waegenaere, Melenberg and Rennen (2013) and Wang, Glynn and Ye
(2016) to introduce ambiguity sets that contain all distributions in some neighbor-
hood of a prescribed reference distribution (typically the empirical distribution that
is formed from historical data). These neighborhoods can be defined with respect
to a discrepancy function between probability distributions such as a 𝜙-divergence
(Csiszár 1963) or a Wasserstein distance (Villani 2008). Unlike moment ambigu-
ity sets, discrepancy-based ambiguity sets have a tunable size parameter (e.g., a
radius) and can thus be shrunk to a singleton that contains only the reference dis-
tribution. If the reference distribution converges to the unknown true distribution
and the size parameter decays to 0 as more historical data becomes available, then
the DRO problem eventually reduces to the classical stochastic program under the
true distribution. Early work on discrepancy-based ambiguity sets relies on the as-
sumption that 𝑍 is a discrete random vector with a finite support set Z. Extensions
to discrepancy-based DRO problems with generic (possibly continuous) random
vectors are due to Mohajerin Esfahani and Kuhn (2018), Zhao and Guan (2018),
Blanchet and Murthy (2019), Zhang, Yang and Gao (2024b) and Gao and Kleywegt
(2023), who construct ambiguity sets using optimal transport discrepancies. We
refer to Kuhn, Mohajerin Esfahani, Nguyen and Shafieezadeh-Abadeh (2019) and
Rahimian and Mehrotra (2022) for prior surveys of the DRO literature.
Distributionally Robust Optimization 5

Historically, the term ‘distributional robustness’ has its roots in robust statistics.
The term was coined by Huber (1981) to describe methods aimed at making
robust decisions in the presence of outlier data points. This idea expanded upon
earlier works by Box (1953, 1979), who explores robustness in situations where the
underlying distribution deviates from normality, a common assumption underlying
many statistical models. To address the challenges posed by outliers, statisticians
have developed several contamination models, each offering a unique approach
to mitigating data irregularities. The Huber contamination model, introduced by
Huber (1964, 1968) and further developed by Hampel (1968, 1971), assumes that
the observed data is drawn from a mixture of the true distribution and an arbitrary
contaminating distribution. Neighborhood contamination models define deviations
from the true distribution in terms of statistical distances such as the total variation
(Donoho and Liu 1988) or Wasserstein distances (Zhu, Jiao and Steinhardt 2022a,
Liu and Loh 2023). More recently, data-dependent adaptive contamination models
allow for a fraction of the observed data points to be replaced with points drawn
from an arbitrary distribution (Diakonikolas, Kamath, Kane, Li, Moitra and Stewart
2019, Zhu et al. 2022a). Interestingly, the optimistic counterpart of a DRO model,
which optimizes in view of the best (as opposed to the worst) distribution in the
ambiguity set, recovers many estimators from robust statistics (Blanchet, Li, Lin and
Zhang 2024b, Jiang and Xie 2024). For a survey of recent advances in algorithmic
robust statistics we refer to Diakonikolas and Kane (2023).
Robust and distributionally robust optimization have found manifold applications
in machine learning. For example, popular regularizers from the machine learning
literature are known to admit a robustness interpretation, which offers theoretical
insights into the strong empirical performance of regularization in practice (Xu,
Caramanis and Mannor 2009, Shafieezadeh-Abadeh, Kuhn and Mohajerin Esfa-
hani 2019, Li, Lin, Blanchet and Nguyen 2022, Gao, Chen and Kleywegt 2024).
Likewise, optimistic counterparts of DRO models that optimize in view of the
best (as opposed to the worst) distribution in the ambiguity set give rise to upper
confidence bound algorithms that are ubiquitous in the bandit and reinforcement
learning literature (Blanchet et al. 2024b, Jiang and Xie 2024). DRO is also related
to adversarial training, which aims to improve the generalization performance of a
machine learning model by training it in view of adversarial examples (Goodfellow,
Shlens and Szegedy 2015). Adversarial examples are perturbations of existing data
points that are designed to mislead a model into making incorrect predictions.
There are also deep connections between DRO and extensions of stochastic (dy-
namic) programming that replace the expected value with coherent risk measures.
Similar to the expected value, a risk measure maps random variables to exten-
ded real numbers. In contrast to the expected value, which is risk-neutral since
it weighs positive and negative outcomes equally, risk measures most commonly
assign greater weights to negative outcomes and thus account for the risk aversion
frequently observed among decision-makers. Artzner, Delbaen, Eber and Heath
(1999) and Delbaen (2002) show that risk measures satisfying the axioms of co-
6 D. Kuhn, S. Shafiee, and W. Wiesemann

herence as well as a Fatou property can be equivalently represented as worst-case


expectations over specific sets of distributions. In other words, there is a direct
link between optimizing worst-case expectations (as done in DRO) and optimizing
coherent risk measures. A similar representation theorem has been developed for
a class of nonlinear expectations, the so-called 𝐺-expectations that are based on
the solution of a backward stochastic differential equation, in the financial math-
ematics literature (Peng 1997, 2007a,b, 2019). Peng (2023) shows that sublinear
𝐺-expectations are equivalent to worst-case expectations over specific families of
distributions, thus creating a bridge between the theory of 𝐺-expectations and DRO.
Philosophically, DRO is related to the principle of ambiguity aversion, under
which individuals prefer known risks over unknown risks even when the unknown
risks promise potentially higher rewards. In the economics literature, the distinction
between risky outcomes whose probabilities are known and ambiguous outcomes
whose probabilities are (partially) unknown goes back to at least Keynes (1921) and
Knight (1921). The concept of ambiguity aversion has been widely popularized
through the Ellsberg paradox (Ellsberg 1961), a thought experiment under which
people are asked to choose between betting on an urn with a known distribution of
colored balls (e.g., 50 red and 50 blue) and an urn with an unknown distribution
of the same colored balls (i.e., the proportion of red to blue is unknown). Despite
the potential for equal or better odds, many people prefer to bet on the urn with
the known distribution, that is, they display ambiguity aversion. The Ellsberg
paradox challenges classical expected utility theory, and it has led to extensions
such as the maxmin expected utility theory (Gilboa and Schmeidler 1989) that serve
as theoretical underpinnings of DRO. Ambiguity aversion has subsequently been
identified in countless empirical economic studies across financial markets (Epstein
and Miao 2003, Bossaerts, Ghirardato, Guarnaschelli and Zame 2010), insurance
markets (Cabantous 2007), individual decision-making (Dimmock, Kouwenberg
and Wakker 2016), macroeconomic policy (Hansen and Sargent 2010), auctions
(Salo and Weber 1995) and games of trust (Li, Turmunkh and Wakker 2019b).
There is also substantial medical and neuroscientific evidence that supports the
presence of ambiguity aversion. Hsu, Bhatt, Adolphs, Tranel and Camerer (2005)
found that the amygdala, a key emotional processing center in the brain, becomes
more active when individuals are confronted with ambiguity compared to situations
with known probabilities, indicating its role in driving ambiguity aversion. A meta-
analysis by Krain, Wilson, Arbuckle, Castellanos and Milham (2006) highlights
the involvement of the prefrontal cortex, which is responsible for higher-order cog-
nitive control, rational decision-making, and emotional regulation, in processing
ambiguity. In addition, a meta-analysis of Wu, Sun, Camilleri, Eickhoff and Yu
(2021) shows that processing risk and ambiguity both rely on the anterior insula.
Risk processing additionally activates the dorsomedial prefrontal cortex and vent-
ral striatum, whereas ambiguity processing specifically engages the dorsolateral
prefrontal cortex, inferior parietal lobe, and right anterior insula. This supports the
notion that distinct neural mechanisms are engaged when individuals face ambigu-
Distributionally Robust Optimization 7

ous versus risky decisions. Genetic factors may influence an individual’s tendency
toward ambiguity aversion. He, Xue, Chen, Lu, Dong, Lei, Ding, Li, Li, Chen, Li,
Moyzis and Bechara (2010) link certain genetic polymorphisms to the perform-
ance of individuals in decision-making under risk and ambiguity. In a separate
study, Buckert, Schwieren, Kudielka and Fiebach (2014) examine how hormonal
changes, such as higher cortisol levels which are linked to stress and anxiety, affect
decision-making under risk and ambiguity. These findings collectively suggest that
perceptions of risk and ambiguity are not just a cognitive phenomenon but also in-
fluenced by brain structures and genetic and hormonal factors that shape individual
differences in decision-making under ambiguity. Finally, we mention Hartley and
Somerville (2015) and Blankenstein, Crone, van den Bos and van Duijvenvoorde
(2016), who examine how ambiguity aversion differs between children, adoles-
cents and adults, and Hayden, Heilbronner and Platt (2010), who observed that
rhesus macaques monkeys also exhibit ambiguity aversion when offered the choice
between risky and ambiguous games of large and small juice outcomes.
The remainder of this survey is structured as follows. A significant part of our
analysis is dedicated to studying the worst-case expectation supP∈P E P [ℓ(𝑥, 𝑍)],
which constitutes the objective function of the DRO problem (1.2). Evaluating this
expression typically requires the solution of a semi-infinite optimization problem
over infinitely many variables that characterize the probability distribution P, sub-
ject to finitely many constraints imposed by the ambiguity set P. This problem,
which we refer to as nature’s subproblem, is the key feature that distinguishes
the DRO problem (1.2) from deterministic, stochastic, and robust optimization
problems. Sections 2 and 3 review commonly studied ambiguity sets P and their
topological properties, focusing especially on conditions under which nature’s sub-
problem attains its optimal value. Sections 4 and 5 develop a duality theory for
nature’s subproblem that allows us to upper bound or equivalently reformulate the
worst-case expectation with a semi-infinite optimization problem over finitely many
dual decision variables that are subjected to infinitely many constraints. This duality
framework lays the foundations for the analytical solution of nature’s subproblem
in Section 6, which relies on constructing primal and dual feasible solutions that
yield the same objective value and thus enjoy strong duality. Sections 7 and 8 lever-
age the same duality theory to develop equivalent reformulations and conservative
approximations of nature’s subproblem as well as the overall DRO problem (1.2).
Section 9 demonstrates how the duality theory gives rise to numerical solution
techniques for nature’s subproblem and the full DRO problem. Finally, Section 10
reviews the statistical guarantees enjoyed by different ambiguity sets.
Length restrictions dictated difficult trade-offs in the choice of topics covered
by this survey. We decided to focus on the most commonly used ambiguity sets
and to only briefly review other possible choices, such as marginal ambiguity
sets, ambiguity sets with structural constraints (including, e.g., symmetry and
unimodality), Sinkhorn ambiguity sets or conditional relative entropy ambiguity
sets. Likewise, we do not cover the important but somewhat more advanced topics
8 D. Kuhn, S. Shafiee, and W. Wiesemann

of distributionally favourable optimization and decision randomization. Finally, we


focus on single-stage problems where the uncertainty is fully resolved after the here-
and-now decision 𝑥 ∈ X is taken; two-stage and multi-stage DRO problems, where
uncertainty unfolds over time and recourse decisions are possible, are reviewed by
Delage and Iancu (2015) and Yanıkoğlu, Gorissen and den Hertog (2019).

1.1. Notation
All vector spaces considered in this paper are defined over the real numbers. For
brevity, we simply refer to them as ‘vector spaces’ instead of ‘real vector spaces.’
We use R = R ∪ {−∞, ∞} to denote the extended reals. The effective domain of
a function 𝑓 : R𝑑 → R is defined as dom( 𝑓 ) = {𝑧 ∈ R𝑑 : 𝑓 (𝑧) < ∞}, and the
epigraph of 𝑓 is defined as epi( 𝑓 ) = {(𝑧, 𝛼) ∈ R𝑑 × R : 𝑓 (𝑧) ≤ 𝛼}. We say that 𝑓
is proper if dom( 𝑓 ) ≠ ∅ and 𝑓 (𝑧) > −∞ for all 𝑧 ∈ R𝑑 . The convex conjugate of
𝑓 is the function 𝑓 ∗ : R𝑑 → R defined through 𝑓 ∗ (𝑦) = sup 𝑧 ∈R𝑑 𝑦 ⊤ 𝑧 − 𝑓 (𝑧). A
convex function 𝑓 is called closed if it is proper and lower semicontinuous or if it
is identically equal to +∞ or to −∞. One can show that 𝑓 is closed if and only if it
coincides with its bi-conjugate 𝑓 ∗∗ , that is, with the conjugate of 𝑓 ∗ . If 𝑓 is proper,
convex and lower semicontinuous, then its recession function 𝑓 ∞ : R𝑑 → R is
defined through 𝑓 ∞ (𝑧) = lim 𝛼→∞ 𝛼 −1 ( 𝑓 (𝑧0 + 𝛼𝑧) − 𝑓 (𝑧0 )), where 𝑧0 is any point
in dom( 𝑓 ) (Rockafellar 1970, Theorem 8.5). The perspective of 𝑓 is the function
𝑓 𝜋 : R𝑑 × R → R defined through 𝑓 𝜋 (𝑧, 𝑡) = 𝑡 𝑓 (𝑧/𝑡) if 𝑡 > 0, 𝑓 𝜋 (𝑧, 𝑡) = 𝑓 ∞ (𝑧)
if 𝑡 = 0 and 𝑓 𝜋 (𝑧, 𝑡) = ∞ if 𝑡 < 0. One can show that 𝑓 𝜋 is proper, convex
and lower semicontinuous (Rockafellar 1970, page 67). When there is no risk
of confusion, we occasionally use 𝑡 𝑓 (𝑧/𝑡) to denote 𝑓 𝜋 (𝑧, 𝑡) even if 𝑡 = 0. The
indicator function 𝛿Z : R𝑑 → R of a set Z ⊆ R𝑑 is defined through 𝛿Z (𝑧) = 0
if 𝑧 ∈ Z and 𝛿Z (𝑧) = ∞ if 𝑧 ∉ Z. The conjugate 𝛿Z ∗ of 𝛿 is called the support
Z
∗ ⊤
function of Z. Thus, it satisfies 𝛿Z (𝑦) = sup 𝑧 ∈Z 𝑦 𝑧. Random objects are denoted
by capital letters (e.g., 𝑍) and their realizations are denoted by the corresponding
lowercase letters (e.g., 𝑧). For any closed set Z ⊆ R𝑑 , we use M(Z) to denote the
space of all finite signed Borel measures on Z, while M+ (Z) stands for the convex
cone of all (non-negative) Borel measures in M(Z), and P(Z) stands for the convex
set of all probability distributions in M+ (Z). The
∫ expectation operator with respect
to P ∈ P(Z) is defined through E P [ 𝑓 (𝑍)] = Z 𝑓 (𝑧) dP(𝑧) for any Borel function
𝑓 : Z → R. If the integrals of the positive and the negative parts of 𝑓 both evaluate
to ∞, then we define E P [ 𝑓 (𝑍)] ‘adversarially.’ That is, we set E P [ 𝑓 (𝑍)] = ∞ (−∞)
if the integral appears in the objective function of a minimization (maximization)
problem. The Dirac probability distribution that assigns unit probability to 𝑧 ∈ Z is
denoted as 𝛿 𝑧 . The Dirac distribution 𝛿 𝑧 should not be confused with the indicator
function 𝛿 { 𝑧 } of the singleton {𝑧}. For any P ∈ P(Z) and any Borel measurable

transformation 𝑓 : Z → Z ′ between Borel sets Z ⊆ R𝑑 and Z ′ ⊆ R𝑑 , we denote
by P ◦ 𝑓 −1 the pushforward distribution of P under 𝑓 . Thus, if 𝑍 is a random vector
on Z governed by P, then 𝑓 (𝑍) is a random vector on Z ′ governed by P ◦ 𝑓 −1 . The
Distributionally Robust Optimization 9

closure, the interior and the relative interior of a set Z ⊆ R𝑑 are denoted by cl(Z),
int(Z) and rint(Z), respectively. We use R+𝑑 and R++ 𝑑 to denote the non-negative

orthant in R and its interior. In addition, we use S𝑑 to denote the space of all
𝑑

symmetric matrices in R𝑑×𝑑 . The cone of positive semidefinite matrices in S𝑑 is


denoted by S+𝑑 , and S++ 𝑑 stands for its interior, that is, the set of all positive definite
𝑑
matrices in S . The truth value 1E of a logical statement evaluates to 1 if E is true
and to 0 otherwise. The set of all natural numbers {1, 2, 3, . . .} is denoted by N,
and [𝑛] = {1, . . . , 𝑛} stands for the set of all integers up to 𝑛 ∈ N.

2. Ambiguity Sets
An ambiguity set P is a family of probability distributions on a common measurable
space. Throughout this paper we assume that P ⊆ P(Z), where P(Z) denotes
the entirety of all Borel probability distributions on a closed set Z ⊆ R𝑑 . This
section reviews popular classes of ambiguity sets. For each class, we first give a
formal definition and provide historical background information. Subsequently, we
exemplify important instances of ambiguity sets and highlight how they are used.

2.1. Moment Ambiguity Sets


A moment ambiguity set is a family of probability distributions that satisfy finitely
many (generalized) moment conditions. Formally, it can thus be represented as
P = {P ∈ P(Z) : E P [ 𝑓 (𝑍)] ∈ F } , (2.1)
where 𝑓 : Z → R𝑚 is a Borel measurable moment function, and F ⊆ R𝑚 is an
uncertainty set. By definition, the moment ambiguity set (2.1) thus contains all
probability distributions P supported on Z whose generalized moments E P [ 𝑓 (𝑍)]
are well-defined and belong to the uncertainty set F. Ambiguity sets of the
type (2.1) were first studied by Isii (1960, 1962) and Karlin and Studden (1966)
to establish the sharpness of generalized Chebyshev inequalities. The following
subsections review popular instances of the moment ambiguity set.

2.1.1. Support-Only Ambiguity Sets


The support-only ambiguity set contains all probability distributions supported on
Z ⊆ R𝑑 , that is, P = P(Z). It can be viewed as an instance of (2.1) with 𝑓 (𝑧) = 1
and F = {1}. Any DRO problem with ambiguity set P(Z) is ostensibly equivalent
to a classical robust optimization problem with uncertainty set Z, that is,
inf sup E P [ℓ(𝑥, 𝑍)] = inf sup ℓ(𝑥, 𝑧).
𝑥 ∈X P∈P(Z) 𝑥 ∈X 𝑧 ∈Z

For a comprehensive review of the theory and applications of robust optimization


we refer to (Ben-Tal and Nemirovski 1998, 1999a, 2000, 2002, Bertsimas and Sim
2004, Ben-Tal et al. 2009, Bertsimas, Brown and Caramanis 2011, Ben-Tal, den
Hertog and Vial 2015a, Bertsimas and den Hertog 2022).
10 D. Kuhn, S. Shafiee, and W. Wiesemann

If the uncertainty set Z covers a fraction of 1 − 𝜀 of the total probability mass of


some distribution P, then the worst-case loss sup 𝑧 ∈Z ℓ(𝑥, 𝑧) is guaranteed to exceed
the (1 − 𝜀)-quantile of ℓ(𝑥, 𝑍) under P. This can be achieved by leveraging prior
structural information or statistical data from P. For example, P(𝑍 ∈ Z) ≥ 1 − 𝜀
may hold (with certainty) if Z is an appropriately sized intersection of halfspaces
and ellipsoids and if 𝑍 has independent, symmetric, unimodal and/or sub-Gaussian
components under P (Bertsimas and Sim 2004, Janak, Lin and Floudas 2007, Ben-
Tal et al. 2009, Li, Ding and Floudas 2011, Bertsimas, den Hertog and Pauphilet
2021). Alternatively, it may hold (with high confidence) if Z is constructed from
independent samples from P by using statistical hypothesis tests (Postek, den Hertog
and Melenberg 2016, Bertsimas, Gupta and Kallus 2018b,a), quantile estimation
(Hong, Huang and Lam 2021), or learning-based methods (Han, Shang and Huang
2021, Goerigk and Kurtz 2023, Wang, Becker, Van Parys and Stellato 2023).

2.1.2. Markov Ambiguity Sets


Markov’s inequality provides an upper bound on the probability that a non-negative
univariate random variable 𝑍 with mean 𝜇 ≥ 0 exceeds a positive threshold 𝜏 > 0.
Formally, it states that P(𝑍 ≥ 𝜏) ≤ 𝜇/𝜏 for every possible probability distribution
of 𝑍 in the ambiguity set P = {P ∈ P(R+ ) : E P [𝑍] = 𝜇}. If 𝜇 ≤ 𝜏, then
Markov’s inequality is sharp, that is, there exists a probability distribution P★ ∈ P
for which the inequality holds as an equality. Indeed, the distribution P★ = (1 −
𝜇/𝜏)𝛿0 + 𝜇/𝜏𝛿 𝜏 , where 𝛿 𝑧 is the Dirac distribution that places point mass as 𝑧 ∈ R,
is an element of P and satisfies P(𝑍 ≥ 𝜏) = 𝜇/𝜏. These insights imply that
supP∈P P(𝑍 ≥ 𝜏) = 𝜇/𝜏 and that the supremum is attained by P★ whenever 𝜇 ≤ 𝜏.
Thus, Markov’s bound can be interpreted as the optimal value of a DRO problem.
It is therefore common to refer to P as a Markov ambiguity set. More generally,
we define the Markov ambiguity set corresponding to a closed support set Z ⊆ R𝑑
and a mean vector 𝜇 ∈ R𝑑 as a family of multivariate distributions of the form

P = {P ∈ P(Z) : E P [𝑍] = 𝜇} . (2.2)

Thus, the Markov ambiguity set (2.2) contains all distributions supported on Z that
share the same mean vector 𝜇. However, these distributions may have dramatically
different shapes and higher-order moments. Worst-case expectations over Markov
ambiguity sets are sometimes used as efficiently computable upper bounds on
the expected cost-to-go functions in stochastic programming. If the cost-to-go
functions are concave in the uncertain problem parameters, then these worst-case
expectations are closely related to Jensen’s inequality (Jensen 1906); see also
Section 6.1. If the cost-to-go functions are convex and Z is a polyhedron, on
the other hand, then these worst-case expectations are related to the Edmundson-
Madansky inequality (Edmundson 1956, Madansky 1959); see also Section 6.2.
Distributionally Robust Optimization 11

2.1.3. Chebyshev Ambiguity Sets


Chebyshev’s inequality provides an upper bound on the probability that a univari-
ate random variable 𝑍 with finite mean 𝜇 ∈ R and variance 𝜎 2 > 0 deviates
from its mean by more than 𝑘 > 0 standard deviations. Formally, it states that
P (|𝑍 − 𝜇| ≥ 𝑘𝜎) ≤ 1/𝑘 2 for every possible probability distribution of 𝑍 in the
ambiguity set P = {P ∈ P(R) : E P [𝑍] = 𝜇, E P [𝑍 2 ] = 𝜎 2 + 𝜇2 }. Chebyshev’s
inequality is sharp if 𝑘 ≥ 1. Indeed, one readily verifies that the distribution
1 1 1
 

P = 2 𝛿 𝜇−𝑘 𝜎 + 1 − 2 𝛿 𝜇 + 2 𝛿 𝜇+𝑘 𝜎
2𝑘 𝑘 2𝑘
is an element of P and satisfies P(|𝑍 − 𝜇| ≥ 𝑘𝜎) = 1/𝑘 2 . These insights imply that
supP∈P P(|𝑍 − 𝜇| ≥ 𝑘𝜎) = 1/𝑘 2 and that the supremum is attained for 𝑘 ≥ 1. Thus,
Chebyshev’s bound can be interpreted as the optimal value of a DRO problem. It is
therefore common to refer to P as a Chebyshev ambiguity set. More generally, we
define the Chebyshev ambiguity set corresponding to a closed support set Z ⊆ R𝑑 ,
mean vector 𝜇 ∈ R𝑑 and second-order moment matrix 𝑀 ∈ S+𝑑 , 𝑀  𝜇𝜇⊤ , as

P = P ∈ P(Z) : E P [𝑍] = 𝜇, E P [𝑍 𝑍 ⊤ ] = 𝑀 . (2.3)
Thus, the Chebyshev ambiguity set (2.3) contains all distributions supported on Z
that share the same mean vector 𝜇 and second-order moment matrix 𝑀 (and thus
also the same covariance matrix Σ = 𝑀 − 𝜇𝜇⊤ ∈ S+𝑑 ). However, these distributions
may have dramatically different shapes and higher-order moments.
The Chebyshev ambiguity set (2.3) captures the distributional information rel-
evant for multivariate Chebyshev inequalities (Lal 1955, Marshall and Olkin 1960,
Tong 1980, Rujeerapaiboon, Kuhn and Wiesemann 2018). In operations research,
Chebyshev ambiguity sets are routinely used since the seminal work of Scarf (1958)
on the distributionally robust newsvendor, which is widely perceived as the first
paper on DRO. Since then a wealth of DRO models with Chebyshev ambiguity
sets have emerged in the context of newsvendor and portfolio selection problems.
These models involve a wide range of different decision criteria such as the expec-
ted value (Gallego and Moon 1993, Natarajan and Linyi 2007, Popescu 2007), the
value-at-risk (El Ghaoui et al. 2003, Xu, Caramanis and Mannor 2012b, Zymler,
Kuhn and Rustem 2013a,b, Rujeerapaiboon, Kuhn and Wiesemann 2016, Yang and
Xu 2016, Zhang, Jiang and Shen 2018), the conditional value-at-risk (Natarajan,
Sim and Uichanco 2010, Chen, He and Zhang 2011, Zymler et al. 2013b, Hanas-
usanto, Kuhn, Wallace and Zymler 2015a), spectral risk measures (Li 2018) and
distortion risk measures (Cai, Li and Mao 2023, Pesenti, Wang and Wang 2024),
as well as minimax regret criteria (Yue, Chen and Wang 2006, Perakis and Roels
2008). Besides this, Chebyshev ambiguity sets have found numerous applications
in option and stock pricing (Bertsimas and Popescu 2002), statistics and machine
learning (Lanckriet, El Ghaoui, Bhattacharyya and Jordan 2001, 2002, Strohmann
and Grudic 2002, Huang, Yang, King, Lyu and Chan 2004, Bhattacharyya 2004,
Farnia and Tse 2016, Nguyen, Shafieezadeh-Abadeh, Yue, Kuhn and Wiesemann
12 D. Kuhn, S. Shafiee, and W. Wiesemann

2019, Rontsis, Osborne and Goulart 2020), stochastic programming (Birge and
Wets 1986, Dulá and Murthy 1992, Dokov and Morton 2005, Bertsimas, Doan,
Natarajan and Teo 2010, Natarajan, Teo and Zheng 2011), control (Van Parys, Kuhn,
Goulart and Morari 2015, Yang 2018, Xin and Goldberg 2021, 2022), the operation
of power systems (Xie and Ahmed 2017, Zhao and Jiang 2017), complex network
analysis (Van Leeuwaarden and Stegehuis 2021, Brugman, Van Leeuwaarden and
Stegehuis 2022), queuing systems (van Eekelen, Hanasusanto, Hasenbein and van
Leeuwaarden 2023), healthcare (Mak, Rong and Zhang 2015, Shehadeh, Cohn and
Jiang 2020), and extreme event analysis (Lam and Mottet 2017), among others.

2.1.4. Chebyshev Ambiguity Sets with Uncertain Moments


Working with Chebyshev ambiguity sets is appropriate when the first- and second-
order moments of P are known, while all higher-order moments are unknown. In
practice, however, even the first- and second-order moments are never known with
absolute certainty. Instead, they must be estimated from statistical data and are thus
subject to estimation errors. This prompted El Ghaoui et al. (2003) to introduce a
Chebyshev ambiguity set with uncertain moments, which can be represented as

P = P ∈ P(Z) : E P [𝑍], E P [𝑍 𝑍 ⊤ ] ∈ F .

(2.4)
Here, F ⊆ R𝑑 × S+𝑑 is a convex set that captures the moment uncertainty. Clearly,
P can be expressed as a union of crisp Chebyshev ambiguity sets, that is, we have
Ø 
P= P ∈ P(Z) : E P [𝑍] = 𝜇, E P [𝑍 𝑍 ⊤ ] = 𝑀 .
(𝜇,𝑀)∈F

Note that the Chebyshev ambiguity set with uncertain moments encapsulates the
support-only ambiguity set, the Markov ambiguity set, and the Chebyshev ambigu-
ity set as special cases. They are recovered by setting F = R𝑑 × S+𝑑 , F = {𝜇} × S+𝑑 ,
and F = {𝜇} × {𝑀 }, respectively.
El Ghaoui et al. (2003) capture the uncertainty in the moments using the box
n o
F = (𝜇, 𝑀) ∈ R𝑑 × S+𝑑 : 𝜇 ≤ 𝜇 ≤ 𝜇, 𝑀  𝑀  𝑀

parametrized by the moment bounds 𝜇, 𝜇 ∈ R𝑑 and 𝑀, 𝑀 ∈ S+𝑑 .


Given noisy estimates 𝜇ˆ and Σ̂ for the unknown mean vector and covariance
matrix of P, respectively, Delage and Ye (2010) propose the ambiguity set
 
ˆ ⊤ Σ̂ −1 (E P [𝑍] − 𝜇)
(E P [𝑍] − 𝜇) ˆ ≤ 𝛾1
P = P ∈ P(Z) : .
ˆ
E P [(𝑍 − 𝜇)(𝑍 ˆ ⊤ ]  𝛾2 Σ̂
− 𝜇)
By construction, P contains all distributions on Z whose first-order moments reside
in an ellipsoid with center 𝜇ˆ and whose second-order moments (relative to 𝜇)
ˆ reside
in a semidefinite cone with apex 𝛾2 Σ̂. An elementary calculation reveals that
ˆ
E P [(𝑍 − 𝜇)(𝑍 ˆ ⊤ ] = E P [𝑍 𝑍 ⊤ ] − E P [𝑍] 𝜇ˆ ⊤ − 𝜇E
− 𝜇) ˆ P [𝑍] ⊤ + 𝜇ˆ 𝜇ˆ ⊤ .
Distributionally Robust Optimization 13

Thus, P can be viewed as a Chebyshev ambiguity set with uncertain moments.


Indeed, P is an instance of (2.4) if we define the moment uncertainty set as
 
𝑑 𝑑 ˆ ⊤ Σ̂(𝜇 − 𝜇)
(𝜇 − 𝜇) ˆ ≤ 𝛾1
F = (𝜇, 𝑀) ∈ R × S+ : .
𝑀 − 𝜇 𝜇ˆ ⊤ − 𝜇𝜇
ˆ ⊤ + 𝜇ˆ 𝜇ˆ ⊤  𝛾2 Σ̂

Delage and Ye (2010) show that if 𝜇ˆ and Σ̂ are set to the sample mean and the
sample covariance matrix constructed from a finite number of independent samples
from P, respectively, then one can tune the size parameters 𝛾1 ≥ 0 and 𝛾2 ≥ 1 to
ensure that P belongs to P with any desired confidence.
Chebyshev as well as Markov ambiguity sets with uncertain moments have
found various applications ranging from control (Nakao, Jiang and Shen 2021)
to integer stochastic programming (Bertsimas, Natarajan and Teo 2004, Cheng,
Delage and Lisser 2014), portfolio optimization (Natarajan et al. 2010), extreme
event analysis (Bai, Lam and Zhang 2023) and mechanism design and pricing
(Bergemann and Schlag 2008, Bandi and Bertsimas 2014, Koçyiğit, Iyengar, Kuhn
and Wiesemann 2020, Koçyiğit, Rujeerapaiboon and Kuhn 2022, Chen, Hu and
Wang 2024a, Bayrak, Koçyiğit, Kuhn and Pınar 2022, Anunrojwong, Balseiro and
Besbes 2024), among many others.
The uncertainty set F for the first- and second-order moments of P often corres-
ponds to a neighborhood of a nominal mean-covariance pair ( 𝜇, ˆ Σ̂) with respect to
some measure of discrepancy. For example, matrix norms such as the Frobenius
norm, the spectral norm or the nuclear norm (Bernstein 2009, § 9) provide nat-
ural measures to quantify the dissimilarity of covariance matrices. The discrep-
ancy between two mean-covariance pairs (𝜇, Σ) and ( 𝜇, ˆ Σ̂) can also be defined
as the discrepancy between the normal distributions N (𝜇, Σ) and N ( 𝜇, ˆ Σ̂) with
respect to a probability metric or an information-theoretic divergence such as the
Kullback-Leibler divergence (Kullback 1959), the Fisher-Rao distance (Atkinson
and Mitchell 1981) or other spectral divergences (Zorzi 2014).
As we will discuss in more detail in Section 2.3, the 2-Wasserstein distance
between two normal distributions N (𝜇, Σ) and N ( 𝜇,
ˆ Σ̂) coincides with the Gelbrich
distance between the underlying mean-covariance pairs (𝜇, Σ) and ( 𝜇, ˆ Σ̂). In the
following, we first provide a formal definition of the Gelbrich distance and then
exemplify how it can be used to define a moment uncertainty set F.
Definition 2.1 (Gelbrich Distance). The Gelbrich distance between two mean-
ˆ Σ̂) in R𝑑 × S+𝑑 is given by
covariance pairs (𝜇, Σ) and ( 𝜇,
r
1 1
 1

ˆ 22 + Tr Σ + Σ̂ − 2 Σ̂ 2 Σ Σ̂ 2 2 .

G (𝜇, Σ), ( 𝜇,
ˆ Σ̂) = k𝜇 − 𝜇k

The Gelbrich distance is non-negative, symmetric and subadditive, and it van-


ˆ Σ̂). Thus, it represents a metric on R𝑑 × S+𝑑 (Givens
ishes if and only if (𝜇, Σ) = ( 𝜇,
and Shortt 1984, pp. 239). When 𝜇 = 𝜇, ˆ then the Gelbrich distance collapses to the
Bures distance between Σ and Σ̂, which was conceived as a measure of dissimil-
14 D. Kuhn, S. Shafiee, and W. Wiesemann

arity between density matrices in quantum information theory. The Bures distance
is known to induce a Riemannian metric on the space of positive semidefinite
matrices (Bhatia, Jain and Lim 2018, 2019). When Σ and Σ̂ are simultaneously
diagonalizable, then their Bures distance coincides with the Hellinger distance
between their spectra. The Hellinger distance is closely related to the Fisher-Rao
metric ubiquitous in information theory (Liese and Vajda 1987). Even though the
Gelbrich distance is nonconvex, the squared Gelbrich distance is jointly convex in
both of its arguments. This is an immediate consequence of the following propos-
ition, which can be found in (Olkin and Pukelsheim 1982, Dowson and Landau
1982, Givens and Shortt 1984, Panaretos and Zemel 2020).
Proposition 2.2 (SDP Representation of the Gelbrich Distance). For any mean-
ˆ Σ̂) in R𝑑 × S+𝑑 , we have
covariance pairs (𝜇, Σ) and ( 𝜇,

 min
 ˆ 22 + Tr(Σ + Σ̂ − 2𝐶)
k𝜇 − 𝜇k

 𝐶 ∈R 𝑑×𝑑
 
G2 (𝜇, Σ), ( 𝜇,

ˆ Σ̂) = Σ 𝐶 (2.5)

 s.t.

𝐶 ⊤ Σ̂
 0.

Proof. Throughout the proof we keep 𝜇, 𝜇ˆ and Σ fixed and treat Σ̂ as a parameter.
We also use 𝑓 (Σ̂) as a shorthand for the left hand side of (2.5) and 𝑔(Σ̂) as a
shorthand for the right hand side of (2.5). Elementary manipulations show that

 max Tr(2𝐶)

 𝐶 ∈R𝑑×𝑑 

2 
ˆ 2 + Tr(Σ + Σ̂) −
𝑔(Σ̂) = k𝜇 − 𝜇k Σ 𝐶 (2.6)
 s.t.  0.

 ⊤
 𝐶 Σ̂
The maximization problem in (2.6) is dual to the following minimization problem.
inf Tr(𝐴11 Σ) + Tr(𝐴22 Σ̂)
𝐴11 , 𝐴22 ∈S 𝑑  
𝐴11 𝐼𝑑
s.t. 0
𝐼 𝑑 𝐴22
Strong duality holds because 𝐴11 = 𝐴22 = 2𝐼𝑑 constitutes a Slater point for the dual
problem (Ben-Tal and Nemirovski 2001, Theorem 2.4.1). The existence of a Slater
point further implies that the primal maximization problem in (2.6) as well as the
minimization problem in (2.5) are solvable. By (Bernstein 2009, Corollary 8.2.2),
both 𝐴11 and 𝐴22 must be positive definite in order to be dual feasible. Thus, they
are invertible. We can therefore employ a Schur complement argument (Ben-Tal
and Nemirovski 2001, Lemma 4.2.1) to simplify the dual problem to
−1
inf Tr(𝐴11 Σ) + Tr(𝐴22 Σ̂) = inf Tr(𝐴22 Σ) + Tr(𝐴22 Σ̂), (2.7)
𝐴11  𝐴−1 𝐴22 ≻0
22 ≻0

where the equality holds because Σ  0. The optimal value of the resulting minim-
ization problem is concave and upper semicontinuous in Σ̂ because it constitutes a
pointwise infimum of affine functions of Σ̂. Thus, 𝑔(Σ̂) is convex and lower semi-
continuous. We now show that if Σ̂ ≻ 0, then the convex minimization problem
Distributionally Robust Optimization 15

over 𝐴22 in (2.7) can be solved in closed form. To this end, we construct a positive
definite matrix 𝐴★ 22 that satisfies the problem’s first-order optimality condition
−1 −1
Σ̂ − 𝐴22 Σ 𝐴22 =0 ⇐⇒ 𝐴22 Σ̂ 𝐴22 − Σ = 0.
1
Indeed, multiplying the quadratic equation on the right from both sides with Σ̂ 2
1 1 1 1
yields the equivalent equation (Σ̂ 2 𝐴22 Σ̂ 2 )2 = Σ̂ 2 Σ Σ̂ 2 . As Σ̂ ≻ 0, this equation is
uniquely solved by 𝐴★ − 21 1 1 1
− 12 ★
22 = Σ̂ (Σ̂ Σ Σ̂ ) Σ̂ . Substituting 𝐴22 into (2.7) reveals
2 2 2
1 1 1
that the optimal value of the dual minimization problem is given by 2 Tr((Σ̂ 2 Σ Σ̂ 2 ) 2 ).
Substituting this value into (2.6) then shows that 𝑔(Σ̂) = 𝑓 (Σ̂) whenever Σ̂ ≻ 0.
It remains to be shown that 𝑔(Σ̂) = 𝑓 (Σ̂) if Σ̂ is singular. To this end, we re-
call from (Nguyen, Shafieezadeh-Abadeh, Kuhn and Mohajerin Esfahani 2023,
Lemma A.2) that the matrix square root is continuous on S+𝑑 , which implies
that 𝑓 (Σ̂) is continuous on S+𝑑 . For any singular Σ̂  0, we thus have
𝑓 (Σ̂) = lim inf 𝑓 (Σ̂′ ) = lim inf 𝑔(Σ̂′ ) = 𝑔(Σ̂).
Σ̂′ →Σ̂, Σ̂′ ≻0 Σ̂′ →Σ̂, Σ̂′ ≻0

Here, the first equality exploits the continuity of 𝑓 , and the second equality holds
because 𝑓 (Σ̂′ ) = 𝑔(Σ̂′ ) for every Σ̂′ ≻ 0. The third equality follows from the
convexity and lower semicontinuity of 𝑔, which imply that the limit inferior can
neither be smaller nor larger than 𝑔(Σ̂), respectively. This completes the proof. 
Proposition 2.2 shows that the squared Gelbrich distance coincides with the
optimal value of a tractable semidefinite program. This makes the Gelbrich distance
attractive for computation. As a byproduct, the proof of Proposition 2.2 reveals
that the squared Gelbrich distance is convex as well as continuous on its domain.
Following Nguyen, Shafiee, Filipović and Kuhn (2021), we can now introduce
the Gelbrich ambiguity set as an instance of the Chebyshev ambiguity set (2.4)
with uncertain moments. The corresponding moment uncertainty set is given by
 
𝑑 𝑑 ∃Σ ∈ S+𝑑 with 𝑀 = Σ + 𝜇𝜇⊤ ,
F = (𝜇, 𝑀) ∈ R × S+ : , (2.8)
G (𝜇, Σ), ( 𝜇,
ˆ Σ̂) ≤ 𝑟
where ( 𝜇,
ˆ Σ̂) is a nominal mean-covariance pair, and the radius 𝑟 ≥ 0 serves as a
tunable size parameter. Below we refer to F as the Gelbrich uncertainty set. The
next proposition establishes basic topological and computational properties of F.
Proposition 2.3 (Gelbrich Uncertainty Set). The uncertainty set F defined in (2.8)
is convex and compact. In addition, it admits the semidefinite representation


 ∃𝐶 ∈ R𝑑×𝑑 , 𝑈 ∈ S+𝑑 with 



 


 ˆ 2 − 2𝜇⊤ 𝜇ˆ + Tr(𝑀 + Σ̂ − 2𝐶) ≤ 𝑟 2 ,
k 𝜇k 

F = (𝜇, 𝑀) ∈ R𝑑 × S+𝑑 :  2    .

 𝑀 −𝑈 𝐶 𝑈 𝜇 


  0, ⊤ 0 

 𝐶⊤ Σ̂ 𝜇 1 
 
Proof. The proof exploits the semidefinite representation of the squared Gelbrich
16 D. Kuhn, S. Shafiee, and W. Wiesemann

distance established in Proposition 2.2. Note first that if 𝑀 = Σ + 𝜇𝜇⊤ , then


ˆ 22 + Tr(Σ + Σ̂ − 2𝐶) = k 𝜇k
k𝜇 − 𝜇k ˆ 22 − 2𝜇⊤ 𝜇ˆ + Tr(𝑀 + Σ̂ − 2𝐶).
By Proposition 2.2, the Gelbrich uncertainty set F can thus be represented as

 ∃𝐶 ∈ R𝑑×𝑑 with 


 



 ˆ 22 − 2𝜇⊤ 𝜇ˆ + Tr(𝑀 + Σ̂ − 2𝐶) ≤ 𝑟 2 ,
k 𝜇k 


F = (𝜇, 𝑀) ∈ R𝑑 × S+𝑑 :   .

 𝑀 − 𝜇𝜇⊤ 𝐶 


 0 


 𝐶⊤ Σ̂ 

A standard Schur complement argument further reveals that
     
𝑀 − 𝜇𝜇⊤ 𝐶 𝑑 𝑀 −𝑈 𝐶 𝑈 𝜇
⊤  0 ⇐⇒ ∃𝑈 ∈ S+ with ⊤  0, ⊤  0.
𝐶 Σ̂ 𝐶 Σ̂ 𝜇 1
Hence, the Gelbrich uncertainty set F admits the semidefinite representation given
in the proposition statement. Convexity is evident from this representation, which
expresses F as the projection of a set defined by conic inequalities in a lifted space.
It remains to be shown that F is compact. To this end, we define

V = (𝜇, Σ) ∈ R𝑑 × S+𝑑 : G (𝜇, Σ), ( 𝜇,

ˆ Σ̂) ≤ 𝑟
as the ball of radius 𝑟 around ( 𝜇,
ˆ Σ̂) with respect to the Gelbrich distance. Note
that F = 𝑓 (V), where the transformation 𝑓 : R𝑑 × S+𝑑 → R𝑑 × S+𝑑 is defined
through 𝑓 (𝜇, Σ) = (𝜇, Σ + 𝜇𝜇⊤ ). We will now prove that V is compact. As 𝑓 is
continuous and as compactness is preserved under continuous transformations, this
will readily imply that F is compact. Clearly, V is closed because the Gelbrich
distance is continuous. To show that V is also bounded, fix any (𝜇, Σ) ∈ V. By the
definition of the Gelbrich distance, we have k𝜇 − 𝜇k ˆ ≤ 𝑟 2 . In addition, we find
   
1 1 Σ 𝐶
 1 
Tr Σ̂ Σ Σ̂
2 2 2
= max Tr(𝐶) : ⊤ 0
𝐶 ∈R𝑑×𝑑 𝐶 Σ̂
n o q
≤ max Tr(𝐶) : 𝐶𝑖2𝑗 ≤ Σ𝑖𝑖 Σ̂ 𝑗 𝑗 ∀𝑖, 𝑗 ∈ [𝑑] ≤ Tr(Σ) Tr(Σ̂) ,
𝐶 ∈R𝑑×𝑑

where the equality has been established in the proof of Proposition 2.2. The two
inequalities follow from a relaxation of the linear matrix inequality, which exploits
the observation that all second principal minors of a positive semidefinite matrix
are non-negative, and from the Cauchy-Schwarz inequality. Thus, Σ satisfies
 1 1
2  1 1 1

Tr(Σ) 2 − Tr(Σ̂) 2 ≤ Tr Σ + Σ̂ − 2 Σ̂ 2 Σ Σ̂ 2 2 ) ≤ 𝑟 2 ,

where the second inequality holds because (𝜇, Σ) ∈ V. We may therefore conclude
1 1
that Tr(Σ) ≤ (𝑟 + Tr(Σ̂) 2 )2 , which in turn implies that 0  Σ  (𝑟 + (Tr(Σ̂)) 2 )2 𝐼 𝑑 . In
summary, we have shown that both 𝜇 and Σ belong to bounded sets. As (𝜇, Σ) ∈ V
was chosen arbitrarily, this proves that V is indeed bounded and thus compact. 
Distributionally Robust Optimization 17

Proposition 2.2 shows that the uncertainty set F is convex. This is surprising
because F = 𝑓 (V), where the Gelbrich ball V in the space of mean-covariance
pairs is convex thanks to Proposition 2.2 and where 𝑓 is a quadratic bijection.
Indeed, convexity is usually only preserved under affine transformations.
Gelbrich ambiguity sets were introduced by Nguyen et al. (2021) in the context
of robust portfolio optimization. They have also found use in machine learning
(Bui, Nguyen and Nguyen 2022, Vu, Tran, Yue and Nguyen 2021, Nguyen, Bui and
Nguyen 2022a), estimation (Nguyen et al. 2023), filtering (Shafieezadeh-Abadeh,
Nguyen, Kuhn and Mohajerin Esfahani 2018, Kargin, Hajar, Malik and Hassibi
2024b) and control (McAllister and Mohajerin Esfahani 2023, Al Taha, Yan and
Bitar 2023, Hajar, Kargin and Hassibi 2023, Hakobyan and Yang 2024, Taşkesen,
Iancu, Koçyiğit and Kuhn 2024, Kargin, Hajar, Malik and Hassibi 2024a,c,d).

2.1.5. Mean-Dispersion Ambiguity Sets


If K ⊆ R 𝑘 is a proper convex cone and 𝑣 1 , 𝑣 2 ∈ R 𝑘 , then the inequality 𝑣 1 K 𝑣 2
means that 𝑣 2 − 𝑣 1 ∈ K. Also, a function 𝐺 : R𝑚 → R 𝑘 is called K-convex if
𝐺(𝜃𝑣 1 + (1 − 𝜃)𝑣 2 ) K 𝜃𝐺(𝑣 1 ) + (1 − 𝜃)𝐺(𝑣 2 ) ∀𝑣 1 , 𝑣 2 ∈ R𝑚 , ∀𝜃 ∈ [0, 1].
The mean-dispersion ambiguity set corresponding to a convex closed support
set Z ⊆ R𝑑 , a mean vector 𝜇 ∈ R𝑑 , a K-convex dispersion function 𝐺 : R𝑚 → R 𝑘
and a dispersion bound 𝑔 ∈ R 𝑘 is defined as
P = {P ∈ P(Z) : E P [𝑍] = 𝜇, E P [𝐺(𝑍)] K 𝑔} . (2.9)
The mean-dispersion ambiguity set is highly expressive, that is, it can model various
stylized features of the unknown probability distribution. For example, if k · k is a
norm on R𝑑 , 𝐺(𝑧) = k𝑧 − 𝜇k is convex in the usual sense, and 𝑔 = 𝜎 ∈ R+ , then all
distributions P ∈ P have a mean absolute deviation from the mean that is bounded
by 𝜎. Alternatively, if 𝐺(𝑧) = (𝑧 − 𝜇)(𝑧 − 𝜇)⊤ is S+𝑑 -convex and 𝑔 = Σ ∈ S+𝑑 , then P
reduces to a Chebyshev ambiguity set with moment uncertainty. Specifically, the
covariance matrix of any P ∈ P is bounded by Σ in Loewner order. Wiesemann,
Kuhn and Sim (2014) show that the ambiguity set P, which contains distributions
of the 𝑑-dimensional random vector 𝑍, is closely related to the lifted ambiguity set

Q = Q ∈ P(C) : E Q [𝑍] = 𝜇, E Q [𝑈] = 𝑔
with support set C = {(𝑧, 𝑢) ∈ Z × R 𝑘 : 𝐺(𝑧) K 𝑢}, which contains joint
distributions of 𝑍 and an auxiliary 𝑚-dimensional random vector 𝑈. Indeed, one
can prove that P = {Q 𝑍 : Q ∈ Q}, where Q 𝑍 denotes the marginal distribution
of 𝑍 under Q. As the loss function depends only on 𝑍 but not on 𝑈, this reasoning
implies that the inner worst-case expectation problem in (1.2) satisfies
sup E P [ℓ(𝑥, 𝑍)] = sup E Q [ℓ(𝑥, 𝑍)] .
P∈P Q∈Q

Hence, one can replace the original ambiguity set P with the lifted ambiguity set Q.
This is useful because Q constitutes a simple Markov ambiguity set that specifies
18 D. Kuhn, S. Shafiee, and W. Wiesemann

only the support set C and the mean (𝜇, 𝑔) of the joint random vector (𝑍, 𝑈). In
addition, one can show that Z is convex because Z is convex and 𝐺 is K-convex.
In summary, DRO problems with mean-dispersion ambiguity sets of the form (2.9)
can systematically be reduced to DRO problems with Markov ambiguity sets.
A more general class of mean-dispersion ambiguity sets can be used to shape the
moment generating function of 𝑍 under P. Specifically, Chen, Sim and Xu (2019)
introduce the entropic dominance ambiguity set

P = P ∈ P(Z) : E P [𝑍] = 𝜇, log E P [exp(𝜃 ⊤ (𝑍 − 𝜇))] ≤ 𝑔(𝜃) ∀𝜃 ∈ R𝑑 ,


where 𝑔 : R𝑑 → R is a convex and twice continuously differentiable function


satisfying 𝑔(0) = 0 and ∇𝑔(0) = 0. The constraints parametrized by 𝜃 impose
a continuum of upper bounds on the cumulant generating function (that is, the
logarithmic moment generating function) of the centered random variable 𝑍 − 𝜇
under P. The choice of 𝑔 determines the specific class of distributions included
in the ambiguity set. For example, if 𝑔(𝜃) = 𝜎 2 𝜃 ⊤ 𝜃/2 for some 𝜎 > 0, then the
ambiguity set contains only sub-Gaussian distributions with variance proxy 𝜎 2 .
Sub-Gaussian distributions are probability distributions whose tails are bounded
by the tails of a Gaussian distribution. They play a significant role in probability
theory and statistics, particularly in the study of concentration inequalities and
high-dimensional phenomena (Vershynin 2018, Wainwright 2019).
The entropic dominance ambiguity set imposes infinitely many constraints on P.
Chen et al. (2019) show that worst-case expectation problems over this ambiguity
set can be reformulated as semi-infinite conic programs. They propose a cutting
plane algorithm to solve these conic programs efficiently. The entropic dominance
ambiguity set has also found applications in the study of nonlinear and PDE-
constrained DRO problems (Milz and Ulbrich 2020, 2022). Generalized entropic
dominance ambiguity sets are considered by Chen, Fu, Si, Sim and Xiong (2023).

2.1.6. Higher-order Moment Ambiguity Sets


Markov and Chebyshev ambiguity sets only impose conditions on the first- and/or
second-order moments of P. DRO problems with such ambiguity sets are often
tractable. In this section we briefly comment on moment ambiguity sets that impose
conditions on higher-order (polynomial) moments of P, which generically lead to
NP-hard DRO problems (Popescu 2005, Propositions 4.5 and 4.6).
Assume now that Z is a closed semialgebraic set defined as the feasible set
of finitely many polynomial inequalities.Î𝑑 In𝛼𝑖 addition, define the monomial of
𝑑 𝑑
order 𝛼 ∈ Z+ in 𝑧 ∈ R as the function 𝑖=1 𝑧𝑖 , which we denote more compactly
as 𝑧 𝛼 . The higher-order moment ambiguity set induced by a finite index set A ⊆ Z+𝑑
and the moment bounds 𝑚 𝛼 ∈ R, 𝛼 ∈ A, is then given by
P = {P ∈ P(Z) : E P [𝑍 𝛼 ] ≤ 𝑚 𝛼 ∀𝛼 ∈ A} .
Evaluating the worst-case expectation of a polynomial function (or the characteristic
function of a semialgebraic set) over all distributions in P thus amounts to solving
Distributionally Robust Optimization 19

a generalized moment problem. This moment problem as well as its dual constitute
semi-infinite linear programs, which can be recast as finite-dimensional conic
optimization problems over certain moment cones and the corresponding dual cones
of non-negative polynomials (Karlin and Studden 1966, Zuluaga and Pena 2005).
Even though NP-hard in general, these conic problems can be approximated by
increasingly tight sequences of tractable semidefinite programs by using tools from
polynomial optimization (Parrilo 2000, 2003, Lasserre 2001, 2009). This general
technique gives rise to worst-case expectation bounds and generalized Chebyshev
inequalities with respect to the ambiguity set P (Bertsimas and Sethuraman 2000,
Lasserre 2002, Popescu 2005, Lasserre 2008). In addition, it leads to tight bounds
on worst-case risk measures (Natarajan, Pachamanova and Sim 2009a).

2.2. 𝜙-Divergence Ambiguity Sets


The dissimilarity between two probability distributions is often quantified in terms
of a 𝜙-divergence, which is uniquely determined by an entropy function 𝜙.
Definition 2.4 (Entropy Functions). An entropy function 𝜙 : R → R is a lower
semicontinuous convex function with 𝜙(1) = 0 and 𝜙(𝑠) = +∞ for all 𝑠 < 0.
Note that any entropy function 𝜙 is continuous relative to its domain. In fact, this
is true for any univariate convex lower semicontinuous function. We emphasize,
however, that multivariate convex lower semicontinuous functions can have points
of discontinuity within their domains (Rockafellar and Wets 2009, Example 2.38).
The notion of a 𝜙-divergence relies on the perspective 𝜙 𝜋 of the entropy function 𝜙.
Definition 2.5 (𝜙-Divergences (Csiszár 1963, 1967, Ali and Silvey 1966)). The
(generalized) 𝜙-divergence of P ∈ P(Z) with respect to P̂ ∈ P(Z) is given by

dP dP̂
 
𝜋
D 𝜙 (P, P̂) = 𝜙 (𝑧), (𝑧) d𝜌(𝑧),
Z d𝜌 d𝜌
where 𝜙 is an entropy function and 𝜌 ∈ M+ (Z) is any dominating measure. This
means that P and P̂ are absolutely continuous with respect to 𝜌, that is, P, P̂ ≪ 𝜌.
By the definition of 𝜙 𝜋 and our convention that 0𝜙(𝑠/0) should be interpreted as
the recession function 𝜙∞ (𝑠), D 𝜙 (P, P̂) can be recast as
 
∫ dP
(𝑧)
dP̂ d𝜌
D 𝜙 (P, P̂) = (𝑧) · 𝜙   d𝜌(𝑧).
Z d𝜌 dP̂
(𝑧)
d𝜌

A dominating measure 𝜌 always exists, but it must depend on P and P̂. For
example, one may set 𝜌 = P + P̂. The absolute continuity conditions P ≪ 𝜌 and
dP dP̂
P̂ ≪ 𝜌 ensure that the Radon-Nikodym derivatives d𝜌 and d𝜌 are well-defined,
respectively. The following proposition derives a dual representation of a generic
𝜙-divergence, which reveals that D 𝜙 (P, P̂) is in fact independent of the choice of 𝜌.
20 D. Kuhn, S. Shafiee, and W. Wiesemann

Proposition 2.6 (Dual Representation of 𝜙-Divergences). We have


∫ ∫
D 𝜙 (P, P̂) = sup 𝑓 (𝑧) dP(𝑧) − 𝜙∗ ( 𝑓 (𝑧)) dP̂(𝑧),
𝑓 ∈F Z Z

where F denotes the family of all bounded Borel functions 𝑓 : Z → dom(𝜙∗ ).


Proof. As the entropy function 𝜙(𝑠) is proper, convex and lower semicontinuous
on R and as 0𝜙(𝑠/0) is interpreted as the recession function 𝜙∞ (𝑠), the perspective
function 𝜙 𝜋 (𝑠, 𝑡) = 𝑡𝜙(𝑠/𝑡) is proper, convex and lower semicontinuous on R × R+ .
By (Rockafellar 1970, Theorem 12.2), 𝜙 𝜋 (𝑠, 𝑡) can therefore be expressed as the
conjugate of its conjugate. Note that the conjugate of 𝜙 𝜋 (𝑠, 𝑡) satisfies
(𝜙 𝜋 )∗ ( 𝑓 , 𝑔) = sup
𝑓 𝑠 + 𝑔𝑡 − 𝑡𝜙(𝑠/𝑡)
𝑠∈R, 𝑡 ∈R+

0 if 𝑓 ∈ dom(𝜙∗ ) and 𝑔 + 𝜙∗ ( 𝑓 ) ≤ 0,
= sup 𝑔𝑡 + 𝑡𝜙∗ ( 𝑓 ) =
𝑡 ∈R+ +∞ otherwise,
for all 𝑓 , 𝑔 ∈ R. The second equality in the above expression follows from
(Rockafellar 1970, Theorem 16.1). As 𝜙 𝜋 (𝑠, 𝑡) = sup 𝑓 ,𝑔∈R 𝑠 𝑓 + 𝑡𝑔 − (𝜙 𝜋 )∗ ( 𝑓 , 𝑔)
by virtue of (Rockafellar 1970, Theorem 12.2), the 𝜙-divergence is thus given by
∫  
dP dP̂ 𝜋 ∗
D 𝜙 (P, P̂) = sup (𝑧) · 𝑓 + (𝑧) · 𝑔 − (𝜙 ) ( 𝑓 , 𝑔) d𝜌(𝑧)
Z 𝑓 ,𝑔∈R d𝜌 d𝜌
∫  
dP dP̂ ∗
= sup (𝑧) · 𝑓 − (𝑧) · 𝜙 ( 𝑓 ) d𝜌(𝑧)
Z 𝑓 ∈dom(𝜙 ∗ ) d𝜌 d𝜌
∫  
dP dP̂
= sup (𝑧) · 𝑓 (𝑧) − (𝑧) · 𝜙∗ ( 𝑓 (𝑧)) d𝜌(𝑧),
𝑓 ∈F Z d𝜌 d𝜌
where the second equality exploits our explicit formula for (𝜙 𝜋 )∗ derived above,
while the third equality follows from (Rockafellar and Wets 2009, Theorem 14.60).
This theorem applies because the function ℎ : dom(𝜙∗ ) × Z → R defined through
dP dP̂
ℎ( 𝑓 , 𝑧) = (𝑧) · 𝑓 − (𝑧) · 𝜙∗ ( 𝑓 )
d𝜌 d𝜌
is continuous in 𝑓 and Borel measurable in 𝑧, thus constituting a Carathéodory
integrand in the sense of (Rockafellar and Wets 2009, Example 14.29). The claim
then follows immediately from the definition of Radon-Nikodym derivatives. 

Proposition 2.6 reveals that D 𝜙 (P, P̂) is jointly convex in P and P̂. If 𝜙(𝑠) grows
superlinearly with 𝑠, that is, if the asymptotic growth rate 𝜙∞ (1) is infinite, then
dP dP̂
D 𝜙 (P, P̂) is finite if and only if d𝜌 (𝑧) = 0 for 𝜌-almost all 𝑧 ∈ Z with d𝜌 (𝑧) = 0. Put
differently, D 𝜙 (P, P̂) is finite if and only if P ≪ P̂. In this special case, the chain
dP dP̂
rule for Radon-Nikodym derivatives implies that d𝜌 / d𝜌 = ddPP̂ . If 𝜙∞ (1) = ∞, the
Distributionally Robust Optimization 21

Divergence 𝜙(𝑠) (𝑠 ≥ 0) 𝜓(𝑠) (𝑠 ≥ 0) 𝜙∞ (1) 𝜓 ∞ (1)

Kullback-Leibler 𝑠 log(𝑠) − 𝑠 + 1 − log(𝑠) + 𝑠 − 1 ∞ 1


Likelihood − log(𝑠) + 𝑠 − 1 𝑠 log(𝑠) − 𝑠 + 1 1 ∞
Total variation 1 − 1| 1 1 1
2 |𝑠 2 |𝑠 − 1| 2 2
Pearson 𝜒 2 (𝑠 − 1)2 1 (𝑠 − 1)2 ∞ 1
𝑠
Neyman 𝜒2 1 (𝑠 − 1)2 (𝑠 − 1)2 1 ∞
𝑠
𝑠 𝛽 −𝛽𝑠+𝛽−1 𝑠 1−𝛽 −𝛽+𝛽𝑠−𝑠 1 1
Cressie-Read for 𝛽 ∈ (0, 1) 𝛽(𝛽−1) 𝛽(𝛽−1) 1−𝛽 𝛽
𝑠 𝛽 −𝛽𝑠+𝛽−1 𝑠 1−𝛽 −𝛽+𝛽𝑠−𝑠 1
Cressie-Read for 𝛽 > 1 𝛽(𝛽−1) 𝛽(𝛽−1) ∞ 𝛽−1

Table 2.1. Examples of entropy functions and their Csiszár duals.

𝜙-divergence thus admits the more common (but less general) representation
( ∫  
dP
𝜙 (𝑧) dP̂(𝑧) if P ≪ P̂,
D 𝜙 (P, P̂) = Z dP̂
+∞ otherwise.
We are now ready to define the 𝜙-divergence ambiguity set as

P = P ∈ P(Z) : D 𝜙 (P, P̂) ≤ 𝑟 . (2.10)
This set contains all probability distributions P supported on Z whose 𝜙-divergence
with respect to some prescribed reference distribution P̂ is at most 𝑟 ≥ 0.
Remark 2.7 (Csiszár Duals). The family of generalized 𝜙-divergences (which may
adopt finite values even if P 3 P̂) is invariant under permutations of P and P̂.
Formally, we have D 𝜙 (P, P̂) = D 𝜓 (P̂, P), where 𝜓 denotes the Csiszár dual of 𝜙
defined through 𝜓(𝑠) = 𝜙 𝜋 (1, 𝑠) = 𝑡𝜙(1/𝑡) (Ben-Tal, Ben-Israel and Teboulle 1991,
Lemma 2.3). One readily verifies that if 𝜙 is a valid entropy function in the sense
of Definition 2.4, then 𝜓 is also a valid entropy function. This relationship shows
that, even though 𝜙-divergences are generically asymmetric, we do not sacrifice
generality by focusing on divergence ambiguity sets of the form (2.10), with the
nominal distribution P̂ being the second argument of the divergence. From the
discussion after Proposition 2.6 it is clear that if 𝜙∞ (1) = ∞, then all distributions P
in the 𝜙-divergence ambiguity set (2.10) satisfy P ≪ P̂. Similarly, if the Csiszár dual
of 𝜙 satisfies 𝜓 ∞ (1) = ∞, then all distributions P in the 𝜙-divergence ambiguity set
satisfy P̂ ≪ P. Table 2.1 lists common entropy functions and their Csiszár duals.
We emphasize that the family of Cressie-Read divergences includes the (scaled)
Pearson 𝜒2 -divergence for 𝛽 = 2, the Kullback-Leibler divergence for 𝛽 → 1 and
the likelihood divergence for 𝛽 → 0 as special cases.
The DRO literature often focuses on the restricted 𝜙-divergence ambiguity set

P = P ∈ P(Z) : P ≪ P̂, D 𝜙 (P, P̂) ≤ 𝑟 (2.11)
22 D. Kuhn, S. Shafiee, and W. Wiesemann

introduced by Ben-Tal et al. (2013). Unlike the standard 𝜙-divergence ambiguity


set (2.10), it contains only distributions that are absolutely continuous with respect
to the reference distribution P̂ even if 𝜙∞ (1) < ∞. Ben-Tal et al. (2013) study
DRO problems over restricted 𝜙-divergence ambiguity sets under the assumption
that the reference distribution P̂ is discrete. In this case, the absolute continuity
constraint P ≪ P̂ ensures that the ambiguity set contains only discrete distributions
supported on the atoms of P̂, and thus nature’s worst-case expectation problem
reduces to a finite convex program. Ben-Tal et al. (2013) further develop a duality
theory for this problem class. Shapiro (2017) extends this duality theory to general
reference distributions P̂ that are not necessarily discrete. Hu, Hong and So (2013)
and Jiang and Guan (2016) show that any distributionally robust individual chance
constraint with respect to a restricted 𝜙-divergence ambiguity set is equivalent to
a classical chance constraint under the reference distribution P̂ but with a rescaled
confidence level. A classification of various 𝜙-divergences and an analysis of the
structural properties of the corresponding 𝜙-divergence ambiguity sets is provided
by Bayraksan and Love (2015) under the assumption that Z is finite. Below we
review popular instances of the standard and restricted 𝜙-divergence ambiguity sets.

2.2.1. Kullback-Leibler Ambiguity Sets


The Kullback-Leibler divergence is the 𝜙-divergence corresponding to the entropy
function that satisfies 𝜙(𝑠) = 𝑠 log(𝑠) − 𝑠 + 1 for all 𝑠 ≥ 0; see also Table 2.1. As
𝜙∞ (1) = +∞, it thus admits the following equivalent definition.
Definition 2.8 (Kullback-Leibler Divergence). The Kullback-Leibler divergence
of P ∈ P(Z) with respect to P̂ ∈ P(Z) is given by
 ∫
dP
 


 log
 (𝑧) dP(𝑧) if P ≪ P̂,
KL(P, P̂) = Z dP̂


 +∞
 otherwise.
We now review a famous variational formula for the Kullback-Leibler divergence.
Proposition 2.9 (Donsker and Varadhan (1983)). The Kullback-Leibler divergence
of P with respect to P̂ satisfies
∫ ∫ 
𝑓 (𝑧)
KL(P, P̂) = sup 𝑓 (𝑧) dP(𝑧) − log e dP̂(𝑧) , (2.12)
𝑓 ∈F Z Z

where F denotes the family of all bounded Borel functions 𝑓 : Z → R𝑑 .


Proof. The convex conjugate of the entropy function 𝜙 inducing the Kullback-
Leibler divergence satisfies 𝜙∗ (𝑡) = exp(𝑡) − 1 with dom(𝜙∗ ) = R. Thus, the dual
representation of generic 𝜙-divergences established in Proposition 2.6 implies that
∫ ∫
e 𝑓 (𝑧) − 1 dP̂(𝑧),

KL(P, P̂) = sup 𝑓 (𝑧) dP(𝑧) −
𝑓 ∈F Z Z
Distributionally Robust Optimization 23

where F denotes the family of all bounded Borel functions 𝑓 : Z → R. Note that F
is invariant under constant shifts. That is, if 𝑓 (𝑧) is a bounded Borel function, then
so is 𝑓 (𝑧) + 𝑐 for any constant 𝑐 ∈ R. Without loss of generality, we may thus
optimize over both 𝑓 ∈ F and 𝑐 ∈ R in the above maximization problem to obtain
∫ ∫
e 𝑓 (𝑧)+𝑐 − 1 dP̂(𝑧).

KL(P, P̂) = sup sup ( 𝑓 (𝑧) + 𝑐) dP(𝑧) −
𝑓 ∈F 𝑐∈R Z Z

For any fixed 𝑓 ∈ F, the inner maximization problem over 𝑐 is uniquely solved by
∫ 
★ 𝑓 (𝑧)
𝑐 = − log 𝑒 dP̂(𝑧) .
Z
Substituting this expression back into the objective function yields (2.12). 
Proposition 2.9 establishes a link between the Kullback-Leibler divergence and
the entropic risk measure. This connection will become useful in Section 4.3.
The Kullback-Leibler ambiguity set of radius 𝑟 ≥ 0 around P̂ ∈ P(Z) is given by

P = P ∈ P(Z) : KL(P, P̂) ≤ 𝑟 . (2.13)
As 𝜙∞ (1) = +∞, all distributions P ∈ P are absolutely continuous with respect to P̂.
Thus, P coincides with the restricted Kullback-Leibler ambiguity set. El Ghaoui
et al. (2003) derive a closed-form expression for the worst-case value-at-risk of a
linear loss function when P̂ is a Gaussian distribution. Hu and Hong (2013) use
similar techniques to show that any distributionally robust individual chance con-
straint with respect to a Kullback-Leibler ambiguity set is equivalent to a classical
chance constraint with a rescaled confidence level. Calafiore (2007) studies worst-
case mean-risk portfolio selection problems when P̂ is a discrete distribution. The
Kullback-Leibler ambiguity set has also found applications in least-squares estim-
ation (Levy and Nikoukhah 2004), hypothesis testing (Levy 2008, Gül and Zoubir
2017), filtering (Levy and Nikoukhah 2012, Zorzi 2016, 2017a,b), the theory of
risk measures (Ahmadi-Javid 2012, Postek et al. 2016) and extreme value analysis
(Blanchet, He and Murthy 2020), among many others.

2.2.2. Likelihood Ambiguity Sets


As the Kullback-Leibler divergence fails to be symmetric, it gives rise to two strictly
different ambiguity sets. The Kullback-Leibler ambiguity set from Section 2.2.1 is
obtained by fixing the second argument of the Kullback-Leibler divergence to the
reference distribution P̂ and considering all distributions P with KL(P, P̂) ≤ 𝑟. An
alternative ambiguity set is obtained by using P̂ as the first argument and setting

P = P ∈ P(Z) : KL(P̂, P) ≤ 𝑟 . (2.14)
We refer to P as the likelihood ambiguity set centered at P̂ ∈ P(Z). Indeed, the
likelihood or Burg-entropy divergence of P ∈ P(Z) with respect to P̂ ∈ P(Z) is
usually defined as the reverse Kullback-Leibler divergence KL(P̂, P). ThisÍ termin-
ology is based on the following intuition. If Z is a discrete set and P̂ = 𝑁1 𝑖=1
𝑁
𝛿 𝑧ˆ𝑖
24 D. Kuhn, S. Shafiee, and W. Wiesemann

is the empirical distribution corresponding to 𝑁 independent samples { 𝑧ˆ𝑖 }𝑖=1𝑁 from

an unknown distribution on Z, then it is natural to construct the family of all


distributions on Z that make the observed data achieve a prescribed level of like-
lihood. This distribution
Î𝑁 family corresponds to a superlevel set of the likelihood
function L(P) = 𝑖=1 P(𝑍 = 𝑧ˆ𝑖 ) over P(Z). One can show that any such superlevel
set coincides with a sublevel set of the likelihood divergence KL(P̂, P). Thus, it
constitutes a likelihood ambiguity set of the form (2.14). We emphasize that this
correspondence does not easily carry over to situations where Z fails to be discrete.
Likelihood ambiguity sets were originally introduced by Wang et al. (2016) in the
context of static DRO, and they were used by Wiesemann, Kuhn and Rustem (2013)
in the context of robust Markov decision processes. Bertsimas et al. (2018a,b)
show that the likelihood ambiguity set contains all distributions that pass a G-test
of goodness-of-fit at a prescribed significance level.
Likelihood ambiguity sets display several statistical optimality properties even
if Z is uncountable. To explain these properties, we consider the task of evaluating
a (1 − 𝜂)-upper confidence bound on the expected value of some loss function
under an unknown distribution P when 𝑁 independent samples from P are given.
Leveraging the empirical likelihood theorem by Owen (1988), Lam (2019) shows
a desirable property of the likelihood ambiguity set centered around the empirical
distribution P̂: The associated worst-case expected loss provides the least conser-
vative confidence bound for a constant significance level 𝜂 asymptotically when
the radius 𝑟 decays at the rate 1/𝑁. Similar guarantees for a broader class of
𝜙-divergences are reported by Duchi, Glynn and Namkoong (2021). In addition,
Van Parys, Mohajerin Esfahani and Kuhn (2021) leverage Sanov’s large deviation
principle (Cover and Thomas 2006, Theorem 11.4.1) to prove that the worst-case
expected loss with respect to a likelihood ambiguity set of constant radius 𝑟 around P̂
provides the least conservative confidence bound for a decaying significance level
𝜂 ∝ 𝑒 −𝑟 𝑁 asymptotically for large 𝑁. Gupta (2019) further shows that a likeli-
hood ambiguity set of radius 𝑟 ∝ 𝑁 −1/2 around P̂ represents the smallest convex
ambiguity set that satisfies a Bayesian robustness guarantee.

2.2.3. Total Variation Ambiguity Sets


The total variation distance of two distributions P, P̂ ∈ P(Z) is the maximum
absolute difference between the probabilities assigned to any event by P and P̂.
Definition 2.10 (Total Variation Distance). The total variation distance is the
function TV : P(Z) × P(Z) → [0, 1] defined through

TV(P, P̂) = sup P(B) − P̂(B) : B ⊆ Z is a Borel set .
The total variation distance is ostensibly symmetric and satisfies the identity
of indiscernible as well as the triangle inequality. Thus, it constitutes a metric
on P(Z). In addition, the total variation distance is an instance of a 𝜙-divergence.
Distributionally Robust Optimization 25

Proposition 2.11. The total variation distance coincides with the 𝜙-divergence
induced by the the entropy function with 𝜙(𝑠) = 12 |𝑠 − 1| for all 𝑠 ≥ 0.
Proof. The conjugate entropy function evaluates to 𝜙∗ (𝑡) = max{𝑡, − 21 } if 𝑡 ≤ 21
and to 𝜙∗ (𝑡) = +∞ if 𝑡 > 21 . By Proposition 2.6, the 𝜙-divergence corresponding to
the given entropy function thus admits the dual representation
∫ ∫  
1
D 𝜙 (P, P̂) = sup 𝑓 (𝑧) dP(𝑧) − max 𝑓 (𝑧), − dP̂(𝑧), (2.15)
𝑓 ∈F Z Z 2

where F denotes the family of all bounded Borel functions 𝑓 : Z → (−∞, 12 ]. As


clipping any 𝑓 ∈ F from below at − 12 creates a new function in F with a non-inferior
objective value, we can in fact restrict attention to Borel
∫ functions∫ 𝑓 : Z → [− 12 , 12 ].
The objective function in (2.15) then simplifies to Z 𝑓 (𝑧)dP(𝑧)− Z 𝑓 (𝑧)dP̂(𝑧). This
simplified objective function remains unchanged when 𝑓 is shifted by a constant.
In summary, we may therefore conclude that (2.15) is equivalent to
∫ ∫
D 𝜙 (P, P̂) = sup 𝑓 (𝑧) dP(𝑧) − 𝑓 (𝑧) dP̂(𝑧), (2.16)
𝑓 ∈F ′ Z Z

where F ′ denotes the family of all Borel functions 𝑓 : Z → [0, 1]. Moreover, as
the objective function of the maximization problem in (2.16) is linear in 𝑓 , we can
further restrict F ′ to contain only binary Borel functions 𝑓 : Z → {0, 1} without
sacrificing optimality. As there is a one-to-one correspondence between Borel sets
and their characteristic functions, we finally obtain the desired identity

D 𝜙 (P, P̂) = sup P(B) − P̂(B) : B ⊆ Z is a Borel set .
Hence, the claim follows. 
The total variation ambiguity set of radius 𝑟 ≥ 0 around P̂ ∈ P(Z) is given by

P = P ∈ P(Z) : TV(P, P̂) ≤ 𝑟 .
Most of the existing literature focuses on the restricted total variation ambiguity
set, which contains all distributions P ∈ P that satisfy P ≪ P̂. Jiang and Guan
(2018, Theorem 1) and Shapiro (2017, Example 3.7) show that the worst-case ex-
pected loss with respect to a restricted total variation ambiguity set coincides with
a combination of a conditional value-at-risk and the essential supremum of the loss
with respect to P̂, see also Section 6.10. Rahimian, Bayraksan and Homem-de-
Mello (2019a,b, 2022) study the worst-case distributions of DRO problems over
unrestricted total variation ambiguity sets when Z is finite. The total variation am-
biguity set is related to Huber’s contamination model from robust statistics (Huber
1981), which assumes that a fraction 𝑟 ∈ (0, 1) of all samples in a statistical dataset
are drawn from an arbitrary contaminating distribution. Hence, the total vari-
ation distance between the target distribution to be estimated and the contaminated
data-generating distribution is at most 𝑟. It is thus natural to use a total variation
26 D. Kuhn, S. Shafiee, and W. Wiesemann

ambiguity set of radius 𝑟 around some estimated distribution as the search space for
the target distribution (Nishimura and Ozaki 2004, 2006, Bose and Daripa 2009,
Duchi, Hashimoto and Namkoong 2023, Tsanga and Shehadeha 2024).

2.2.4. 𝜒2 -Divergence Ambiguity Set


The 𝜒2 -divergence is the 𝜙-divergence corresponding to the entropy function that
satisfies 𝜙(𝑠) = (𝑠 − 1)2 for all 𝑠 ≥ 0; see also Table 2.1. As 𝜙∞ (1) = +∞, it thus
admits the following equivalent definition.
Definition 2.12 (𝜒2 -Divergence). The 𝜒2 -divergence of P ∈ P(Z) with respect
to P̂ ∈ P(Z) is given by
 ∫  2

 dP

 (𝑧) − 1 dP̂(𝑧) if P ≪ P̂,
𝜒2 (P, P̂) = Z dP̂


 +∞ otherwise.

The 𝜒2 -divergence admits the following dual representation.
Proposition 2.13. The 𝜒2 -divergence of P with respect to P̂ satisfies
2
E P [ 𝑓 (𝑍)] − E [ 𝑓 (𝑍)]
𝜒2 (P, P̂) = sup P̂
,
𝑓 ∈F VP̂ [ 𝑓 (𝑍)]
where F is a shorthand for the family of all bounded Borel functions 𝑓 : Z → R,
and VP̂ [ 𝑓 (𝑍)] stands for the variance of 𝑓 (𝑍) under P̂. If VP̂ [ 𝑓 (𝑍)] = 0, then the
above fraction is interpreted as 0 if E P [ 𝑓 (𝑍)] = E P̂ [ 𝑓 (𝑍)] and as +∞ otherwise.
Proof. The convex conjugate of the entropy function inducing the 𝜒2 -divergence
2
satisfies 𝜙∗ (𝑡) = 𝑡4 + 𝑡 if 𝑡 ≥ −2 and 𝜙∗ (𝑡) = −1 if 𝑡 < −2, and its domain is given
by dom(𝜙∗ ) = R. Consequently, Proposition 2.6 implies that
∫ ∫ 
𝑓 (𝑧)2

2
𝜒 (P, P̂) = sup 𝑓 (𝑧) dP(𝑧) − + 𝑓 (𝑧) dP̂(𝑧),
𝑓 ∈F Z Z 4
where F denotes the family of all bounded Borel functions 𝑓 : Z → R. Note
that we have replaced 𝜙∗ ( 𝑓 (𝑧)) with 𝑓 (𝑧)2 /4 + 𝑓 (𝑧) in the second integral. This
may be done without loss of generality. Indeed, if the function 𝑓 (𝑧) adopts values
below −2, then it is (weakly) dominated by the function 𝑓 ′ (𝑧) = max{ 𝑓 (𝑧), −2}.
Note also that F is invariant under constant shifts. That is, if 𝑓 (𝑧) is a bounded
Borel function, then so is 𝑓 (𝑧)+𝑐 for any constant 𝑐 ∈ R. An elementary calculation
reveals that, for any fixed 𝑓 ∈ F, the optimal shift is 𝑐★ = −E P̂ [ 𝑓 (𝑍)]. Hence, we
may replace 𝑓 (𝑧) with 𝑓 (𝑧) − E P̂ [ 𝑓 (𝑍)] in the above expression, which yields
VP̂ [ 𝑓 (𝑍)]
𝜒2 (P, P̂) = sup E P [ 𝑓 (𝑍)] − E P̂ [ 𝑓 (𝑍)] − .
𝑓 ∈F 4
Note that the set F is also invariant under scaling. That is, if 𝑓 (𝑧) is a bounded
Distributionally Robust Optimization 27

Borel function, then so is 𝑐 𝑓 (𝑧) for any constant 𝑐 ∈ R. We may thus optimize
separately over 𝑓 ∈ F and 𝑐 ∈ R in the above maximization problem to obtain
V [ 𝑓 (𝑍)] 2
𝜒2 (P, P̂) = sup sup E P [ 𝑓 (𝑍)] − E P̂ [ 𝑓 (𝑍)] 𝑐 − P̂

𝑐
𝑓 ∈F 𝑐∈R 4
2
E P [ 𝑓 (𝑍)] − E P̂ [ 𝑓 (𝑍)]
= sup .
𝑓 ∈F VP̂ [ 𝑓 (𝑍)]
Note that the inner maximization problem over 𝑐 simply evaluates the conjugate
of the convex quadratic function VP̂ [ 𝑓 (𝑍)]𝑐2 /4 at E P [ 𝑓 (𝑍)] − E P̂ [ 𝑓 (𝑍)], which is
available in closed form. Thus, the claim follows. 
As the 𝜒2 -divergence fails to be symmetric, it give rise to two complementary
ambiguity sets, which differ according to whether the reference distribution P̂ ∈
P(Z) is used as the first or the second argument of the 𝜒2 -divergence. Lam (2018)
defines the Pearson 𝜒2 -ambiguity set of radius 𝑟 ≥ 0 around P̂ as

P = P ∈ P(Z) : 𝜒2 (P, P̂) ≤ 𝑟 (2.17)
in order to analyze operations and service systems with dependent data. Philpott,
de Matos and Kapelevich (2018) develop a stochastic dual dynamic programming
algorithm for solving distributionally robust multistage stochastic programs with a
Pearson ambiguity set. In the context of static DRO, Duchi and Namkoong (2019)
show that robustification with respect to a Pearson ambiguity set is closely related
to variance regularization. Note that as 𝜙∞ (1) = +∞, the Pearson ambiguity set
coincides with its restricted version, which contains only distributions P ≪ P̂.
Klabjan, Simchi-Levi and Song (2013) define the Neyman 𝜒2 -ambiguity set as

P = P ∈ P(Z) : 𝜒2 (P̂, P) ≤ 𝑟
in order to formulate robust lot-sizing problems. Hanasusanto and Kuhn (2013) use
a Neyman ambiguity set with finite Z in the context of robust data-driven dynamic
programming. Finally, Hanasusanto et al. (2015a) use the same ambiguity set to
model the uncertainty in the mixture weights of multimodal demand distributions.

2.3. Optimal Transport Ambiguity Sets


Optimal transport theory offers a natural way to quantify the difference between
probability distributions and gives rise to a rich family of ambiguity sets. To explain
this, we first introduce the notion of a transportation cost function.
Definition 2.14 (Transportation Cost Function). A lower semicontinuous function
𝑐 : Z ×Z → [0, +∞] with 𝑐(𝑧, 𝑧) = 0 for all 𝑧 ∈ Z is a transportation cost function.
Every transportation cost function induces an optimal transport discrepancy.
Definition 2.15 (Optimal Transport Discrepancy). The optimal transport discrep-
ancy OT𝑐 : P(Z) × P(Z) → [0, +∞] associated with any given transportation
28 D. Kuhn, S. Shafiee, and W. Wiesemann

cost function 𝑐 is defined through


OT𝑐 (P, P̂) = inf ˆ
E 𝛾 [𝑐(𝑍, 𝑍)], (2.18)
𝛾∈Γ(P,P̂)

where Γ(P, P̂) represents the set of all couplings 𝛾 of P and P̂, that is, all joint
probability distributions of 𝑍 and 𝑍ˆ with marginals P and P̂, respectively.
By definition, we have 𝛾 ∈ Γ(P, P̂) if and only if 𝛾((𝑍, 𝑍) ˆ ∈ B × Z) = P(𝑍 ∈ B)
ˆ ∈ Z×B̂) = P̂( 𝑍ˆ ∈ B̂) for all Borel sets B, B̂ ⊆ Z. If the probability dis-
and 𝛾((𝑍, 𝑍)
tributions P and P̂ are visualized as two piles of sand, then any coupling 𝛾 ∈ Γ(P, P̂)
can be interpreted as a transportation plan, that is, an instruction for morphing P̂ into
the shape of P by moving sand between various origin-destination pairs in Z. In-
deed, for any fixed origin 𝑧ˆ ∈ Z, the conditional probability 𝛾(𝑧 ≤ 𝑍 ≤ 𝑧+d𝑧| 𝑍ˆ = 𝑧ˆ)
determines the proportion of the sand located at 𝑧ˆ that should be moved to (an in-
finitesimally small rectangle at) the destination 𝑧. If the cost of moving one unit
of probability mass from 𝑧ˆ to 𝑧 amounts to 𝑐(𝑧, 𝑧ˆ), then OT𝑐 (P, P̂) is the minimal
amount of money that is needed to morph P̂ into P. We now provide a dual
representation for generic optimal transport discrepancies.
Proposition 2.16 (Kantorovich Duality I). We have
∫ ∫




 sup 𝑓 (𝑧) dP(𝑧) − 𝑔(ˆ𝑧) dP̂(ˆ𝑧)
OT𝑐 (P, P̂) = 𝑓 ∈L1 (P), 𝑔∈L1 (P̂) Z Z (2.19)


 s.t. 𝑓 (𝑧) − 𝑔(ˆ𝑧) ≤ 𝑐(𝑧, 𝑧ˆ) ∀𝑧, 𝑧ˆ ∈ Z,

where L1 (P) and L1 (P̂) denote the sets of all Borel functions from Z to R that are
integrable with respect to P and P̂, respectively.
The dual problem (2.19) represents the profit maximization problem of a third
party that redistributes the sand from P̂ to P on behalf of the problem owner by
buying sand at the origin 𝑧ˆ at unit price 𝑔(ˆ𝑧) and selling sand at the destination 𝑧
at unit price 𝑓 (𝑧). The constraints ensure that it is cheaper for the problem owner
to use the services of the third party instead of moving the sand without external
help at the transportation cost 𝑐(𝑧, 𝑧ˆ) for every origin-destination pair (ˆ𝑧 , 𝑧). The
optimal price functions 𝑓 ★ and 𝑔★, if they exist, are termed Kantorovich potentials.

Proof of Proposition 2.16. For a general proof we refer to (Villani 2008, The-
orem 5.10 (i)). We prove the claim under the simplifying assumption that Z is
compact. In this case, the family C(Z × Z) of all continuous (and thus bounded)
functions 𝑓 : Z × Z → R equipped with the supremum norm constitutes a Banach
space. Its topological dual is the space M(Z ×Z) of all finite signed Borel measures
on Z × Z equipped with the total variation norm (Folland 1999, Corollary 7.18).
This means that for every continuous linear
∫ functional 𝜑 : C(Z × Z) → R there
exists 𝛾 ∈ M(Z × Z) such that 𝜑( 𝑓 ) = Z ×Z 𝑓 (𝑧, 𝑧ˆ) d𝛾(𝑧, 𝑧ˆ) for all 𝑓 ∈ C(Z × Z).
Distributionally Robust Optimization 29

We first use the Fenchel–Rockafellar duality theorem to show that


∫ ∫


 sup
 𝑓 (𝑧) dP(𝑧) − 𝑔(ˆ𝑧 ) dP̂(ˆ𝑧 )
OT𝑐 (P, P̂) = 𝑓 ,𝑔∈C(Z) Z Z (2.20)


 s.t. 𝑓 (𝑧) − 𝑔(ˆ𝑧) ≤ 𝑐(𝑧, 𝑧ˆ) ∀𝑧, 𝑧ˆ ∈ Z,
that is, we prove that strong duality holds if the price functions 𝑓 and 𝑔 in the dual
problem are restricted to the space C(Z) of continuous functions from Z to R. To
this end, we re-express the maximization problem in (2.20) more compactly as
sup −𝜙(ℎ) − 𝜓(ℎ), (2.21)
ℎ∈C(Z ×Z)

where the convex functions 𝜙, 𝜓 : C(Z × Z) → (−∞, +∞] are defined through

0 if − ℎ(𝑧, 𝑧ˆ) ≤ 𝑐(𝑧, 𝑧ˆ) ∀𝑧, 𝑧ˆ ∈ Z,
𝜙(ℎ) =
+∞ otherwise,
and
∫ ∫ 

 if ∃ 𝑓 , 𝑔 ∈ C(Z) with

 ℎ(𝑧, 𝑧ˆ) dP(𝑧) dP̂(ˆ𝑧 )
𝜓(ℎ) = Z Z ℎ(𝑧, 𝑧ˆ) = 𝑔(ˆ𝑧 ) − 𝑓 (𝑧) ∀𝑧, 𝑧ˆ ∈ Z,

 +∞
 otherwise.
Note that (2.21) can be viewed as the conjugate of 𝜙 + 𝜓 with respect to the
pairing of C(Z × Z) and M(Z × Z) evaluated at the zero measure. Note also
that 𝜙 is continuous at the constant function ℎ0 ≡ 1 because the transportation
cost function 𝑐 is non-negative. In addition, ℎ0 belongs to the domain of 𝜓. The
Fenchel–Rockafellar duality theorem (Brezis 2011, Theorem 1.12) thus ensures
that the conjugate of the sum of the proper convex functions 𝜙 and 𝜓 coincides
with the infimal convolution of their conjugates 𝜙∗ and 𝜓 ∗ . Hence, (2.21) equals
(𝜙 + 𝜓)∗ (0) = inf 𝜙∗ (−𝛾) + 𝜓 ∗ (𝛾). (2.22)
𝛾∈M(Z ×Z)

It remains to evaluate the conjugates of 𝜙 and 𝜓. For any 𝛾 ∈ M(Z × Z) we have


 ∫ 

𝜙 (−𝛾) = sup − ℎ(𝑧, 𝑧ˆ) d𝛾(𝑧, 𝑧ˆ) : −ℎ(𝑧, 𝑧ˆ) ≤ 𝑐(𝑧, 𝑧ˆ) ∀𝑧, 𝑧ˆ ∈ Z
ℎ∈C(Z ×Z) Z ×Z




 𝑐(𝑧, 𝑧ˆ) d𝛾(𝑧, 𝑧ˆ) if 𝛾 ∈ M+ (Z × Z),
= Z ×Z


 +∞ otherwise,
where M+ (Z × Z) stands for the cone of finite Borel measures on Z × Z. Indeed, if
𝛾 ∈ M+ (Z × Z), then the second equality follows from the monotone convergence
theorem, which applies because 𝑐 is lower semicontinuous and can thus be written
as the pointwise limit of a non-decreasing sequence of continuous functions (see
also Lemma 3.1 below). On the other hand, if 𝛾 ∉ M+ (Z × Z), then the second
equality holds because every 𝛾 ∈ M(Z ×Z) is a Radon measure, which ensures that
30 D. Kuhn, S. Shafiee, and W. Wiesemann

the measure of any Borel set can be approximated with the integral of a continuous
function. Similarly, for any 𝛾 ∈ M(Z × Z) one readily verifies that
(
∗ 0 if 𝛾 ∈ Γ(P, P̂),
𝜓 (𝛾) =
+∞ otherwise.
Substituting the above formulas for 𝜙∗ and 𝜓 ∗ into (2.22) yields (2.20).
Relaxing the requirement 𝑓 , 𝑔 ∈ C(Z) to 𝑓 ∈ L1 (P) and 𝑔 ∈ L1 (P̂) on the right
hand side of (2.20) immediately leads to the upper bound
∫ ∫



 sup 𝑓 (𝑧) dP(𝑧) − 𝑔(ˆ𝑧) dP̂(ˆ𝑧 )
OT𝑐 (P, P̂) ≤ 𝑓 ∈L1 (P), 𝑔∈L1 (P̂) Z Z (2.23)


 s.t. 𝑓 (𝑧) − 𝑔(ˆ𝑧 ) ≤ 𝑐(𝑧, ˆ
𝑧 ) ∀𝑧, 𝑧ˆ ∈ Z.
On the other hand, it is clear that


OT𝑐 (P, P̂) = inf sup 𝑐(𝑧, 𝑧ˆ) − 𝑓 (𝑧) + 𝑔(ˆ𝑧 ) d𝛾(𝑧, 𝑧ˆ)
𝛾∈M+ (Z ×Z) 𝑓 ∈L1 (P),𝑔∈L1 (P̂) Z ×Z
∫ ∫
+ 𝑓 (𝑧) dP(𝑧) − 𝑔(ˆ𝑧) dP̂(ˆ𝑧).
Z Z
Interchanging the order of minimization and maximization in the above expression
and then evaluating the inner infimum in closed form yields
∫ ∫



 sup 𝑓 (𝑧) dP(𝑧) − 𝑔(ˆ𝑧) dP̂(ˆ𝑧 )
OT𝑐 (P, P̂) ≥ 𝑓 ∈L1 (P), 𝑔∈L1 (P̂) Z Z (2.24)


 s.t. 𝑓 (𝑧) − 𝑔(ˆ𝑧 ) ≤ 𝑐(𝑧, ˆ
𝑧 ) ∀𝑧, 𝑧ˆ ∈ Z.
Combining (2.23) with (2.24) proves (2.19), and thus the claim follows. 

The dual optimal transport problem (2.19) constitutes a linear program over the
price functions 𝑓 ∈ L1 (P) and 𝑔 ∈ L1 (P̂), and its objective function is linear in P
and P̂. As pointwise suprema of linear functions are convex, OT𝑐 (P, P̂) is thus
jointly convex in P and P̂. Problem (2.19) can be further simplified by invoking the
𝑐-transform 𝑓 𝑐 : Z → (−∞, +∞] of the price function 𝑓 , which is defined through
𝑓 𝑐 (ˆ𝑧 ) = sup 𝑓 (𝑧) − 𝑐(𝑧, 𝑧ˆ). (2.25)
𝑧 ∈Z

The constraints of the dual problem (2.19) can now be re-expressed as


𝑔(ˆ𝑧 ) ≥ 𝑓 (𝑧) − 𝑐(𝑧, 𝑧ˆ) ∀𝑧, 𝑧ˆ ∈ Z ⇐⇒ 𝑔(ˆ𝑧) ≥ 𝑓 𝑐 (ˆ𝑧 ) ∀ˆ𝑧 ∈ Z.
Note that problem (2.19) seeks a price function 𝑔 that is as small as possible. As 𝑔 is
lower bounded by 𝑓 𝑐 , this suggests that 𝑔 = 𝑓 𝑐 at optimality. Conversely, defining
the 𝑐-transform 𝑔 𝑐 : Z → [−∞, +∞) of the price function 𝑔 through
𝑔 𝑐 (𝑧) = inf 𝑔(ˆ𝑧) + 𝑐(𝑧, 𝑧ˆ), (2.26)
𝑧ˆ ∈Z
Distributionally Robust Optimization 31

the constraint of problem (2.19) can be re-expressed as

𝑓 (𝑧) ≤ 𝑔(ˆ𝑧) + 𝑐(𝑧, 𝑧ˆ) ∀𝑧, 𝑧ˆ ∈ Z ⇐⇒ 𝑓 (𝑧) ≤ 𝑔 𝑐 (𝑧) ∀𝑧 ∈ Z.

This suggests that 𝑓 = 𝑔 𝑐 at optimality. Note that 𝑓 𝑐 and 𝑔 𝑐 constitute pointwise


suprema of upper semicontinuous functions and are therefore also upper semicon-
tinuous. In addition, note that 𝑓 𝑐 and 𝑔 𝑐 may fail to be integrable with respect to P̂
and P, respectively.
∫ If 𝑓 ∈ L1 (P) and 𝑔∫∈ L1 (P̂), however, then one can verify that
𝑐
the integrals Z 𝑓 (ˆ𝑧 ) dP̂(ˆ𝑧) < +∞ and Z 𝑔 𝑐 (𝑧) dP(𝑧) > −∞ exist as extended real
numbers. The above insights culminate in the following corollary, which we state
without proof. For details see (Villani 2008, Theorem 5.10 (i)).

Corollary 2.17 (Kantorovich Duality II). We have


∫ ∫
OT𝑐 (P, P̂) = sup 𝑓 (𝑧) dP(𝑧) − 𝑓 𝑐 (ˆ𝑧) dP̂(ˆ𝑧)
𝑓 ∈L1 (P) Z Z
∫ ∫
= sup 𝑔 𝑐 (𝑧) dP(𝑧) − 𝑔(ˆ𝑧) dP̂(ˆ𝑧 ),
𝑔∈L1 (P̂) Z Z

where the 𝑐-transforms 𝑓 𝑐 and 𝑔 𝑐 are defined in (2.25) and (2.26), respectively.
In addition, the first (second) supremum does not change if we require that 𝑓 = 𝑔 𝑐
(𝑔 = 𝑓 𝑐 ) for some function 𝑔 : Z → (−∞, +∞] ( 𝑓 : Z → [−∞, +∞)).

Given any transportation cost function 𝑐, reference distribution P̂ ∈ P(Z) and


transportation budget 𝑟 ≥ 0, the optimal transport ambiguity set is defined as

P = P ∈ P(Z) : OT𝑐 (P, P̂) ≤ 𝑟 . (2.27)
By construction, P contains all probability distributions P that can be obtained by
reshaping the reference distribution P̂ at a finite cost of at most 𝑟 ≥ 0. The optimal
transport ambiguity set was first studied by Pflug and Wozabal (2007), who propose
a successive linear programming algorithm to solve robust mean-risk portfolio
selection problems when Z is finite. Postek et al. (2016) leverage tools from
conjugate duality theory to develop an exact solution method for the same problem
class. Wozabal (2012) and Pflug and Pichler (2014, § 7.1) reformulate DRO
problems with optimal transport ambiguity sets over uncountable support sets Z ⊆
R𝑑 as finite-dimensional nonconvex programs and address them with methods
from global optimization. Mohajerin Esfahani and Kuhn (2018) and Zhao and
Guan (2018) use specialized duality results to show that these DRO problems are
in fact equivalent to generalized moment problems that admit exact reformulations
as finite-dimensional convex programs. Blanchet and Murthy (2019), Gao and
Kleywegt (2023) as well as Zhang et al. (2024b) show that the underlying duality
results remain valid even when Z is a Polish space. For recent surveys of the theory
and applications of DRO with optimal transport ambiguity sets we refer to Kuhn
et al. (2019) and Blanchet, Murthy and Nguyen (2021).
32 D. Kuhn, S. Shafiee, and W. Wiesemann

2.3.1. 𝑝-Wasserstein Ambiguity Sets


It is common to set the transportation cost function 𝑐 in Definition 2.15 to the 𝑝-th
power of some metric on Z. In this case, the 𝑝-th root of the optimal transport
discrepancy is termed the 𝑝-Wasserstein distance.
Definition 2.18 (𝑝-Wasserstein Distance). Assume that 𝑑(·, ·) is a metric on Z
and 𝑝 ∈ [1, +∞) is a prescribed exponent. Then, the 𝑝-Wasserstein distance
W 𝑝 : P(Z) × P(Z) → [0, +∞] corresponding to 𝑑 and 𝑝 is defined via
1
W 𝑝 (P, P̂) = inf ˆ 𝑝] 𝑝 .
E 𝛾 [𝑑(𝑍, 𝑍)
𝛾∈Γ(P,P̂)
𝑝
Definition 2.18 implies that if 𝑐(𝑧, 𝑧ˆ) = 𝑑(𝑧, 𝑧ˆ) 𝑝 , then 𝑊 𝑝 (P, P̂) = OT𝑐 (P, P̂). In
the following we use P 𝑝 (Z) = {P ∈ P(Z) : E P [𝑑(𝑍, 𝑧ˆ0 ) 𝑝 ] < ∞} to denote the
family of all distributions on Z with finite 𝑝-th moment. As 𝑑 is a metric, P 𝑝 (Z) is
independent of the choice of the reference point 𝑧ˆ0 ∈ Z. The 𝑝-Wasserstein distance
constitutes a metric on P 𝑝 (Z). Indeed, it is evident that 𝑊 𝑝 (P, P̂) is symmetric
and vanishes if and only if P = P̂. The proof that 𝑊 𝑝 (P, P̂) obeys the triangle
inequality requires a gluing lemma for transportation plans and is therefore more
intricate; see, e.g., (Villani 2008, § 1). The 𝑝-Wasserstein distance further metrizes
the weak convergence of distributions and the convergence of their 𝑝-th moments.
This means that 𝑊 𝑝 (P, P̂ 𝑁 ) converges to 0 if and only if P̂ 𝑁 converges weakly
to P and E P̂ 𝑁 [𝑑(𝑍, 𝑧ˆ0 ) 𝑝 ] converges to E P [𝑑(𝑍, 𝑧ˆ0 ) 𝑝 ] as 𝑁 grows (Villani 2008,
Theorem 6.9). Furthermore, the 𝑝-Wasserstein distance enjoys attractive measure
concentration properties. Specifically, if P̂ 𝑁 represents the empirical distribution
obtained from 𝑁 independent samples from P, then the rate at which P̂ 𝑁 converges
to P in 𝑝-Wasserstein distance admits sharp asymptotic and finite-sample bounds
(Fournier and Guillin 2015, Weed and Bach 2019).
As the 𝑝-Wasserstein distance constitutes the 𝑝-th root of an optimal transport
discrepancy, Proposition 2.16 and Corollary 2.17 readily imply that it admits a dual
representation. For 𝑝 = 1 this dual representation becomes particularly simple.
Indeed, one can show that the 1-Wasserstein distance coincides with the integral
probability metric generated by all test functions that are Lipschitz continuous with
respect to the metric 𝑑 and have Lipschitz modulus at most 1.
Corollary 2.19 (Kantorovich-Rubinstein Duality). We have
∫ ∫
W1 (P, P̂) = sup 𝑓 (𝑧) dP(𝑧) − 𝑓 (ˆ𝑧 ) dP̂(ˆ𝑧 ).
𝑓 ∈L1 (P), lip( 𝑓 )≤1 Z Z

Proof. Corollary 2.17 implies that


∫ ∫
W1 (P, P̂) = sup 𝑓 (𝑧) dP(𝑧) − 𝑓 𝑐 (ˆ𝑧 ) dP̂(ˆ𝑧 ).
𝑓 ∈L1 (P) Z Z

In addition, it ensures that the supremum does not change if we restrict the search
space to functions that are representable as 𝑓 = 𝑔 𝑐 for some 𝑔 : Z → (−∞, +∞].
Distributionally Robust Optimization 33

By (2.26), we thus have 𝑓 (𝑧) = inf 𝑧ˆ ∈Z 𝑔(ˆ𝑧) − 𝑑(𝑧, 𝑧ˆ). For any fixed 𝑧ˆ ∈ Z, the
auxiliary function 𝑓 𝑧ˆ (𝑧) = 𝑔(ˆ𝑧) − 𝑑(𝑧, 𝑧ˆ) is ostensibly 1-Lipschitz with respect to
the metric 𝑑. As infima of 1-Lipschitz functions remain 1-Lipschitz, we thus find
lip( 𝑓 ) ≤ 1. In summary, we have shown that restricting attention to 1-Lipschitz
functions does not reduce the supremum of the dual optimal transport problem.
Next, we prove that lip( 𝑓 ) ≤ 1 implies that 𝑓 𝑐 = 𝑓 . Indeed, for any 𝑧ˆ ∈ Z we have
𝑓 (ˆ𝑧 ) ≤ sup 𝑓 (𝑧) − 𝑑(𝑧, 𝑧ˆ) ≤ sup 𝑓 (ˆ𝑧 ) + 𝑑(𝑧, 𝑧ˆ) − 𝑑(𝑧, 𝑧ˆ) = 𝑓 (ˆ𝑧 ),
𝑧 ∈Z 𝑧 ∈Z

where the two inequalities hold because 𝑑(ˆ𝑧 , 𝑧ˆ) = 0 and lip( 𝑓 ) ≤ 1, respectively.
This implies via (2.25) that 𝑓 (ˆ𝑧) = sup 𝑧 ∈Z 𝑓 (𝑧) − 𝑑(𝑧, 𝑧ˆ) = 𝑓 𝑐 (ˆ𝑧) for all 𝑧ˆ ∈ Z.
Hence, 𝑓 𝑐 coincides with 𝑓 whenever lip( 𝑓 ) ≤ 1, and thus the claim follows. 
The 𝑝-Wasserstein ambiguity set of radius 𝑟 ≥ 0 around P̂ ∈ P(Z) is defined as

P = P ∈ P(Z) : W 𝑝 (P, P̂) ≤ 𝑟 . (2.28)
Pflug, Pichler and Wozabal (2012) study robust portfolio selection problems, where
the uncertainty about the asset return distribution is captured by a 𝑝-Wasserstein
ball. They prove that—as 𝑟 approaches infinity—it becomes optimal to distribute
one’s capital equally among all available assets. Hence, this result reveals that
the popular 1/𝑁-investment strategy widely used in practice (DeMiguel, Garlappi
and Uppal 2009) is optimal under extreme ambiguity. Pflug et al. (2012), Pichler
(2013) and Wozabal (2014) further show that, for a broad range of convex risk
measures, the worst-case portfolio risk across all distributions in a 𝑝-Wasserstein
ball equals the nominal risk under P̂ plus a regularization term that scales with the
Wasserstein radius 𝑟; see also Section 8.3.
The Wasserstein ambiguity set corresponding to 𝑝 = 1 enjoys particular prom-
inence in DRO. The Kanthorovich-Rubinstein duality can be used to construct a
simple upper bound on the worst-case expectation of a Lipschitz continuous loss
function across all distributions in a 1-Wasserstein ball. This upper bound is given
by the sum of the expected loss under the nominal distribution P̂ plus a regular-
ization term that consists of the Lipschitz modulus of the loss function weighted
by the radius 𝑟 of the ambiguity set. Shafieezadeh-Abadeh, Mohajerin Esfahani
and Kuhn (2015) demonstrate that this upper bound is exact for distributionally
robust logistic regression problems. However, this exactness result extends in fact
to many linear prediction models with convex (Chen and Paschalidis 2018, 2019,
Blanchet, Kang and Murthy 2019b, Shafieezadeh-Abadeh et al. 2019, Wu, Li and
Mao 2022) and even nonconvex loss functions (Gao et al. 2024, Ho-Nguyen and
Wright 2023). More generally, 1-Wasserstein ambiguity sets have found numerous
applications in diverse areas such as two-stage and multi-stage stochastic program-
ming (Zhao and Guan 2018, Hanasusanto and Kuhn 2018, Duque and Morton
2020, Bertsimas, Shtern and Sturt 2023), chance constrained programming (Chen,
Kuhn and Wiesemann 2024b, Xie 2021, Ho-Nguyen, Kılınç-Karzan, Küçükyavuz
and Lee 2022, Shen and Jiang 2023), inverse optimization (Mohajerin Esfahani,
34 D. Kuhn, S. Shafiee, and W. Wiesemann

Shafieezadeh-Abadeh, Hanasusanto and Kuhn 2018), statistical learning (Blanchet,


Glynn, Yan and Zhou 2019a, Zhu, Xie, Zhang, Gao and Xie 2022b), hypothesis
testing (Gao, Xie, Xie and Xu 2018), contextual stochastic optimization (Zhang,
Yang and Gao 2024a), transportation (Sun, Xie and Witten 2023), control (Cher-
ukuri and Cortés 2019, Yang 2020, Boskos, Cortés and Martínez 2020, Li and
Martínez 2020, Coulson, Lygeros and Dörfler 2021, Aolaritei, Lanzetti, Chen and
Dörfler 2022a, Terpin, Lanzetti, Yardim, Dörfler and Ramponi 2022, Terpin, Lan-
zetti and Dörfler 2024), and power systems analysis (Wang, Gao, Wei, Shafie-khah,
Bi and Catalao 2018, Ordoudis, Nguyen, Kuhn and Pinson 2021), among others.
The Wasserstein ambiguity set corresponding to 𝑝 = 2 also enjoys wide pop-
ularity. Before reviewing its various uses, we highlight an interesting connection
between the 2-Wasserstein distance and the Gelbrich distance introduced in Sec-
tion 2.1.4 (see Definition 2.1). As pointed out by Gelbrich (1990, Theorem 2.1),
the 2-Wasserstein distance between two probability distributions provides an upper
bound on the Gelbrich distance between their mean-covariance pairs.

Theorem 2.20 (Gelbrich Bound). Assume that Z is equipped with the Euclidean
metric 𝑑(𝑧, 𝑧ˆ) = k𝑧 − 𝑧ˆ k 2 . For any distributions P, P̂ ∈ P(Z) with finite mean
vectors 𝜇, 𝜇ˆ ∈ R𝑑 and covariance matrices Σ, Σ̂ ∈ S+𝑑 , respectively, we have

W2 (P, P̂) ≥ G((𝜇, Σ), ( 𝜇,


ˆ Σ̂)).

Proof. By definition, the squared 2-Wasserstein distance satisfies



2
W2 (P, P̂) = inf k𝑧 − 𝑧ˆ k 22 d𝛾(𝑧, 𝑧ˆ)
𝛾∈Γ(P,P̂) Z ×Z



 ˆ 22 + Tr[Σ + Σ̂ − 2𝐶]
inf k𝜇 − 𝜇k



 s.t. 𝛾 ∈ Γ(P, P̂), 𝐶 ∈ R𝑑×𝑑

= ∫   ⊤    

 𝑧−𝜇 𝑧−𝜇 Σ 𝐶 Σ 𝐶


 d𝛾(𝑧, 𝑧ˆ) = ⊤ ,  0.

 Z ×Z 𝑧ˆ − 𝜇ˆ 𝑧ˆ − 𝜇ˆ 𝐶 Σ̂ 𝐶 ⊤ Σ̂

Note that the new decision variable 𝐶 is uniquely determined by the transportation
plan 𝛾, that is, it represents the cross-covariance matrix of 𝑍 and 𝑍ˆ under 𝛾. Thus,
its presence does not enlarge the feasible set. Note also that the linear matrix
inequality in the last expression is redundant because the second-order moment
matrix of 𝛾 is necessarily positive semidefinite. Thus, its presence does not reduce
the feasible set. Finally, note that the integral of the quadratic function

k𝑧 − 𝑧ˆ k 22 =k𝜇 − 𝜇k
ˆ 22 + k𝑧 − 𝜇k 22 + k 𝑧ˆ − 𝜇k
ˆ 22 − 2(𝑧 − 𝜇)⊤ (ˆ𝑧 − 𝜇)
ˆ
⊤ ⊤
+ 2(𝜇 − 𝜇)
ˆ (𝑧 − 𝜇) − 2(𝜇 − 𝜇) ˆ (ˆ𝑧 − 𝜇)
ˆ

with respect to 𝛾 is uniquely determined by the first- and second-order moments


ˆ 22 + Tr[Σ + Σ̂ − 2𝐶]. Relaxing the last optimization
of 𝛾 and evaluates to k𝜇 − 𝜇k
Distributionally Robust Optimization 35

problem by removing all constraints that involve 𝛾 then yields




 min ˆ 22 + Tr[Σ + Σ̂ − 2𝐶]
k𝜇 − 𝜇k

 𝐶 ∈R𝑑×𝑑  
W22 (P, P̂) ≥ Σ 𝐶

 s.t.

𝐶 ⊤ Σ̂
 0.

By Proposition 2.2, the optimal value of the resulting semidefinite program amounts
to G2 ((𝜇, Σ), ( 𝜇,
ˆ Σ̂)). The claim follows by taking square roots on both sides. 
The proof of Theorem 2.20 reveals that the squared Gelbrich distance coincides
with the minimum of a relaxed optimal transport problem, which only requires
the marginals of the transportation plan 𝛾 to have the same first- and second-order
moments as P and P̂, respectively. Gelbrich’s inequality may be useful when the
exact 2-Wasserstein distance is inaccessible. Indeed, computing the 2-Wasserstein
distance between a discrete and a continuous distribution is #P-hard already when
the discrete distribution has only two atoms (Taşkesen, Shafieezadeh-Abadeh and
Kuhn 2023a). Computing the 2-Wasserstein distance may even be #P-hard when
both distributions are discrete (Taşkesen, Shafieezadeh-Abadeh, Kuhn and Natara-
jan 2023b). If both P and P̂ are Gaussian, then Gelbrich’s inequality collapses to
an equality. Thus, the 2-Wasserstein distance between two Gaussian distributions
matches the Gelbrich distance between their mean vectors and covariance matrices
(Givens and Shortt 1984, Proposition 7). This classical result, which actually pred-
ates Gelbrich’s inequality, is nowadays recognized as an immediate consequence
of a celebrated optimality condition for optimal transport problems by Brenier
(1991). Using Brenier’s optimality condition, one can prove more generally that if
P̂ is a positive semidefinite affine pushforward of P, that is, if there exists an affine
function 𝑓 (𝑧) = 𝐴𝑧 + 𝑏 with 𝐴 ∈ S+𝑑 and 𝑏 ∈ R𝑑 such that P̂ = P ◦ 𝑓 −1 , then the 2-
Wasserstein distance between P and P̂ matches again the Gelbrich distance between
their mean vectors and covariance matrices (Nguyen et al. 2021, Theorem 2).
The 2-Wasserstein ambiguity set has found applications in machine learning
(Sinha, Namkoong and Duchi 2018, Blanchet et al. 2019b, Blanchet, Murthy
and Si 2022b, Blanchet, Murthy and Zhang 2022c), inverse optimization (Mo-
hajerin Esfahani et al. 2018), two-stage stochastic programming (Hanasusanto and
Kuhn 2018), estimation and filtering (Shafieezadeh-Abadeh et al. 2018, Nguyen
et al. 2023, Kargin et al. 2024b), portfolio optimization (Blanchet, Chen and Zhou
2022a, Nguyen et al. 2021) as well as control theory (Al Taha et al. 2023, Hajar et al.
2023, Hakobyan and Yang 2024, Taşkesen et al. 2024, Kargin et al. 2024a,c,d).

2.3.2. Lévy-Prokhorov Ambiguity Sets


The Lévy-Prokhorov distance is one of the most widely used probability metrics
because it metrizes the topology of weak convergence on P(Z). We assume below
that 𝑑(·, ·) is a continuous metric on Z. For any set B ⊆ Z and 𝑟 ≥ 0, we use
B𝑟 = {𝑧 ∈ Z : ∃𝑧′ ∈ B with 𝑑(𝑧, 𝑧′ ) ≤ 𝑟} (2.29)
36 D. Kuhn, S. Shafiee, and W. Wiesemann

to denote the 𝑟-neighborhood of B. The dependence of B𝑟 on the metric 𝑑 is


notationally suppressed because 𝑑 is usually obvious from the context. With these
preparations, we are now ready to define the Lévy-Prokhorov distance.
Definition 2.21 (Lévy-Prokhorov Distance). For any metric 𝑑(·, ·) on Z, the Lévy-
Prokhorov distance LP : P(Z) × P(Z) → [0, 1] induced by 𝑑 is defined via

LP(P, P̂) = inf 𝑟 ≥ 0 : P(B) ≤ P̂(B𝑟 ) + 𝑟 for all Borel sets B ⊆ Z ,
where B𝑟 is defined in (2.29).
The Lévy-Prokhorov distance is bounded by 1 and vanishes if and only if its
arguments match. In addition, one can easily show that it satisfies the triangle
inequality. However, it appears to be asymmetric. The next proposition reveals
that the Lévy-Prokhorov distance is closely linked to the theory of optimal transport.
Proposition 2.22 (Strassen (1965)). If the transportation cost function 𝑐𝑟 corres-
ponding to 𝑟 ≥ 0 is defined through 𝑐𝑟 (𝑧, 𝑧ˆ) = 1𝑑(𝑧, 𝑧ˆ )>𝑟 for all 𝑧, 𝑧ˆ ∈ Z, then

LP(P, P̂) = inf 𝑟 ≥ 0 : OT𝑐𝑟 (P, P̂) ≤ 𝑟 .
Proof. Note that 𝑐𝑟 is lower semicontinuous because the metric 𝑑 is continuous by
assumption. By Proposition 2.16, OT𝑐𝑟 (P, P̂) thus admits the dual representation
∫ ∫
sup 𝑓 (𝑧) dP(𝑧) − 𝑔(ˆ𝑧) dP̂(ˆ𝑧 )
𝑓 ∈L1 (P), 𝑔∈L1 (P̂) Z Z (2.30)
s.t. 𝑓 (𝑧) − 𝑔(ˆ𝑧 ) ≤ 1𝑑(𝑧, 𝑧ˆ )>𝑟 ∀𝑧, 𝑧ˆ ∈ Z.
Here, for any fixed 𝑔, it is optimal to push 𝑓 up such that for all 𝑧 ∈ Z we have
𝑓 (𝑧) = inf 𝑔(ˆ𝑧 ) + 1𝑑(𝑧, 𝑧ˆ )>𝑟 =⇒ inf 𝑔(ˆ𝑧) ≤ 𝑓 (𝑧) ≤ 1 + inf 𝑔(ˆ𝑧 ). (2.31a)
𝑧ˆ ∈Z 𝑧ˆ ∈Z 𝑧ˆ ∈Z

Also, for any fixed 𝑓 , it is optimal to push 𝑔 down such that for all 𝑧ˆ ∈ Z we have
𝑔(ˆ𝑧) = sup 𝑓 (𝑧) − 1𝑑(𝑧, 𝑧ˆ )>𝑟 =⇒ sup 𝑓 (𝑧) − 1 ≤ 𝑔(ˆ𝑧) ≤ sup 𝑓 (𝑧). (2.31b)
𝑧 ∈Z 𝑧 ∈Z 𝑧 ∈Z

Combining the upper bound on 𝑔(ˆ𝑧 ) in (2.31b) with the upper bound on 𝑓 (𝑧)
in (2.31a) further implies that 𝑔(ˆ𝑧 ) ≤ sup 𝑧 ∈Z 𝑓 (𝑧) ≤ 1+inf 𝑧 ′ ∈Z 𝑔(𝑧′ ). At optimality,
(2.31a) and (2.31b) must hold simultaneously, and thus we have
inf 𝑔(𝑧′ ) ≤ 𝑓 (𝑧) ≤ 1 + inf 𝑔(𝑧′ ) and inf 𝑔(𝑧′ ) ≤ 𝑔(ˆ𝑧) ≤ 1 + inf 𝑔(𝑧′ )
𝑧 ′ ∈Z ′ 𝑧 ∈Z 𝑧 ′ ∈Z ′ 𝑧 ∈Z

for all 𝑧, 𝑧ˆ ∈ Z. Note that, as both P and P̂ are probability distributions, the objective
function of the dual optimal transport problem (2.30) remains invariant under the
substitutions 𝑓 (𝑧) ← 𝑓 (𝑧) − inf 𝑧 ′ ∈Z 𝑔(𝑧′ ) and 𝑔(ˆ𝑧 ) ← 𝑔(ˆ𝑧 ) − inf 𝑧 ′ ∈Z 𝑔(𝑧′ ). In the
following, we may thus assume without loss of generality that 0 ≤ 𝑓 (𝑧) ≤ 1 for
all 𝑧 ∈ Z and that 0 ≤ 𝑔(ˆ𝑧) ≤ 1 for all 𝑧ˆ ∈ Z.
Distributionally Robust Optimization 37

As 𝑓 and 𝑔 are now normalized to [0, 1], they admit the integral representations
∫ 1 ∫ 1
𝑓 (𝑧) = 1 𝑓 (𝑧)≥ 𝜏 d𝜏 ∀𝑧 ∈ Z and 𝑔(ˆ𝑧) = 1𝑔( 𝑧ˆ )≥ 𝜏 d𝜏 ∀ˆ𝑧 ∈ Z.
0 0
Next, one can show that 𝑓 and 𝑔 satisfy the constraints in (2.30) if and only if
1 𝑓 (𝑧)≥ 𝜏 − 1𝑔( 𝑧ˆ )≥ 𝜏 ≤ 1𝑑(𝑧, 𝑧ˆ )>𝑟 ∀𝑧, 𝑧ˆ ∈ Z, ∀𝜏 ∈ [0, 1]. (2.32)
Note first that (2.32) is trivially satisfied unless its left hand side evaluates to 1 and its
right hand side evaluates to 0. This happens if and only if 𝑓 (𝑧) ≥ 𝜏 and 𝑔(ˆ𝑧) < 𝜏 for
some 𝜏 ∈ [0, 1] and 𝑧, 𝑧ˆ ∈ Z with 𝑑(𝑧, 𝑧ˆ) ≤ 𝑟. This is impossible, however, because
it implies that 𝑓 (𝑧) − 𝑔(ˆ𝑧) > 0 for some 𝑧, 𝑧ˆ with 𝑑(𝑧, 𝑧ˆ) ≤ 𝑟, thus contradicting the
constraints in (2.30). Hence, the constraints in (2.30) imply (2.32). The converse
implication follows immediately from the integral representations of 𝑓 and 𝑔.
Finally, note that 1 𝑓 (𝑧)≥ 𝜏 and 1𝑔( 𝑧ˆ )≥ 𝜏 are the characteristic functions of the Borel
sets B = {𝑧 ∈ Z : 𝑓 (𝑧) ≥ 𝜏} and C = { 𝑧ˆ ∈ Z : 𝑔(ˆ𝑧 ) ≥ 𝜏}, respectively. Note also
that (2.32) holds if and only if C ⊇ B𝑟 . Recalling their integral representations, we
may thus conclude that the functions 𝑓 and 𝑔 are feasible in (2.30) if and only if
they represent convex combinations of (infinitely many) characteristic functions of
the form 1 𝑧 ∈B and 1 𝑧ˆ ∈C for some Borel sets B and C with C ⊇ B𝑟 . As the objective
function of (2.30) is linear in 𝑓 and 𝑔, its supremum does not change if we restrict
the feasible set to such characteristic functions. Hence, (2.30) reduces to

OT𝑐𝑟 (P, P̂) = sup P(B) − P̂(C) : B, C ⊆ Z are Borel sets with C ⊇ B𝑟 .
Clearly, it is always optimal to set C = B𝑟 , and thus the claim follows. 

While Proposition 2.22 follows from (Strassen 1965, Theorem 11), the proof
shown here parallels that of (Villani 2003, Theorem 1.27). As a byproduct, Pro-
position 2.22 reveals that the Lévy-Prokhorov distance is symmetric, which is not
evident from its definition. Thus, it constitutes indeed a metric.
The Lévy-Prokhorov ambiguity set of radius 𝑟 ≥ 0 around P̂ ∈ P(Z) is defined as

P = P ∈ P(Z) : LP(P, P̂) ≤ 𝑟 .
For our purposes, the most important implication of Proposition 2.22 is that P can
be viewed as special instance of an optimal transport ambiguity set, that is, we have

P = P ∈ P(Z) : OT𝑐𝑟 (P, P̂) ≤ 𝑟
for any radius 𝑟 ≥ 0. Lévy-Prokhorov ambiguity sets were first introduced in the
context of chance-constrained programming (Erdoğan and Iyengar 2006). They
also naturally emerge in data-driven decision-making and the training of robust
machine learning models (Pydi and Jog 2021, Bennouna and Van Parys 2023, Ben-
nouna, Lucas and Van Parys 2023). We close this section with a useful corollary,
which follows immediately from the last part of the proof of Proposition 2.22.
38 D. Kuhn, S. Shafiee, and W. Wiesemann

Corollary 2.23. If the transportation cost function 𝑐𝑟 corresponding to 𝑟 ≥ 0 is


defined through 𝑐𝑟 (𝑧, 𝑧ˆ) = 1𝑑(𝑧, 𝑧ˆ )>𝑟 for all 𝑧, 𝑧ˆ ∈ Z, then we have

OT𝑐𝑟 (P, P̂) = sup P(B) − P̂(B𝑟 ) : B ⊆ Z is a Borel set ,
where the 𝑟-neighborhood B𝑟 is defined in (2.29).

2.3.3. Total Variation Ambiguity Sets Revisited


In Section 2.2.3 we showed that the total variation distance constitutes an instance
of a 𝜙-divergence; see Proposition 2.11. We can now demonstrate that the total
variation distance is also an instance of an optimal transport discrepancy.
Proposition 2.24. If 𝑐(𝑧, 𝑧ˆ) = 1 𝑧≠ 𝑧ˆ for all 𝑧, 𝑧ˆ ∈ Z, then we have
TV(P, P̂) = OT𝑐 (P, P̂) = inf ˆ
𝛾(𝑍 ≠ 𝑍).
𝛾∈Γ(P,P̂)

Proof. By Definition 2.10, the total variation distance satisfies



TV(P, P̂) = sup P(B) − P̂(B) : B ⊆ Z is a Borel set

= sup P(B) − P̂(B) : B ⊆ Z is a Borel set = OT𝑐 (P, P̂),
where the second equality holds because the complement of any Borel set is again
a Borel set. The third equality follows from Corollary 2.23 for 𝑟 = 0, which
applies because 𝑐(𝑧, 𝑧ˆ) = 1𝑑(𝑧, 𝑧ˆ )>0 for any (continuous) metric 𝑑 on Z. Since
𝑐(𝑧, 𝑧ˆ) = 1 𝑧≠ 𝑧ˆ , we also have
 
OT𝑐 (P, P̂) = inf E 𝛾 1 𝑍≠ 𝑍ˆ = inf 𝛾(𝑍 ≠ 𝑍). ˆ
𝛾∈Γ(P,P̂) 𝛾∈Γ(P,P̂)

This observation completes the proof. 


Proposition 2.24 readily implies that any total variation ambiguity set can also
be viewed as a special instance of an optimal transport ambiguity set.

2.3.4. ∞-Wasserstein Ambiguity Sets


Section 2.3.1 focuses exclusively on 𝑝-Wasserstein distances corresponding to finite
exponents 𝑝 ∈ [1, ∞). The ∞-Wasserstein distance requires special treatment.
Definition 2.25 (∞-Wasserstein Distance). The ∞-Wasserstein distance W∞ :
P(Z) × P(Z) → [0, ∞] corresponding to a continuous metric 𝑑(·, ·) on Z is
 
ˆ ,
W∞ (P, P̂) = inf ess sup 𝛾 𝑑(𝑍, 𝑍) (2.33)
𝛾∈Γ(P,P̂)

ˆ under 𝛾 is given by
where the essential supremum of 𝑑(𝑍, 𝑍)
  
ˆ = inf 𝜏 : 𝛾(𝑑(𝑍, 𝑍)
ess sup 𝛾 𝑑(𝑍, 𝑍) ˆ > 𝜏) = 0 .
𝜏 ∈R

Definition 2.25 makes sense because the ∞-Wasserstein distance can be obtained
from the 𝑝-Wasserstein distance in the limit when 𝑝 tends to infinity.
Distributionally Robust Optimization 39

Proposition 2.26 (Givens and Shortt (1984)). For any P, P̂ ∈ P(Z) we have
W∞ (P, P̂) = lim W 𝑝 (P, P̂) = sup W 𝑝 (P, P̂).
𝑝→∞ 𝑝≥1

Proof. If 𝑝 ≥ 𝑞 ≥ 1, then 𝑓 (𝑡) = 𝑡 𝑞/ 𝑝 is concave on R+ . This implies that


1
 𝑞 1
W 𝑝 (P, P̂) = inf ˆ 𝑝 ] 𝑝 𝑞 ≥ inf
E 𝛾 [𝑑(𝑍, 𝑍) ˆ 𝑞 ] 𝑞 = W𝑞 (P, P̂)
E 𝛾 [𝑑(𝑍, 𝑍)
𝛾∈Γ(P,P̂) 𝛾∈Γ(P,P̂)

thanks to Jensen’s inequality. Hence, W 𝑝 (P, P̂) is non-decreasing in the exponent 𝑝


as long as 𝑝 ∈ [1, ∞). In addition, for any transportation plan 𝛾 ∈ Γ(P, P̂) and
exponent 𝑝 ∈ [1, ∞), the definition of the essential supremum readily implies that
1 1
ˆ 𝑝 ] 𝑝 ≤ ess sup 𝛾 [𝑑(𝑍, 𝑍)
E 𝛾 [𝑑(𝑍, 𝑍) ˆ 𝑝 ] 𝑝 = ess sup 𝛾 [𝑑(𝑍, 𝑍)].
ˆ

Minimizing both sides of this inequality across all 𝛾 ∈ Γ(P, P̂) further implies that
W 𝑝 (P, P̂) ≤ W∞ (P, P̂) for all 𝑝 ∈ [1, ∞). In summary, we may thus conclude that
lim W 𝑝 (P, P̂) = sup W 𝑝 (P, P̂) ≤ W∞ (P, P̂).
𝑝→∞ 𝑝≥1

It remains to be shown that the last inequality holds in fact as an equality. To see
this, fix some tolerance 𝜀 > 0. For any 𝑝 ∈ N, let 𝛾 𝑝 ∈ Γ(P, P̂) be a coupling
with E 𝛾 𝑝 [𝑑(𝑍, 𝑍)ˆ 𝑝 ] 1/ 𝑝 = W 𝑝 (P, P̂). Note that 𝛾 𝑝 exists because, as we will
see in Corollary 3.16 and Proposition 3.3 below, Γ(P, P̂) is weakly compact and
ˆ 𝑝 ] is weakly lower semicontinuous in 𝛾. Next, let {𝛾 𝑝( 𝑗) } 𝑗 ∈N be a
E 𝛾 [𝑑(𝑍, 𝑍)
subsequence that converges weakly to some coupling 𝛾∞ ∈ Γ(P, P̂), which exists
again because Γ(P, P̂) is weakly compact. We proceed by case distinction.
ˆ is finite, define the open set
Case 1. If ess sup 𝛾∞ [𝑑(𝑍, 𝑍)]

ˆ −𝜀 ,
B = (𝑧, 𝑧ˆ) ∈ Z × Z : 𝑑(𝑧, 𝑧ˆ) > ess sup 𝛾∞ [𝑑(𝑍, 𝑍)]
and note that 𝛾∞ (B) > 0 by the definition of the essential supremum. We then find
∫  𝑝(1𝑗)
𝑝( 𝑗)
W 𝑝( 𝑗) (P, P̂) ≥ 𝑑(𝑧, 𝑧ˆ) d𝛾 𝑝( 𝑗) (𝑧, 𝑧ˆ)
B
1
ˆ −𝜀

≥ 𝛾 𝑝( 𝑗) (B) 𝑝( 𝑗) ess sup 𝛾∞ [𝑑(𝑍, 𝑍)]
1 
≥ 𝛾 𝑝( 𝑗) (B) 𝑝( 𝑗) W∞ (P, P̂) − 𝜀 .
Since B is open and 𝛾 𝑝( 𝑗) converges weakly to 𝛾∞ as 𝑗 grows, the Portmanteau
theorem (Billingsley 2013, Theorem 2.1 (iiv)) implies that lim inf 𝑗→∞ 𝛾 𝑝( 𝑗) (B) ≥
𝛾∞ (B) > 0. Thus, 𝛾 𝑝( 𝑗) (B)1/ 𝑝( 𝑗) converges to 1 as 𝑗 grows, and we obtain
lim W 𝑝 (P, P̂) ≥ W∞ (P, P̂) − 𝜀.
𝑝→∞

As this inequality holds for any tolerance 𝜀 > 0, the above reasoning finally implies
that W 𝑝 (P, P̂) converges indeed to W∞ (P, P̂) for large 𝑝.
40 D. Kuhn, S. Shafiee, and W. Wiesemann

ˆ = ∞, then we replace ess sup 𝛾 [𝑑(𝑍, 𝑍)]


Case 2. If ess sup 𝛾∞ [𝑑(𝑍, 𝑍)] ˆ in the

definition of the open set B with an arbitrarily large constant. Proceeding as in
Case 1 eventually reveals that lim 𝑝→∞ W 𝑝 (P, P̂) = W∞ (P, P̂) = ∞. 
To develop some intuition for Proposition 2.26, consider the optimal transport
problem in the definition of W 𝑝 (P, P̂). If 𝑝 > 1, then the cost 𝑐(𝑧, 𝑧ˆ) = 𝑑(𝑧, 𝑧ˆ) 𝑝
of transporting one unit of probability mass from 𝑧ˆ to 𝑧 grows superlinearly with
the distance 𝑑(𝑧, 𝑧ˆ). Hence, parts of the distribution P̂ that are transported further
under an optimal transportation plan contribute more to W 𝑝 (P, P̂). In addition, as 𝑝
tends to infinity, eventually only the portion of the distribution P̂ that is transported
the furthest has an impact on W∞ (P, P̂). Even more, only the largest transportation
distance matters, whereas the amount of probability mass transported is irrelevant.
Despite Proposition 2.26, the optimal transport problems in the definitions of
the Wasserstein distances of order 𝑝 < ∞ and of order 𝑝 = ∞ are fundamentally
different. Indeed, if 𝑝 < ∞, then the objective function E 𝛾 [𝑑(𝑍, 𝑍) ˆ 𝑝 ] of the
optimal transport problem is linear in the tansportation plan 𝛾. If 𝑝 = ∞, on the
other hand, then the objective function ess sup 𝛾 [𝑑(𝑍, 𝑍)]ˆ is not even convex, but
rather quasi-convex, in 𝛾 (Jylhä 2015, Lemma 2.2); see also (Champion, De Pascale
and Juutinen 2008). Thus, ∞-Wasserstein distances require a more subtle treatment.
The next proposition relates the ∞-Wasserstein distance to a standard optimal
transport problem. Therefore, it has computational relevance.
Proposition 2.27. If the transportation cost function 𝑐𝑟 corresponding to 𝑟 ≥ 0 is
defined through 𝑐𝑟 (𝑧, 𝑧ˆ) = 1𝑑(𝑧, 𝑧ˆ )>𝑟 for all 𝑧, 𝑧ˆ ∈ Z, then we have

W∞ (P, P̂) = inf 𝑟 ≥ 0 : OT𝑐𝑟 (P, P̂) ≤ 0 .
ˆ : 𝛾 ∈ Γ(P, P̂)}. Note that
Proof. Recall that OT𝑐𝑟 (P, P̂) = inf{E 𝛾 [𝑐𝑟 (𝑍, 𝑍)]
the underlying optimal transport problem is solvable because Γ(P, P̂) is weakly
ˆ 𝑝 ] is weakly lower semicontinuous in 𝛾 thanks to
compact and because E 𝛾 [𝑑(𝑍, 𝑍)
Corollary 3.16 and Proposition 3.3 below, respectively. Therefore, we have

inf 𝑟 ≥ 0 : OT𝑐𝑟 (P, P̂) ≤ 0

ˆ =0
= inf 𝑟 ≥ 0 : ∃𝛾 ∈ Γ(P, P̂) with E 𝛾 [𝑐𝑟 (𝑍, 𝑍)]

= inf ˆ > 𝑟] = 0 = W∞ (P, P̂),
𝑟 : 𝛾[𝑑(𝑍, 𝑍)
𝛾∈Γ(P,P̂), 𝑟 ∈R+

where the first equality holds because OT𝑐𝑟 (P, P̂) is non-negative and because the
underlying optimal transport problem is solvable. The second equality follows
from the definitions of 𝑐𝑟 and the ∞-Wasserstein distance. 
Combining Proposition 2.27 with Corollary 2.23 immediately yields the follow-
ing equivalent characterization of the ∞-Wasserstein distance.
Corollary 2.28 (Givens and Shortt (1984)). The ∞-Wasserstein distance satisfies

W∞ (P, P̂) = inf 𝑟 ≥ 0 : P(B) ≤ P̂(B𝑟 ) for all Borel sets B ⊆ Z ,
Distributionally Robust Optimization 41

where the 𝑟-neighborhood B𝑟 is defined in (2.29).


The ∞-Wasserstein ambiguity set of radius 𝑟 ≥ 0 around P̂ ∈ P(Z) is defined as

P = P ∈ P(Z) : W∞ (P, P̂) ≤ 𝑟 . (2.34)
Proposition 2.27 implies that P coincides with an optimal transport ambiguity set
with transportation cost function 𝑐𝑟 (𝑧, 𝑧ˆ) = 1𝑑(𝑧, 𝑧ˆ )>𝑟 , that is, we have

P = P ∈ P(Z) : OT𝑐𝑟 (P, P̂) ≤ 0 .
DRO with ∞-Wasserstein ambiguity sets has strong connections to adversarial
machine learning (Gao, Chen and Kleywegt 2017, García Trillos and García Trillos
2022, García Trillos and Murray 2022, García Trillos and Jacobs 2023, Bungert,
García Trillos and Murray 2023, Bungert, Laux and Stinson 2024, Gao et al. 2024,
Pydi and Jog 2024, Frank and Niles-Weed 2024a,b) and kernel density estimation
(Xu, Caramanis and Mannor 2012a). In addition, ∞-Wasserstein ambiguity sets
are used in two- and multi-stage stochastic programming (Xie 2020, Bertsimas,
Shtern and Sturt 2022, Bertsimas et al. 2023), portfolio optimization (Nguyen,
Zhang, Wang, Blanchet, Delage and Ye 2024), and robust learning (Nguyen, Zhang,
Blanchet, Delage and Ye 2020, Wang, Nguyen and Hanasusanto 2024c).

2.4. Other Ambiguity Sets


There exist several ambiguity sets that cannot be classified as moment, 𝜙-divergence
or optimal transport ambiguity sets. In the following we offer a brief overview of
these ambiguity sets without providing extensive mathematical details.

2.4.1. Marginal Ambiguity Sets


Marginal ambiguity sets specify the marginal distributions of multiple subvectors
of 𝑍 without detailing their joint distribution. The simplest example of a marginal
ambiguity set is the Fréchet ambiguity set, which specifies the marginal distribu-
tions of all individual components of 𝑍 but provides no information about their
copula. Thus, the Fréchet ambiguity set is parametrized by 𝑑 marginal cumulative
distribution functions 𝐹𝑖 : R → [0, 1], 𝑖 ∈ [𝑑], and can be represented as

P = P ∈ P(R𝑑 ) : P(𝑍𝑖 ≤ 𝑧𝑖 ) = 𝐹𝑖 (𝑧𝑖 ) ∀𝑧𝑖 ∈ R, ∀𝑖 ∈ [𝑑] . (2.35)
Here, 𝐹𝑖 is an arbitrary cumulative distribution function, that is, a right-continuous,
non-decreasing function with lim 𝑧𝑖 →−∞ 𝐹𝑖 (𝑧𝑖 ) = 0 and lim 𝑧𝑖 →+∞ 𝐹𝑖 (𝑧𝑖 ) = 1.
Fréchet ambiguity sets are relevant for probabilistic logic. Imagine that each 𝑍𝑖
represents a binary variable that evaluates to 1 if a certain event occurs and to 0
otherwise, and assume that the probability of each event is known, whereas the joint
distribution of all events is unknown. In this setting, Boole (1854) was interested in
computing bounds on the probability of a composite event encoded by a Boolean
function of the variables 𝑍𝑖 , 𝑖 ∈ [𝑑]. Almost a century later, Fréchet (1935) derived
explicit inequalities for the probabilities of such composite events, which are now
42 D. Kuhn, S. Shafiee, and W. Wiesemann

called Fréchet inequalities. Note that these Fréchet inequalities can be obtained by
minimizing or maximizing the probability of the composite event over all distribu-
tions in a Fréchet ambiguity set with Bernoulli marginals. More recently, there has
been growing interest in generalized Fréchet inequalities, which bound the risk of
general (not necessarily Boolean) functions of 𝑍 with respect to all distributions
in a Fréchet ambiguity set with general (not necessarily Bernoulli) marginals. For
example, a wealth of Fréchet inequalities for the risk of a sum of random variables
have emerged in finance and risk management (Rüschendorf 1983, 1991, Embrechts
and Puccetti 2006, Wang and Wang 2011, Wang, Peng and Yang 2013, Puccetti
and Rüschendorf 2013, Van Parys, Goulart and Embrechts 2016a, Blanchet, Lam,
Liu and Wang 2024a). In addition, Natarajan, Song and Teo (2009b) derive sharp
bounds for the worst-case expectation of a piecewise affine functions over a Fréchet
ambiguity set. We highlight that Fréchet ambiguity sets are also relevant because
they coincide with the feasible sets of multi-marginal optimal transport problems,
which can sometimes be solved in polynomial time (Pass 2015, Altschuler and
Boix-Adsera 2023, Natarajan, Padmanabhan and Ramachandra 2023).
General marginal ambiguity sets specify the marginal distributions of several
(possibly overlapping) subsets of the set {𝑍𝑖 : 𝑖 ∈ [𝑑]} of random variables. How-
ever, checking whether such an ambiguity set is non-empty is NP-complete even if
each 𝑍𝑖 is a Bernoulli random variable and each subset accommodates merely two
elements (Honeyman, Ladner and Yannakakis 1980, Georgakopoulos, Kavvadias
and Papadimitriou 1988). Computing worst-case expectations over marginal am-
biguity sets is thus intractable unless the subsets of random variables with known
marginals are disjoint (Doan and Natarajan 2012) or if the corresponding overlap
graph displays a running intersection property (Doan, Li and Natarajan 2015).
Marginal ambiguity sets are attractive because, given limited statistical data, it
is far easier to estimate low-dimensional marginals than their global dependence
structure. However, even univariate marginals cannot be estimated exactly. For this
reason, several researchers study marginal ambiguity sets that provide only limited
information about the marginals such as bounds on marginal moments or marginal
dispersion measures (Bertsimas et al. 2004, Bertsimas, Natarajan and Teo 2006a,b,
Chen, Sim, Sun and Teo 2010, Mishra, Natarajan, Tao and Teo 2012, Natarajan,
Sim and Uichanco 2018).
A related stream of literature focuses on ambiguity sets under which the ran-
dom variables 𝑍𝑖 , 𝑖 ∈ [𝑑], are independent and governed by ambiguous marginal
distributions. For example, the Hoeffding ambiguity set contains all joint distri-
butions on a box with independent (and completely unknown) marginals, whereas
the Bernstein ambiguity set contains all distributions from within the Hoeffding
ambiguity set subject to marginal moment bounds (Nemirovski and Shapiro 2007,
Hanasusanto et al. 2015a). Bernstein ambiguity sets that constrain the mean
as well as the mean-absolute deviation of each marginal are used to derive safe
tractable approximations for distributionally robust chance constrained programs
(Postek, Ben-Tal, den Hertog and Melenberg 2018), two-stage integer programs
Distributionally Robust Optimization 43

(Postek et al. 2018, Postek, Romeijnders, den Hertog and van der Vlerk 2019), and
queueing systems (Wang, Prasad, Hanasusanto and Hasenbein 2024d).
DRO with marginal ambiguity sets has close connections to submodularity and
to the theory of comonotonicity in risk management (Tchen 1980, Rüschendorf
2013, Bach 2013, 2019, Natarajan et al. 2023, Long, Qi and Zhang 2024). It
has a broad range of diverse applications ranging from discrete choice model-
ing (Natarajan et al. 2009b, Mishra, Natarajan, Padmanabhan, Teo and Li 2014,
Chen, Ma, Natarajan, Simchi-Levi and Yan 2022, Ruan, Li, Murthy and Natara-
jan 2022), queuing theory (van Eekelen, den Hertog and van Leeuwaarden 2022),
transportation (Wang, Chen and Liu 2020, Shehadeh 2023), chance constrained
programming (Xie, Ahmed and Jiang 2022), scheduling (Mak et al. 2015), invent-
ory management (Liu, Chen, Wang and Wang 2024a), the analysis of complex
networks (Chen, Padmanabhan, Lim and Natarajan 2020, Van Leeuwaarden and
Stegehuis 2021, Brugman et al. 2022) and mechanism design (Carroll 2017, Gravin
and Lu 2018, Chen et al. 2024a, Wang, Liu and Zhang 2024b, Wang 2024), etc.
For further details we refer to the comprehensive monograph by Natarajan (2021).

2.4.2. Mixture Ambiguity Sets and Structural Ambiguity Sets


Let Θ ⊆ R𝑚 be a Borel set and P 𝜃 ∈ P(Z) a parametric distribution that is uniquely
determined by 𝜃 ∈ Θ. Assume that P 𝜃 (𝑍 ∈ B) is a Borel measurable function of 𝜃
for every fixed Borel set B ⊆ Z. The parametric distribution family {P 𝜃 : 𝜃 ∈ Θ}
can then be used as a mixture family, which induces the mixture ambiguity set
∫ 
P= P 𝜃 dQ(𝜃) : Q ∈ P(Θ) . (2.36)
Θ
Thus, P contains all distributions that can be represented as mixtures of the distri-
butions P 𝜃 , 𝜃 ∈ Θ. Put differently,
∫ for every P ∈ P there exists a mixture distribu-
tion Q ∈ P(Θ) with P(𝑍 ∈ B) = Θ P 𝜃 (𝑍 ∈ B) dQ(𝜃) for all Borel sets B ⊆ Z. This
construction ensures that P ⊆ P(Z) is convex. For example, if P 𝜃 is a Gaussian
distribution whose mean and covariance matrix are encoded by 𝜃, then P contains
(possibly continuous) mixtures of Gaussians. Mixture ambiguity sets correspond-
ing to compact parameter sets Θ are studied by Lasserre and Weisser (2021), who
develop a semidefinite programming-based hierarchy of increasingly tight inner
approximations for the feasible set of a distributionally robust chance constraint.
Note that P can be viewed as the convex hull of the parametric distribution family
{P 𝜃 : 𝜃 ∈ Θ}. A classical result in convex analysis due to Minkowski asserts that
any compact convex subset of a Euclidean vector space coincides with the convex
hull of its extreme points. Choquet theory (Phelps 1965) seeks similar extreme
point representations for convex compact subsets of topological vector spaces. For
example, if {P 𝜃 : 𝜃 ∈ Θ} is the set of all extreme distributions of a weakly compact
convex ambiguity set P, then (2.36) constitutes a Choquet representation of P.
Families of distributions that share certain structural properties sometimes admit
a Choquet representation of the form (2.36). For example, let P be the family of
44 D. Kuhn, S. Shafiee, and W. Wiesemann

all distributions P ∈ P(R𝑑 ) that are point symmetric about the origin. This means
that P(𝑍 ∈ B) = P(−𝑍 ∈ B) for every Borel set B ⊆ R𝑑 . One can then show
that all extreme distributions of P are representable as P 𝜃 = 21 𝛿+𝜃 + 12 𝛿 − 𝜃 for
some 𝜃 ∈ R𝑑 . Thus, P admits a Choquet representation of the form (2.36). As
another example, let P be the family of all distributions P ∈ P(R𝑑 ) that are 𝛼-
unimodal about the origin for some 𝛼 > 0. This means that 𝑡 𝛼 P(𝑍 ∈ B/𝑡) is
non-decreasing in 𝑡 > 0 for every Borel set B ⊆ R𝑑 . One can then show that
every extreme distribution of P is a distribution P 𝜃 supported on the line segment
from 0 to 𝜃 ∈ R𝑑 with the property that P 𝜃 (k𝑍 k 2 ≤ 𝑡 k𝜃 k 2 ) = 𝑡 𝛼 for all 𝑡 ∈ [0, 1].
Thus, P admits again a Choquet representation of the form (2.36). We remark that
𝑑-unimodal distributions on R𝑑 are also called star-unimodal. One readily verifies
that a distribution with a continuous probability density function is star-unimodal
if and only if the density function is non-increasing along each ray emanaging from
the origin. In addition, one can show that the family of all 𝛼-unimodal distributions
converges—in a precise sense—to the family of all possible distributions on R𝑑
as 𝛼 tends to infinity. For more information on structural distribution families and
their Choquet representations we refer to (Dharmadhikari and Joag-Dev 1988).

The moment ambiguity sets of Section 2.1 are known to contain discrete distribu-
tions with only very few atoms; see Section 7. However, uncertainties encountered
in real physical, technical or economic systems are unlikely to follow such discrete
distributions. Instead, they are often expected to be unimodal. Hence, an effective
means to eliminate the pathological discrete distributions from a moment ambiguity
set is to intersect it with the structural ambiguity set of all 𝛼-unimodal distributions
for some 𝛼 > 0. Popescu (2005) combines ideas from Choquet theory and sums-
of-squares polynomial optimization to approximate worst-case expectations over
the resulting intersection ambiguity sets by a hierarchy of increasingly accurate
bounds, each of which is computed by solving a tractable semidefinite program.
Van Parys, Goulart and Kuhn (2016b) and Van Parys, Goulart and Morari (2019)
extend this approach and establish exact semidefinite programming reformula-
tions for the worst-case probability of a polyhedron and the worst-case conditional
value-at-risk of a piecewise linear convex loss function across all 𝛼-unimodal dis-
tributions in a Chebyshev ambiguity set; see also (Hanasusanto, Roitch, Kuhn
and Wiesemann 2015b). Li, Jiang and Mathieu (2019a) demonstrate that these
semidefinite programming reformulations can sometimes be simplified to highly
tractable second-order cone programs. Complementing moment information with
structural information generally leads to less conservative DRO models as Li, Ji-
ang and Mathieu (2016) demonstrate in the context of a power system application.
Lam, Liu and Zhang (2021) consider another basic notion of distributional shape
known as orthounimodality and build a corresponding Choquet representation to
address multivariate extreme event estimation. More recently, Lam, Liu and Sing-
ham (2024) combine Choquet theory with importance sampling and likelihood
ratio techniques for modeling distribution shapes.
Distributionally Robust Optimization 45

2.4.3. Non-Standard 𝜙-Divergence and Optimal Transport Ambiguity Sets


A wealth of non-standard 𝜙-divergences and optimal transport discrepancies have
been proposed to measure the dissimilarity between probability distributions. They
offer great flexibility in designing ambiguity sets with complementary computa-
tional and statistical properties. Non-standard distance measures notably include
smoothed 𝜙-divergences (Zeitouni and Gutman 1991, Yang and Chen 2018, Liu,
Van Parys and Lam 2023) as well as combinations of 𝜙-divergences and op-
timal transport discrepancies (Reid and Williamson 2011, Dupuis and Mao 2022,
Van Parys 2024). In addition, they include coherent Wasserstein distances (Li
and Mao 2022) and Sinkhorn divergences (Wang, Gao and Xie 2021) as well as
divergences based on causal optimal transport (Analui and Pflug 2014, Pflug and
Pichler 2014, Yang, Zhang, Chen, Gao and Hu 2022, Arora and Gao 2022, Jiang
and Obloj 2024), outlier-robust optimal transport (Nietert, Goldfeld and Shafiee
2024a,b), mixed-feature optimal transport (Selvi, Belbasi, Haugh and Wiesemann
2022, Belbasi, Selvi and Wiesemann 2023), cluster-based optimal transport (Wang,
Becker, Van Parys and Stellato 2022), partial optimal transport (Esteban-Pérez and
Morales 2022), sliced optimal transport (Olea, Rush, Velez and Wiesel 2022),
multi-marginal optimal transport (Lau and Liu 2022, García Trillos, Jacobs and
Kim 2023, Rychener, Esteban-Pérez, Morales and Kuhn 2024), and constrained
conditional moment optimal transport (Li et al. 2022, Blanchet, Kuhn, Li and
Taşkesen 2023, Sauldubois and Touzi 2024).

2.4.4. Ambiguity Sets Based on Integral Probability Metrics


Let F be a family of Borel measurable test functions 𝑓 : Z → R such that 𝑓 ∈ F if
and only if − 𝑓 ∈ F. The integral probability metric generated by F is defined via
∫ ∫
DF (P, P̂) = sup 𝑓 (𝑧) dP(𝑧) − 𝑓 (ˆ𝑧 ) dP(ˆ𝑧 )
𝑓 ∈F Z Z

for all distributions P, P̂ ∈ P(Z) under which all test functions 𝑓 ∈ F are integrable.
The underlying maximization problem probes how well the test functions can
distinguish P from P̂. By construction, DF constitutes a pseudo-metric, that is, it is
non-negative and symmetric (because F = −F), vanishes if its arguments match,
and satisfies the triangle inequality. In addition, DF becomes a proper metric
if F separates distributions, in which case DF (P, P̂) vanishes only if P = P̂. The
ambiguity set of radius 𝑟 ≥ 0 around P̂ ∈ P(Z) with respect to DF is defined as

P = P ∈ P(Z) : DF (P, P̂) ≤ 𝑟 .
The proof of Proposition 2.11 reveals that the total variation distance is the in-
tegral probability metric generated by all Borel functions 𝑓 : Z → [−1/2, 1/2];
see (2.16). The Kantorovich-Rubinstein duality established in Corollary 2.19
further shows that the 1-Wasserstein distance is the integral probability metric
generated by all Lipschitz continuous functions 𝑓 : Z → R with lip( 𝑓 ) ≤ 1. In
addition, if H is a reproducing kernel Hilbert space of Borel functions 𝑓 : Z → R
46 D. Kuhn, S. Shafiee, and W. Wiesemann

with Hilbert norm k · k H , then the maximum mean discrepancy distance corres-
ponding to H is the integral probability metric generated by the standard unit ball
F = { 𝑓 ∈ H : k 𝑓 k H ≤ 1} in H. Maximum mean discrepancy ambiguity sets
are studied in (Staib and Jegelka 2019, Zhu, Jitkrittum, Diehl and Schölkopf 2020,
2021, Zeng and Lam 2022, Iyengar, Lam and Wang 2022). Husain (2020) un-
covers a deep connection between DRO problems and regularized empirical risk
minimization problems, which holds whenever the ambiguity set is defined via an
integral probability metric.

3. Topological Properties of Ambiguity Sets


A fundamental question of theoretical as well as practical interest is whether nature’s
subproblem in (1.2) is solvable or, in other words, whether the inner supremum
in (1.2) is attained. In this section we will investigate under what conditions the
Weierstrass extreme value theorem applies to nature’s subproblem. That is, we will
develop easily checkable conditions under which the ambiguity set P is weakly
compact and the expected loss E P [ℓ(𝑥, 𝑍)] is weakly lower semicontinuous in P.
Throughout this discussion, we assume that Z is a closed subset of R𝑑 .
A classical result by Baire asserts that a function on the real line is lower
semicontinuous if and only if it can be represented as the pointwise supremum of
a non-decreasing sequence of continuous functions (Baire 1905). Below we will
use the following multivariate generalization of this result.
Lemma 3.1 (Stromberg (2015, p. 132)). A function 𝑓 : Z → (−∞, +∞] is lower
semicontinuous if and only if there is a non-decreasing sequence of continuous
functions 𝑓𝑖 : Z → R, 𝑖 ∈ N, with 𝑓 (𝑧) = sup𝑖∈N 𝑓𝑖 (𝑧) for all 𝑧 ∈ Z.
If 𝑓 is bounded from below, then the continuous functions 𝑓𝑖 can be assumed to be
uniformly bounded. Indeed, if 𝑓 (𝑧) ≥ 0, say, then the continuous function 𝑓𝑖 (𝑧) can
be replaced with the bounded continuous function 𝑓𝑖′ (𝑧) = min{max{ 𝑓 (𝑧), 0}, 𝑖}.
The sequence 𝑓𝑖′ , 𝑖 ∈ N, is still non-decreasing and converges pointwise to 𝑓 .
Definition 3.2 (Weak Convergence of Probability Distributions). A sequence of
probability distributions P 𝑗 ∈ P(Z), 𝑗 ∈ N, converges weakly to P ∈ P(Z) if for
every bounded and continuous function 𝑓 : Z → R we have
lim E P 𝑗 [ 𝑓 (𝑍)] = E P [ 𝑓 (𝑍)] .
𝑗 ∈N

There is a close link between the continuity properties of the expected value
of 𝑓 (𝑍) with respect to the distribution P and the continuity properties of 𝑓 . Recall
that a function 𝐹 : P(Z) → R is weakly continuous if lim𝑖→∞ 𝐹(P𝑖 ) = 𝐹(P) for
every sequence P𝑖 ∈ P(Z), 𝑖 ∈ N, that converges weakly to P. Weak lower and
upper semicontinuity are defined analogously in the obvious way.
Proposition 3.3 (Continuity of Expected Values). If 𝑓 : Z → [−∞, +∞] is lower
semicontinuous and bounded from below, then E P [ 𝑓 (𝑍)] is weakly lower semicon-
Distributionally Robust Optimization 47

tinuous in P ∈ P(Z). Conversely, if 𝑓 is upper semicontinuous and bounded from


above, then E P [ 𝑓 (𝑍)] is weakly upper semicontinuous in P ∈ P(Z). Finally, if 𝑓 is
continuous and bounded, then E P [ 𝑓 (𝑍)] is weakly continuous in P ∈ P(Z).
Proof. Assume first that 𝑓 is lower semicontinuous and bounded from below. In
the following, we assume without loss of generality that 𝑓 is in fact non-negative.
Then, by Lemma 3.1, there is a non-decreasing sequence of bounded, continuous
and non-negative functions 𝑓𝑖 , 𝑖 ∈ N, with 𝑓 (𝑧) = sup𝑖∈N 𝑓𝑖 (𝑧). If P 𝑗 ∈ P(Z),
𝑗 ∈ N, is any sequence of distributions that converges weakly to P, then we find
 
lim inf E P 𝑗 [ 𝑓 (𝑍)] = sup inf E P 𝑗 sup 𝑓𝑖 (𝑍)
𝑗 ∈N 𝑘 ∈N 𝑗 ≥ 𝑘 𝑖∈N
= sup inf sup E P 𝑗 [ 𝑓𝑖 (𝑍)]
𝑘 ∈N 𝑗 ≥ 𝑘 𝑖∈N
≥ sup sup inf E P 𝑗 [ 𝑓𝑖 (𝑍)]
𝑖∈N 𝑘 ∈N 𝑗 ≥ 𝑘
= sup E P [ 𝑓𝑖 (𝜉)] = E P [ 𝑓 (𝑍)] .
𝑖∈N

Here, both the second and the last equality follow from the monotone convergence
theorem, which applies because each 𝑓𝑖 is bounded and thus integrable with respect
to any probability distribution and because the 𝑓𝑖 , 𝑖 ∈ N, form a non-decreasing
sequence of non-negative functions. The inequality follows from the interchange
of the supremum over 𝑖 and the infimum over 𝑗, and the third equality holds
because P 𝑗 converges weakly to P and because 𝑓𝑖 is continuous and bounded. This
shows that E P [ 𝑓 (𝑍)] is weakly lower semicontinuous in P.
The proofs of the assertions regarding weak upper semicontinuity and weak
continuity are analogous and therefore omitted for brevity. 
In the following we equip the family P(Z) of all probability distributions on Z
with the weak topology, which is generated by the open sets
𝑈 𝑓 , 𝛿 = {P ∈ P(Z) : |E P [ 𝑓 (𝑍)]| < 𝛿}
encoded by any continuous bounded function 𝑓 : Z → R and tolerance 𝛿 > 0. The
weak topology on P(Z) is metrized by the Prokhorov metric (Billingsley 2013,
Theorem 6.8), and therefore the notions of sequential compactness and compactness
are equivalent on P(Z); see, e.g., (Munkres 2000, Theorem 28.2).
Definition 3.4 (Tightness). A family P ⊆ P(Z) of distributions is tight if for any
tolerance 𝜀 > 0 there is a compact set C ⊆ Z with P(𝑍 ∉ C) ≤ 𝜀 for all P ∈ P.
A classical result by Prokhorov asserts that a distribution family is weakly
compact if and only if it is tight and weakly closed. Prokhorov’s theorem is the key
tool to show that an ambiguity set is weakly compact. We state it without proof.
Theorem 3.5 (Billingsley (2013, Theorem 5.1)). A family P ⊆ P(Z) of distribu-
tions is weakly compact if and only if it is tight as well as weakly closed.
48 D. Kuhn, S. Shafiee, and W. Wiesemann

In the following we revisit the ambiguity sets of Section 2 one by one and
determine under what conditions they are tight, weakly closed and weakly compact.

3.1. Moment Ambiguity Sets


The support-only ambiguity sets arguably form the simplest class of moment am-
biguity sets because they impose no moment conditions at all. In fact, all other
ambiguity sets considered in this paper are subsets of a support-only ambiguity set.
Proposition 3.6 (Support-Only Ambiguity Sets). The set P(Z) of all distributions
supported on Z ⊆ R𝑑 is weakly compact if and only if Z is compact.
Proof. Note first that P(Z) is tight if and only if Z is bounded. Indeed, if Z is
bounded, then it is compact because Z is closed thanks to our blanket assumption.
Given any 𝜀 > 0, we may thus set C = Z, which ensures that P(𝑍 ∉ C) = 0 ≤ 𝜀
for all P ∈ P(Z). Hence, P(Z) is tight. If Z is unbounded, on the other hand,
then P(Z) trivially fails to be tight. Indeed, for any compact set C ⊆ Z, the
complement Z\C is non-empty because C is bounded and Z is not. Hence, there
exists a probability distribution P ∈ P(Z) supported on Z\C such that P(𝑍 ∉ C) = 1.
Next, note that P(Z) is weakly closed if and only if Z is closed. To see this,
assume first that Z is closed, and note that the indicator function 𝛿Z defined
through 𝛿Z (𝑧) = 0 if 𝑧 ∈ Z and 𝛿Z (𝑧) = +∞ otherwise is lower semicontinuous
and bounded below. By Proposition 3.3, E P [𝛿Z (𝑍)] is therefore weakly lower
semicontinuous in P. If P 𝑗 ∈ P(Z), 𝑗 ∈ N, converges weakly to P, we then have
0 = lim inf E P 𝑗 [𝛿Z (𝑍)] ≥ E P [𝛿Z (𝑍)] ≥ 0,
𝑗 ∈N

where the equality holds because P 𝑗 is supported on Z for every 𝑗 ∈ N, and the first
inequality follows from weak lower semicontinuity. This implies that P ∈ P(Z),
and thus P(Z) is weakly closed. Conversely, assume that P(Z) is weakly closed,
and consider a sequence 𝑧 𝑗 ∈ Z, 𝑗 ∈ N, converging to 𝑧. Then, the sequence of
Dirac distributions 𝛿 𝑧 𝑗 , 𝑗 ∈ N, converges weakly to 𝛿 𝑧 , and thus we find
0 = lim inf E 𝛿𝑧 𝑗 [𝛿Z (𝑍)] ≥ E 𝛿𝑧 [𝛿Z (𝑍)] ≥ 0.
𝑗 ∈N

Here, the first inequality holds again because E P [𝛿Z (𝑍)] is weakly lower semicon-
tinuous in P. This implies that E 𝛿𝑧 [𝛿Z (𝑍)] = 0, which holds if and only if 𝑧 ∈ Z.
Thus, Z is closed. Given these insights, the claim follows from Theorem 3.5. 

By using Proposition 3.6, we can now show that a moment ambiguity set of the
form (2.1) is weakly compact whenever the underlying support set Z is compact,
the moment function 𝑓 is continuous and the uncertainty set F is closed.
Proposition 3.7 (Moment Ambiguity Sets). If Z ⊆ R𝑑 is a compact support set,
𝑓 : Z → R𝑚 is a continuous moment function and F ⊆ R𝑚 is a closed uncertainty
set, then the moment ambiguity set P defined in (2.1) is weakly compact.
Distributionally Robust Optimization 49

Proof. As Z is compact, the support-only ambiguity set P(Z) is weakly compact


by virtue of Proposition 3.6. Consequently, P(Z) is tight and weakly closed. This
readily implies that P is tight as a subset of a tight set remains tight. Proposition 3.3
further implies that E P [ 𝑓 (𝑍)] is weakly continuous in P. As F is closed and as
the pre-image of any closed set under a continuous transformation is closed, we
may conclude that P 𝑓 = {P ∈ P(R𝑑 ) : E P [ 𝑓 (𝑍)] ∈ F } is weakly closed. Hence,
P = P(Z) ∩ P 𝑓 is weakly closed as the intersection of two weakly closed sets.
Given these insights, the claim follows readily from Theorem 3.5. 
The conditions of Proposition 3.7 are only sufficient but not necessary for weak
compactness. The next examples show that moment ambiguity sets can be tight or
weakly compact even if the support set Z or the moment function 𝑓 are unbounded.
Example 3.8 (Markov Ambiguity Sets). The Markov ambiguity set (2.2) fails to be
tight if Z = R𝑑 . For example, if Z = R and 𝜇 = 0, then for every compact set C ⊆ R
there is a constant 𝑅 > 0 such that the two-point distribution P = 12 𝛿 −𝑅 + 12 𝛿 𝑅 is
fully supported on the complement of C. However, the Markov ambiguity set P
becomes tight if Z = R+ and 𝜇 = 1. Indeed, in this case Markov’s inequality
implies that P(𝑍 ∉ C) ≤ 𝜀 for every P ∈ P and 𝜀 > 0 if we define C as the compact
interval [0, 1/𝜀]. Even in this case, however, P fails to be weakly closed. Indeed,
𝑖 1
the distributions P𝑖 = 𝑖+1 𝛿0 + 𝑖+1 𝛿𝑖+1 belong to P for all 𝑖 ∈ N, but their weak
limit P = 𝛿0 is no member of P. If Z is convex, one can extend this reasoning in
the obvious way to show that P is weakly compact if and only if Z is compact.
The next example shows that Chebyshev ambiguity sets are tight irrespective
of Z. Nevertheless, they are not always weakly compact.
Example 3.9 (Chebyshev Ambiguity Sets). The Chebyshev ambiguity set P defined
in (2.3) is always tight. To see this, assume without loss of generality that 𝜇 = 0
and 𝑀 = 𝐼𝑑 , which can always be enforced by applying an affine coordinate
transformation.
p Given any 𝜀 > 0, we can define a compact set C = {𝑧 ∈ Z :
k𝑧k 2 ≤ 𝑑/𝜀}. It is then easy to see that any distribution P ∈ P satisfies
 p   
P(𝑍 ∉ C) = P k𝑍 k 2 > 𝑑/𝜀 ≤ E P k𝑍 k 22 · 𝜀/𝑑 = 𝜀,

where the inequality holds because the quadratic function 𝑞(𝑧) = k𝑧k 22 · 𝜀/𝑑 ma-
jorizes the characteristic function of Z\C. Hence, P is indeed tight. However, P
is not necessarily weakly closed. To see this, suppose that 𝑑 = 1 and that Z = R.
2
In this case the distributions P𝑖 = 2𝑖12 𝛿 −𝑖 + 𝑖 𝑖−1 1
2 𝛿 0 + 2𝑖 2 𝛿 𝑖 have zero mean and unit
variance for all 𝑖 ∈ N. That is, they all belong to P. However, they converge weakly
to P = 𝛿0 , which is not an element of P. Thus, P fails to be weakly compact.
The family of all distributions on R𝑑 with bounded 𝑝-th-order moments is always
weakly compact even though ambiguity sets that fix the 𝑝-th-order moments to
prescribed values (e.g., the Chebyshev ambiguity set) may not be weakly compact.
50 D. Kuhn, S. Shafiee, and W. Wiesemann

Example 3.10 (𝑝-th-Order Moment Ambiguity Sets). The ambiguity set


P = {P ∈ P(Z) : E P [k𝑍 k 𝑝 ] ≤ 𝑅}
induced by any norm k · k on R𝑑 and two parameters 𝑝, 𝑅 > 0 is weakly compact.
Using a similar reasoning as in Example 3.9, one can show that for any 𝜀 > 0
there exists a compact set, namely C = {𝑧 ∈ Z : k𝑧k ≤ (𝑅/𝜀)1/ 𝑝 }, which satisfies
P(𝑍 ∉ C) ≤ 𝜀. Thus, P is tight. To see that P is also weakly closed, note that
𝑓 (𝑧) = k𝑧k 𝑝 is continuous and bounded below. By Proposition 3.3, the expected
value E P [k𝑍 k 2𝑝 ] is therefore weakly lower semicontinuous in P and has weakly
closed sublevel sets. Therefore, P is weakly compact by virtue of Theorem 3.5.

3.2. 𝜙-Divergence Ambiguity Sets


In this section we show that 𝜙-divergence ambiguity sets of the form (2.10) are
weakly compact whenever the entropy function 𝜙 grows superlinearly. Otherwise,
if 𝜙 grows at most linearly, then the corresponding 𝜙-divergence ambiguity sets
generically fail to be weakly compact. Recall that an entropy function 𝜙 in the sense
of Definition 2.4 grows superlinearly if and only if 𝜙∞ (1) = ∞; see also Table 2.1.
Lemma 3.11 (Worst-Case Probability Maps). Let P be the 𝜙-divergence ambiguity
set of radius 𝑟 > 0 around P̂ ∈ P(Z) defined in (2.10), and assume that 𝜙 is
continuous at 1 and that 𝜙∞ (1) = ∞. Then, there is a continuous, concave and
surjective function 𝑝 : [0, 1] → [0, 1] that depends only on 𝜙 and 𝑟 such that
sup P(𝑍 ∈ B) = 𝑝(P̂(𝑍 ∈ B))
P∈P

for every Borel set B ⊆ Z.


Proof. The proof is constructive. That is, we define the function 𝑝 through
𝑝(𝑡) = inf 𝜆 0 + 𝜆𝑟 + 𝑡 · (𝜙∗ ) 𝜋 (1 − 𝜆 0 , 𝜆) + (1 − 𝑡) · (𝜙∗ ) 𝜋 (−𝜆 0 , 𝜆)
𝜆0 ∈R,𝜆∈R+

for all 𝑡 ∈ [0, 1]. In the remainder we show that 𝑝 satisfies all desired properties.
By construction, 𝑝 depends only on 𝜙 and 𝑟 and coincides with the lower envelope
of infinitely many linear functions in 𝑡. Hence, 𝑝 is concave as well as upper
semicontinuous. By the definition of P and by Theorem 4.15 below, we also have

sup P(𝑍 ∈ B) = sup E P [1B (𝑍)] : D 𝜙 (P, P̂) ≤ 𝑟
P∈P P∈P(Z)
= inf 𝜆 0 + 𝜆𝑟 + E P̂ [(𝜙∗ ) 𝜋 (1B (𝑍) − 𝜆 0 , 𝜆)] (3.1)
𝜆0 ∈R,𝜆∈R+

=𝑝(P̂(𝑍 ∈ B)),
for any Borel set B, where the last equality follows from the definition of 𝑝. As
the worst-case probability on the left hand side of (3.1) falls within [0, 1] and as
P̂(𝑍 ∈ B) can adopt any value in [0, 1], it is clear that the range of 𝑝 is a subset
of [0, 1]. Next, we show that 𝑝 is continuous. To this end, note that the concavity
Distributionally Robust Optimization 51

and finiteness of 𝑝 on [0, 1] imply via (Rockafellar 1970, Theorem 10.1) that 𝑝 is
continuous on (0, 1). In addition, its upper semicontinuity prevents 𝑝 from jumping
at 0 or at 1. Thus, 𝑝 is indeed continuous throughout [0, 1]. Finally, setting B = ∅
or B = Z in (3.1) shows that 𝑝(0) = 0 and 𝑝(1) = 1, respectively. Consequently,
we may conclude that 𝑝 is surjective. This observation completes the proof. 

As P̂ ∈ P, the worst-case probability map 𝑝 from Lemma 3.11 satisfies 𝑝(𝑡) ≥ 𝑡


for all 𝑡 ∈ [0, 1], that is, the worst-case probability is never smaller than the nominal
probability. We remark that the map 𝑝 also emerges in the study of distributionally
robust chance constraints over 𝜙-divergence ambiguity sets with 𝜙∞ (1) = ∞. In-
deed, any such distributionally robust chance constraint with violation probability
𝜀 ∈ (0, 1) is equivalent to a classical chance constraint under the reference distri-
bution P̂ with (smaller) violation probability 𝑝 −1 (𝜀); see (El Ghaoui et al. 2003,
Jiang and Guan 2016, Shapiro 2017). We can now show that divergence ambiguity
sets corresponding to superlinear entropy functions are weakly compact.
Proposition 3.12 (𝜙-Divergence Ambiguity Sets). If 𝜙 is an entropy function with
𝜙∞ (1) = ∞, then the corresponding 𝜙-divergence ambiguity set P defined in (2.10)
is weakly compact for any closed set Z ⊆ R𝑑 , distribution P̂ ∈ P(Z) and 𝑟 ≥ 0.
Proof. We first show that P is tight. To this end, select any 𝜀 ∈ (0, 1), and
define 𝑝 −1 (𝜀) as the unique 𝑡 ∈ (0, 1] satisfying 𝑝(𝑡) = 𝜀, where 𝑝 represents the
worst-case probability map from Lemma 3.11. Note that 𝑝 −1 (𝜀) is well-defined
because 𝑝 is concave and surjective and because 𝑝(0) = 0 and 𝑝(1) = 1. Note also
that 𝑝 −1 (𝜀) ≤ 𝜀 because 𝑝(𝑡) ≥ 𝑡. Next, select a sufficiently large 𝑅 > 0 such
that P̂(k𝑍 k 2 > 𝑅) ≤ 𝑝 −1 (𝜀), and define a compact set C = {𝑧 ∈ Z : k𝑧k 2 ≤ 𝑅}.
Lemma 3.11 applied to B = Z\C then allows us to conclude that
sup P(𝑍 ∉ C) = 𝑝(P̂(𝑍 ∉ C)) ≤ 𝑝(𝑝 −1 (𝜀)) = 𝜀,
P∈P

where the inequality follows from the monotonicity of 𝑝 and choice of 𝑅. We have
thus shown that P(𝑍 ∉ C) ≤ 𝜀 for all P ∈ P, and thus P is tight.
It remains to be shown that P is weakly closed. To this end, recall first that P(Z)
is weakly closed because Z is closed; see Proposition 3.6. Next, recall from
Proposition 2.6 that any 𝜙-divergence admits a dual representation of the form
∫ ∫
D 𝜙 (P, P̂) = sup 𝑓 (𝑧) dP(𝑧) − 𝜙∗ ( 𝑓 (𝑧)) dP̂(𝑧), (3.2)
𝑓 ∈F Z Z

where F denotes the family of all bounded Borel functions 𝑓 : Z → dom(𝜙∗ ).


In fact, F can be restricted to the space F 𝑐 of all continuous bounded functions
without reducing the supremum in (3.2). This is a direct consequence of Lusin’s
theorem, which ensures that for any 𝛿 > 0 and 𝑓 ∈ F there exists a compact
set A ⊆ Z with P̂(𝑍 ∉ A) ≤ 𝛿 and a bounded continuous function 𝑓 𝛿 ∈ F 𝑐 that
coincides with 𝑓 on A and satisfies sup 𝑧 ∈Z | 𝑓 𝛿 (𝑧)| ≤ sup 𝑧 ∈Z | 𝑓 (𝑧)| = k 𝑓 k ∞ . As
52 D. Kuhn, S. Shafiee, and W. Wiesemann

the convex lower semicontinuous function 𝜙∗ is continuous on its domain, both


𝜙∗𝑙 = inf {𝜙∗ (𝑠) : |𝑠| ≤ k 𝑓 k ∞ } and 𝜙∗𝑢 = sup {𝜙∗ (𝑠) : |𝑠| ≤ k 𝑓 k ∞ }
𝑠∈dom(𝜙 ∗ ) 𝑠∈dom(𝜙 ∗ )

are finite. Therefore, we have


∫ ∫
𝑓 𝛿 (𝑧) dP(𝑧) − 𝜙∗ ( 𝑓 𝛿 (𝑧)) dP̂(𝑧)
Z∫ Z

≥ 𝑓 (𝑧) dP(𝑧) − 𝜙∗ ( 𝑓 (𝑧)) dP̂(𝑧) − 2k 𝑓 k ∞ P(𝑍 ∉ A) − (𝜙∗𝑢 − 𝜙∗𝑙 ) P̂(𝑍 ∉ A).
Z Z

As 𝜙∞ (1)= ∞ implies P ≪ P̂ and as P̂(𝑍 ∉ Z) ≤ 𝛿, both P(𝑍 ∉ A) and P̂(𝑍 ∉ A)


decay to 0 as 𝛿 is reduced. Thus, the objective function value of 𝑓 𝛿 in problem (3.2)
is asymptotically non-inferior to that of 𝑓 . This confirms that restricting F to F 𝑐
has no impact on the supremum in (3.2). Recall now from Proposition 3.3 that,
for any bounded continuous function 𝑓 ∈ F 𝑐 , the first integral in (3.2) is weakly
continuous in P. Thus, D 𝜙 (P, P̂) is weakly lower semicontinuous in P as a pointwise
supremum of weakly continuous functions. This implies that any sublevel set of the
function 𝑓 (P) = D 𝜙 (P, P̂) is weakly closed. We thus conclude that the divergence
ambiguity set is weakly closed. The claim then follows from Theorem 3.5. 
The proof of Proposition 3.12 critically relies on the assumption that 𝜙∞ (1) = ∞,
which ensures that the divergence ambiguity set contains only distributions that
are absolutely continuous with respect to P̂. Below we show that if the entropy
function 𝜙 grows at most linearly (that is, if 𝜙∞ (1) < ∞) and Z is unbounded,
then the corresponding divergence ambiguity set fails to be weakly compact. As a
preparation, we first establish an upper bound on any 𝜙-divergence on P(Z)×P(Z).
Lemma 3.13 (Upper Bounds on 𝜙-Divergences). If 𝜙 is an entropy function and
Z ⊆ R𝑑 a closed set, then we have D 𝜙 (P, P̂) ≤ 𝜙(0) + 𝜙∞ (1) for all P, P̂ ∈ P(Z).
This upper bound is attained if P and P̂ are mutually singular, that is, if P ⊥ P̂.
Proof. In the first part of the proof we derive the desired upper bound. To this end,
assume that 𝜙(0) < ∞ and 𝜙∞ (1) < ∞ for otherwise the upper bound is trivially
satisfied. As the entropy function is convex, we then have
Δ 𝑠 𝜙(𝑠 + Δ) − 𝜙(0)
𝜙(𝑠) ≤ 𝜙(0) + 𝜙(𝑠 + Δ) ⇐⇒ 𝜙(𝑠) ≤ 𝜙(0) + 𝑠
𝑠+Δ 𝑠+Δ 𝑠+Δ
for every 𝑠, Δ ≥ 0. Letting Δ tend to infinity, this implies that 𝜙(𝑠) ≤ 𝜙(0) + 𝑠 𝜙∞ (1)
for all 𝑠 ≥ 0. The 𝜙-divergence between any P, P̂ ∈ P(Z) thus satisfies
 
∫ dP
(𝑧)
dP̂ d𝜌
D 𝜙 (P, P̂) = (𝑧) 𝜙   d𝜌(𝑧)
Z d𝜌 dP̂
(𝑧)
d𝜌
∫ ∫
dP̂ dP
≤ (𝑧) 𝜙(0) d𝜌(𝑧) + (𝑧) 𝜙∞ (1) d𝜌(𝑧) = 𝜙(0) + 𝜙∞ (1),
Z d𝜌 Z d𝜌
Distributionally Robust Optimization 53

where we may assume without loss of generality that the dominating measure 𝜌 ∈
M+ (Z) is given by 𝜌 = P + P̂. This establishes the desired upper bound. It remains
to be shown that this bound is attained even if 𝜙(0) or 𝜙∞ (1) evaluate to infinity.
To this end, suppose that P and P̂ are mutually singular. This means that there exist
disjoint Borel sets B, B̂ ⊆ Z with P(𝑍 ∈ B) = 1 and P̂(𝑍 ∈ B̂) = 1. We thus have
∫ ∫ dP
d𝜌 (𝑧)
!
dP̂
D 𝜙 (P, P̂) = (𝑧) 𝜙(0) d𝜌(𝑧) + 0𝜙 d𝜌(𝑧)
B̂ d𝜌 B 0

∞ dP
 
= 𝜙(0) + 𝜙 (𝑧) d𝜌(𝑧) = 𝜙(0) + 𝜙∞ (1).
B d𝜌
dP dP̂
The first equality holds because d𝜌 (𝑧) = 0 for 𝜌-almost all 𝑧 ∈ B̂ and d𝜌 (𝑧) = 0
for 𝜌-almost all 𝑧 ∈ B. The second equality follows from the definition of the
perspective function and exploits that the restriction of 𝜌 to B̂ coincides with P̂.
The third equality, finally, holds because the restriction of 𝜌 to B coincides with P.
Note that the upper bound is attained even if 𝜙(0) = ∞ or 𝜙∞ (1) = ∞. 
The following example reveals that 𝜙-divergence ambiguity sets fail to be weakly
compact if 𝜙∞ (1) < ∞ and if the set Z without the atoms of P̂ is unbounded.
Example 3.14 (𝜙-Divergence Ambiguity Sets). Consider an entropy function 𝜙
with 𝜙∞ (1) < ∞. By Lemma 3.13, D 𝜙 (P, P̂) is bounded above by 𝑟 = 𝜙(0) + 𝜙∞ (1)
for all P, P̂ ∈ P. In addition, let P be the 𝜙-divergence ambiguity set with
center P̂ ∈ P(Z) and radius 𝑟 ∈ (0, 𝑟) defined in (2.10). Assume that for every
𝑅 > 0 there exists 𝑧0 ∈ Z with k𝑧0 k 2 ≥ 𝑅 and P̂(𝑍 = 𝑧0 ) = 0. This assumption
holds, for example, whenever Z is unbounded and convex, and it implies that P
fails to be tight. To see this, fix an arbitrary compact set C ⊆ Z, and select
any point 𝑧0 ∈ Z\C with P̂(𝑍 = 𝑧0 ) = 0. Such a point exists by assumption.
Next, consider the distributions P 𝜃 = (1 − 𝜃) P̂ + 𝜃 𝛿 𝑧0 parametrized by 𝜃 ∈ [0, 1].
Note that P̂ and 𝛿 𝑧0 are mutually singular and that 𝑓 (𝜃) = D 𝜙 (P 𝜃 , P̂) is a convex
continuous bijective function from [0, 1] to [0, 𝑟]. Set now 𝜀 = 21 𝑓 −1 (𝑟). For
𝜃 = 𝑓 −1 (𝑟), the distribution P 𝜃 satisfies D 𝜙 (P 𝜃 , P̂) = 𝑓 ( 𝑓 −1 (𝑟)) = 𝑟 and thus
belongs to P. In addition, P 𝜃 (𝑍 ∉ C) ≥ 𝑓 −1 (𝑟) > 𝜀 because 𝑧0 ∉ C. Note that 𝜀
is independent of C and 𝑧0 as long as P̂(𝑍 = 𝑧0 ) = 0. As the compact set C was
chosen arbitrarily, this implies that P fails to be tight and weakly compact.

3.3. Marginal Ambiguity Sets


As a preparation towards exploring the topological properties of optimal transport
ambiguity sets, we first study marginal ambiguity sets. The following proposition
shows that Fréchet ambiguity sets, which prescribe the marginal distributions of
all 𝑑 individual components of 𝑍, are always weakly compact.
Proposition 3.15 (Fréchet Ambiguity Sets). The Fréchet ambiguity set P defined
in (2.35) is weakly compact for any cumulative distribution functions 𝐹𝑖 , 𝑖 ∈ [𝑑].
54 D. Kuhn, S. Shafiee, and W. Wiesemann

Proof. We first show that the Fréchet ambiguity set is tight. For any 𝜀 > 0 and
𝑖 ∈ [𝑑], we can set 𝑧𝑖 and 𝑧𝑖 to the 𝜀/(2𝑑)-quantile and the (1 − 𝜀/(2𝑑))-quantile
of the distribution function 𝐹𝑖 , respectively. Setting C = ×𝑖∈ [𝑑] [𝑧 𝑖 , 𝑧𝑖 ] yields
Õ Õ
P(𝑍 ∉ C) ≤ P(𝑍𝑖 ∉ [𝑧𝑖 , 𝑧 𝑖 ]) = 𝜀/𝑑 = 𝜀,
𝑖∈ [𝑑] 𝑖∈ [𝑑]

where the inequality follows from the union bound. Thus, P is tight. It remains to
be shown that P is weakly closed. Note that the distribution function of 𝑍𝑖 under P
matches 𝐹𝑖 if and only if for every bounded continuous function 𝑓 we have
∫ +∞
E P [ 𝑓 (𝑍𝑖 )] = 𝑓 (𝑧𝑖 ) d𝐹𝑖 (𝑧𝑖 ).
−∞
This is true because every Borel distribution on R constitutes a Radon measure. The
set of all P ∈ P(R𝑑 ) satisfying the above equality for any fixed bounded and con-
tinuous function 𝑓 and any fixed index 𝑖 ∈ [𝑑] is weakly closed by Proposition 3.3.
Hence, P is weakly closed because closedness is preserved by intersection. 

It is straightforward to generalize Proposition 3.15 from Fréchet ambiguity sets


to generic marginal ambiguity sets as discussed in Section 2.4.1, which prescribe
multivariate marginal distributions. Details are omitted for brevity.

3.4. Optimal Transport Ambiguity Sets


Recall that Γ(P, P̂) denotes the family of all transportation plans linking the prob-
ability distributions P, P̂ ∈ P(Z). Thus, Γ(P, P̂) contains all joint distributions 𝛾
of 𝑍 and 𝑍ˆ with marginals P and P̂, respectively. The set Γ(P, P̂) appears in the
definition of the optimal transport discrepancy OT𝑐 (P, P̂); see Definition 2.15. The
reasoning in Section 3.3 immediately implies that Γ(P, P̂) is weakly compact be-
cause it constitutes a marginal ambiguity set. This insight is formalized in the
following simple corollary of Proposition 3.15. Its proof is omitted for brevity.
Corollary 3.16 (Transportation Plans). The set of all transportation plans Γ(P, P̂)
with marginal distributions P, P̂ ∈ P(Z) is weakly compact.
Corollary 3.16 enables us to show that the optimal transport problem in (2.18) is
solvable as the transportation cost function is assumed to be lower semicontinuous.
Lemma 3.17 (Solvability of Optimal Transport Problems). The infimum in (2.18)
is attained.
Proof. By Corollary 3.16, the set Γ(P, P̂) is weakly compact. In addition, the
transportation cost function 𝑐(𝑧, 𝑧ˆ) is lower semicontinuous and bounded below.
By Proposition 3.3, the expected value E 𝛾 [𝑐(𝑍, 𝑍)] ˆ is therefore weakly lower
semicontinuous in 𝛾. Thus, the optimal transport problem in (2.18) is solvable
thanks to Weierstrass’ theorem, and its infimum is attained. 
Distributionally Robust Optimization 55

Lemma 3.17 allows us to prove that the optimal transport discrepancy OT𝑐 (P, P̂)
constitutes a weakly lower semicontinuous function of its inputs P and P̂.
Lemma 3.18 (Weak Lower Semicontinuity of Optimal Transport Discrepancies).
The optimal transport discrepancy OT𝑐 (P, P̂) is weakly lower semicontinuous
jointly in P and P̂.
Proof. Assume that P 𝑗 and P̂ 𝑗 , 𝑗 ∈ N, converge weakly to P and P̂, respectively,
and define the countable ambiguity sets P = {P 𝑗 } 𝑗 ∈N and P̂ = {P̂ 𝑗 } 𝑗 ∈N . By the
definition of sequential compactness, the weak closures of P and P̂ are weakly
compact. Prokhorov’s theorem (see Theorem 3.5) thus implies that both P and P̂
are tight. Hence, for any 𝜀 > 0 there exist two compact sets C, Cˆ ⊆ R𝑑 with
P 𝑗 (𝑍 ∉ C) ≤ 𝜀/2 and ˆ ≤ 𝜀/2 ∀ 𝑗 ∈ N.
P̂ 𝑗 ( 𝑍ˆ ∉ C)
Whenever 𝛾 ∈ Γ(P 𝑗 , P̂ 𝑗 ) for some 𝑗 ∈ N, we thus have
ˆ ∉ C × Cˆ ≤ P 𝑗 (𝑍 ∉ C) + P̂ 𝑗 (𝑍 ∉ C)
ˆ ≤ 𝜀.

𝛾 (𝑍, 𝑍)
As C × Cˆ is compact and as 𝜀 was chosen arbitrarily, this reveals that the union
Ø
Γ(P 𝑗 , P̂ 𝑗 ) (3.3)
𝑗 ∈N

is tight, which in turn implies via Prokhorov’s theorem that its closure is weakly
compact. Let now 𝛾★𝑗 be an optimal coupling of P 𝑗 and P̂ 𝑗 , which solves prob-
lem (2.18), and which exists thanks to Lemma 3.17. As all these optimal couplings
belong to some weakly compact set (i.e., the weak closure of (3.3)), we may assume
without loss of generality that 𝛾★𝑗, 𝑗 ∈ N, converges weakly to some distribution 𝛾.
Otherwise, we can pass to a subsequence. Clearly, we have 𝛾 ∈ Γ(P, P̂). For 𝛾★ an
optimal coupling of P and P̂, we then find
ˆ
lim inf OT𝑐 (P 𝑗 , P̂ 𝑗 ) = lim inf E 𝛾★𝑗 [𝑐(𝑍, 𝑍)]
𝑗→∞ 𝑗→∞
ˆ ≥ E 𝛾★ [𝑐(𝑍, 𝑍)]
≥ E 𝛾 [𝑐(𝑍, 𝑍)] ˆ = OT𝑐 (P, P̂),
where the two equalities follow from the definitions of 𝛾★𝑗 and 𝛾★, respectively.
ˆ is weakly lower semicontinuous in 𝛾
The first inequality holds because E 𝛾 [𝑐(𝑍, 𝑍)]
thanks to Proposition 3.3, and the second inequality follows from the suboptimality
of 𝛾 in (2.18). Thus, OT𝑐 (P, P̂) is weakly lower semicontinuous in P and P̂. 
Lemma 3.18 is inspired by (Clément and Desch 2008, Lemma 5.2) and (Yue,
Kuhn and Wiesemann 2022, Theorem 1). Next, we prove that Wasserstein am-
biguity sets are weakly compact. Throughout this discussion we assume that the
metric underlying the transportation cost function is induced by a norm k · k on R𝑑 .
This assumption simplifies our derivations but could be relaxed. Recall that the
𝑝-Wasserstein distance W 𝑝 (P, P̂) for 𝑝 ≥ 1 is the 𝑝-th root of OT𝑐 (P, P̂), where the
transportation cost function is set to 𝑐(𝑧, 𝑧ˆ) = k𝑧 − 𝑧ˆ k 𝑝 ; see Definition 2.18.
56 D. Kuhn, S. Shafiee, and W. Wiesemann

Theorem 3.19 (𝑝-Wasserstein Ambiguity Sets). Assume that the metric 𝑑(·, ·) on Z
is induced by some norm k · k on the ambient space R𝑑 . If P̂ ∈ P(Z) has finite 𝑝-th
moments (i.e., E P̂ [k𝑍 k 𝑝 ] < ∞) for some exponent 𝑝 ≥ 1, then the 𝑝-Wasserstein
ambiguity set P defined in (2.28) is weakly compact.
Proof. We first show that all distributions P ∈ P have uniformly bounded 𝑝-th
moments. To this end, set 𝑟ˆ = E P̂ [k𝑍 k 𝑝 ] < ∞, and note that any P ∈ P satisfies
1
E P [k𝑍 k 𝑝 ] 𝑝 = W 𝑝 (P, 𝛿0 ) ≤ W 𝑝 (P, P̂) + W 𝑝 (P̂, 𝛿0 )
1
= W 𝑝 (P, P̂) + E P̂ [k𝑍 k 𝑝 ] 𝑝 ≤ 𝑟 + 𝑟.
ˆ
Here, the first inequality holds because the 𝑝-Wasserstein distance is a metric and
thus satisfies the triangle inequality, and the second inequality holds because P ∈ P.
We therefore have E P [k𝑍 k 𝑝 ] ≤ (𝑟 + 𝑟) ˆ 𝑝 for every P ∈ P. In other words, the
Wasserstein ball P is a subset of the 𝑝-th-order moment ambiguity set discussed
in Example 3.10. This implies that P is tight. Note further that P is defined as a
sublevel set of the function 𝑓 (P) = W 𝑝 (P, P̂), which is weakly lower semicontinuous
thanks to Lemma 3.18. Hence, P is weakly closed. 
Finally, we prove that the ∞-Wasserstein ambiguity set is always weakly compact.
Corollary 3.20 (∞-Wasserstein Ambiguity Sets). Assume that the metric 𝑑(·, ·)
on Z is induced by some norm k · k on the ambient space R𝑑 . Then, the ∞-
Wasserstein ambiguity set defined in (2.34) is weakly compact for every P̂ ∈ P(Z).
Proof. We first show that P is tight. To this end, select any 𝜀 > 0 and any compact
set Cˆ ⊆ Z with P̂(𝑍 ∉ C) ˆ ≤ 𝜀. Note that Cˆ is guaranteed to exist because P̂ is a
probability distribution. Next, define C as the 𝑟-neighborhood Cˆ𝑟 of C,
ˆ that is, set

C = 𝑧 ∈ Z : ∃𝑧ˆ ∈ Cˆ with k𝑧 − 𝑧ˆ k ≤ 𝑟 ,
ˆ Any
see also (2.29). One readily verifies that C inherits compactness from C.
distribution P ∈ P satisfies W∞ (P, P̂) ≤ 𝑟. Consequently, we find
ˆ = P̂(𝑍 ∉ C)
P(𝑍 ∉ C) = P(𝑍 ∈ Z\C) ≤ P̂(𝑍 ∈ Z\C) ˆ ≤ 𝜀,

where the first inequality follows from Corollary 2.28 and the observation that the
𝑟-neighborhood of Z\C coincides with Z\C. ˆ The second inequality follows from
the definition of C. ˆ As 𝜀 was chosen arbitrarily, P is tight. It remains to be shown
that P is weakly closed. Proposition 2.26 readily implies that W∞ (P, P̂) ≤ 𝑟 if and
only if W 𝑝 (P, P̂) ≤ 𝑟 for all 𝑝 ≥ 1. Thus, we may conclude that
Ù
P= P ∈ P(R𝑑 ) : W 𝑝 (P, P̂) ≤ 𝑟 .
𝑝≥1

That is, the ∞-Wasserstein ambiguity set can be expressed as the intersection of
all 𝑝-Wasserstein ambiguity sets for 𝑝 ≥ 1, all of which are weakly closed by
Theorem 3.19. Hence, P is is indeed weakly closed, and the claim follows. 
Distributionally Robust Optimization 57

4. Duality Theory for Worst-Case Expectation Problems


The DRO problem (1.2) is often interpreted as a zero-sum game between the
decision-maker and a fictitious adversary. The decision-maker moves first and thus
selects 𝑥 before seeing P. Therefore, 𝑥 is optimized against all distributions P ∈ P.
In contrast, the adversary moves second and thus selects P after seeing 𝑥. Therefore,
P is only optimized against one particular decision 𝑥 ∈ X . Put differently, the
adversary’s choice may adapt to the decision-maker’s choice but not vice versa.
In this section we develop a duality theory for the adversary’s subproblem, which
aims to maximize the expected loss of a fixed decision 𝑥 across all distributions in
a convex ambiguity set P. To avoid clutter, we suppress the dependence of the loss
function ℓ on the fixed decision 𝑥 throughout this discussion, that is, we write ℓ(𝑧)
instead of ℓ(𝑥, 𝑧). We thus address worst-case expectation problems of the form
sup E P [ℓ(𝑍)]. (4.1)
P∈P

Note that P represents a convex subset of the linear space of all finite signed Borel
measures on Z. Unless Z is finite, (4.1) thus constitutes an infinite-dimensional
convex program with a linear objective function. For this problem to be well-
defined, we assume that ℓ : Z → R is a Borel function. In line with (Rockafellar
and Wets 2009, Section 14.E), we define E P [ℓ(𝑍)] = −∞ if E P [max{ℓ(𝑍), 0}] = ∞
and E P [min{ℓ(𝑍), 0}] = −∞. This means that infeasibility trumps unboundedness.
More generally, throughout the rest of the paper, we assume that if the objective
function of a minimization (maximization) problem can be expressed as the dif-
ference of two terms, both of which evaluate to ∞, then the objective function
value should be interpreted as ∞ (−∞). This convention is in line with the rules of
extended arithmetic used in (Rockafellar and Wets 2009).
In the remainder we will show that (4.1) can be dualized by using elementary tools
from finite-dimensional convex analysis (Fenchel 1953, Rockafellar 1970) for a
broad class of finitely-parametrized ambiguity sets including all moment ambiguity
sets (Section 4.2), 𝜙-divergence ambiguity sets (Section 4.3) and optimal transport
ambiguity sets (Section 4.4). We broadly adopt the proof strategies developed
by Shapiro (2001) and Zhang et al. (2024b) for moment and optimal transport
ambiguity sets, respectively, and we extend them to 𝜙-divergence ambiguity sets.

4.1. General Proof Strategy


In order to outline the high-level ideas for dualizing (4.1), we recall a basic result
on the convexity of parametric infima; see, e.g., (Rockafellar 1974, Theorem 1).
Lemma 4.1 (Convexity of Optimal Value Functions). If U and V are arbitrary
real vector spaces and 𝐻 : U × V → R is a convex function, then the optimal value
function ℎ : U → R defined through ℎ(𝑢) = inf 𝑣 ∈V 𝐻(𝑢, 𝑣) is convex.
Proof. Note that ℎ is a convex function if and only if its epigraph epi(ℎ) is a convex
58 D. Kuhn, S. Shafiee, and W. Wiesemann

set. By the definitions of the epigraph and the infimum operator, we find
epi(ℎ) = {(𝑢, 𝑡) ∈ U × R : ℎ(𝑢) ≤ 𝑡}
= {(𝑢, 𝑡) ∈ U × R : ∃𝑣 ∈ V with 𝐻(𝑢, 𝑣) ≤ 𝑡 + 𝜀 ∀𝜀 > 0}
Ù
= {(𝑢, 𝑡) ∈ U × R : ∃𝑣 ∈ V with 𝐻(𝑢, 𝑣) − 𝜀 ≤ 𝑡}.
𝜀>0
Thus, epi(ℎ) can be obtained by projecting ∩ 𝜀>0 epi(𝐻 −𝜀) to U ×R. The claim then
follows because epi(𝐻 − 𝜀) is convex for every 𝜀 > 0 thanks to the convexity of 𝐻
and because convexity is preserved under intersections and linear transformations;
see, e.g., (Rockafellar 1970, Theorems 2.1 & 5.7). 
The following result marks a cornerstone of convex analysis. It states that the
biconjugate ℎ∗∗ (that is, the conjugate of ℎ∗ ) of a closed convex function ℎ coincides
with ℎ. Here, we adopt the standard convention that ℎ is closed if it is lower semi-
continuous and either ℎ(𝑢) > −∞ for all 𝑢 ∈ U or ℎ(𝑢) = −∞ for all 𝑢 ∈ U . We
use cl(ℎ) to denote the closure of ℎ, that is, the largest closed function below ℎ.
Lemma 4.2 (Fenchel–Moreau Theorem). For any convex function ℎ : R𝑑 → R,
we have ℎ ≥ ℎ∗∗ . The inequality becomes an equality on rint(dom(ℎ)).
Proof. By (Rockafellar 1970, Theorem 12.2), we have ℎ∗∗ = cl(ℎ) ≤ ℎ. In addition,
(Rockafellar 1970, Theorem 10.1) ensures that the convex function ℎ is continuous
on rint(dom(ℎ)) and thus coincides with cl(ℎ) there. Hence, the claim follows. 
The main idea for dualizing the worst-case expectation problem (4.1) is to
represent its optimal value as −ℎ(𝑢), where ℎ(𝑢) = inf 𝑣 ∈V 𝐻(𝑢, 𝑣), U is a finite-
dimensional space of parameters 𝑢 that encode the ambiguity set P (such as a set of
prescribed moments or a size parameter), and V is an infinite-dimensional space of
finite signed measures on Z. In addition, 𝐻(𝑢, 𝑣) represents the negative expected
loss if the signed measure 𝑣 happens to be a probability measure in P ⊆ V and
evaluates to ∞ otherwise. If 𝐻(𝑢, 𝑣) is jointly convex on 𝑢 and 𝑣, then ℎ(𝑢) is
convex by virtue of Lemma 4.1. A problem dual to (4.1) can then be constructed
from the bi-conjugate ℎ∗∗ (𝑢). Lemma 4.2 provides conditions for strong duality.

4.2. Moment Ambiguity Sets


Recall from Section 2.1 that the generic moment ambiguity set (2.1) is defined as

P = P ∈ P 𝑓 (Z) : E P [ 𝑓 (𝑍)] ∈ F ,
where Z ⊆ R𝑑 is a closed support set, 𝑓 : Z → R𝑚 is a Borel measurable moment
function, F ⊆ R𝑚 is a closed moment uncertainty set, and P 𝑓 (Z) denotes the
family of all distributions P ∈ P(Z) for which E P [ 𝑓 (𝑍)] is finite.1 We may assume
1 Clearly, E P [ 𝑓 (𝑍)] must be finite to belong to the compact set F. Therefore, we may replace P(Z)
with P 𝑓 (Z) in the definition of P without loss of generality. However, working with P 𝑓 (Z) is
more convenient when we dualize the worst-case expectation problem (4.1) over P.
Distributionally Robust Optimization 59

without loss of generality that F is covered by the convex set



C = E P [ 𝑓 (𝑍)] : P ∈ P 𝑓 (Z)
of all possible moments of any distribution on Z. To rule out trivial special cases,
we make the blanket assumption that Z and F are non-empty.
Clearly, problem (4.1) over the moment ambiguity set (2.1) can be recast as

sup E P [ℓ(𝑍)] = sup sup E P [ℓ(𝑍)] : E P [ 𝑓 (𝑍)] = 𝑢 = sup −ℎ(1, 𝑢), (4.2)
P∈P 𝑢∈F P∈P 𝑓 (Z) 𝑢∈F

where the auxiliary function ℎ : R × R𝑚 → R is defined through


 ∫ ∫ ∫ 
ℎ(𝑢0 , 𝑢) = inf − ℓ(𝑧) d𝑣(𝑧) : d𝑣(𝑧) = 𝑢0 , 𝑓 (𝑧) d𝑣(𝑧) = 𝑢 . (4.3)
𝑣 ∈M 𝑓 ,+ (Z) Z Z Z
Here, the set M 𝑓 ,+∫(Z) stands for the family of all Borel measures 𝑣 ∈ M+ (Z) for
which the integral Z 𝑓 (𝑧) d𝑣(𝑧) is finite. Put differently, M 𝑓 ,+ (Z) represents the
convex cone generated by P 𝑓 (Z). As the objective and constraint functions of the
minimization problem in (4.3) are all jointly convex in 𝑣, 𝑢0 and 𝑢, Lemma 4.1
implies that ℎ is convex. Under a reasonable regularity condition, one can further
show that the domain of ℎ coincides with the cone generated by {1} × C.
Lemma 4.3 (Domain of ℎ). If E P [ℓ(𝑍)] > −∞ for every P ∈ P 𝑓 (Z), then we have
dom(ℎ) = cone({1} × C).
Proof. It is clear that (𝑢0 , 𝑢) ∈ dom(ℎ) if and only if ℎ(𝑢0 , 𝑢) < ∞, which is the case
if and only if the minimization problem in (4.3) is feasible. Thus, it remains to be
shown that the problem in (4.3) is feasible if and only if (𝑢0 , 𝑢) ∈ cone({1} × C). To
this end, assume first that the∫problem in (4.3) is ∫feasible at (𝑢0 , 𝑢). This implies that
there is 𝑣 ∈ M 𝑓 ,+ (Z) with Z d𝑣(𝑧) = 𝑢0 and Z 𝑓 (𝑧) d𝑣(𝑧) = 𝑢. Hence, 𝑢0 ≥ 0.
If 𝑢0 = 0, then we must have 𝑢 = 0. If 𝑢0 > 0, on the other hand, then 𝑣/𝑢0 must be
a probability measure in P 𝑓 (Z), which implies that 𝑢/𝑢0 ∈ C. In either case, (𝑢0 , 𝑢)
is a non-negative multiple of a point in {1} × C and thus belongs to cone({1} × C).
Next, assume that (𝑢0 , 𝑢) ∈ cone({1} × C). If 𝑢0 = 0, then 𝑢 = 0, and indeed,
the zero measure in M 𝑓 ,+ (Z) is feasible in (4.3). If 𝑢0 > 0, on the other hand,
then 𝑢/𝑢0 ∈ C. By the definition of C, there exists a distribution P ∈ P 𝑓 (Z) with
E P [ 𝑓 (𝑍)] = 𝑢/𝑢0 . As E P [ℓ(𝑍)] > −∞, this implies that 𝑣 = 𝑢0 P is feasible in (4.3).
We have thus shown that (4.3) is feasible if and only if (𝑢0 , 𝑢) ∈ cone({1} × C).
This observation completes the proof. 
The following proposition characterizes the bi-conjugate of ℎ.
Proposition 4.4 (Bi-conjugate of ℎ). The bi-conjugate of ℎ defined in (4.3) satisfies

ℎ∗∗ (𝑢0 , 𝑢) = sup − 𝑢0 𝜆 0 − 𝑢⊤ 𝜆 : 𝜆 0 + 𝑓 (𝑧)⊤ 𝜆 ≥ ℓ(𝑧) ∀𝑧 ∈ Z .
𝜆0 ∈R, 𝜆∈R𝑚

If additionally E P [ℓ(𝑍)] > −∞ for every P ∈ P 𝑓 (Z), then ℎ∗∗ and ℎ match on the
cone generated by {1} × rint(C) except at the origin.
60 D. Kuhn, S. Shafiee, and W. Wiesemann

Proof. For any fixed (𝜆 0 , 𝜆) ∈ R × R𝑚 , the convex conjugate of ℎ satisfies


ℎ∗ (−𝜆 0 , −𝜆) = −𝑢0 𝜆 0 − 𝑢⊤ 𝜆 − ℎ(𝑢0 , 𝑢)
sup
𝑢0 ∈R, 𝑢∈R𝑚


 ⊤

 sup −𝑢0 𝜆 0 − 𝑢 𝜆 + ℓ(𝑧) d𝑣(𝑧)

 Z


= s.t. 𝑢0 ∈ R, 𝑢 ∈ R𝑚 , 𝑣 ∈ M 𝑓 ,+ (Z)

 ∫ ∫



 d𝑣(𝑧) = 𝑢0 , 𝑓 (𝑧) d𝑣(𝑧) = 𝑢
 Z Z

ℓ(𝑧) − 𝜆 0 − 𝑓 (𝑧)⊤ 𝜆 d𝑣(𝑧)

= sup
𝑣 ∈M 𝑓 ,+ (Z) Z

0 if ℓ(𝑧) − 𝜆 0 − 𝑓 (𝑧)⊤ 𝜆 ≤ 0 ∀𝑧 ∈ Z,
=
∞ otherwise,
where the last equality holds because M 𝑓 ,+ (Z) contains all weighted Dirac meas-
ures on Z. Thus, for any fixed (𝑢0 , 𝑢) ∈ R × R𝑚 , the conjugate of ℎ∗ satisfies
ℎ∗∗ (𝑢0 , 𝑢) = sup −𝑢0 𝜆 0 − 𝑢⊤𝜆 − ℎ∗ (−𝜆 0 , −𝜆)
𝜆0 ∈R, 𝜆∈R𝑚

= sup − 𝑢0 𝜆 0 − 𝑢⊤𝜆 : 𝜆 0 + 𝑓 (𝑧)⊤ 𝜆 ≥ ℓ(𝑧) ∀𝑧 ∈ Z .
𝜆0 ∈R, 𝜆∈R𝑚

This establishes the desired formula for the bi-conjugate of ℎ. Assume now that
E P [ℓ(𝑍)] > −∞ for every P ∈ P 𝑓 (Z). It remains to be shown that ℎ(𝑢0 , 𝑢) =
ℎ∗∗ (𝑢0 , 𝑢) for all (𝑢0 , 𝑢) ≠ (0, 0) in the cone generated by {1} × rint(C). However,
this follows immediately from Lemma 4.2 and the observation that
rint(dom(ℎ)) = rint(cone({1} × C)) = cone({1} × rint(C))\{(0, 0)},
where the two equalities hold because of Lemma 4.3 and (Rockafellar 1970, Co-
rollary 6.8.1), respectively. Therefore, the claim follows. 

Proposition 4.4 implies that ℎ(1, 𝑢) = ℎ∗∗ (1, 𝑢) for all 𝑢 ∈ rint(C). The following
main theorem exploits this relation to convert the maximization problem on the
right hand side of (4.2) to an equivalent dual minimization problem.
Theorem 4.5 (Duality Theory for Moment Ambiguity Sets). If P is the moment
ambiguity set (2.1), then the following weak duality relation holds.
 ∗ (𝜆)
inf 𝜆 0 + 𝛿F




sup E P [ℓ(𝑍)] ≤ s.t. 𝜆 0 ∈ R, 𝜆 ∈ R𝑚 (4.4)
P∈P 

 𝜆 0 + 𝑓 (𝑧)⊤ 𝜆 ≥ ℓ(𝑧) ∀𝑧 ∈ Z.

If E P [ℓ(𝑍)] > −∞ for all P ∈ P 𝑓 (Z) and F ⊆ C is a convex and compact set with
rint(F) ⊆ rint(C), then strong duality holds, that is, (4.4) becomes an equality.
Distributionally Robust Optimization 61

Proof. For ease of exposition, we introduce



L = (𝜆 0 , 𝜆) ∈ R × R𝑚 : 𝜆 0 + 𝑓 (𝑧)⊤ 𝜆 ≥ ℓ(𝑧) ∀𝑧 ∈ Z
as a shorthand for the dual feasible set. Using the decomposition (4.2), we find
sup E P [ℓ(𝑍)] = sup −ℎ(1, 𝑢) ≤ sup inf 𝜆0 + 𝑢⊤ 𝜆
P∈P 𝑢∈F 𝑢∈F (𝜆0 ,𝜆)∈L
≤ inf sup 𝜆 0 + 𝑢⊤ 𝜆
(𝜆0 ,𝜆)∈L 𝑢∈F

= inf 𝜆 0 + 𝛿F (𝜆).
(𝜆0 ,𝜆)∈L

Here, the first inequality exploits Proposition 4.4 and Lemma 4.2, which ensures
that ℎ ≥ ℎ∗∗ , and the second inequality holds thanks to the max-min inequality.
The last equality follows from the definition of the support function 𝛿F ∗ . This

establishes the weak duality relation (4.4). Next, suppose that F is a convex
compact set with rint(F) ⊆ rint(C). Under this additional assumption, we have
sup E P [ℓ(𝑍)] = sup −ℎ(1, 𝑢) = sup −ℎ(1, 𝑢)
P∈P 𝑢∈F 𝑢∈rint(F )
= sup inf 𝜆0 + 𝑢⊤ 𝜆
𝑢∈rint(F ) (𝜆0 ,𝜆)∈L
= sup inf 𝜆0 + 𝑢⊤ 𝜆 = inf ∗
𝜆 0 + 𝛿F (𝜆),
𝑢∈F (𝜆0 ,𝜆)∈L (𝜆0 ,𝜆)∈L

where the first equality exploits (4.2). The second equality follows from two obser-
vations. First, rint(F) is non-empty and convex (Rockafellar 1970, Theorem 6.2).
Second, −ℎ(1, 𝑢) is concave in 𝑢, which ensures that −ℎ(1, 𝑢) cannot jump up on
the boundary of its domain C and—in particular—on the boundary of F ⊆ C.
Taken together, these observations imply that we can restrict F to rint(F) without
reducing the supremum. The third equality follows from Proposition 4.4, which
allows us to replace ℎ with ℎ∗∗ on rint(F) ⊆ rint(C). The fourth equality holds
because −ℎ∗∗ (1, 𝑢) is concave in 𝑢, which allows us to change rint(F) back to F.
Finally, the fifth equality follows from Sion’s minimax theorem (Sion 1958, The-
orem 4.2), which applies because F is convex and compact, L is convex and
𝜆 0 + 𝑢⊤ 𝜆 is biaffine in 𝑢 and (𝜆 0 , 𝜆). Therefore, strong duality holds. 
Theorem 4.5 shows that the worst-case expectation problem (4.1) over the mo-
ment ambiguity set (2.1) admits a semi-infinite dual. Indeed, the dual problem
on the right hand side of (4.4) accommodates finitely many decision variables but
infinitely many constraints parametrized by the uncertainty realizations 𝑧 ∈ Z.
The dual problem can also be interpreted as a robust optimization problem with
uncertainty set Z. Note that we did not assume Z to be convex. In addition, we
emphasize that compactness of F is not a necessary condition for strong duality.
Indeed, strong duality can also be established under Slater-type conditions (Zhen,
Kuhn and Wiesemann 2023). Finally, the condition rint(F) ⊆ rint(C) is equi-
valent to the—seemingly weaker—requirement that F intersects rint(C). Indeed,
62 D. Kuhn, S. Shafiee, and W. Wiesemann

if F ∩ rint(C) ≠ ∅, then F is not entirely contained in the relative boundary of C,


which implies via (Rockafellar 1970, Corollary 6.5.2) that rint(F) ⊆ rint(C).
In the remainder of this section, we use Theorem 4.5 to dualize worst-case
expectations problems corresponding to popular classes of moment ambiguity
sets. Recall from Section 2.1.4 that the Chebyshev ambiguity set (2.4) is defined as

P = P ∈ P2 (Z) : E P [𝑍] = 𝜇, E P [𝑍 𝑍 ⊤ ] = 𝑀 ∀(𝜇, 𝑀) ∈ F ,
where F ⊆ R𝑑 × S+𝑑 is a closed moment uncertainty set, and P2 (Z) denotes the set
of all distributions in P(Z) with finite second moments. Note that P is an instance
of the generic moment ambiguity set (2.1) with moment function 𝑓 (𝑧) = (𝑧, 𝑧𝑧⊤ ).
Theorem 4.6 (Duality Theory for Chebyshev Ambiguity Sets). If P is the Cheby-
shev ambiguity set (2.4), then the following weak duality relation holds.
 ∗ (𝜆, Λ)
inf 𝜆 0 + 𝛿F




sup E P [ℓ(𝑍)] ≤ s.t. 𝜆 0 ∈ R, 𝜆 ∈ R𝑑 , Λ ∈ S𝑑 (4.5)
P∈P 

 𝜆 0 + 𝜆⊤ 𝑧 + 𝑧⊤ Λ𝑧 ≥ ℓ(𝑧) ∀𝑧 ∈ Z.

If E P [ℓ(𝑍)] > −∞ for all P ∈ P2 (Z) and F is a convex compact set with 𝑀 ≻ 𝜇𝜇⊤
for all (𝜇, 𝑀) ∈ rint(F), then strong duality holds, that is, (4.5) becomes an equality.
Theorem 4.6 is a direct corollary of Theorem 4.5. Thus, we omit its proof. Recall
that the Chebyshev ambiguity (2.4) set with uncertain moments encapsulates the
support-only ambiguity set P(Z), the Markov ambiguity set (2.2), and the Cheby-
shev ambiguity set (2.3) with fixed moments as special cases. They are recovered
by setting F = R𝑑 × S𝑑 , F = {𝜇} × S𝑑 and F = {𝜇} × {𝑀 }, respectively. The
following lemma characterizes the support functions of these moment uncertainty
sets in closed form. The proof is elementary and is thus omitted.
Lemma 4.7 (Support Functions of Elementary Sets). The following hold.
∗ (𝜆, Λ) = 𝛿
(i) If F = R𝑑 × S𝑑 , then 𝛿F {(0,0)} (𝜆, Λ).
𝑑
(ii) If F = {𝜇} × S , then 𝛿F∗ (𝜆, Λ) = 𝜆⊤ 𝜇 + 𝛿
{0} (Λ).
∗ ⊤
(iii) If F = {𝜇} × {𝑀 }, then 𝛿F (𝜆, Λ) = 𝜆 𝜇 + Tr(Λ𝑀).
When combined with Theorem 4.5, Lemma 4.7 immediately leads to duality
theorems for support-only, Markov, and Chebyshev ambiguity sets. For brevity,
we omit the details. In Section 2.1.4, we have also defined the Gelbrich ambiguity
set as a Chebyshev ambiguity set with uncertain moments of the form (2.4) with F
representing the Gelbrich uncertainty set (2.8) defined as
 
𝑑 𝑑 ∃Σ ∈ S+𝑑 with 𝑀 = Σ + 𝜇𝜇⊤ ,
F = (𝜇, 𝑀) ∈ R × S+ : ,
G (𝜇, Σ), ( 𝜇,
ˆ Σ̂) ≤ 𝑟
where G is the Gelbrich distance of Definition 2.1. In the following we derive the
∗ of the Gelbrich uncertainty set F.
support function 𝛿F
Distributionally Robust Optimization 63

Lemma 4.8 (Support Function of Gelbrich Uncertainty Sets). Let F be the Gelbrich
uncertainty set (2.8) of radius 𝑟 ≥ 0 around ( 𝜇, ˆ Σ̂) ∈ R𝑑 × S+𝑑 , where G is the
Gelbrich distance of Definition 2.1. For any (𝜆, Λ) ∈ R𝑑 × S𝑑 , we then have
inf 𝛾 𝑟 2 − k 𝜇kˆ 2 − Tr(Σ̂) + Tr(𝐴) + 𝛼





 𝑑

 s.t. 𝛼, 𝛾 ∈ R+ , 𝐴 ∈ S+


𝛿F (𝜆, Λ) = " 1# " #

 𝛾𝐼𝑑 − Λ 𝛾 Σ̂ 2 𝛾𝐼 𝑑 − Λ 𝛾 𝜇ˆ + 𝜆2

  0,  0.

 1 𝜆 ⊤
 𝛾 Σ̂ 2 𝐴 (𝛾 ˆ
𝜇 + 2 ) 𝛼
Proof. By Proposition 2.3, which provides a semidefinite representation of the
Gelbrich uncertainty set F, the support function of F satisfies


 sup 𝜇⊤ 𝜆 + Tr(𝑀Λ)

 𝑑 𝑑 𝑑×𝑑
 s.t. 𝜇 ∈ R , 𝑀, 𝑈 ∈ S+ , 𝐶 ∈ R




𝛿F (𝜆, Λ) = Tr(𝑀 − 2𝜇 𝜇ˆ ⊤ − 2𝐶) ≤ 𝑟 2 − k 𝜇k
ˆ 2 − Tr(Σ̂)

    

 𝑀 −𝑈 𝐶 𝑈 𝜇
  0,

 ⊤ 𝜇 ⊤ 1  0.
 𝐶 Σ̂
By conic duality (Ben-Tal and Nemirovski 2001, Theorem 1.4.2), the maximization
problem in the above expression admits the dual minimization problem
inf 𝛾 𝑟 2 − k 𝜇k
ˆ 2 − Tr(Σ̂) + Tr(Σ̂ 𝐴22 ) + 𝛼


s.t. 𝛼, 𝛾 ∈ R+ , 𝐴11 , 𝐴22 , 𝐵 ∈ S+𝑑


   
𝐴11 𝛾𝐼 𝑑 𝐵 𝛾 𝜇ˆ + 𝜆2
 0,  0, 𝛾𝐼 𝑑 − Λ  𝐴11  𝐵.
𝛾𝐼 𝑑 𝐴22 (𝛾 𝜇ˆ + 𝜆2 )⊤ 𝛼
Strong duality holds because 𝛼 = k2𝛾 𝜇ˆ + 𝜆k 2 , 𝛾 = max{𝜆 max (Λ), 0} + 4, 𝐴11 = 2𝐼,
𝐴22 = 𝛾 2 𝐼 and 𝐵 = 𝐼 represents a Slater point for the dual problem. At optimality,
we have 𝛾𝐼 𝑑 − Λ = 𝐴11 = 𝐵. Hence, the dual problem can be further simplified to
inf 𝛾 𝑟 2 − k 𝜇k
ˆ 2 − Tr(Σ̂) + Tr(Σ̂ 𝐴22 ) + 𝛼


s.t. 𝛼, 𝛾 ∈ R+ , 𝐴22 ∈ S+𝑑


   
𝛾𝐼 𝑑 − Λ 𝛾𝐼𝑑 𝛾𝐼 𝑑 − Λ 𝛾 𝜇ˆ + 𝜆2
 0,  0.
𝛾𝐼𝑑 𝐴22 (𝛾 𝜇ˆ + 𝜆2 )⊤ 𝛼
1 1
The substitution 𝐴 ← Σ̂ 2 𝐴22 Σ̂ 2 and the equivalence
     
𝛾𝐼 𝑑 − Λ 𝛾𝐼𝑑 𝐼 𝑑 0 𝛾𝐼𝑑 − Λ 𝛾𝐼𝑑 𝐼 𝑑 0
 0 ⇐⇒ 1 1 0
𝛾𝐼 𝑑 𝐴22 0 Σ̂ 2 𝛾𝐼 𝑑 𝐴22 0 Σ̂ 2
then yield the desired semidefinite program. Thus, the optimal value of this semi-
definite program equals indeed 𝛿F∗ (𝜆, Λ). 
Armed with Theorem 4.6 and Lemma 4.8, we are now prepared to dualize the
worst-case expectation problem over a Gelbrich ambiguity set.
64 D. Kuhn, S. Shafiee, and W. Wiesemann

Theorem 4.9 (Duality Theory for Gelbrich Ambiguity Sets). If P is the Chebyshev
ambiguity set (2.4) with F representing the Gelbrich uncertainty (2.8), then the
following weak duality relation holds.
inf 𝜆 0 + 𝛾 𝑟 2 − k 𝜇k
ˆ 2 − Tr(Σ̂) + Tr(𝐴) + 𝛼






 s.t. 𝜆 0 ∈ R, 𝛼, 𝛾 ∈ R+ , 𝜆 ∈ R𝑑 , Λ ∈ S𝑑 , 𝐴 ∈ S+𝑑




sup E P [ℓ(𝑍)] ≤ 𝜆 0 + 𝜆⊤ 𝑧 + 𝑧⊤ Λ𝑧 ≥ ℓ(𝑧) ∀𝑧 ∈ Z (4.6)
P∈P 
 " 1# " 𝜆
#

 𝛾𝐼 𝑑 − Λ 𝛾 Σ̂ 2 𝛾𝐼 𝑑 − Λ 𝛾 ˆ
𝜇 + 2

  0,  0.

 1 𝜆 ⊤
 𝛾 Σ̂ 2 𝐴 (𝛾 𝜇ˆ + 2 ) 𝛼
If E P [ℓ(𝑍)] > −∞ for all P ∈ P2 (Z) and 𝑟 > 0, then strong duality holds, that is,
the inequality (4.6) becomes an equality.
Proof. Weak duality follows immediately from the first claim of Theorem 4.6 and
Lemma 4.8. To prove strong duality, recall from Proposition 2.3 that the Gelbrich
uncertainty set F is convex and compact. In addition, recall from the proof of
Proposition 2.2 that the Gelbrich distance is continuous. As 𝑟 > 0, this implies that

rint(F) = (𝜇, 𝑀) ∈ R𝑑 × S+𝑑 : 𝑀 ≻ 𝜇𝜇⊤ , G (𝜇, 𝑀 − 𝜇𝜇⊤ ), ( 𝜇,

ˆ Σ̂) < 𝑟 ,
which in turn ensures that 𝑀 ≻ 𝜇𝜇⊤ for all (𝜇, 𝑀) ∈ rint(F). Therefore, strong
duality follows from the second claim of Theorem 4.6. 
We close this section with some historical remarks. The classical problem of
moments asks whether there exists a distribution on Z with a given sequence of
moments. In the language of this survey, the problem of moments thus seeks to
determine whether a given moment ambiguity set of the form (2.1) is non-empty,
where 𝑓 is a polynomial and F is a singleton. The analysis of moment problems
has a long and distinguished history in mathematics dating back to the 19th century.
Notable contributions were made by Chebyshev (1874), Markov (1884), Stieltjes
(1894), Hamburger (1920) and Hausdorff (1923); see (Shohat and Tamarkin 1950)
for an early survey. The study of moment problems with tools from mathematical
optimization—in particular semi-infinite duality theory—was pioneered by Isii
(1960, 1962). Shapiro (2001) formulates the worst-case expectation problem over
a family of distributions with prescribed moments as an infinite-dimensional conic
linear program and establishes conditions for strong duality.

4.3. 𝜙-Divergence Ambiguity Sets


Recall from Section 2.2 that the 𝜙-divergence ambiguity set (2.10) is defined as

P = P ∈ P(Z) : D 𝜙 (P, P̂) ≤ 𝑟 .
Here, Z is a closed support set, 𝑟 ≥ 0 is a size parameter, 𝜙 is an entropy function
in the sense of Definition 2.4, D 𝜙 is the corresponding 𝜙-divergence in the sense
of Definition 2.5, and P̂ ∈ P(Z) is a reference distribution. It is expedient to
Distributionally Robust Optimization 65

extend D 𝜙 to arbitrary measures. By slight abuse of notation, we thus define the


𝜙-divergence of 𝑣 ∈ M+ (Z) with respect to 𝑣ˆ ∈ M+ (Z) as

d𝑣 dˆ𝑣
 
𝜋
D 𝜙 (𝑣, 𝑣ˆ ) = 𝜙 (𝑧), (𝑧) d𝜌(𝑧),
Z d𝜌 d𝜌
where 𝜌 ∈ M+ (Z) is a dominating measure with 𝑣, 𝑣ˆ ≪ 𝜌. An obvious generaliz-
ation of Proposition 2.6 implies that D 𝜙 (𝑣, 𝑣ˆ ) is convex in (𝑣, 𝑣ˆ ) and independent of
the choice of 𝜌. By using the extension of D 𝜙 to general measures, the worst-case
expectation problem (4.1) over the ambiguity set (2.10) can now be recast as
sup E P [ℓ(𝑍)] = −ℎ(1, 𝑟),
P∈P

where the auxiliary function ℎ : R2 → R is defined through


 ∫ ∫ 
ℎ(𝑢0 , 𝑢) = inf − ℓ(𝑧) d𝑣(𝑧) : d𝑣(𝑧) = 𝑢0 , D 𝜙 (𝑣, P̂) ≤ 𝑢 . (4.7)
𝑣 ∈M+ (Z) Z Z
As the objective and constraint functions of the minimization problem in (4.7) are
jointly convex in 𝑣, 𝑢0 and 𝑢, Lemma 4.1 implies that ℎ is convex. Clearly, we
have dom(ℎ) ⊆ R2+ . Under mild regularity conditions, one can additionally show
that {1} × R++ ⊆ rint(dom(ℎ)).
Lemma 4.10 (Domain of ℎ). If E P̂ [ℓ(𝑍)] > −∞ and 𝜙 is continuous at 1, then
{1} × R++ ⊆ rint(dom(ℎ)).
Proof. If 𝑢0 = 1 and 𝑢 > 0, then 𝑣 = P̂ is feasible in (4.7). Indeed, P̂ obeys
both constraints, and its objective function value satisfies E P̂ [ℓ(𝑍)] > −∞. If we
perturb 𝑢0 and 𝑢 locally, then 𝑢0 P̂ satisfies the equality constraint, and the objective
function does not evaluate to −∞ for all 𝑢0 ≥ 0. The inequality constraint, on the
other hand, is satisfied for all 𝑢 > 0 and all 𝑢0 that are sufficiently close to 1 because
D 𝜙 (𝑢0 P̂, P̂) = 𝜙 𝜋 (𝑢0 , 1) = 𝜙(𝑢0 ) < 𝑢.
Here, the first equality follows from the definition of D 𝜙 with 𝜌 = P̂, the second
equality follows from the definition of the perspective function 𝜙 𝜋 , and the inequal-
ity holds because 𝜙(1) = 0, 𝑢 > 0 and 𝜙(𝑢0 ) is continuous at 𝑢0 = 1. This confirms
that (1, 𝑢) ∈ rint(dom(ℎ)) for every 𝑢 > 0, and thus the claim follows. 
The following two lemmas are instrumental to derive the bi-conjugate of ℎ.
Lemma 4.11 (Conjugates of Scaled Perspective Functions). If 𝜙 is an entropy
function in the sense of Definition 2.4, 𝑡 ∈ R, 𝛽 ∈ R+ and 𝜆 ∈ R++ , then we have

𝜋 𝛽𝜆𝜙∗ (𝑡/𝜆) if 𝛽 > 0,
sup 𝑡𝛼 − 𝜆𝜙 (𝛼, 𝛽) =
𝛼∈R 𝜆𝛿 cl(dom(𝜙 ))
∗ (𝑡/𝜆) if 𝛽 = 0.
Proof. If 𝛽 > 0, then we have
sup 𝑡𝛼 − 𝜆𝜙 𝜋 (𝛼, 𝛽) = sup 𝑡𝛼 − 𝜆𝛽𝜙(𝛼/𝛽) = 𝛽 sup 𝑡𝛼 − 𝜆𝜙(𝛼) = 𝛽𝜆𝜙∗ (𝑡/𝜆),
𝛼∈R 𝛼∈R 𝛼∈R
66 D. Kuhn, S. Shafiee, and W. Wiesemann

where the three equalities follow from the definition of the perspective function 𝜙 𝜋 ,
the substitution 𝛼 ← 𝛼/𝛽 and the replacement of 𝑡 by 𝜆𝑡/𝜆, respectively. Note that
these manipulations are admissible because 𝛽, 𝜆 > 0. If 𝛽 = 0, then we have
sup 𝑡𝛼 − 𝜆𝜙 𝜋 (𝛼, 𝛽) = sup 𝑡𝛼 − 𝜆𝜙∞ (𝛼)
𝛼∈R 𝛼∈R

= sup 𝑡𝛼 − 𝜆𝛿dom(𝜙 ∗ ) (𝛼) = 𝜆𝛿 cl(dom(𝜙 ∗ )) (𝑡/𝜆),
𝛼∈R
where the first equality holds again because of the definition of 𝜙 𝜋 , and the second
equality exploits (Rockafellar 1970, Theorem 13.3). The third equality replaces 𝑡
with 𝜆𝑡/𝜆 and exploits the elementary observation that the conjugate of the support
function of a convex set coincides with the indicator function of the closure of this
set (Rockafellar 1970, Theorem 13.2). Thus, the claim follows. 
Lemma 4.12 (Domain of Conjugate Entropy Functions). If 𝜙 is an entropy function
in the sense of Definition 2.4, then we have
(
(−∞, 𝜙∞ (1)] if 𝜙∞ (1) < ∞,
cl(dom(𝜙∗ )) =
R if 𝜙∞ (1) = ∞.
Proof. As 𝜙 is proper, convex and closed, (Rockafellar 1970, Theorem 8.5) implies
that its recession function 𝜙∞ is positive homogeneous. Recall that 𝜙(𝑠) = ∞ for
every 𝑠 < 0. We may thus conclude that 𝜙∞ (𝑡) = 𝑡 𝜙∞ (1) for 𝑡 > 0, 𝜙∞ (𝑡) = 0 for 𝑡 =
0 and 𝜙∞ (𝑡) = ∞ for 𝑡 < 0. In addition, (Rockafellar 1970, Theorem 13.3) implies
that the support function of dom(𝜙∗ ) coincides with the recession function 𝜙∞ . The
indicator function of cl(dom(𝜙∗ )) is known to coincide with the conjugate of the
support function of dom(𝜙∗ ), and therefore it satisfies

0 if 𝑠 ≤ 𝜙∞ (1),
𝛿cl(dom(𝜙∗ )) (𝑠) = sup 𝑠 − 𝜙∞ (1) 𝑡 =

𝑡 ∈R+ ∞ otherwise.
This shows that cl(dom(𝜙∗ )) = (−∞, 𝜙∞ (1)] if 𝜙∞ (1) < ∞ and that cl(dom(𝜙∗ )) = R
otherwise. Hence, the claim follows. 

Proposition 4.13 (Bi-conjugate of ℎ). Assume that E P̂ [ℓ(𝑍)] > −∞. Then, the
bi-conjugate of ℎ defined in (4.7) satisfies



 sup −𝜆 0 𝑢0 − 𝜆𝑢 − E P̂ [(𝜙∗ ) 𝜋 (ℓ(𝑍) − 𝜆 0 , 𝜆)]
𝜆 ∈R,𝜆∈R
ℎ∗∗ (𝑢0 , 𝑢) = 0 +

 s.t. 𝜆 0 + 𝜆 𝜙∞ (1) ≥ sup ℓ(𝑧),
 𝑧 ∈Z

where the product 𝜆 𝜙 (1) is assumed to evaluate to ∞ if 𝜆 = 0 and 𝜙∞ (1) = ∞.


If 𝜙 is continuous at 1, then ℎ∗∗ coincides with ℎ on {1} × R++ .


As 𝜙(1) = 0, we have 𝜙∗ (𝜏) = sup 𝛼∈R 𝜏𝛼 − 𝜙(𝛼) ≥ 𝜏 for all 𝜏 ∈ R. This readily
implies that (𝜙∗ ) 𝜋 (𝜏, 𝜆) ≥ 𝜏 for all 𝜏, 𝜆 ∈ R. Hence, E P̂ [(𝜙∗ ) 𝜋 (ℓ(𝑍) − 𝜆 0 , 𝜆)] ≥
E P̂ [ℓ(𝑍) − 𝜆 0 ]. In addition, 𝜙∗ is non-decreasing because dom(𝜙) ⊆ R+ . Examples
of common entropy functions and their conjugates are listed in Table 4.1.
Distributionally Robust Optimization 67

Divergence 𝜙(𝑠) (𝑠 ≥ 0) 𝜙∞ (1) 𝜙∗ (𝑡)

Kullback-Leibler 𝑠 log(𝑠) − 𝑠 + 1 ∞ e𝑡 − 1
Likelihood − log(𝑠) + 𝑠 − 1 1 − log(1 − 𝑡)
Total variation 1 |𝑠 − 1| 1 max{𝑡, −1/2} + 𝛿(−∞,1/2] (𝑡)
2 2
Pearson 𝜒2 (𝑠 − 1)2 ∞ (𝑡/2 + 1)2+ − 1
1 (𝑠 − 1)2

Neyman 𝜒 2 𝑠 1 2−2 1−𝑡
𝛽/(𝛽−1)
𝑠 𝛽 −𝛽𝑠+𝛽−1 [(𝛽−1)𝑡+1] +
Cressie-Read for 𝛽 ∈ (0, 1) 𝛽(𝛽−1) 1 𝛽
𝛽/(𝛽−1)
𝑠 𝛽 −𝛽𝑠+𝛽−1 [(𝛽−1)𝑡+1] +
Cressie-Read for 𝛽 > 1 𝛽(𝛽−1) ∞ 𝛽

Table 4.1. Examples of entropy functions, their asymptotic slopes and their con-
jugates. Here, for any 𝑐 ∈ R, we use the [𝑐] + as a shorthand for max{𝑐, 0}.

Proof of Proposition 4.13. For any fixed (𝜆 0 , 𝜆) ∈ R2 , the conjugate of ℎ satisfies


ℎ∗ (−𝜆 0 , −𝜆) = sup −𝜆 0 𝑢0 − 𝜆𝑢 − ℎ(𝑢0 , 𝑢)
(𝑢0 ,𝑢)∈R2
 ∫ 
= sup −𝜆𝑢 + (ℓ(𝑧) − 𝜆 0 ) d𝑣(𝑧) : D 𝜙 (𝑣, P̂) ≤ 𝑢 ,
𝑢∈R+ ,𝑣 ∈M+ (Z) Z

where the second equality holds because Z d𝑣(𝑧) = 𝑢0 and D 𝜙 (𝑣, P̂) ≥ 0.
As E P̂ [ℓ(𝑍)] > −∞, the resulting maximization problem over 𝑢 is unbounded
whenever 𝜆 < 0. If 𝜆 > 0, on the other hand, then we find


ℎ (−𝜆 0 , −𝜆) = sup (ℓ(𝑧) − 𝜆 0 ) d𝑣(𝑧) − 𝜆D 𝜙 (𝑣, P̂)
𝑣 ∈M+ (Z) Z
 ∫
d𝑣 d𝑣 dP̂
 

 𝜋
 sup
 (ℓ(𝑧) − 𝜆 0 ) (𝑧) − 𝜆𝜙 (𝑧), (𝑧) d𝜌(𝑧)
= Z d𝜌 d𝜌 d𝜌


 s.t. 𝑣, 𝜌 ∈ M+ (Z), 𝑣 ≪ 𝜌, P̂ ≪ 𝜌,

d𝑣 dP̂
where the second equality exploits the definition of D 𝜙 . Note that d𝜌 (𝑧) and d𝜌 (𝑧)
belong to the space L1 (𝜌) of all 𝜌-integrable Borel functions that can be represented
as the Radon-Nikodym derivative of some measure in M+ (Z) with respect to 𝜌.
Introducing auxiliary decision variables 𝛼, 𝛽 ∈ L1 (𝜌) for the Radon-Nikodym
derivatives of 𝑣 and P̂, respectively, and eliminating the measure 𝑣 yields




 sup (ℓ(𝑧) − 𝜆 0 ) 𝛼(𝑧) − 𝜆𝜙 𝜋 (𝛼(𝑧), 𝛽(𝑧)) d𝜌(𝑧)

 Z


ℎ∗ (−𝜆 0 , −𝜆) = s.t. 𝜌 ∈ M+ (Z), 𝛼, 𝛽 ∈ L1 (𝜌) (4.8)




 dP̂
 = 𝛽 𝜌-a.s.
 d𝜌
68 D. Kuhn, S. Shafiee, and W. Wiesemann

dP̂
For any 𝜌 ∈ M+ (Z) and 𝛽 ∈ L1 (𝜌) with 𝛽 = d𝜌 𝜌-almost surely, we then find

sup (ℓ(𝑧) − 𝜆 0 ) 𝛼(𝑧) − 𝜆𝜙 𝜋 (𝛼(𝑧), 𝛽(𝑧)) d𝜌(𝑧)
𝛼∈L1 (𝜌) Z


= sup (ℓ(𝑧) − 𝜆 0 ) 𝛼 − 𝜆𝜙 𝜋 (𝛼, 𝛽(𝑧)) d𝜌(𝑧), (4.9)
Z 𝛼∈R
where the equality follows from (Rockafellar and Wets 2009, Theorem 14.60),
which applies because the objective function of the maximization problem in the
second line constitutes a normal integrand in the sense of (Rockafellar and Wets
2009, Definition 14.27). This can be verified by recalling that sums, perspectives
and concatenations of normal integrands are again normal integrands (Rockafellar
and Wets 2009, Section 14.E). Next, we partition Z into Z+ (𝛽) = {𝑧 ∈ Z : 𝛽(𝑧) >
0} and Z0 (𝛽) = {𝑧 ∈ Z : 𝛽(𝑧) = 0}. By Lemma 4.11, the integral (4.9) equals
∫   ∫  
ℓ(𝑧) − 𝜆 0 ℓ(𝑧) − 𝜆 0
𝜆𝜙∗ 𝛽(𝑧) d𝜌(𝑧) + 𝜆𝛿cl(dom(𝜙∗ )) d𝜌(𝑧).
Z+ (𝛽) 𝜆 Z0 (𝛽) 𝜆
dP̂
As 𝛽 = d𝜌 𝜌-almost surely, and as P̂(𝑍 ∈ Z+ (𝛽)) = 1, the first of these integrals
simply reduces to an expectation with respect to the reference distribution and is thus
independent of 𝛽. The second integral still depends on 𝛽 through the integration
domain Z0 (𝛽). Thus, partially maximizing over 𝛼 allows us to recast (4.8) as
  
∗ ∗ ℓ(𝑍) − 𝜆 0
ℎ (−𝜆 0 , −𝜆) = E P̂ 𝜆𝜙
𝜆
∫ 
dP̂
 
ℓ(𝑧) − 𝜆 0
+ sup 𝜆𝛿cl(dom(𝜙∗ )) d𝜌(𝑧) : = 𝛽 𝜌-a.s. .
𝜌∈M+ (Z), Z0 (𝛽) 𝜆 d𝜌
𝛽∈L1 (𝜌)

If there exists 𝑧0 ∈ Z with (ℓ(𝑧0 ) − 𝜆 0 )/𝜆 ∉ cl(dom(𝜙∗ )), then ℎ∗ (−𝜆 0 , −𝜆) = ∞.
To see this, assume first that 𝑧0 is an atom of P̂. In this case, the expectation in the
first line evaluates to ∞. If 𝑧0 is not an atom of P̂, then the supremum in the second
line evaluates to ∞ because we may set 𝜌 = P̂ + 𝛿 𝑧0 and define 𝛽 ∈ L1 (𝜌) through
𝛽(𝑧) = 1 if 𝑧 ≠ 𝑧0 and 𝛽(𝑧0 ) = 0. Hence, we may conclude that
( h  i
E 𝜆𝜙 ∗ ℓ(𝑍)−𝜆0 if ℓ(𝑧)−𝜆 0
∈ cl(dom(𝜙∗ )) ∀𝑧 ∈ Z,
ℎ∗ (−𝜆 0 , −𝜆) = P̂ 𝜆 𝜆
∞ otherwise.
Note that this formula was derived under the assumption that 𝜆 > 0. Note also
that, by Lemma 4.12, the condition (ℓ(𝑧) − 𝜆 0 )/𝜆 ∈ cl(dom(𝜙∗ )) is equivalent to the
requirement that 𝜆 0 + 𝜆 𝜙∞ (1) is larger than or equal to sup 𝑧 ∈Z ℓ(𝑧). We claim that
ℎ∗ (−𝜆 0 , −𝜆)
 if 𝜆 ≥ 0 and 𝜆 0 + 𝜆 𝜙∞ (1) ≥ sup ℓ(𝑧),
 E P̂ [(𝜙∗ ) 𝜋 (ℓ(𝑍) − 𝜆 0 , 𝜆)]


= 𝑧 ∈Z (4.10)

∞ otherwise,

Distributionally Robust Optimization 69

for all 𝜆 0 , 𝜆 ∈ R. Indeed, the above reasoning and the definition of the perspective
function (𝜙∗ ) 𝜋 ensure that (4.10) holds whenever 𝜆 ≠ 0. Note that ℎ∗ is convex and
closed thanks to (Rockafellar 1970, Theorem 12.2). The expression on the right
hand side of (4.10) is also convex and closed in (𝜆 0 , 𝜆). In particular, it is lower
semicontinuous thanks to Fatou’s lemma, which applies because 𝜙(1) = 0 such that
(𝜙∗ ) 𝜋 (𝑡, 𝜆) ≥ 𝑡 for all 𝑡 ∈ R and 𝜆 ∈ R+ and because E P̂ [ℓ(𝑍)] > −∞. Observe
also that (𝜙∗ ) 𝜋 is proper, closed and convex thanks to (Rockafellar 1970, page 35,
page 67 & Theorem 13.3). Hence, (4.10) must indeed hold for all 𝜆 0 , 𝜆 ∈ R.
Given (4.10), we finally obtain
ℎ∗∗ (𝑢0 , 𝑢) = sup −𝜆 0 𝑢0 − 𝜆𝑢 − ℎ∗ (−𝜆 0 , −𝜆)
𝜆0 ,𝜆∈R



 sup −𝜆 0 𝑢0 − 𝜆𝑢 − E P̂ [(𝜙∗ ) 𝜋 (ℓ(𝑍) − 𝜆 0 , 𝜆)]
𝜆0 ∈R,𝜆∈R+
=

 s.t. 𝜆 0 + 𝜆 𝜙∞ (1) ≥ sup ℓ(𝑧),
 𝑧 ∈Z

which establishes the desired formula for the bi-conjugate of ℎ. It remains to be


shown that if 𝜙 is continuous at 1, then ℎ(1, 𝑢) = ℎ∗∗ (1, 𝑢) for all 𝑢 ∈ R++ . However,
this follows immediately from Lemmas 4.2 and 4.10. 
The following main theorem uses Proposition 4.13 to dualize the worst-case
expectation problem (4.1) with a 𝜙-divergence ambiguity set.
Theorem 4.14 (Duality Theory for 𝜙-Divergence Ambiguity Sets). Assume that
E P̂ [ℓ(𝑍)] > −∞. If P is the 𝜙-divergence ambiguity set (2.10), then the following
weak duality relation holds.



 inf 𝜆 0 + 𝜆𝑟 + E P̂ [(𝜙∗ ) 𝜋 (ℓ(𝑍) − 𝜆 0 , 𝜆)]
𝜆0 ∈R,𝜆∈R+
sup E P [ℓ(𝑍)] ≤ s.t. 𝜆 0 + 𝜆 𝜙∞ (1) ≥ sup ℓ(𝑧) (4.11)
P∈P 

 𝑧 ∈Z

Here, the product 𝜆 𝜙∞ (1) is assumed to evaluate to ∞ if 𝜆 = 0 and 𝜙∞ (1) = ∞. If


additinally 𝑟 > 0 and 𝜙 is continuous at 1, then strong duality holds, that is, the
inequality (4.11) collapses to an equality.
Proof. Recall first that
sup E P [ℓ(𝑍)] = −ℎ(1, 𝑟) ≤ −ℎ∗∗ (1, 𝑟),
P∈P

where the inequality holds because of Lemma 4.2. Weak duality thus follows from
the first claim in Proposition 4.13. If 𝜙 is additionally continuous at 1, and if 𝑟 > 0,
then strong duality follows from the second claim in Proposition 4.13. 
Recall now that the restricted 𝜙-divergence ambiguity set (2.11) is defined as

P = P ∈ P(Z) : P ≪ P̂, D 𝜙 (P, P̂) ≤ 𝑟 .
That is, P contains all distributions from within the (unrestricted) 𝜙-divergence
70 D. Kuhn, S. Shafiee, and W. Wiesemann

ambiguity set (2.10) that are absolutely continuous with respect to P̂. The worst-
case expected loss over P can again be expressed as −ℎ(1, 𝑟), where ℎ(𝑢0 , 𝑢) is
now defined as the infimum of the optimization problem (4.7) with the additional
constraint 𝑣 ≪ P̂. One readily verifies that ℎ remains convex and that {1} × R++
is still contained in rint(dom(ℎ)) despite this restriction. Indeed, the proof of
Lemma 4.10 remains valid almost verbatim.
Theorem 4.15 (Duality Theory for Restricted 𝜙-Divergence Ambiguity Sets). As-
sume that E P̂ [ℓ(𝑍)] > −∞. If P is the restricted 𝜙-divergence ambiguity set (2.11),
then the following weak duality relation holds.
sup E P [ℓ(𝑍)] ≤ inf 𝜆 0 + 𝜆𝑟 + E P̂ [(𝜙∗ ) 𝜋 (ℓ(𝑍) − 𝜆 0 , 𝜆)] (4.12)
P∈P 𝜆0 ∈R,𝜆∈R+

If additionally 𝑟 > 0 and 𝜙 is continuous at 1, then strong duality holds, that is, the
inequality (4.12) collapses to an equality.
Note that if (𝜆 0 , 𝜆) is feasible in (4.12), then (ℓ(𝑍) − 𝜆 0 , 𝜆) belongs P̂-almost
surely to dom((𝜙∗ ) 𝜋 ). Otherwise, its objective function value equals ∞. In view of
Lemma 4.12, this implies that 𝜆 0 + 𝜆 𝜙∞ (1) ≥ ess sup P̂ [ℓ(𝑍)]. In contrast, if (𝜆 0 , 𝜆)
is feasible in (4.11), then it satisfies the constraint 𝜆 0 + 𝜆 𝜙∞ (1) ≥ sup 𝑧 ∈Z ℓ(𝑧),
which is more restrictive unless 𝜙∞ (1) = ∞. Hence, the dual problem in (4.12) has
a (weakly) larger feasible set and a (weakly) smaller infimum than the dual problem
in (4.11). This is perhaps unsurprising because (4.12) corresponds to the worst-
case expectation problem over the restricted 𝜙-divergence ambiguity set, which is
(weakly) smaller than the corresponding unrestricted 𝜙-divergence ambiguity set.
Note also that the solution of a worst-case expectation problem over an unrestricted
𝜙-divergence ambiguity set depends on Z and not just on the support of P̂.

Proof of Theorem 4.15. If ℎ(𝑢0 , 𝑢) is defined as the infimum of the optimization


problem (4.7) with the additional constraint 𝑣 ≪ P̂, then one can show that
ℎ∗∗ (𝑢0 , 𝑢) = sup −𝜆 0 𝑢0 − 𝜆𝑢 − E P̂ [(𝜙∗ ) 𝜋 (ℓ(𝑍) − 𝜆 0 , 𝜆)] .
𝜆0 ,𝜆∈R

Indeed, one can proceed as in the proof of Proposition 4.13. However, the reasoning
simplifies significantly because the additional constraint 𝑣 ≪ P̂ allows us to set
the dominating measure 𝜌 in the definition of D 𝜙 to P̂. Thus, the Radon-Nikodym
derivative 𝛽 = dP̂/d𝜌 is P̂-almost surely equal to 1. This in turn implies that the
calculation of ℎ∗ requires no case distinction, that is, the set Z0 (𝛽) is empty.
Given the bi-conjugate of ℎ, both weak and strong duality can then be established
exactly as in the proof of Theorem 4.14. Details are omitted for brevity. 

Van Parys et al. (2021, Proposition 5) establish a strong duality result for worst-
case expectations over likelihood ambiguity sets as introduced in Section 2.2.2.
Theorem 4.14 extends this result to general 𝜙-divergence ambiguity sets with a
significantly shorter proof that only uses tools from convex analysis. Ben-Tal et al.
Distributionally Robust Optimization 71

(2013) establish a strong duality result akin to Theorem 4.15 for restricted 𝜙-
divergence ambiguity sets under the assumption that the reference distribution P̂ is
discrete. Shapiro (2017) extends this result to general reference distributions by us-
ing tools from infinite-dimensional analysis. In contrast, our proof of Theorem 4.15
establishes the same duality result using finite-dimensional convex analysis.

4.4. Optimal Transport Ambiguity Sets


Recall from Section 2.3 that the optimal transport ambiguity set (2.27) is defined as

P = P ∈ P(Z) : OT𝑐 (P, P̂) ≤ 𝑟 .
Here, Z is a closed support set, 𝑟 ≥ 0 is a size parameter, 𝑐 is a transportation
cost function in the sense of Definition 2.14, OT𝑐 is the corresponding optimal
transport discrepancy in the sense of Definition 2.15, and P̂ ∈ P(Z) is a reference
distribution. In analogy to Section 4.3, the worst-case expectation problem (4.1)
over the ambiguity set (2.27) can now be reformulated as
sup E P [ℓ(𝑍)] = −ℎ(𝑟),
P∈P

where the auxiliary function ℎ : R → R is defined through



ℎ(𝑢) = inf −E P [ℓ(𝑍)] : OT𝑐 (P, P̂) ≤ 𝑢 . (4.13)
P∈P(Z)

As the objective and constraint functions of the minimization problem in (4.13) are
jointly convex in P and 𝑢, Lemma 4.1 implies that ℎ is convex. Recall also that 𝑐
is non-negative and satisfies 𝑐(𝑧, 𝑧) = 0 for all 𝑧 ∈ Z. If E P̂ [ℓ( 𝑍)]ˆ > −∞, it is
therefore easy to show that dom(ℎ) = R+ .
The following lemma will be instrumental for deriving the bi-conjugate of ℎ.
Recall that Γ(P, P̂) denotes the set of all couplings of P and P̂; see Definition 2.15.
Lemma 4.16 (Interchangeability Principle). If 𝑐 is a transportation cost function,
ℓ is upper semicontinuous and 𝜆 ≥ 0, then we have
 
 
sup sup E 𝛾 ℓ(𝑍) − 𝜆𝑐(𝑍, 𝑍)ˆ = E sup ℓ(𝑧) − 𝜆𝑐(𝑧, 𝑍) ˆ .

P∈P(Z) 𝛾∈Γ(P,P̂) 𝑧 ∈Z

One can show that Lemma 4.16 remains valid, for example, if Z is a Polish (sep-
arable metric) space equipped with its Borel 𝜎-algebra and even if 𝑐 and ℓ fail to be
lower and upper semicontinuous, respectively (Zhang et al. 2024b, Proposition 1).

Proof of Lemma 4.16. Define 𝐿 : Z → R through 𝐿(ˆ𝑧 ) = sup 𝑧 ∈Z ℓ(𝑧) − 𝜆𝑐(𝑧, 𝑧ˆ).
If 𝜆 = 1, then 𝐿 reduces to the 𝑐-transform of ℓ defined in (2.25). Note first that 𝐿
constitutes a pointwise supremum of upper semicontinuous functions and is thus
also upper semicontinuous and, in particular, Borel-measurable.
Observe next that, by the definition of 𝐿, we have ℓ(𝑧) − 𝜆𝑐(𝑧, 𝑧ˆ) ≤ 𝐿(ˆ𝑧 ) for
72 D. Kuhn, S. Shafiee, and W. Wiesemann

all 𝑧, 𝑧ˆ ∈ Z. This inequality persists if we integrate both sides with respect to any
coupling 𝛾 ∈ Γ(P, P̂) for any distribution P ∈ P(Z), and therefore we obtain
   
sup sup E 𝛾 ℓ(𝑍) − 𝜆𝑐(𝑍, 𝑍) ˆ ≤ E 𝐿( 𝑍) ˆ .

P∈P(Z) 𝛾∈Γ(P,P̂)

It remains to prove the reverse inequality. To this end, observe that


 
   
ˆ ˆ ˆ − 𝜆𝑐( 𝑓 ( 𝑍),
E P̂ 𝐿( 𝑍) = E P̂ sup ℓ(𝑧) − 𝜆𝑐(𝑧, 𝑍) = sup E P̂ ℓ( 𝑓 ( 𝑍)) ˆ 𝑍)ˆ
𝑧 ∈Z 𝑓 ∈F
 
≤ sup ˆ ,
sup E 𝛾 ℓ(𝑍) − 𝜆𝑐(𝑍, 𝑍)
P∈P(Z) 𝛾∈Γ(P,P̂)

where F denotes the family of all Borel functions 𝑓 : Z → Z. The second


equality follows from (Rockafellar and Wets 2009, Theorem 14.60), which applies
because ℓ(𝑧) − 𝜆𝑐(𝑧, 𝑧ˆ) is upper semicontinuous in (𝑧, 𝑧ˆ) and thus constitutes a
normal integrand thanks to (Rockafellar and Wets 2009, Example 14.31). Note
ˆ and 𝑍ˆ under P̂ coincides with the pushforward
that the joint distribution of 𝑓 ( 𝑍)
−1
distribution 𝛾 = P̂ ◦ 𝑔 , where 𝑔 : Z → Z × Z is defined through 𝑔(ˆ𝑧) = ( 𝑓 (ˆ𝑧 ), 𝑧ˆ).
By construction, we have 𝛾 ∈ Γ(P̂ ◦ 𝑓 −1 , P̂). The inequality in the above expression
therefore holds because P̂◦ 𝑓 −1 ∈ P(Z). This observation completes the proof. 
ˆ > −∞ and that ℓ is
Proposition 4.17 (Bi-conjugate of ℎ). Assume that E P̂ [ℓ( 𝑍)]
upper semicontinuous. Then, the bi-conjugate of ℎ defined in (4.13) satisfies
 
∗∗
ℎ (𝑢) = sup −𝜆𝑟 − E P̂ sup ℓ(𝑧) − 𝜆𝑐(𝑧, 𝑍) ˆ .
𝜆≥0 𝑧 ∈Z

In addition, ℎ∗∗ coincides with ℎ on R++ .


Proof. For any fixed 𝜆 ∈ R, the conjugate of ℎ satisfies
ℎ∗ (−𝜆) = sup −𝜆𝑢 − ℎ(𝑢)
𝑢∈R

= sup −𝜆𝑢 + E P [ℓ(𝑍)] : OT𝑐 (P, P̂) ≤ 𝑢 ,
𝑢∈R+ ,P∈P(Z)

ˆ > −∞, the


where the second equality holds because OT𝑐 (P, P̂) ≥ 0. As E P̂ [ℓ( 𝑍)]
resulting maximization problem is unbounded if 𝜆 < 0. If 𝜆 > 0, then we find
ℎ∗ (−𝜆) = sup E P [ℓ(𝑍)] − 𝜆OT𝑐 (P, P̂)
P∈P(Z)

= sup sup ˆ
E P [ℓ(𝑍)] − 𝜆E 𝛾 [𝑐(𝑍, 𝑍)]
P∈P(Z) 𝛾∈Γ(P,P̂)
 
= sup ˆ
sup E 𝛾 ℓ(𝑍) − 𝜆𝑐(𝑍, 𝑍)
P∈P(Z) 𝛾∈Γ(P,P̂)
 
ˆ ,
= E P̂ sup ℓ(𝑧) − 𝜆𝑐(𝑧, 𝑍) (4.14)
𝑧 ∈Z
Distributionally Robust Optimization 73

where the second equality follows from Definition 2.15, the third equality holds
because the marginal distribution of 𝑍 under 𝛾 is given by P, and the fourth
equality exploits Lemma 4.16. The above reasoning implies that ℎ∗ (−𝜆) coincides
with (4.14) for all 𝜆 > 0. However, this formula remains valid at 𝜆 = 0. To see this,
note that ℎ∗ is convex and closed thanks to (Rockafellar 1970, Theorem 12.2). The
last expectation in (4.10) is also convex and closed in 𝜆 thanks to Fatou’s lemma,
which applies because sup 𝑧 ∈Z ℓ(𝑧) − 𝜆𝑐(𝑧, 𝑧ˆ) is larger than or equal to ℓ(ˆ𝑧) and
ˆ > −∞. Hence,
lower semicontinuous in 𝜆 for every 𝑧ˆ ∈ Z and because E P̂ [ℓ( 𝑍)]
the last expectation in (4.14) is indeed convex and lower-semicontinuous in 𝜆, and
thus it coincides indeed with ℎ∗ (−𝜆) for all 𝜆 ∈ R+ .
Given (4.14), we finally obtain the following formula for the bi-conjugate of ℎ.
 
∗∗ ∗
ℎ (𝑢) = sup −𝜆𝑢 − ℎ (−𝜆) = sup −𝜆𝑢 − E sup ℓ(𝑧) − 𝜆𝑐(𝑧, 𝑍) ˆ

𝜆≥0 𝜆≥0 𝑧 ∈Z

Here, the first equality holds because ℎ∗ (−𝜆) = ∞ whenever 𝜆 < 0. The second
equality follows from (4.14), which holds for any 𝜆 ≥ 0. This establishes the
desired formula for ℎ∗∗ . Lemma 4.2 and our earlier observation that dom(ℎ) = R+
finlly imply that ℎ(𝑢) = ℎ∗∗ (𝑢) for all 𝑢 ∈ R++ . 
The following main theorem uses Proposition 4.17 to dualize the worst-case
expectation problem (4.1) with an optimal transport ambiguity set.
Theorem 4.18 (Duality Theory for Optimal Transport Ambiguity Sets). Assume
ˆ > −∞ and ℓ is upper semicontinuous. If P is the optimal transport
that E P̂ [ℓ( 𝑍)]
ambiguity set defined in (2.27), then the following weak duality relation holds.
 
ˆ
sup E P [ℓ(𝑍)] ≤ inf 𝜆𝑟 + E P̂ sup ℓ(𝑧) − 𝜆𝑐(𝑧, 𝑍) . (4.15)
P∈P 𝜆∈R+ 𝑧 ∈Z

If 𝑟 > 0, then strong duality holds, that is, (4.15) collapses to an equality.
Proof. Recall first that
sup E P [ℓ(𝑍)] = −ℎ(𝑟) ≤ −ℎ∗∗ (𝑟),
P∈P

where the inequality holds because of Lemma 4.2. Weak duality thus follows from
the first claim in Proposition 4.17. If 𝑟 > 0, then strong duality follows from the
second claim in Proposition 4.17. This concludes the proof. 
Mohajerin Esfahani and Kuhn (2018) and Zhao and Guan (2018) use semi-
infinite duality theory to prove Theorem 4.18 in the special case when OT𝑐 is the
1-Wasserstein distance and when the reference distribution P̂ is discrete. Blanchet
and Murthy (2019) and Gao and Kleywegt (2023) prove a generalization of The-
orem 4.18 by leveraging a Fenchel duality theorem in Banach spaces and by devising
a constructive argument, respectively. They both allow for arbitrary optimal trans-
port discrepancies as well as arbitrary reference distributions on Polish spaces. The
74 D. Kuhn, S. Shafiee, and W. Wiesemann

proof shown here, which exploits the interchangeability principle of Lemma 4.16
and elementary tools from convex analysis, is due to Zhang et al. (2024b).

5. Duality Theory for Worst-Case Risk Problems


The standard DRO problem (1.2) assumes that the decision-maker is risk-neutral
and ambiguity-averse. Risk-neutrality means that if the distribution of 𝑍 is known,
then decisions are ranked by their expected loss. Ambiguity-aversion means that
if the distribution of 𝑍 is ambiguous, then expectations are evaluated under a
distribution in the ambiguity set P that is most detrimental to the decision-maker.
If low-probability events have a disproportionate negative impact on the decision-
maker, then it is inappropriate to use the expected loss as a decision criterion even
if the distribution of 𝑍 is known. Instead, it is expedient to rank decisions by the
risk of their loss with respect to a law-invariant risk measure. A law-invariant risk
measure 𝜚 assigns each (univariate) loss distribution in P(R) a riskiness index. If
the loss is representable as ℓ(𝑍), where ℓ : R𝑑 → R is a Borel function and 𝑍 is a
𝑑-dimensional random vector with probability distribution P, then the distribution
of the loss ℓ(𝑍) is given by the pushforward distribution P ◦ ℓ −1 . Throughout this
paper, we use 𝜚P [ℓ(𝑍)] to denote the risk 𝜚(P ◦ ℓ −1 ) of such a loss distribution.
These conventions are formalized in the following definition. Here and in the
remainder we use L(R𝑑 ) to denote the family of all Borel functions ℓ : R𝑑 → R.
Definition 5.1 (Law-Invariant Risk Measure). A law-invariant risk measure is a
function 𝜚 : P(R) → R. We use 𝜚P [ℓ(𝑍)] to denote 𝜚(P ◦ ℓ −1 ) for any Borel
function ℓ ∈ L(R𝑑 ), Borel distribution P ∈ P(R𝑑 ) and dimension 𝑑 ∈ N.
A law-invariant risk measure 𝜚 has the property that if P1 ◦ ℓ1−1 = P2 ◦ ℓ2−1 for
two different Borel functions ℓ1 and ℓ2 and two different distributions P1 and P2
on R𝑑1 and R𝑑2 , respectively, then 𝜚P1 [ℓ1 (𝑍1 )] = 𝜚P2 [ℓ2 (𝑍2 )]. In fact, this property
is the very reason for why 𝜚 is called ‘law-invariant.’
The notation 𝜚P [ℓ(𝑍)] is consistent with our usual conventions for the expected
value E P [ℓ(𝑍)], which is a special instance of a law-invariant risk measure. Also,
it makes the dependence of the risk on P explicit, which is necessary when P is
ambiguous. We stress that, in contrast to most of the literature on risk measures,
our definition of a law-invariant risk measure 𝜚 is not tied to a particular probability
space. A prime example of a law-invariant risk measure is the value-at-risk.
Definition 5.2 (Value-at-Risk). The value-at-risk (VaR) at level 𝛽 ∈ (0, 1) of an
uncertain loss ℓ(𝑍) with ℓ ∈ L(R𝑑 ) and 𝑍 ∼ P ∈ P(R𝑑 ) is given by
𝛽-VaRP [ℓ(𝑍)] = inf {𝜏 ∈ R : P(ℓ(𝑍) ≤ 𝜏) ≥ 1 − 𝛽} . (5.1)
The VaR is indeed law-invariant because P(ℓ(𝑍) ≤ 𝜏) = 𝐹(𝜏) depends on ℓ
and P only indirectly through the cumulative distribution function 𝐹 associated
with the pushfoward distribution P ◦ ℓ −1 . Note that the infimum in (5.1) is attained
because 𝐹 is non-decreasing and right-continuous. By construction, the VaR
Distributionally Robust Optimization 75

at level 𝛽 represents the smallest number 𝜏★ that weakly exceeds the loss with
probability 1 − 𝛽. Thus, it coincides with the leftmost (1 − 𝛽)-quantile of the loss
distribution 𝐹. For later reference we remark that the 𝛽-VaR can be reformulated as
𝛽-VaRP [ℓ(𝑍)] = inf {𝜏 ∈ R : P(ℓ(𝑍) ≥ 𝜏) ≤ 𝛽} . (5.2)
However, the infimum in (5.2) may not be attained. Note that the VaR is well-defined
and finite for any loss function ℓ ∈ L(R𝑑 ) and for any distribution P ∈ P(R𝑑 ).
Nonetheless, other law-invariant risk measures are finite only for certain sub-classes
of loss functions and distributions. In the remainder of this paper we will often
study risk measures that display some or all of the following structural properties.
Definition 5.3 (Properties of Risk Measures). A law-invariant risk measure 𝜚 is

(i) translation-invariant if
𝜚P [ℓ(𝑍) + 𝑐] = 𝜚P [ℓ(𝑍)] + 𝑐 ∀ℓ ∈ L(R𝑑 ), ∀𝑐 ∈ R, ∀P ∈ P(R𝑑 );

(ii) scale-invariant if
𝜚P [𝑐ℓ(𝑍)] = 𝑐𝜚P [ℓ(𝑍)] ∀ℓ ∈ L(R𝑑 ), ∀𝑐 ∈ R+ , ∀P ∈ P(R𝑑 );

(iii) monotone if
𝜚P [ℓ1 (𝑍)] ≤ 𝜚P [ℓ2 (𝑍)]
∀ℓ1 , ℓ2 ∈ L(R𝑑 ) with ℓ1 (𝑍) ≤ ℓ2 (𝑍) P-a.s., ∀P ∈ P(R𝑑 );

(iv) convex if
𝜚P [𝜃ℓ1 (𝑍) + (1 − 𝜃)ℓ2 (𝑍)] ≤ 𝜃 𝜚P [ℓ1 (𝑍)] + (1 − 𝜃)𝜚P [ℓ2 (𝑍)]
∀ℓ1 , ℓ2 ∈ L(R𝑑 ), ∀𝜃 ∈ [0, 1], ∀P ∈ P(R𝑑 ).

A coherent risk measure is translation-invariant, scale-invariant, monotone as


well as convex (Artzner et al. 1999). In addition, a convex risk measure is
translation-invariant, monotone and convex (but not necessarily scale-invariant).
Any law-invariant risk measure 𝜚 gives rise to a risk-averse DRO problem
inf sup 𝜚P [ℓ(𝑥, 𝑍)] . (5.3)
𝑥 ∈X P∈P

This problem seeks a decision 𝑥 that minimizes the worst-case risk of the random
loss ℓ(𝑥, 𝑍) with respect to all distributions of 𝑍 in the ambiguity set P. Below we
will show that the duality theory for worst-case expectation problems developed in
Section 4 has ramifications for a broad class of worst-case risk problems of the form
sup 𝜚P [ℓ(𝑍)]. (5.4)
P∈P

Here, we suppress as usual the dependence of the loss function on 𝑥 to avoid clutter.
76 D. Kuhn, S. Shafiee, and W. Wiesemann

5.1. Optimized Certainty Equivalents


We now describe a class of law-invariant risk measures for which the risk-averse
DRO problem (5.3) can be converted to an equivalent risk-neutral DRO problem of
the form (1.2). This will show that many risk-averse DRO problems are susceptible
to methods developed for risk-neutral problems. The risk measures studied in this
section are induced by disutility functions in the sense of the following definition.
Definition 5.4 (Disutility Function). A disutility function 𝑔 : R → R is a convex
(and therefore continuous) function with 𝑔(0) = 0 and 𝑔(𝜏) > 𝜏 for all 𝜏 ≠ 0.
Ben-Tal and Teboulle (1986) use disutility functions to construct a class of law-
invariant risk measures, which they term optimized certainty equivalents. Recall
that if the objective function of a minimization (maximization) problem can be
expressed as the difference of two terms, both of which evaluate to ∞ (e.g., the
positive and negative parts of an integral), then it should be interpreted as ∞ (−∞).
Definition 5.5 (Optimized Certainty Equivalent). The optimized certainty equival-
ent induced by the disutility function 𝑔 is the law-invariant risk measure 𝜚 with
𝜚P [ℓ(𝑍)] = inf 𝜏 + E P [𝑔(ℓ(𝑍) − 𝜏)] . (5.5)
𝜏 ∈R

The expected disutility E P [𝑔(ℓ(𝑍))] represents a deterministic present loss that


the decision-maker considers to be equally (un)desirable as the random future
loss ℓ(𝑍). If it is possible to shift a deterministic portion 𝜏 of the loss ℓ(𝑍) to the
present, then the decision-maker will solve the minimization problem in (5.5) in
order to strike an optimal trade-off between present and future losses. Hence, it is
natural to interpret 𝜚P [ℓ(𝑍)] as an ‘optimized certainty equivalent.’
There is also an intimate relation between optimized certainty equivalents and
a class of 𝜙-divergences. To see this, let 𝜙 be an entropy function in the sense
of Definition 2.4 with 𝜙∞ (1) = ∞. Assume also that 𝜙 is twice continuously
differentiable on a neighborhood of 1 with 𝜙′ (1) = 0 and 𝜙′′ (1) > 0. Under
these conditions, 𝜙∗ constitutes a disutility function in the sense of Definition 5.4.
Indeed, 𝜙∗ is read-valued because 𝜙∞ (1) = ∞ and satisfies 𝜙∗ (𝑡) ≥ 𝑡 for all 𝑡 ∈ R
because 𝜙(1) = 0. Finally, we have 𝜙∗ (0) = 0 because 𝜙′ (1) = 0 and 𝜙∗ (𝑡) > 𝑡
for all 𝑡 ≠ 0 because 𝜙′′ (𝑡) > 0. If E P̂ [ℓ(𝑍)] > −∞, then the optimized certainty
equivalent induced by the disutility function 𝑔 = 𝜙∗ satisfies
inf 𝜆 0 + E P̂ [𝜙∗ (ℓ(𝑍) − 𝜆 0 )] = sup E P [ℓ(𝑍)] − D 𝜙 (P, P̂) (5.6)
𝜆0 ∈R P∈P(Z)

and thus coincides with the optimal value of a penalty-based distributionally robust
optimization model with a 𝜙-divergence penalty. The equality in the above expres-
sion follows from Ben-Tal and Teboulle (2007, Theorem 4.2), which is reminiscent
of the strong duality theorem for worst-case expectation problems over restricted
𝜙-divergence ambiguity sets (see Theorem 4.15). The assumption that 𝜙∞ (1) = ∞
ensures indeed that D 𝜙 (P, P̂) is finite only if P ≪ P̂. We also remark that if 𝑔 is a
Distributionally Robust Optimization 77

disutility function in the sense of Definition 5.4 and if 𝑔 is non-decreasing, then 𝑔∗


constitutes an entropy function in the sense of Definition 2.4.
We will see below that the optimized certainty equivalents encapsulate several
widely used risk measures as special cases. Notable examples include the mean-
variance risk measure, the mean-median risk measure, the conditional value-at-risk
or the entropic risk measure. More generally, Rockafellar, Uryasev and Zabarankin
(2006, 2008) show that virtually any regular risk measure admits a representation of
the form (5.5) provided that the expected disutility is replaced with a more general
measure of regret; see also (Rockafellar and Royset 2014, 2015) and the survey
papers (Rockafellar and Royset 2013, Royset 2022).
Definition 5.6 (Mean-Variance Risk Measure). The mean-variance risk measure
with risk-aversion coefficient 𝛽 ∈ (0, ∞) is the law-invariant risk measure 𝜚 with
𝜚P [ℓ(𝑍)] = E P [ℓ(𝑧)] + 𝛽 · VP [ℓ(𝑍)],
where VP [ℓ(𝑍)] denotes the variance of ℓ(𝑍) under P.
We call a function 𝑓 : R → R coercive if lim𝑖→∞ 𝑓 (𝜏𝑖 ) = ∞ for every sequence
{𝜏𝑖 }𝑖∈N with lim𝑖→∞ |𝜏𝑖 | = ∞. Coercivity will play a key role in re-expressing
worst-case optimized certainty equivalents in terms of worst-case expectations.
Proposition 5.7 (Mean-Variance Risk Measure). The mean-variance risk meas-
ure 𝜚 with risk-aversion coefficient 𝛽 ∈ (0, ∞) is the optimized certainty equivalent
induced by the disutility function 𝑔(𝜏) = 𝜏 + 𝛽𝜏 2 . The objective function of prob-
lem (5.5) is coercive in 𝜏 and is uniquely minimized by 𝜏★ = E P [ℓ(𝑍)].
Proof. The objective function of problem (5.5) corresponding to the disutility
function 𝑔 is given by E P [ℓ(𝑍) + 𝛽(ℓ(𝑍) − 𝜏)2 ]. This function is ostensibly coercive
in 𝜏 and is minimized by 𝜏★ = E P [ℓ(𝑍)]. Substituting 𝜏★ back into the objective
function shows that the optimized certainty equivalent induced by 𝑔 coincides
indeed with the mean-variance risk measure with risk-aversion coefficient 𝛽. 
Definition 5.8 (Mean-MAD Risk Measure). The mean-median absolute deviation
(MAD) risk measure with risk-aversion coefficient 𝛽 ∈ (0, ∞) is the law-invariant
risk measure 𝜚 with
 
𝜚P [ℓ(𝑍)] = E P [ℓ(𝑧)] + 𝛽 · E P |ℓ(𝑍) − MP [ℓ(𝑍)]| ,
where MP [ℓ(𝑍)] denotes the median of ℓ(𝑍) under P.
Proposition 5.9 (Mean-MAD Risk Measure). The mean-MAD risk measure 𝜚 with
risk-aversion coefficient 𝛽 ∈ (0, ∞) is the optimized certainty equivalent induced
by the disutility function 𝑔(𝜏) = 𝜏 + 𝛽|𝜏|. The objective function of problem (5.5)
is coercive in 𝜏 and is minimized by 𝜏★ = MP [ℓ(𝑍)].
Proof. The objective function of problem (5.5) corresponding to the disutility
function 𝑔 is given by E P [ℓ(𝑍) + 𝛽|ℓ(𝑍) − 𝜏|]. This function is ostensibly coercive
78 D. Kuhn, S. Shafiee, and W. Wiesemann

in 𝜏 and is minimized by 𝜏★ = MP [ℓ(𝑍)]. Substituting 𝜏★ back into the objective


function yields the mean-MAD risk measure with risk-aversion coefficient 𝛽. 
Definition 5.10 (Conditional Value-at-Risk). The conditional VaR (CVaR) at level
𝛽 ∈ (0, 1) is the law-invariant risk measure denoted as 𝛽-CVaR with
1
𝛽-CVaRP [ℓ(𝑍)] = inf 𝜏 + E P [max {ℓ(𝑍) − 𝜏, 0}] . (5.7)
𝜏 ∈R 𝛽
Note that 𝛽-CVaRP [ℓ(𝑍)] converges to E P [ℓ(𝑍)] as 𝛽 tends to 1. One can further
show that it converges to the essential supremum ess sup P [ℓ(𝑍)] as 𝛽 tends to 0.
Proposition 5.11 (CVaR). The CVaR at level 𝛽 ∈ (0, 1) is the optimized certainty
equivalent induced by the disutility function 𝑔(𝜏) = 𝛽 −1 max{𝜏, 0}. The objective
function of problem (5.5) is coercive in 𝜏 and is minimized by 𝜏★ = 𝛽-VaRP [ℓ(𝑍)].
Proof. It is evident that problem (5.7) is an instance of problem (5.5) corresponding
to the given disutility function 𝑔. In addition, as 𝛽 ∈ (0, 1), it is evident that the
objective function of problem (5.7) is coercive in 𝜏. Finally, one readily verifies that
𝜏★ = 𝛽-VaRP [ℓ(𝑍)] solves the first-order optimality condition of the unconstrained
convex program (5.7) and thus constitutes a minimizer. 
By substituting 𝜏★ = 𝛽-VaRP [ℓ(𝑍)] into the objective function of problem (5.7),
it becomes now clear that 𝛽-CVaRP [ℓ(𝑍)] ≥ 𝛽-VaRP [ℓ(𝑍)]. If the loss ℓ(𝑍) has a
continuous distribution under P, then one can further use (5.7) to show that
𝛽-CVaRP [ℓ(𝑍)] = E P [ℓ(𝑍) |ℓ(𝑍) ≥ 𝛽-VaRP [ℓ(𝑍)] ] .
Hence, the CVaR at level 𝛽 coincides with the expectation of the upper 𝛽-tail of the
loss distribution, which implies that 𝛽-CVaRP [ℓ(𝑍)] is generically strictly larger
than 𝛽-VaRP [ℓ(𝑍)]. For details we refer to (Rockafellar and Uryasev 2000, 2002).
Definition 5.12 (Entropic Risk Measure). The entropic risk measure with risk-aver-
sion parameter 𝛽 ∈ (0, ∞) is the law-invariant risk measure denoted as 𝛽-ERM with
1  
𝛽-ERMP [ℓ(𝑍)] = log E P exp 𝛽ℓ(𝑍) . (5.8)
𝛽
Using a Taylor expansion, one can show that 𝛽-ERMP [ℓ(𝑍)] converges to the ex-
pected value E P [ℓ(𝑍)] as 𝛽 tends to 0. Similarly, one can show that 𝛽-ERMP [ℓ(𝑍)]
converges to the essential supremum ess sup P [ℓ(𝑍)] as 𝛽 tends to ∞.
Proposition 5.13 (Entropic Risk Measure). The entropic risk measure with risk-
aversion parameter 𝛽 ∈ (0, 1) is the optimized certainty equivalent induced by the
disutility function 𝑔(𝜏) = 𝛽 −1 (exp(𝛽𝜏) − 1). The objective function of problem (5.5)
is coercive in 𝜏 and is minimized by 𝜏★ = 𝛽 −1 log(E P [exp(𝛽ℓ(𝑍))]).
Proof. By the definition of 𝑔, we have
1   
inf 𝜏 + E P [𝑔(ℓ(𝑍) − 𝜏)] = inf 𝜏 + E P exp 𝛽(ℓ(𝑍) − 𝜏) − 1
𝜏 ∈R 𝜏 ∈R 𝛽
Distributionally Robust Optimization 79

1  
= log E P exp 𝛽ℓ(𝑍) = 𝛽-ERMP [ℓ(𝑍)].
𝛽
The second equality holds because the unconstrained convex minimization problem
over 𝜏 is uniquely solved by 𝜏★ = 𝛽 −1 log(E P [exp(𝛽ℓ(𝑍))]), which can be verified
by inspecting the problem’s first-order optimality condition. In addition, as 𝛽 ∈
(0, 1), it is clear that the problem’s objective function is coercive in 𝜏. 
Kupper and Schachermayer (2009) show that, with the exception of the expected
value, the entropic risk measure is the only relevant law-invariant risk measure that
obeys the tower property. That is, for any random vectors 𝑍1 and 𝑍2 it satisfies
𝛽-ERMP [ℓ(𝑍2 )] = 𝛽-ERMP [𝛽-ERMP [ℓ(𝑍2 )|𝑍1 ]],
where the conditional entropic risk measure 𝛽-ERMP [𝑍1 |𝑍2 ] is defined in the
obvious way by replacing the unconditional expectation in (5.8) with a conditional
expectation. The entropic risk measure is often used for modeling risk-aversion in
dynamic optimization problems, where the dynamic consistency of the decisions
taken at different points in time is a concern. For example, it occupies center stage
in finance (Föllmer and Schied 2008), risk-sensitive control (Whittle 1990, Başar
and Bernhard 1995) and economics (Hansen and Sargent 2008).
Proposition 5.14 (Dual Representation of the Entropic Risk Measure). Assume that
E P̂ [ℓ(𝑍)] > −∞. Then, the entropic risk measure admits the dual representation
1
𝛽-ERMP̂ [ℓ(𝑍)] = sup E P [ℓ(𝑍)] − · KL(P, P̂).
P∈P(Z) 𝛽
Proof. Let 𝜙 be the entropy function of the Kullback-Leibler divergence. Thus, we
have 𝜙∗ (𝑡) = e𝑡 − 1 for all 𝑡 ∈ R; see Table 4.1. By Proposition 5.13, the entropic
value-at-risk is the optimized certainty equivalent induced by the disutility function
𝑔(𝑡) = 𝛽 −1 (exp(𝛽𝑡) − 1) = 𝛽 −1 𝜙∗ (𝛽𝑡) = (𝛽 −1 𝜙)∗ (𝑡),
where the last equality uses (Rockafellar 1970, Theorem 16.1). This implies that
 
𝛽-ERMP̂ [ℓ(𝑍)] = inf 𝜏 + E P (𝛽 −1 𝜙)∗ (ℓ(𝑍) − 𝜏)
𝜏 ∈R
= sup E P [ℓ(𝑍)] − D 𝛽 −1 𝜙 (P, P̂)
P∈P(Z)

= sup E P [ℓ(𝑍)] − 𝛽 −1 KL(P, P̂).


P∈P(Z)

Here, the second equality follows from the strong duality relation (5.6), which
applies because E P̂ [ℓ(𝑍)] > −∞, and the third equality holds because the entropy
function 𝜙 was assumed to induce the Kullback-Leibler divergence. 
We remark that Proposition 5.14 can also be proved by leveraging the Donsker-
Varadhan formula from Proposition 2.9 in lieu of the duality relation (5.6).
One can show that every optimized certainty equivalent 𝜚 is translation-invariant
80 D. Kuhn, S. Shafiee, and W. Wiesemann

and convex. If the underlying disutility function 𝑔 is non-decreasing, then 𝜚 is also


monotone, and if 𝑔 is positive homogeneous, then 𝜚 is also scale-invariant.
In the remainder we will show that if 𝜚 is any optimized certainty equivalent, then
the worst-case risk problem (5.4) can be reduced a worst-case expectation problem
of the form (4.1). This reduction is predicated on a lopsided minimax theorem
to be derived below, and it allows us to extend the duality theory for worst-case
expectation problems of Section 4 to a rich class of worst-case risk problems.

5.2. Lopsided Minimax Theorems


A generic minimax problem can be represented as
inf sup 𝐻(𝑢, 𝑣),
𝑢∈U 𝑣 ∈V

where U and V are arbitrary spaces, and 𝐻 : U × V → R is an arbitrary function.


A minimax theorem provides conditions under which the infimum and supremum
operators can be interchanged without changing the problem’s optimal value. The
following minimax theorem inspired by (Rockafellar 1974, Example 13) will be
essential for solving worst-case risk problems with optimized certainty equivalents.
Recall from Section 4.1 that a convex function is closed if it is either proper and
lower semicontinuous or identically equal to −∞.
Theorem 5.15 (Lopsided Minimax Theorem). Suppose that U is an arbitrary
vector space and V is a locally convex topological vector space. Assume also that
the function 𝐻 : U × V → R is such that 𝐻(𝑢, 𝑣) is convex in 𝑢 and such that
−𝐻(𝑢, 𝑣) is convex and closed in 𝑣. If sup𝑣 ∈V inf 𝑢∈U 𝐻(𝑢, 𝑣) > −∞ and for every
𝛼 ∈ R there exists 𝑢 ∈ U such that {𝑣 ∈ V : 𝐻(𝑢, 𝑣) ≥ 𝛼} is compact, then we have
inf sup 𝐻(𝑢, 𝑣) = sup inf 𝐻(𝑢, 𝑣).
𝑢∈U 𝑣 ∈V 𝑣 ∈V 𝑢∈U

Proof. Let V∗ be the topological dual of V, and define the bilinear form h·, ·i :
V × V → R through h𝑣 ∗ , 𝑣i = 𝑣 ∗ (𝑣). If we equip V ∗ with the weak topology

induced by V, then h·, 𝑣i is a continuous linear functional on V ∗ for every 𝑣 ∈ V,


and every continuous linear functional on V ∗ can be represented in this way.
Define 𝐹 : U × V ∗ → R through 𝐹(𝑢, 𝑣 ∗ ) = sup𝑣 ∈V 𝐻(𝑢, 𝑣) − h𝑣 ∗ , 𝑣i, which is
jointly convex in 𝑢 and 𝑣 ∗ thanks to Lemma 4.1. Thus, 𝐹(𝑢, 𝑣 ∗ ) = (−𝐻)∗ (𝑢, −𝑣 ∗ ),
where the conjugate of −𝐻(𝑢, 𝑣) is evaluated with respect to its second argument 𝑣
only. As −𝐻(𝑢, 𝑣) is convex and closed in 𝑣, this implies via Lemma 4.2 that
𝐹 ∗(𝑢, 𝑣) = −𝐻(𝑢, −𝑣). Here, again, the conjugate of 𝐹(𝑢, 𝑣 ∗ ) is evaluated with
respect to its second argument 𝑣 ∗ only. In addition, define ℎ : V ∗ → R through
ℎ(𝑣 ∗ ) = inf 𝑢∈U 𝐹(𝑢, 𝑣 ∗ ), which is convex in 𝑣 ∗ . Thus, we find
ℎ(0) = inf 𝐹(𝑢, 0) = inf sup 𝐻(𝑢, 𝑣),
𝑢∈U 𝑢∈U 𝑣 ∈V

where the two equalities follow from the definitions of ℎ and 𝐹, respectively. In
Distributionally Robust Optimization 81

addition, we also have


ℎ∗∗ (0) = sup −ℎ∗ (−𝑣) = sup ∗inf ∗ h𝑣 ∗ , 𝑣i + ℎ(𝑣 ∗ )
𝑣 ∈V 𝑣 ∈V 𝑣 ∈V
= sup inf ∗inf ∗ h𝑣 , 𝑣i + 𝐹(𝑢, 𝑣 ∗ )

𝑣 ∈V 𝑢∈U 𝑣 ∈V
= sup inf −𝐹 ∗ (𝑢, −𝑣) = sup inf 𝐻(𝑢, 𝑣),
𝑣 ∈V 𝑢∈U 𝑣 ∈V 𝑢∈U

where the first two equalities follow from the definitions of the bi-conjugate ℎ∗∗ and
the conjugate ℎ∗ , respectively, and the third equality exploits the definition of ℎ.
The fourth equality follows from the definition of the conjugate 𝐹 ∗, and the last
equality holds because 𝐹 ∗(𝑢, 𝑣) = −𝐻(𝑢, −𝑣). Thus, the desired minimax result
holds if we manage to prove that ℎ(0) = ℎ∗∗ (0).
By the definitions of ℎ∗ and ℎ and by the relation 𝐹 ∗(𝑢, 𝑣) = −𝐻(𝑢, −𝑣), we have
 
∗ ∗ ∗
{𝑣 ∈ V : ℎ (𝑣) ≤ 𝛼} = 𝑣 ∈ V : sup h𝑣 , 𝑣i − ℎ(𝑣 ) ≤ 𝛼
𝑣 ∗ ∈V ∗
 
∗ ∗
= 𝑣 ∈ V : sup sup h𝑣 , 𝑣i − 𝐹(𝑢, 𝑣 ) ≤ 𝛼
𝑢∈U 𝑣 ∗ ∈V ∗
 
= 𝑣 ∈ V : sup −𝐻(𝑢, −𝑣) ≤ 𝛼
𝑢∈U
Ù
=− {𝑣 ∈ V : 𝐻(𝑢, 𝑣) ≥ −𝛼}
𝑢∈U

for any 𝛼 ∈ R. Hence, {𝑣 ∈ V : ℎ∗ (𝑣) ≤ 𝛼} is representable as an intersec-


tion of closed sets, at least one of which is compact. Therefore, the intersec-
tion is also compact. Selecting 𝛼 > inf 𝑣 ∈V ℎ∗ (𝑣), which is possible because
sup𝑣 ∈V inf 𝑢∈U 𝐻(𝑢, 𝑣) > −∞ implies that inf 𝑣 ∈V ℎ∗ (𝑣) < ∞, we further ensure that
the compact set {𝑣 ∈ V : ℎ∗ (𝑣) ≤ 𝛼} is non-empty. This implies via (Rockafellar
1974, Theorem 10 (b)) that ℎ∗∗ (𝑣 ∗ ) and ℎ(𝑣 ∗ ) are both bounded above on a neigh-
borhood of 0. By (Rockafellar 1974, Theorem 17 (a)), this in turn implies that
ℎ(0) = ℎ∗∗ (0), which establishes the desired minimax equality. 
Swapping the roles of 𝑢 and 𝑣 leads to the following immediate corollary.
Corollary 5.16 (Reverse Lopsided Minimax Theorem). Suppose that U is a locally
convex topological vector space and V is an arbitrary vector space. Assume also
that the function 𝐻 : U × V → R is such that 𝐻(𝑢, 𝑣) is convex and closed in 𝑢
and such that −𝐻(𝑢, 𝑣) is convex in 𝑣. If inf 𝑢∈U sup 𝑣 ∈V 𝐻(𝑢, 𝑣) < ∞ and for every
𝛼 ∈ R there exists 𝑣 ∈ V such that {𝑢 ∈ U : 𝐻(𝑢, 𝑣) ≤ 𝛼} is compact, then we have
inf sup 𝐻(𝑢, 𝑣) = sup inf 𝐻(𝑢, 𝑣).
𝑢∈U 𝑣 ∈V 𝑣 ∈V 𝑢∈U

A function ℎ 𝑣 (𝑢) = 𝐻(𝑢, 𝑣) whose sublevel sets {𝑢 ∈ U : ℎ 𝑣 (𝑢) ≤ 𝛼} are all


compact is commonly referred to as inf-compact (Hartung 1982). The following
82 D. Kuhn, S. Shafiee, and W. Wiesemann

lemma provides an easily checkable sufficient condition for the inf-compactness of


ℎ 𝑣 (𝑢) in case U is a Euclidean space. To this end, recall that a function ℎ 𝑣 is coercive
if for every sequence {𝑢𝑖 }𝑖∈N with lim𝑖→∞ k𝑢𝑖 k 2 = ∞, we have lim𝑖→∞ ℎ 𝑣 (𝑢𝑖 ) = ∞.
Lemma 5.17 (Inf-Compactness). Suppose that U is a Euclidean space and 𝐻 :
U × V → R is lower semicontinuous and coercive in its first argument. Then, the
sublevel sets {𝑢 ∈ U : 𝐻(𝑢, 𝑣) ≤ 𝛼} are compact for all 𝑣 ∈ V and 𝛼 ∈ R.
Proof. To show that the sublevel set U 𝛼 (𝑣) = {𝑢 ∈ U : 𝐻(𝑢, 𝑣) ≤ 𝛼} is compact,
note first that U 𝛼 (𝑣) is closed because 𝐻(𝑢, 𝑣) is lower semicontinuous in 𝑢. In order
to prove that U 𝛼 (𝑣) is also bounded, assume for the sake of contradiction that there
exists a sequence {𝑢𝑖 }𝑖∈N ∈ U 𝛼 (𝑣) with lim𝑖→∞ k𝑢𝑖 k = ∞. As 𝐻(𝑢, 𝑣) is coercive
in 𝑢, we have lim𝑖→∞ 𝐻(𝑢𝑖 , 𝑣) = ∞. However, this contradicts the assumption that
𝐻(𝑢𝑖 , 𝑣) ≤ 𝛼 for all 𝑖 ∈ N. Thus, U 𝛼 (𝑣) must be bounded and compact. 
Note that if 𝐻0 : U0 × V0 → R is defined on convex sets U0 ⊆ U and V0 ⊆ V,
then it can be extended to a function 𝐻 : U × V → R on the underlying vector
spaces U and V by setting

 𝐻0 (𝑢, 𝑣) if 𝑢 ∈ U0 and 𝑣 ∈ V0 ,


𝐻(𝑢, 𝑣) = +∞ if 𝑢 ∉ U0 and 𝑣 ∈ V0 ,

 −∞
 if 𝑣 ∉ V0 .
This construction guarantees that
inf sup 𝐻(𝑢, 𝑣) = inf sup 𝐻0 (𝑢, 𝑣) and sup inf 𝐻(𝑢, 𝑣) = sup inf 𝐻0 (𝑢, 𝑣).
𝑢∈U 𝑣 ∈V 𝑢∈U0 𝑣 ∈V0 𝑣 ∈V 𝑢∈U 𝑣 ∈V0 𝑢∈U0

It also guarantees that if 𝐻0 (𝑢, 𝑣) is convex and closed in 𝑢 and concave in 𝑣, then
so is 𝐻(𝑢, 𝑣). Thus, the feasible sets in any convex-concave minimax problem can
always be extended to the underlying vector spaces without changing the problem.
We now leverage Corollary 5.16 to derive a minimax theorem for optimized
certainty equivalents. This result exploits the inf-compactness of the objective
function of problem (5.5) in 𝜏. Shafiee and Kuhn (2024) establish similar minimax
theorems for a more general class of regular risk and deviation measures introduced
by Rockafellar and Uryasev (2013).
Theorem 5.18 (Minimax Theorem for Optimized Certainty Equivalents). Suppose
that P ⊆ P(Z) is non-empty and convex, 𝜚 is any optimized certainty equivalent
induced by a disutility function 𝑔, supP∈P E P [𝑔(ℓ(𝑍))] < ∞, and E P [ℓ(𝑍)] > −∞
for all P ∈ P. Then, 𝐺(𝜏, P) = 𝜏 + E P [𝑔(ℓ(𝑍) − 𝜏)] for 𝜏 ∈ R and P ∈ P satisfies
sup 𝜚P [ℓ(𝑍)] = sup inf 𝐺(𝜏, P) = inf sup 𝐺(𝜏, P).
P∈P P∈P 𝜏 ∈R 𝜏 ∈R P∈P

Proof. Note first that 𝐺(𝜏, P) is convex in 𝜏 and concave (in fact, linear) in P. In
addition, 𝐺(𝜏, P) is closed in 𝜏. To see this, observe that
lim′ inf 𝐺(𝜏 ′ , P) = lim′ inf E P [𝜏 ′ + 𝑔(ℓ(𝑍) − 𝜏 ′ )]
𝜏 →𝜏 𝜏 →𝜏
Distributionally Robust Optimization 83

≥ E P [lim′ inf 𝜏 ′ + 𝑔(ℓ(𝑍) − 𝜏 ′ )]


𝜏 →𝜏
≥ E P [𝜏 + 𝑔(ℓ(𝑍) − 𝜏)] = 𝐺(𝜏, P),
where the two inequalities follow from Fatou’s lemma and the continuity of 𝑔,
respectively. Fatou’s lemma applies because any disutility function satisfies 𝑔(𝜏) ≥
𝜏 for all 𝜏 ∈ R, which implies that 𝜏 + 𝑔(ℓ(𝑧) − 𝜏) ≥ ℓ(𝑧) for all 𝑧 ∈ Z and 𝜏 ∈ R.
Note also that E P [ℓ(𝑍)] is finite by assumption. Next, we show that 𝐺(𝜏, P) is
inf-compact in 𝜏. To this end, recall that 𝑔(0) = 0 and 𝑔(𝜏) > 𝜏 for all 𝜏 ≠ 0. As 𝑔
is also convex, this implies that 𝑔(𝜏) must grow faster than 𝜏 as 𝜏 tends to +∞ and
that 𝑔(𝜏) must decay slower than 𝜏 as 𝜏 tends to −∞. Hence, there exists 𝜀 > 0
with 𝑔(𝜏) ≥ (1 + 𝜀)𝜏 − 1 and 𝑔(𝜏) ≥ (1 − 𝜀)𝜏 − 1 for all 𝜏 ∈ R. For a formal proof
of this assertion we refer to (Zhen et al. 2023, Lemma C.10). This implies that
𝐺(𝜏, P) ≥ 𝜏 + (1 + 𝜀) (E P [ℓ(𝑍)] − 𝜏) − 1 = −𝜀𝜏 + (1 + 𝜀)E P [ℓ(𝑍)] − 1
and
𝐺(𝜏, P) ≥ 𝜏 + (1 − 𝜀) (E P [ℓ(𝑍)] − 𝜏) − 1 = 𝜀𝜏 + (1 + 𝜀)E P [ℓ(𝑍)] − 1
for all 𝜏 ∈ R, and thus {𝜏 ∈ R : 𝐺(𝜏, P) ≤ 𝛼} is compact for every 𝛼 ∈ R.
Next, set U = R, and define V = M(R𝑑 ) as the space of all finite signed Borel
measures on R𝑑 . In addition, define the function 𝐻 : U × V → R through

𝐺(𝑢, 𝑣) if 𝑣 ∈ P,
𝐻(𝑢, 𝑣) =
−∞ if 𝑣 ∉ P.
By construction, 𝐻(𝑢, 𝑣) is convex and closed in 𝑢 and concave in 𝑣. Recall
from Section 4.1 that a convex function is closed if it is either proper and lower
semicontinuous or identically equal to −∞. In addition, we have
sup 𝐻(0, 𝑣) = sup 𝐺(0, P) = sup E P [𝑔(ℓ(𝑍))] < ∞.
𝑣 ∈V P∈P P∈P

and the sublevel sets {𝑢 ∈ U : 𝐻(𝑢, 𝑣) ≤ 𝛼} are compact for every 𝛼 ∈ R provided
that 𝑣 ∈ P. The claim thus follows from Corollary 5.16. 
Theorem 5.18 implies that if 𝛽 ∈ (0, 1), then the worst-case 𝛽-CVaR satisfies
1
sup 𝛽-CVaRP [ℓ(𝑍)] = inf 𝜏 + sup E P [max{ℓ(𝑍) − 𝜏, 0}] (5.9)
P∈P 𝜏 ∈R 𝛽 P∈P
for any non-empty convex ambiguity set P ⊆ P(Z) provided that E P [|ℓ(𝑍)|] < ∞
for all P ∈ P. In the extant literature, the interchange of the supremum over P and
the infimum over 𝜏 is often justified with Sion’s minimax theorem (Sion 1958).
However, many studies overlook that Sion’s minimax theorem only applies if P
is weakly compact and E P [max{ℓ(𝑍) − 𝜏, 0}] is weakly upper semicontinuous
in P. As shown in Section 3, unfortunately, many popular ambiguity sets fail to
be weakly compact. In addition, E P [max{ℓ(𝑍) − 𝜏, 0}] fails to be weakly upper
semicontinuous unless the loss function ℓ is upper semicontinuous and bounded
84 D. Kuhn, S. Shafiee, and W. Wiesemann

on Z; see Proposition 3.3. All non-trivial convex loss functions on R𝑑 violate this
condition. In contrast, Theorem 5.18 offers a more general result that exploits the
inf-compactness in 𝜏 but obviates any restrictive topological conditions on P or ℓ.

5.3. Moment Ambiguity Sets


Recall that the generic moment ambiguity set (2.1) is defined as

P = P ∈ P 𝑓 (Z) : E P [ 𝑓 (𝑍)] ∈ F ,
where Z ⊆ R𝑑 is a non-empty closed support set, 𝑓 : Z → R𝑚 is a Borel
measurable moment function, F ⊆ R𝑚 is a non-empty closed moment uncertainty
set, and P 𝑓 (Z) denots the family
 of all distributions P ∈ P(Z) for which E P [ 𝑓 (𝑍)]
is finite. Recall also that C = E P [ 𝑓 (𝑍)] : P ∈ P 𝑓 (Z) represents the family of all
possible moments of any distribution on Z. The next theorem establishes a duality
result for the worst-case risk problem (5.4) with a moment ambiguity set.
Theorem 5.19 (Duality Theory for Moment Ambiguity Sets II). If P is the moment
ambiguity set (2.1) and 𝜚 is an optimized certainty equivalent induced by a disutility
function 𝑔, then the following weak duality relation holds.
 inf 𝜏 + 𝜆 0 + 𝛿F∗ (𝜆)




sup 𝜚P [ℓ(𝑍)] ≤ s.t. 𝜏, 𝜆 0 ∈ R, 𝜆 ∈ R𝑚 (5.10)
P∈P 

 𝜆 0 + 𝑓 (𝑧)⊤ 𝜆 ≥ 𝑔(ℓ(𝑧) − 𝜏) ∀𝑧 ∈ Z

If supP∈P E P [𝑔(ℓ(𝑍))] < ∞, E P [ℓ(𝑍)] > −∞ for all P ∈ P 𝑓 (Z), and F ⊆ C is a
convex and compact set with rint(F) ⊆ rint(C), then strong duality holds, that is,
the inequality (5.10) becomes an equality.
Proof. The max-min inequality implies that
sup 𝜚P [ℓ(𝑍)] = sup inf 𝜏 + E P [𝑔(ℓ(𝑍) − 𝜏)]
P∈P P∈P 𝜏 ∈R
≤ inf sup 𝜏 + E P [𝑔(ℓ(𝑍) − 𝜏)] .
𝜏 ∈R P∈P

The inner maximization problem in the resulting upper bound constitutes a worst-
case expectation problem. Hence, it is bounded above by the dual problem de-
rived in Theorem 4.5. Substituting this dual problem into the above expression
yields (5.10). Strong duality follows from the minimax theorem for optimized
certainty equivalents (Theorem 5.18) and the strong duality result for worst-case
expectation problems (Theorem 4.5), which apply under the given assumptions. 
The semi-infinite constraint in (5.10) involves the composite function 𝑔(ℓ(𝑧)− 𝜏),
which fails to be concave in 𝑧 even if 𝑔 is non-decreasing and ℓ is concave. Thus,
checking whether a given (𝜏, 𝜆 0 , 𝜆) satisfies the semi-infinite constraint in (5.10) is
generically hard. In fact, Chen and Sim (2024, Theorem 1) prove that evaluating the
worst-case entropic risk is NP-hard even if ℓ is linear and P is a Markov ambiguity
Distributionally Robust Optimization 85

set. Hence, while providing theoretical insights, Theorem 5.18 does not necessarily
pave the way towards an efficient method for solving worst-case risk problems of the
form (5.4). Nevertheless, Theorem 5.18 provides a concise reformulation for (5.4)
that is susceptible to approximate iterative solution procedures.

5.4. 𝜙-Divergence Ambiguity Sets


Recall that the 𝜙-divergence ambiguity set (2.10) is defined as

P = P ∈ P(Z) : D 𝜙 (P, P̂) ≤ 𝑟 ,
where Z is a closed support set, 𝑟 ≥ 0 is a size parameter, 𝜙 is an entropy
function in the sense of Definition 2.4, D 𝜙 is the corresponding 𝜙-divergence in
the sense of Definition 2.5, and P̂ ∈ P(Z) is a reference distribution. The next
theorem establishes a duality result for worst-case risk problems over 𝜙-divergence
ambiguity sets. The proof follows from Theorems 4.14 and 5.18 and is thus omitted.
Theorem 5.20 (Duality Theory for 𝜙-Divergence Ambiguity Sets II). Assume that
E P̂ [ℓ(𝑍)] > −∞. If P is the 𝜙-divergence ambiguity set (2.10), and 𝜚 is an
optimized certainty equivalent induced by a disutility function 𝑔, then the following
weak duality relation holds.

 inf
 𝜏,𝜆0 ∈R,𝜆∈R
 𝜏 + 𝜆 0 + 𝜆𝑟 + E P̂ [(𝜙∗ ) 𝜋 (𝑔(ℓ(𝑍) − 𝜏) − 𝜆 0 , 𝜆)]
+
sup 𝜚P [ℓ(𝑍)] ≤ s.t. 𝜆 0 + 𝜆 𝜙∞ (1) ≥ sup 𝑔(ℓ(𝑧) − 𝜏)
P∈P 

 𝑧 ∈Z

If supP∈P E P [𝑔(ℓ(𝑍))] < ∞, E P [ℓ(𝑍)] > −∞ for all P ∈ P, 𝑟 > 0 and 𝜙 is con-
tinuous at 1, then strong duality holds, that is, the inequality becomes an equality.
A duality result akin to Theorem 5.20 also holds for worst-case risk problems
over restricted 𝜙-divergence ambiguity sets of the form

P = P ∈ P(Z) : P ≪ P̂, D 𝜙 (P, P̂) ≤ 𝑟 .
The proof of the next theorem follows immediately from Theorems 4.15 and 5.18.
Theorem 5.21 (Duality Theory for Restricted 𝜙-Divergence Ambiguity Sets II).
Assume that E P̂ [ℓ(𝑍)] > −∞. If P is the restricted 𝜙-divergence ambiguity
set (2.11), and 𝜚 is an optimized certainty equivalent induced by a disutility func-
tion 𝑔, then the following weak duality relation holds.
sup 𝜚P [ℓ(𝑍)] ≤ inf 𝜏 + 𝜆 0 + 𝜆𝑟 + E P̂ [(𝜙∗ ) 𝜋 (𝑔(ℓ(𝑍) − 𝜏) − 𝜆 0 , 𝜆)] .
P∈P 𝜏,𝜆0 ∈R, 𝜆∈R+

If supP∈P E P [𝑔(ℓ(𝑍))] < ∞, E P [ℓ(𝑍)] > −∞ for all P ∈ P, 𝑟 > 0 and 𝜙 is con-
tinuous at 1, then strong duality holds, that is, the inequality becomes an equality.

5.5. Optimal Transport Ambiguity Sets


Recall that the optimal transport ambiguity set (2.27) is defined as

P = P ∈ P(Z) : OT𝑐 (P, P̂) ≤ 𝑟 ,
86 D. Kuhn, S. Shafiee, and W. Wiesemann

where Z is a closed support set, 𝑟 ≥ 0 is a size parameter, 𝑐 is a transportation


cost function in the sense of Definition 2.14, OT𝑐 is the corresponding optimal
transport discrepancy in the sense of Definition 2.15, and P̂ ∈ P(Z) is a reference
distribution. The next theorem establishes a duality result for worst-case risk
problems over optimal transport ambiguity sets. Its proof follows immediately
from Theorems 4.18 and 5.18 and is thus omitted.
Theorem 5.22 (Duality Theory for Optimal Transport Ambiguity Sets II). Assume
ˆ > −∞ and ℓ is upper semicontinuous. If P is the optimal transport
that E P̂ [ℓ( 𝑍)]
ambiguity set defined in (2.27) and 𝜚 is an optimized certainty equivalent induced
by a disutility function 𝑔, then the following weak duality relation holds.
 
sup E P [ℓ(𝑍)] ≤ inf ˆ .
𝜏 + 𝜆𝑟 + E P̂ sup 𝑔(ℓ(𝑧) − 𝜏) − 𝜆𝑐(𝑧, 𝑍)
P∈P 𝜏 ∈R, 𝜆∈R+ 𝑧 ∈Z

If supP∈P E P [𝑔(ℓ(𝑍))] < ∞, E P [ℓ(𝑍)] > −∞ for all P ∈ P and 𝑟 > 0, then strong
duality holds, that is, the inequality becomes an equality.
Worst-case risk problems with optimal transport ambiguity sets are studied by
Pflug and Wozabal (2007), Pichler (2013) and Wozabal (2014) in the context of
portfolio selection with linear loss functions and by Mohajerin Esfahani et al.
(2018) in the context of inverse optimization using the CVaR. Sadana, Delage and
Georghiou (2024) investigate worst-case entropic risk measures over ∞-Wasserstein
balls and establish tractable reformulations under standard convexity assumptions.
Kent, Li, Blanchet and Glynn (2021) and Sheriff and Mohajerin Esfahani (2023)
develop customized Frank-Wolfe algorithms in the space of probability distribu-
tion to address worst-case risk problems involving generic loss functions and risk
measures. Specifically, Kent et al. (2021) work with Wasserstein gradient flows and
use the corresponding notions of smoothness to establish the convergence of their
Frank-Wolfe algorithm. In contrast, Sheriff and Mohajerin Esfahani (2023) work
with Gâteaux derivatives, which leads to a different notion of smoothness and thus
to a different convergence analysis. Both algorithms display sublinear convergence
rates. When the reference distribution P̂ is discrete or when only samples from
P̂ are used, the algorithms’ iterates represent discrete distributions with progress-
ively increasing bit sizes. Theorem 5.22 provides a compact, albeit potentially
nonconvex, reformulation of the worst-case risk problem. This reformulation is
amenable to primal-dual gradient methods in the finite-dimensional space of the
dual variables, which are guaranteed to converge to a stationary point.
Worst-case risk problems represent special instances of optimization problems
over spaces of probability distributions. The mainstream methods to address such
problems leverage the machinery of Wasserstein gradient flows (Ambrosio, Gigli
and Savaré 2008). Wasserstein gradient flows have recently been used in the context
of distributionally robust optimization problems (Lanzetti, Bolognani and Dörfler
2022, Lanzetti, Terpin and Dörfler 2024, Xu, Lee, Cheng and Xie 2024), nonconvex
optimization (Chizat and Bach 2018, Chizat 2022) or variational inference (Jiang,
Distributionally Robust Optimization 87

Chewi and Pooladian 2024, Lambert, Chewi, Bach, Bonnabel and Rigollet 2022,
Diao, Balasubramanian, Chewi and Salim 2023, Zhang and Zhou 2020). The
results of this section are new and complementary to these existing works.

6. Analytical Solutions of Nature’s Subproblem


A key challenge in DRO is to handle the worst-case expectation problem embedded
in (1.2). This problem is solved by the fictitious adversary—commonly thought of
as nature— once the decision-maker has committed to an 𝑥 ∈ X . It maximizes a
linear function over a convex subset of an infinite-dimensional space of measures
and thus appears to be intractable. Therefore, considerable research effort has been
devoted to identifying conditions under which this problem is efficiently solvable.
We now show that it can actually be solved analytically in interesting situations.
The duality theory derived in Section 4 motivates the following simple strategy
for finding analytical solutions of nature’s subproblem. Construct feasible solutions
for the primal worst-case expectation problem and its dual, and show that their
objective function values match. If such matching solutions can be found, then
both of them must be optimal in their respective optimization problems thanks to
weak duality. As we will see below, this simple strategy succeeds surprisingly
often. In addition, we will see that analytical solutions for worst-case expectation
problems can sometimes be generalized to analytical solutions for worst-case risk
problems of the form (5.4). The material reviewed in this section covers several
decades of research in DRO from the 1950s until the present day.

6.1. Jensen Bound


Consider the worst-case expectation problem
sup {E P [ℓ(𝑍)] : E P [𝑍] = 𝜇} , (6.1a)
P∈P(Z)

which maximizes the expected value of ℓ(𝑍) over the Markov ambiguity set of all
distributions supported on Z with mean 𝜇. The Markov ambiguity set is a moment
ambiguity set of the form (2.1) with 𝑓 (𝑧) = 𝑧 and F = {𝜇}. By Theorem 4.5 and
as the support function of F is linear, the problem dual to (6.1a) is given by

inf 𝜆 0 + 𝜆⊤ 𝜇 : 𝜆 0 + 𝜆⊤ 𝑧 ≥ ℓ(𝑧) ∀𝑧 ∈ Z . (6.1b)
𝜆0 ∈R, 𝜆∈R𝑑

Intuitively, this dual problem aims to find an affine function 𝑎(𝑧) = 𝜆 0 + 𝜆⊤ 𝑧 that
majorizes the loss function ℓ(𝑧) on Z and has minimal expected value E P [𝑎(𝑍)]
under any distribution P feasible in the primal problem (6.1a).
Proposition 6.1 (Jensen Bound). Suppose that Z is convex, 𝜇 ∈ Z, ℓ is concave,
and 𝜆★ is any supergradient of ℓ at 𝜇. Then, the primal problem (6.1a) is solved
by P★ = 𝛿 𝜇 , and the dual problem (6.1b) is solved by (𝜆★0 , 𝜆★), where 𝜆★0 =
ℓ(𝜇) − 𝜇⊤𝜆★. In addition, the optimal values of (6.1a) and (6.1b) both equal ℓ(𝜇).
88 D. Kuhn, S. Shafiee, and W. Wiesemann

Proof. By construction, P★ is feasible in the primal worst-case expectation problem,


and its objective function value amounts to ℓ(𝜇). In addition, (𝜆★0 , 𝜆★) is feasible
in the dual robust optimization problem because 𝜆★ is a supergradient of ℓ at 𝜇,
and its objective function value amounts to ℓ(𝜇), too. Hence, by weak duality as
established in Theorem 4.5, P★ is primal optimal, and (𝜆★0 , 𝜆★) is dual optimal. 
Proposition 6.1 implies Jensen’s inequality E P [ℓ(𝑍)] ≤ E P★ [ℓ(𝑍)] = ℓ(E P [𝑍]),
which holds for all distributions P feasible in (6.1a) (Jensen 1906). Proposition 6.1
further shows that (6.1b) is solved by any affine function tangent to ℓ at 𝜇.
If the loss function ℓ(𝑥, 𝑧) in the DRO problem (1.2) is concave in 𝑧 for any
fixed 𝑥 ∈ X , then Proposition 6.1 implies that the same distribution P★ solves
the inner maximization problem in (1.2) for every 𝑥 ∈ X . Hence, the DRO
problem (1.2) reduces to the (non-robust) stochastic program inf 𝑥 ∈X E P★ [ℓ(𝑥, 𝑍)].
Jensen’s inequality is traditionally used to approximate hard stochastic optim-
ization problems of the form inf 𝑥 ∈X E P [ℓ(𝑥, 𝑍)], where P is a known continuous
distribution of 𝑍. Proposition 6.1 implies that if ℓ(𝑥, 𝑧) is concave in 𝑧 for any 𝑥 ∈ X ,
then replacing P with P★ = 𝛿E P [ 𝑍 ] leads to a conservative approximation of this
stochastic program. As P★ is discrete (in fact, a Dirac distribution), the resulting
approximate problem is much easier to solve. Its approximation quality can be im-
proved by partitioning Z into finitely many convex cells and constructing separate
Jensen bounds for all cells (Birge and Louveaux 2011, Section 10.1).

6.2. Edmundson-Madansky Bound


The worst-case expectation problem (6.1a) over a Markov ambiguity set and its
dual (6.1b) can also be solved in closed form if ℓ is convex and Z is a simplex.
Proposition 6.2 (Edmundson-Madansky Bound). Suppose that Z is the probability
simplex in R𝑑 with vertices 𝑒𝑖 , 𝑖 ∈ [𝑑], 𝜇 ∈ rint(Z), and Í𝑑ℓ is convex and real-
valued. Then, the primal problem (6.1a) is solved by P★ = 𝑖=1 𝜇𝑖 𝛿𝑒𝑖 , and the dual
★ ★ ★ ★
problem (6.1b) is solved by (𝜆 0 , 𝜆 ), where 𝜆 0 = 0 and 𝜆 𝑖 = ℓ(𝑒𝑖 ) for all 𝑖 ∈ [𝑑].
Í𝑑
In addition, the optimal values of (6.1a) and (6.1b) both equal 𝑖=1 E P [𝑍𝑖 ]ℓ(𝑒𝑖 ).
Proof. As 𝜇 belongs to the probability simplex, P★ is feasible
Í𝑑 in the primal worst-
case expectation problem with objective function value 𝑖=1 𝜇𝑖 ℓ(𝑒𝑖 ). Also, as ℓ is
convex, Jensen’s inequality implies that
𝑑 𝑑
!
Õ Õ
𝜆★0 + 𝑧⊤𝜆★ = 𝑧𝑖 ℓ(𝑒𝑖 ) ≥ ℓ 𝑧𝑖 𝑒𝑖 = ℓ(𝑧) ∀𝑧 ∈ Z.
𝑖=1 𝑖=1

We conclude that (𝜆★0 , 𝜆★)


is feasible in the dual robust optimization problem, and
Í𝑑
its objective function value amounts to 𝑖=1 𝜇𝑖 ℓ(𝑒𝑖 ), too. Hence, by weak duality as
established in Theorem 4.5, P★ is primal optimal, and (𝜆★0 , 𝜆★) is dual optimal. 
Proposition 6.2 impliesÍthe Edmundson-Madansky inequality, which states that
𝑑
E P [ℓ(𝑍)] ≤ E P★ [ℓ(𝑍)] = 𝑖=1 E P [𝑍𝑖 ]ℓ(𝑒𝑖 ) for all distributions P feasible in (6.1a)
Distributionally Robust Optimization 89

(Edmundson 1956, Madansky 1959), and it shows that (6.1b) is solved by an affine
function that touches ℓ at the vertices 𝑒𝑖 , 𝑖 ∈ [𝑑], of Z. We emphasize, however,
that Proposition (6.2) remains valid with minor modifications if Z is an arbitrary
regular simplex in R𝑑 , that is, the convex hull of 𝑑 + 1 affinely independent vectors
𝑣 𝑖 ∈ R𝑑 , 𝑖 ∈ [𝑑 + 1]; see (Birge and Wets 1986, Gassmann and Ziemba 1986).
If the loss function ℓ(𝑥, 𝑧) in (1.2) is convex in 𝑧 for any fixed 𝑥 ∈ X , then
Proposition 6.2 implies that the DRO problem (1.2) is equivalent to the stochastic
program inf 𝑥 ∈X E P★ [ℓ(𝑥, 𝑍)], where P★ is independent of 𝑥. As P★ is a discrete
distribution with 𝑑 atoms, this stochastic program is usually easy to solve.

6.3. Barycentric Approximation


Consider the worst-case expectation problem
  
sup ¯ E P 𝑉𝑊 ⊤ = 𝐶 ,
E P [ℓ(𝑉, 𝑊)] : E P [𝑉] = 𝑣¯ , E P [𝑊] = 𝑤, (6.2a)
P∈P(V ×W)

which maximizes the expected value of ℓ(𝑉, 𝑊) across all distributions of 𝑍 =


(𝑉, 𝑊) on V × W under which 𝑉 and 𝑊 have mean vectors 𝑣¯ and 𝑤, ¯ respectively,
and cross moment matrix 𝐶. Note that if 𝑉 and 𝑊 are uncorrelated, then 𝐶 =
𝑣¯ 𝑤¯ ⊤ . Problem (6.2a) optimizes over a moment ambiguity set of the form (2.1)
with 𝑓 (𝑣, 𝑤) = (𝑣, 𝑤, 𝑣𝑤 ⊤ ) and F = {𝑣¯ } × {𝑤}
¯ × {𝐶}. By Theorem 4.5 and as the
support function of F is linear, the problem dual to (6.2a) is given by
inf 𝜆 0 + 𝜆⊤𝑣 𝑣¯ + 𝜆⊤𝑤 𝑤¯ + hΛ, 𝐶i
s.t. 𝜆 0 ∈ R, 𝜆 𝑣 ∈ R𝑑𝑣 , 𝜆 𝑤 ∈ R𝑑𝑤 , Λ ∈ R𝑑𝑣 ×𝑑𝑤 (6.2b)
𝜆 0 + 𝜆⊤𝑣 𝑣 + 𝜆⊤𝑤 𝑤 + 𝑣 ⊤ Λ𝑤 ≥ ℓ(𝑣, 𝑤) ∀𝑣 ∈ V, ∀𝑤 ∈ W.
This dual problem seeks a bi-affine function 𝑏(𝑣, 𝑤) = 𝜆 0 + 𝜆⊤𝑣 𝑣 + 𝜆⊤𝑤 𝑤 + 𝑣 ⊤ Λ𝑤
that majorizes the loss function ℓ(𝑣, 𝑤) on V × W and minimizes E P [𝑏(𝑉, 𝑊)]
under any distribution P feasible in (6.2a). The following proposition shows that
problems (6.2a) and (6.2b) can be solved in closed form if ℓ is a concave-convex
saddle function and W is a simplex. Below, we use 𝑒𝑖 to denote the 𝑖-th standard
basis vector in R𝑑𝑤 , 𝑖 ∈ [𝑑 𝑤 ], and 𝑒 to denote the vector of ones in R𝑑𝑤 .
Proposition 6.3 (Barycentric Approximation). Suppose that V ⊆ R𝑑𝑣 is convex
and W ⊆ R𝑑𝑤 is the probability simplex with vertices 𝑒𝑖 , 𝑖 ∈ [𝑑 𝑤 ]. Suppose also
that the loss function ℓ(𝑣, 𝑤) is concave and superdifferentiable in 𝑣 for any fixed 𝑤
and convex in 𝑤 for any fixed 𝑣. In addition, suppose that 𝑣¯ ∈ V, 𝑤¯ ∈ rint(W) and
𝐶𝑒 = 𝑣¯ and that problem (6.2a) is feasible. Then (6.2a) is solved by
𝑑𝑤
Õ

P = 𝑤¯ 𝑖 𝛿(𝐶𝑒𝑖 /𝑤¯ 𝑖 ,𝑒𝑖 ) .
𝑖=1

If Λ★𝑖 is any supergradient in 𝜕𝑣 ℓ(𝐶𝑒𝑖 /𝑤¯ 𝑖 , 𝑒𝑖 ) for all 𝑖 ∈ [𝑑 𝑤 ] and


𝜆★𝑤,𝑖 = ℓ(𝐶𝑒𝑖 /𝑤¯ 𝑖 , 𝑒𝑖 ) − (Λ★𝑖)⊤𝐶𝑒𝑖 /𝑤¯ 𝑖 ∀𝑖 ∈ [𝑑 𝑤 ],
90 D. Kuhn, S. Shafiee, and W. Wiesemann

then the dual problem (6.2b) is solved by (𝜆★0 , 𝜆★𝑣, 𝜆★𝑤 , Λ★), where 𝜆★0 = 0 and
𝜆★𝑣 = 0, while 𝜆★𝑤 has elements 𝜆★𝑤,𝑖 and Λ★ has columns Λ★𝑖, 𝑖 ∈ [𝑑 𝑤 ]. The
optimal values of (6.2a) and (6.2b) coincide and are both equal to
𝑑𝑤
Õ
𝜇 𝑤,𝑖 ℓ(𝐶𝑒𝑖 /𝑤¯ 𝑖 , 𝑒𝑖 ).
𝑖=1

The condition 𝐶𝑒 = 𝑣¯ is necessary for (6.2a) to be feasible. Indeed, if P is


feasible in (6.2a), then we have 𝐶𝑒 = E P [𝑉𝑊 ⊤ 𝑒] = E P [𝑉] = 𝑣¯ . Here, the second
equality holds because P(𝑊 ∈ W) = 1 and W is the probability simplex in R𝑑𝑤 .
However, the condition 𝐶𝑒 = 𝑣¯ is not sufficient for (6.2a) to be feasible. Indeed,
if the support set V = {𝑣¯ } is a singleton, then 𝐶 = E P [𝑉𝑊 ⊤ ] = 𝑣¯ 𝑤¯ ⊤ . That is, 𝑉
and 𝑊 must be uncorrelated. Hence, V and 𝐶 cannot be selected independently.
To circumvent this problem, Proposition 6.3 requires (6.2a) to be feasible.
Proof of Proposition 6.3. Note that 𝑤¯ > 0 and 𝑒⊤ 𝑤¯ = 1 because 𝑤¯ belongs to the
relative interior of the probability simplex W. Thus, P★ is indeed a well-defined
probability distribution, that is, the atoms of P★ have positive probabilities that sum
to 1. In addition, P★ is supported on V × W because
   
𝑉𝑊𝑖 E P [𝑊𝑖 |𝑉]
𝐶𝑒𝑖 /𝑤¯ 𝑖 = E P = EP 𝑉 ∈ V and 𝑒𝑖 ∈ W ∀𝑖 ∈ [𝑑 𝑤 ],
E P [𝑊𝑖 ] E P [𝑊𝑖 ]
where P is any distribution feasible in (6.2a). Note also that if 𝑉 and 𝑊 are
uncorrelated, in which case 𝐶 = 𝑣¯ 𝑤¯ ⊤ , then the 𝑖-th generalized barycenter 𝐶𝑒𝑖 /𝑤¯ 𝑖
of V simplifies to 𝑣¯ for every 𝑖 ∈ [𝑑 𝑤 ]. Recalling that 𝐶𝑒 = 𝑣¯ , we further have
𝑑𝑤
Õ 𝑑𝑤
Õ
E P★ [𝑉] = 𝑤¯ 𝑖 𝐶𝑒𝑖 /𝑤¯ 𝑖 = 𝑣¯ , E P★ [𝑊] = 𝑤¯ 𝑖 𝑒𝑖 = 𝑤¯
𝑖=1 𝑖=1

and
𝑑𝑤
  Õ
E P★ 𝑉𝑊 ⊤ = 𝑤¯ 𝑖 𝐶𝑒𝑖 𝑒⊤
𝑖 /𝑤
¯ 𝑖 = 𝐶.
𝑖=1

In summary, we have shown that P★


is feasible in (6.2a). A similar calculation
reveals that the objective function value of P★ in (6.2a) is given by the formula in
the proposition statement. Details are omitted for brevity.
To show that (𝜆★0 , 𝜆★𝑣, 𝜆★𝑤 , Λ★) is feasible in (6.2b), note first that
𝜆★0 + (𝜆★𝑣)⊤ 𝑣 + (𝜆★𝑤 )⊤ 𝑤 + 𝑣 ⊤ Λ★ 𝑤
𝑑𝑤
Õ 𝑑𝑤
Õ
 
= 𝑤 𝑖 ℓ(𝐶𝑒𝑖 /𝑤¯ 𝑖 , 𝑒𝑖 ) + (Λ★𝑖)⊤ (𝑣 − 𝐶𝑒𝑖 /𝑤¯ 𝑖 ) ≥ 𝑤 𝑖 ℓ(𝑣, 𝑒𝑖 ) ≥ ℓ(𝑣, 𝑤)
𝑖=1 𝑖=1

for all 𝑣 ∈ V and 𝑤 ∈ W. The first inequality follows from the concavity of ℓ(𝑣, 𝑤)
in 𝑣 and the definition of Λ★𝑖 as a supergradient, while the second inequality follows
Distributionally Robust Optimization 91

from the convexity of ℓ(𝑣, 𝑤) in 𝑤 and Jensen’s inequality. Hence, (𝜆★0 , 𝜆★𝑣, 𝜆★𝑤 , Λ★)
is indeed feasible in (6.2b). A similar calculation reveals that the objective function
value of (𝜆★0 , 𝜆★𝑣, 𝜆★𝑤 , Λ★) in (6.2b) is given by the formula in the proposition
statement. Consequently, by weak duality as established in Theorem 4.5, we have
shown that P★ is primal optimal and (𝜆★0 , 𝜆★𝑣, 𝜆★𝑤 , Λ★) is dual optimal. 

Proposition 6.3 remains valid with obvious minor modifications if W is defined


as an arbitrary regular simplex in R𝑑𝑤 (Frauendorfer 1992). If 𝑧 = (𝑣, 𝑤) and the
loss function ℓ(𝑥, 𝑧) = ℓ(𝑥, 𝑣, 𝑤) in (1.2) is concave in 𝑣 and convex in 𝑤 for any
fixed 𝑥 ∈ X , then Proposition 6.3 implies that the DRO problem (1.2) is equi-
valent to the stochastic program inf 𝑥 ∈X E P★ [ℓ(𝑥, 𝑈, 𝑉)], where P★ is independent
of 𝑥. As P★ is a discrete distribution with 𝑑 𝑤 atoms, this stochastic program is
usually easy to solve. Traditionally, the distribution P★ is used to approximate hard
stochastic optimization problems of the form inf 𝑥 ∈X E P [ℓ(𝑥, 𝑉, 𝑊)], where P is a
known continuous distribution of (𝑉, 𝑊). Proposition 6.3 implies that if ℓ(𝑥, 𝑣, 𝑤)
is concave in 𝑣 and convex in 𝑤 for any 𝑥 ∈ X , then replacing P with P★ leads to a
conservative approximation, which is termed the upper barycentric approximation
of the original stochastic program (Frauendorfer 1992). Barycentric approxima-
tions for more general stochastic programs involving loss functions that may fail to
be convex and/or concave are derived by Kuhn (2005).

6.4. Ben-Tal and Hochman Bound


Consider the worst-case expectation problem
sup {E P [ℓ(𝑍)] : E P [𝑍] = 𝜇, E P [|𝑍 − 𝜇|] = 𝜎} , (6.3a)
P∈P(Z)

which maximizes the expected value of ℓ(𝑍) over the family of all univariate
distributions supported on Z with mean 𝜇 and mean absolute deviation 𝜎. Note
that problem (6.3a) optimizes over a moment ambiguity set of the form (2.1)
with 𝑓 (𝑧) = (𝑧, |𝑧 − 𝜇|) and F = {𝜇} × {𝜎}. By Theorem 4.5 and as the support
function of F is linear, the problem dual to (6.3a) is given by
inf {𝜆 0 + 𝜆 1 𝜇 + 𝜆 2 𝜎 : 𝜆 0 + 𝜆 1 𝑧 + 𝜆 2 |𝑧 − 𝜇| ≥ ℓ(𝑧) ∀𝑧 ∈ Z } . (6.3b)
𝜆0 ,𝜆1 ,𝜆2 ∈R

Intuitively, this dual problem aims to approximate the loss function from above with
a piecewise linear continuous function that has a kink at 𝜇. The problems (6.3a)
and (6.3b) can be solved in closed form if ℓ is convex.
Proposition 6.4 (Ben-Tal and Hochman Bound). Assume that Z = [0, 1], 𝜇 ∈ (0, 1)
and 𝜎 ∈ [0, 2𝜇(1− 𝜇)]. Suppose also that ℓ is a real-valued convex function. Then,
the primal problem (6.3a) is solved by
 
★ 𝜎 𝜎 𝜎 𝜎
P = 𝛿0 + 1 − − 𝛿𝜇 + 𝛿1 ,
2𝜇 2𝜇 2(1 − 𝜇) 2(1 − 𝜇)
92 D. Kuhn, S. Shafiee, and W. Wiesemann

and the dual problem (6.3b) is solved by


(1 − 𝜇)ℓ(0) + ℓ(𝜇) − 𝜇ℓ(1)
𝜆★0 = ,
2(1 − 𝜇)
(𝜇 − 1)ℓ(0) + (1 − 2𝜇)ℓ(𝜇) + 𝜇ℓ(1)
𝜆★1 = ,
2𝜇(1 − 𝜇)
(1 − 𝜇)ℓ(0) − ℓ(𝜇) + 𝜇ℓ(1)
𝜆★2 = .
2𝜇(1 − 𝜇)
In addition, the optimal values of (6.3a) and (6.3b) coincide and are both equal to
 
𝜎 𝜎 𝜎 𝜎
ℓ(0) + 1 − − ℓ(𝜇) + ℓ(1).
2𝜇 2𝜇 2(1 − 𝜇) 2(1 − 𝜇)
Proof. The assumptions about 𝜇 and 𝜎 imply that P★ is supported on Z and that
the probabilities of the three atoms are non-negative and sum to 1. Also, we have
 
𝜎 𝜎𝜇 𝜎
E P [𝑍] = 𝜇 − −
★ + = 𝜇 and E P★ [|𝑍 − 𝜇|] = 𝜎.
2 2(1 − 𝜇) 2(1 − 𝜇)
Thus, P★ is feasible in (6.3a). In addition, one readily verifies that the objective
function value of P★ in (6.3a) is given by the formula in the proposition statement.
Next, note that the piecewise linear function 𝜆★0 + 𝜆★1 𝑧 + 𝜆★2 |𝑧 − 𝜇| coincides with
the loss function ℓ(𝑧) for every 𝑧 ∈ {0, 𝜇, 1}. As the loss function is convex, we
may thus conclude that 𝜆★0 + 𝜆★1 𝑧 + 𝜆★2 |𝑧 − 𝜇| majorizes ℓ(𝑧) for every 𝑧 ∈ [0, 1] = Z.
This shows that (𝜆★0 , 𝜆★1 , 𝜆★2 ) is feasible in (6.3b). An elementary calculation further
reveals that the objective function value of (𝜆★0 , 𝜆★1 , 𝜆★2 ) in (6.3b) is given by the
formula in the proposition statement. Weak duality as established in Theorem 4.5
thus implies that P★ is primal optimal and that (𝜆★0 , 𝜆★1 , 𝜆★2 ) is dual optimal. 

Proposition 6.4 readily extends to support sets of the form Z = [𝑎, 𝑏] for
any 𝑎, 𝑏 ∈ R with 𝑎 < 𝜇 < 𝑏 by applying a linear coordinate transformation.
If ℓ(𝑥, 𝑧) in (1.2) is convex in 𝑧 for any fixed 𝑥 ∈ X , then Proposition 6.4 implies that
the DRO problem (1.2) is equivalent to the stochastic program inf 𝑥 ∈X E P★ [ℓ(𝑥, 𝑍)],
where the three-point distribution P★ is independent of 𝑥. Traditionally, this
stochastic program is used as a conservative approximation for a stochastic program
of the form inf 𝑥 ∈X E P [ℓ(𝑥, 𝑍)], where P is a known continuous distribution (Ben-
Tal and Hochman 1972). Unlike the Jensen and Edmundson-Madansky bounds,
which only use information about the location of P, and unlike the barycentric
approximation, which only uses information about the location and certain cross-
moments of P, the Ben-Tal and Hochman bound uses information about the location
as well as the dispersion of P. Thus, it provides a tighter approximation.
If 𝑍 is a 𝑑-dimensional random vector with independent components 𝑍𝑖 , 𝑖 ∈ [𝑑],
each of which has a known mean and mean absolute deviation, then one can show
that the worst-case expected value of a convex loss function is attained by P★ =
𝑑
⊗𝑖=1 P★𝑖, where each P★𝑖 is a three-point distribution constructed as in Proposition 6.4
Distributionally Robust Optimization 93

(Ben-Tal and Hochman 1972). In this case, P★ is a discrete distribution with 3𝑑


atoms. Hence, evaluating expected values with respect to P★ is generically hard
but becomes tractable for a class of exponential loss functions that offer safe
approximations for chance constraints (Postek et al. 2018).

6.5. Scarf’s Bound


Consider the worst-case expectation problem
  
sup E P [ℓ(𝑍)] : E P [𝑍] = 0, E P 𝑍 2 = 𝜎 2 , (6.4a)
P∈P(Z)

which maximizes the expected value of ℓ(𝑍) over the Chebyshev ambiguity set of
all univariate distributions supported on Z with mean 0 and variance 𝜎 2 . This
Chebyshev ambiguity set is a moment ambiguity set of the form (2.1) with 𝑓 (𝑧) =
(𝑧, 𝑧2 ) and F = {0} × {𝜎 2 }. By Theorem 4.5 and as the support function of F is
linear, the problem dual to (6.4a) is given by

inf 𝜆 0 + 𝜆 2 𝜎 2 : 𝜆 0 + 𝜆 1 𝑧 + 𝜆 2 (𝑧 − 𝜇)2 ≥ ℓ(𝑧) ∀𝑧 ∈ Z . (6.4b)
𝜆0 ,𝜆1 ,𝜆2 ∈R

This dual problem seeks a quadratic function 𝑞(𝑧) = 𝜆 0 + 𝜆 1 𝑧 + 𝜆 2 𝑧2 that majorizes


the loss function ℓ(𝑧) throughout Z and has minimal expectation E P [𝑞(𝑍)] under
any distribution P with mean 0 and variance 𝜎 2 . The problems (6.4a) and (6.4b)
can be solved in closed form if ℓ is a ramp function.
Proposition 6.5 (Scarf’s Bound). If Z = R, 𝜎 2 ∈ R+ and ℓ(𝑧) = max{𝑧 − 𝑎, 0} is
a ramp function with a kink at 𝑎 ∈ R, then the primal problem (6.4a) is solved by
1 1
   
★ 𝑎 𝑎
P = 1+ √ 𝛿 𝑎− √𝑎2 +𝜎2 + 1− √ 𝛿 𝑎+√𝑎2 +𝜎2 ,
2 𝑎2 + 𝜎 2 2 𝑎2 + 𝜎 2
and the dual problem (6.4b) is solved by
 √ 2

𝑎 − 𝑎2 + 𝜎 2 𝑎 − 𝑎2 + 𝜎 2 1
★ ★
𝜆0 = √ , 𝜆1 = − √ and 𝜆★2 = √ .
4 𝑎2 + 𝜎 2 2 𝑎2 + 𝜎 2 4 𝑎2 + 𝜎 2

The optimal values of (6.4a) and (6.4b) are both equal to 21 ( 𝑎2 + 𝜎 2 − 𝑎).
Proof. Note that the two-point distribution P★ is well-defined, that is, its atoms
have non-negative probabilities that sum to 1. By the definition of P★, we also have
1

𝑎
 p 
E P [𝑍] =
★ 1+ √ 𝑎− 𝑎 +𝜎2 2
2 2 2
 𝑎 +𝜎 p
1

𝑎 
+ 1− √ 𝑎 + 𝑎2 + 𝜎 2 = 0.
2 𝑎2 + 𝜎 2
Similarly, it is easy to verify that E P★ [𝑍 2 ] = 𝜎 2 . This shows that P★ is feasible
94 D. Kuhn, S. Shafiee, and W. Wiesemann

in (6.4a). The objective function value of P★ is


1 p 2 
E P★ [ℓ(𝑍)] = E P★ [max{𝑍 − 𝑎, 0}] = 𝑎 + 𝜎2 − 𝑎 .
2
Next, observe that the dual variables (𝜆★0 , 𝜆★1 , 𝜆★2 ) defined in the proposition state-
ment give rise to the quadratic function
1  p 2
★ ★ ★ ★ 2
𝑞 (𝑧) = 𝜆 0 + 𝜆 1 𝑧 + 𝜆 2 𝑧 = √ 𝑧−𝑎+ 𝑎 +𝜎 2 2 .
4 𝑎2 + 𝜎 2
We will now show that 𝑞★(𝑧) ≥ max{𝑧√− 𝑎, 0} = ℓ(𝑧) for all 𝑧 ∈ Z. Clearly, 𝑞★ is
non-negative and evaluates
√ to 0 at 𝑎 − 𝑎2 + 𝜎 2 . In addition, 𝑞★ touches the affine
function 𝑧 − 𝑎 at 𝑎 + 𝑎2 + 𝜎 2 . To see this, note that
 p  p d ★ p 
𝑞★ 𝑎 + 𝑎2 + 𝜎 2 = 𝑎2 + 𝜎 2 and 𝑞 𝑎 + 𝑎2 + 𝜎 2 = 1.
d𝑧
Hence, 𝑞★ majorizes the ramp function ℓ(𝑧), implying that (𝜆★0 , 𝜆★1 , 𝜆★2 ) is dual
feasible. Also, the objective function value of (𝜆★0 , 𝜆★1 , 𝜆★2 ) is given by
1
  p 2  1 p 
★ ★ 2 2 2 2
𝜆0 + 𝜆2 𝜎 = √ 𝜎 + 𝑎− 𝑎 +𝜎 = 𝑎2 + 𝜎 2 − 𝑎 .
4 𝑎2 + 𝜎 2 2
As the objective function values of P★ and (𝜆★0 , 𝜆★1 , 𝜆★2 ) match, weak duality as es-
tablished in Theorem 4.5 thus implies that P★ is primal optimal and that (𝜆★0 , 𝜆★1 , 𝜆★2 )
is dual optimal. This observation completes the proof. 

Proposition 6.5 was first derived by Scarf (1958) in his pioneering treatise
on the distributionally robust newsvendor problem; see also (Jagannathan 1977,
Theorem 1). Note that if the mean of 𝑍 is known to equal 𝜇 ≠ 0 instead of 0, then
Scarf’s bound remains valid if we replace 𝑎 with 𝑎 − 𝜇. Gallego and Moon (1993)
extend Scarf’s bound to more general loss functions such as wedge functions or
ramp functions with a discontinuity, whereas Natarajan et al. (2018) extend Scarf’s
bound to more general ambiguity sets that not only contain information about the
mean and variance of 𝑍 but also about its semivariance. In addition, Das, Dhara
and Natarajan (2021) discuss variants of Scarf’s bound that rely on information
about the mean and the 𝛼-th moment of 𝑍 for any 𝛼 > 1.
Proposition 6.5 is often used to reformulate DRO problems of the form (1.2)
whose objective function is given by the expected value of a ramp function.
Examples include distributionally robust newsvendor, support vector machine or
mean-CVaR portfolio selection problems. In most of these applications, the loc-
ation 𝑎 of the kind of the ramp function is a decision variable or a function of
the decision variables. Thus, the worst-case distribution P★ is decision-dependent,
which means that Proposition 6.5 does not enable us to reduced the DRO prob-
lem (1.2) to a stochastic program with a single fixed worst-case distribution.
Distributionally Robust Optimization 95

6.6. Marshall and Olkin Bound


Consider the worst-case probability problem
  
sup P (𝑍 ∈ C) : E P [𝑍] = 0, E P 𝑍 𝑍 ⊤ = 𝐼𝑑 , (6.5a)
P∈P(Z)

which maximizes the probability of the event 𝑍 ∈ C over the Chebyshev ambiguity
set of all distributions on Z = R𝑑 with mean 0 and covariance matrix 𝐼 𝑑 . This
Chebyshev ambiguity set is a moment ambiguity set of the form (2.1) with 𝑓 (𝑧) =
(𝑧, 𝑧𝑧⊤ ) and F = {0} × {𝐼 𝑑 }. If we set ℓ to the characteristic function of C defined
through ℓ(𝑧) = 1 𝑧 ∈C for all 𝑧 ∈ Z, then the worst-case probability problem (6.5a)
can be recast as a worst-case expectation problem. By Theorem 4.5 and as the
support function of F is linear, the corresponding dual problem is thus given by

inf 𝜆 0 + hΛ, 𝐼 𝑑 i : 𝜆 0 + 𝜆⊤ 𝑧 + 𝑧⊤ Λ𝑧 ≥ ℓ(𝑧) ∀𝑧 ∈ Z . (6.5b)
𝜆0 ∈R,𝜆∈R𝑑 ,Λ∈S 𝑑

The problems (6.5a) and (6.5b) can be solved analytically if C is convex and closed.
Proposition 6.6 (Marshall and Olkin Bound). Suppose that Z = R𝑑 , C ⊆ R𝑑 is
convex and closed, and ℓ is the characteristic function of C. Set Δ = min𝑧 ∈C k𝑧k 2 ,
and let 𝑧0 ∈ R𝑑 be the unique minimizer of this problem. Then, the optimal values
of (6.5a) and (6.5b) are both equal to (1 + Δ2 ) −1 . If Δ = 0, then the supremum
of (6.5a) may not be attained. However, if Δ > 0, then (6.5a) is solved by
1 Δ2
P★ = 𝛿 𝑧 0 + Q,
1 + Δ2 1 + Δ2
where Q ∈ P(Z) is an arbitrary distribution with mean −𝑧0 /Δ2 and covariance
2
matrix 1+Δ
Δ2
(𝐼𝑑 − 𝑧0 𝑧⊤
0 /Δ ). For any Δ ≥ 0, problem (6.4b) is solved by
2

1 2𝑧0 𝑧0 𝑧⊤
0
𝜆★0 = , 𝜆★ = and Λ★ = .
(1 + Δ2 )2 (1 + Δ2 )2 (1 + Δ2 )2
Proof. Assume first that Δ = 0, that is, 0 ∈ C. For every 𝑗 ∈ N, let Q 𝑗 ∈ P(Z) be
any distribution with mean 0 and covariance matrix 𝑗 𝐼 𝑑 , and set
P 𝑗 = (1 − 1/ 𝑗) 𝛿0 + (1/ 𝑗) Q 𝑗 .
We thus have E P 𝑗 [𝑍] = 0 and E P 𝑗 [𝑍 𝑍 ⊤ ] = 𝐼𝑑 , which implies that P 𝑗 is feasible
in (6.5a). In addition, the objective function value of P 𝑗 in (6.5a) satisfies

P 𝑗 (𝑍 ∈ C) = 1 − 1/ 𝑗 + Q 𝑗 (𝑍 ∈ Z)/ 𝑗 ≥ 1 − 𝑗 −1 .
Driving 𝑗 to infinity reveals that problem (6.5a) is trivial for Δ = 0 and that its
supremum equals 1. Assume now that Δ > 0, and let Q ∈ P(Z) be an arbitrary
2
distribution with mean −𝑧0 /Δ2 and covariance matrix 1+Δ Δ2
(𝐼𝑑 − 𝑧0 𝑧⊤ 2
0 /Δ ). Such a

distribution is guaranteed to exist because 𝐼𝑑  𝑧0 𝑧0 /Δ . In addition, define P★ as
2
96 D. Kuhn, S. Shafiee, and W. Wiesemann

in the proposition statement. By construction, we have E P★ [𝑍] = 0 and


  𝑧0 𝑧⊤
0 Δ2  
E P★ 𝑍 𝑍 ⊤ = + EQ 𝑍 𝑍 ⊤
1+Δ 2 1+Δ 2
𝑧0 𝑧⊤
0 𝑧0 𝑧⊤
0 Δ2 𝑧0 𝑧⊤
0
= + 𝐼 𝑑 − + = 𝐼𝑑 .
1 + Δ2 Δ2 1 + Δ2 Δ4
Also, the objective function value of P★ in (6.5a) is given by P★(𝑍 ∈ C) = (1+Δ2 ) −1 .
Next, use (𝜆★0 , 𝜆★, Λ★) defined in the proposition to construct the quadratic function
(𝑧⊤
0 𝑧 + 1)
2
𝑞★ (𝑧) = 𝜆★0 + (𝜆★)⊤ 𝑧 + 𝑧⊤ Λ★ 𝑧 = .
(1 + Δ2 )2
Note that 𝑞★ is non-negative and constant on any hyperplane perpendicular to 𝑧0 .
If Δ > 0, we have 𝑞★(𝑧0 ) = 1 as well as 𝑞★ (−𝑧0 /Δ2 ) = 0. Thus, at every 𝑧 ∈ Z
with 𝑧⊤ ★
0 𝑧 ≥ −1, the quadratic function 𝑞 (𝑧) is non-decreasing in the direction of 𝑧 0 .
As 𝑧0 minimizes the differentiable convex function k𝑧k 22 over the convex closed
set C, we have 𝑧⊤ ★
0 (𝑧−𝑧 0 ) ≥ 0 for all 𝑧 ∈ C. By the monotonicity properties of 𝑞 , this
★ ★
implies that 𝑞 (𝑧) ≥ 1 for every 𝑧 ∈ C. Hence, the quadratic function 𝑞 majorizes
the indicator function ℓ on Z, which implies that (𝜆★0 , 𝜆★, Λ★) is dual feasible. If
Δ = 0, then 𝑞★(𝑧) = 1 for all 𝑧 ∈ Z, and (𝜆★0 , 𝜆★, Λ★) is also dual feasible. In any
case, one readily verifies that its objective function value is given by
 −1
𝜆★0 + hΛ★, 𝐼 𝑑 i = 1 + Δ2 .
As the objective function values of P★ and (𝜆★0 , 𝜆★, Λ★) for Δ > 0 match, weak
duality as established in Theorem 4.5 implies that P★ is primal optimal and
that (𝜆★0 , 𝜆★, Λ★) is dual optimal. If Δ = 0, then the optimal value 1 of the
primal problem also matches the objective function value of (𝜆★0 , 𝜆★, Λ★) in (6.5b).
Hence, (𝜆★0 , 𝜆★, Λ★) remains dual optimal even though the supremum of the primal
problem may not be attained. This observation completes the proof. 

6.7. Chebyshev Risk


Analytical solutions of worst-case expectation problems sometimes enable us to
evaluate the worst-case risk of a random variable if the underlying risk measure
is law-invariant, translation-invariant as well as scale-invariant; see Definition 5.3.
For example, it is elementary to verify that the 𝛽-VaR and 𝛽-CVaR constitute law-
invariant, translation-invariant as well as scale-invariant risk measures for every
fixed 𝛽 ∈ (0, 1). If the distribution of 𝑍 is unknown except for its mean 𝜇 ∈ R𝑑 and
covariance matrix Σ ∈ S+𝑑 , then it is natural to quantify the riskiness of an uncertain
loss ℓ(𝑍) under a law-invariant risk measure 𝜚 by the corresponding Chebyshev
risk. Specifically, the Chebyshev risk of ℓ(𝑍) is defined as the worst-case risk
sup 𝜚P [ℓ(𝑍)],
P∈P(𝜇,Σ)
Distributionally Robust Optimization 97

where P(𝜇, Σ) denotes the Chebyshev ambiguity set that contains all probability
distributions on R𝑑 with mean 𝜇 ∈ R𝑑 and covariance matrix Σ ∈ S+𝑑 .
We now describe a powerful tool for analyzing the Chebyshev risk with respect
to any law-, translation- and scale-invariant risk measure. To this end, recall that
if 𝑍 follows some distribution P on R𝑑 , then 𝐿 = ℓ(𝑍) follows the pushforward
distribution P ◦ ℓ −1 on R. If P is uncertain and only known to belong to some
ambiguity set P, then the distribution of 𝐿 = ℓ(𝑍) is also uncertain and only known
to belong to the pushforward ambiguity set P ◦ ℓ −1 = {P ◦ ℓ −1 : P ∈ P}. The
following proposition due to Popescu (2007) shows that linear pushforwards of
Chebyshev ambiguity sets are again Chebyshev ambiguity sets.
Proposition 6.7 (Pushforwards of Chebyshev Ambiguity Sets). If 𝜇 ∈ R𝑑 , Σ ∈ S+𝑑 ,
𝜃 ∈ R𝑑 , and ℓ : R𝑑 → R is the linear transformation defined through ℓ(𝑧) = 𝜃 ⊤ 𝑧,
then the pushforward of the Chebyshev ambiguity set P(𝜇, Σ) is the Chebyshev
ambiguity set of all distributions on R with mean 𝜃 ⊤ 𝜇 and variance 𝜃 ⊤ Σ𝜃, that is,
P(𝜇, Σ) ◦ ℓ −1 = P(𝜃 ⊤ 𝜇, 𝜃 ⊤ Σ𝜃).
Proof. Select first any distribution P ∈ P(𝜇, Σ). If the random vector 𝑍 follows P,
then the random variable 𝐿 = ℓ(𝑍) follows P ◦ ℓ −1 . Thus, we have
E P◦ℓ −1 [𝐿] = E P [ℓ(𝑍)] = E P [𝜃 ⊤ 𝑍] = 𝜃 ⊤ 𝜇,
where the first equality follows from the measure-theoretic change of variables
formula. Similarly, one can show that E P◦ℓ −1 [(𝐿 − 𝜃 ⊤ 𝜇)2 ] = 𝜃 ⊤ Σ𝜃. Thus, we find
P(𝜇, Σ) ◦ ℓ −1 ⊆ P(𝜃 ⊤ 𝜇, 𝜃 ⊤ Σ𝜃).
Next, select any Q 𝐿 ∈ P(𝜃 ⊤ 𝜇, 𝜃 ⊤ Σ𝜃). If 𝜃 ⊤ Σ𝜃 = 0, then Q 𝐿 = 𝛿 𝜃 ⊤ 𝜇 , which
coincides with the pushforward distribution P ◦ ℓ −1 for any P ∈ P(𝜇, Σ). In the
remainder of the proof we may thus assume that 𝜃 ⊤ Σ𝜃 ≠ 0. Let now 𝐿 be a random
variable governed by Q 𝐿 , and let 𝑀 be a 𝑑-dimensional random vector governed
by an arbitrary distribution Q 𝑀 ∈ P(R𝑑 ) with mean 𝜇 and covariance matrix Σ.
For example, we can set Q 𝑀 to the normal distribution N (𝜇, Σ). Assume 𝐿 and 𝑀
are independent. Then, the distribution P of the 𝑑-dimensional random vector
𝑍 = 𝜃 ⊤1Σ 𝜃 Σ𝜃 𝐿 + 𝐼𝑑 − 𝜃 ⊤1Σ 𝜃 Σ𝜃𝜃 ⊤ 𝑀


belongs to P(𝜇, Σ). By the construction of 𝐿 and 𝑀, we have indeed


E P [𝑍] = 𝜃 ⊤1Σ 𝜃 Σ𝜃 E Q 𝐿 [𝐿] + 𝐼 𝑑 − 𝜃 ⊤1Σ 𝜃 Σ𝜃𝜃 ⊤ E Q 𝑀 [𝑀] = 𝜇


and
E P [(𝑍 − 𝜇)(𝑍 − 𝜇)⊤ ]
1 ⊤ 1 ⊤ 1 ⊤
 
= 𝜃 ⊤ Σ 𝜃 Σ𝜃𝜃 Σ + 𝐼𝑑 − 𝜃 ⊤ Σ 𝜃 Σ𝜃𝜃 Σ 𝐼𝑑 − 𝜃 ⊤ Σ 𝜃 𝜃𝜃 Σ = Σ.
The first equality in the above expression holds because 𝐿 and 𝑀 are independent,
𝐿 has variance 𝜃 ⊤ Σ𝜃 and 𝑀 has covariance matrix Σ. By construction, we further
98 D. Kuhn, S. Shafiee, and W. Wiesemann

have ℓ(𝑍) = 𝜃 ⊤ 𝑍 = 𝐿, which implies that P ◦ ℓ −1 = Q 𝐿 . We have thus shown that


for every Q 𝐿 ∈ P(𝜃 ⊤ 𝜇, 𝜃 ⊤ Σ𝜃) there exists P ∈ P(𝜇, Σ) with P ◦ ℓ −1 = Q 𝐿 , that is,
P(𝜇, Σ) ◦ ℓ −1 ⊇ P(𝜃 ⊤ 𝜇, 𝜃 ⊤ Σ𝜃).
This observation completes the proof. 
Generalizations of Proposition 6.7 to multi-dimensional affine transformations
and to subfamilies of the Chyebyshev ambiguity set that contain only distributions
with certain structural properties (such as symmetry, linear unimodality or log-
concavity etc.) are presented in (Yu, Li, Schuurmans and Szepesvári 2009); see
also (Chen et al. 2011).
We now show that if the risk measure 𝜚 is law-, translation- and scale-invariant
and the loss function ℓ is linear, then the Chebyshev risk reduces to a mean-standard
deviation risk measure, which involves the standard risk coefficient of 𝜚.
Definition 6.8 (Standard Risk Coefficient). The standard risk coefficient of a law-
invariant risk measure 𝜚 is given by 𝛼 = supQ∈P(0,1) 𝜚Q [𝐿].
Thus, the standard risk coefficient of 𝜚 is defined as the worst-case risk of an
uncertain loss 𝐿 whose distribution Q is only known to have mean 0 and variance 1.
Proposition 6.9 (Chebyshev Risk). If 𝜚 is a law-, translation- and scale-invariant
risk measure with standard risk coefficient 𝛼, there is 𝜃 ∈ R𝑑 with ℓ(𝑧) = 𝜃 ⊤ 𝑧 for
all 𝑧 ∈ R𝑑 , and P(𝜇, Σ) is the Chebyshev ambiguity set of all distributions on R𝑑
with mean 𝜇 ∈ R𝑑 and covariance matrix Σ ∈ S+𝑑 , then the Chebyshev risk satisfies

sup 𝜚P [ℓ(𝑍)] = 𝜃 ⊤ 𝜇 + 𝛼 𝜃 ⊤ Σ𝜃.
P∈P(𝜇,Σ)

Proof. If 𝜃 ⊤ Σ𝜃 = 0, then
   
sup 𝜚P 𝜃 ⊤ 𝑍 = 𝜃 ⊤ 𝜇 + sup 𝜚P 𝜃 ⊤ (𝑍 − 𝜇)
P∈P(𝜇,Σ) P∈P(𝜇,Σ)

=𝜃 𝜇+ sup 𝜚P [0] = 𝜃 ⊤ 𝜇,
P∈P(𝜇,Σ)

where the first equality holds because 𝜚 is translation invariant, whereas the second
equality holds because 𝜃 ⊤ (𝑍 − 𝜇) equals 0 in law under any P ∈ P(𝜇, Σ) and
because 𝜚 is law-invariant. Finally, the third equality follows from the scale-
invariance of 𝜚. If 𝜃 ⊤ Σ𝜃 > 0, on the other hand, then we have
   
sup 𝜚P 𝜃 ⊤ 𝑍 = 𝜃 ⊤ 𝜇 + sup 𝜚P 𝜃 ⊤ (𝑍 − 𝜇)
P∈P(𝜇,Σ) P∈P(𝜇,Σ)
 
⊤ 𝜃 ⊤ (𝑍 − 𝜇) √ ⊤
=𝜃 𝜇+ sup 𝜚P √ 𝜃 Σ𝜃
P∈P(𝜇,Σ) 𝜃 ⊤ Σ𝜃

= 𝜃 ⊤ 𝜇 + 𝛼 𝜃 ⊤ Σ𝜃,
where the first two equalities follow from the translation- and scale-invariance of 𝜚,
Distributionally Robust Optimization 99

respectively. The third equality follows from Proposition 6.7, the law-invariance
of 𝜚 and the definition of 𝛼. Indeed, the pushforward of the multivariate
√ Chebyshev
ambiguity set P(𝜇, Σ) under the transformation ℓ(𝑧) = 𝜃 ⊤ (𝑧 − 𝜇)/ 𝜃 ⊤ Σ𝜃 coincides
with the univariate standard Chebyshev ambiguity set P(0, 1). 

The standard risk coefficient of a generic law-invariant risk measure may be


difficult to compute. We now show, however, that the standard risk coefficients of
the VaR and the CVaR match and are available in closed form.
Proposition 6.10 (Standard Risk Coefficients of VaR and CVaR). For any 𝛽 ∈
(0, 1), the standard risk coefficients of the 𝛽-VaR and the 𝛽-CVaR coincide, that is,
s
1−𝛽
sup 𝛽-CVaRQ [𝐿] = sup 𝛽-VaRQ [𝐿] = .
Q∈P(0,1) Q∈P(0,1) 𝛽
Proof. As 𝛽-CVaRQ [𝐿] upper bounds 𝛽-VaRQ [𝐿] for every Q ∈ P(0, 1), we have
sup 𝛽-CVaRQ [𝐿] ≥ sup 𝛽-VaRQ [𝐿]. (6.6)
Q∈P(0,1) Q∈P(0,1)

The rest of the proof proceeds as follows. We first derive an analytical formula for
the worst-case 𝛽-VaR on the right hand side (Step 1). Next, we prove that the same
analytical formula provides an upper bound on the worst-case 𝛽-CVaR on the left
hand side (Step 2). The claim then follows from the above inequality.
Step 1. We first express the worst-case 𝛽-VaR as its smallest upper bound to find

sup 𝛽-VaRQ [𝐿] = inf 𝜏 : 𝛽-VaRQ (𝐿) ≤ 𝜏 ∀Q ∈ P(0, 1)
Q∈P(0,1) 𝜏 ∈R

= inf {𝜏 : Q(𝐿 ≥ 𝜏) ≤ 𝛽 ∀Q ∈ P(0, 1)}


𝜏 ∈R
  s
1 1−𝛽
= inf 𝜏 : ≤𝛽 = .
𝜏 ∈R 1+𝜏 2 𝛽
The second equality in the above derivation follows from (5.2), and the third
equality follows from the Marshall and Olkin bound of Proposition 6.6. The final
formula is obtained by analytically solving the minimization problem over 𝜏.
Step 2. The max-min inequality2 and the definition of the 𝛽-CVaR imply that
1
sup 𝛽-CVaRQ [𝐿] ≤ inf sup 𝜏+E Q [max{𝐿 − 𝜏, 0}]
Q∈P(0,1) 𝜏 ∈R Q∈P(0,1) 𝛽
s
1 p  1−𝛽
= inf 𝜏 + 1 + 𝜏2 − 𝜏 = ,
𝜏 ∈R 2𝛽 𝛽

2 The Chebyshev ambiguity set P(0, 1) is not weakly compact (see Example 3.9). Therefore, Sion’s
minimax theorem does not allow us to interchange the infimum over 𝜏 and the supremum over Q.
While we could instead invoke Theorem 5.18, this is actually not needed to prove Proposition 6.10.
100 D. Kuhn, S. Shafiee, and W. Wiesemann

where the first equality follows from Scarf’s bound derived in Proposition 6.5,
and the last equality is obtained by analytically solving the convex minimization
problem over 𝜏. The unique minimizer is given by
1 − 2𝛽
𝜏★ = p .
2 𝛽(1 − 𝛽)
This completes Step 2. The claim then follows by combining the analytical formula
for the worst-case 𝛽-VaR found in Step 1 and the analytical upper bound on the
worst-case 𝛽-CVaR found in Step 2 with the elementary inequality (6.6). 

Propositions 6.9 and 6.10 provide an analytical formula for the Chebyshev risk
of a linear loss function provided that the underlying risk measure is the VaR or
the CVaR. The formula for the worst-case VaR was first derived in (Lanckriet et al.
2001, 2002, El Ghaoui et al. 2003); see also (Calafiore and El Ghaoui 2006). The
equality of the worst-case VaR and the worst-case CVaR was discovered in (Zymler
et al. 2013a). It not only holds for linear but also for arbitrary concave and arbitrary
quadratic (not necessarily concave) loss functions. Proposition 6.9 follows from
(Nguyen et al. 2021). The standard risk coefficient can be characterized in closed
form for a wealth of law-, translation- and scale-invariant risk measures other than
the VaR and the CVaR. It is available, for instance, for all spectral risk measures
and all risk measures that admit a Kusuoka representation (Li 2018) as well as all
distortion risk measures (Cai et al. 2023); see also (Nguyen et al. 2021).

6.8. Gelbrich Risk


ˆ Σ̂) the Gelbrich ambiguity set of all distributions P ∈ P(R𝑑 )
Denote by G𝑟 ( 𝜇,
whose mean-covariance pairs (𝜇, Σ) ∈ R𝑑 × S+𝑑 reside in a ball of radius 𝑟 ≥ 0
ˆ Σ̂) ∈ R𝑑 × S+𝑑 with respect to the Gelbrich distance; see Definition 2.1.
around ( 𝜇,
Recall from Section 2.1.4 that the Gelbrich ambiguity set accounts for moment
ambiguity and thus often provides a more realistic account of uncertainty than a
naïve Chebyshev ambiguity set. If the distribution of 𝑍 is only known to have a
mean-covariance pair close to ( 𝜇,
ˆ Σ̂), then it is natural to quantify the riskiness of
an uncertain loss ℓ(𝑍) under a law-invariant risk measure 𝜚 by the Gelbrich risk
sup 𝜚P [ℓ(𝑍)].
P∈G𝑟 ( 𝜇,
ˆ Σ̂)

By construction, G𝑟 ( 𝜇,
ˆ Σ̂) is the union of all Chebyshev ambiguity sets P(𝜇, Σ)
corresponding to a mean-covariance pair (𝜇, Σ) with G((𝜇, Σ), ( 𝜇, ˆ Σ̂)) ≤ 𝑟. This
decomposition of the Gelbrich ambiguity set into Chebyshev ambiguity sets allows
us via Proposition 6.9 to derive an analytical formula for the Gelbrich risk.
Proposition 6.11 (Gelbrich Risk). Assume that 𝜚 is a law-, translation- and scale-
invariant risk measure with standard risk coefficient 𝛼 ∈ R+ , there is 𝜃 ∈ R𝑑
with ℓ(𝑧) = 𝜃 ⊤ 𝑧 for all 𝑧 ∈ R𝑑 , and G𝑟 ( 𝜇,
ˆ Σ̂) is the Gelbrich ambiguity set of all
Distributionally Robust Optimization 101

distributions on R𝑑 whose mean-covariance pairs have a Gelbrich distance of at


ˆ Σ̂) ∈ R𝑑 × S+𝑑 . Then, the Gelbrich risk satisfies
most 𝑟 ≥ 0 from ( 𝜇,
  p p
sup 𝜚P 𝜃 ⊤ 𝑍 = 𝜇ˆ ⊤ 𝜃 + 𝛼 𝜃 ⊤ Σ̂𝜃 + 𝑟 1 + 𝛼2 k𝜃 k 2 . (6.7)
P∈G𝑟 ( 𝜇,
ˆ Σ̂)

Proof. Assume first that Σ̂ ≻ 0. If 𝜃 = 0, then the claim holds trivially because 𝜚
is law- and scale-invariant. If 𝑟 = 0, then the claim follows immediately from
Proposition 6.9. We may thus assume that 𝜃 ≠ 0 and 𝑟 > 0. In this case, we have
 
 ⊤   sup
 sup 𝜚P 𝜃 ⊤ 𝑍
sup 𝜚P 𝜃 𝑍 = P∈P(𝜇,Σ)
P∈G𝑟 ( 𝜇,
ˆ Σ̂)
 s.t. 𝜇 ∈ R𝑑 , Σ ∈ S𝑑 , G (𝜇, Σ), ( 𝜇, 
ˆ Σ̂) ≤ 𝑟
 +

⊤ 𝜃 + 𝛼 𝜃 ⊤ Σ𝜃


 sup 𝜇

 𝑑 𝑑
= s.t. 𝜇 ∈ R , Σ ∈ S+

 1 
2 + Tr Σ + Σ̂ − 2 Σ̂ 21 Σ Σ̂ 21 2 ≤ 𝑟 2 ,

 k𝜇 − ˆ
𝜇k

where the first equality exploits the decomposition of the Gelbrich ambiguity set
into Chebyshev ambiguity sets. The second equality follows from Proposition 6.9
and Definition 2.1. By dualizing the resulting convex optimization problem, we find
 n o
 ⊤ 
𝜚P 𝜃 𝑍 = inf 𝛾 𝑟 2 − Tr(Σ̂) + sup 𝜇⊤ 𝜃 − 𝛾k𝜇 − 𝜇k ˆ 2

sup
𝛾∈R+ 𝜇∈R𝑑
P∈G𝑟 ( 𝜇,
ˆ Σ̂)
n √ 1 1 1 
o  (6.8)
+ sup 𝛼 𝜃 Σ𝜃 − 𝛾 Tr Σ − 2 Σ̂ Σ Σ̂
⊤ 2 2 2
.
Σ∈S+𝑑

Strong duality holds because 𝑟 > 0, which implies that ( 𝜇, ˆ Σ̂) constitutes a Slater
point for the primal maximization problem. If 𝛾 = 0, then the maximization
problems over 𝜇 and Σ in (6.8) are unbounded. We may thus restrict 𝛾 to be strictly
positive. For any fixed 𝛾 > 0, the maximization problem over 𝜇 can be solved in
closed form. Its optimal value is given by 𝜇ˆ ⊤ 𝜃 + k𝜃 k 2 /(4𝛾). By introducing an
auxiliary variable 𝑡, the maximization problem over Σ can be reformulated as
1 1
 1

sup 𝛼𝑡 − 𝛾 Tr Σ − 2 Σ̂ 2 Σ Σ̂ 2 2
(6.9)
s.t. 𝑡 ∈ R+ , Σ ∈ S+𝑑 , 𝑡 2 − 𝜃 ⊤ Σ𝜃 ≤ 0.
Note that 𝑡 = 0 and Σ = 𝜃𝜃 ⊤ form a Slater point for (6.9) because 𝜃 ≠ 0. Thus,
1 1 1
problem (6.9) admits a strong dual. The variable substitution 𝐵 ← (Σ̂ 2 Σ Σ̂ 2 ) 2
allows us to reformulate this dual problem more concisely as
inf sup 𝛼𝑡 − 𝜆𝑡 2 + sup Tr 𝐵2 Δ𝜆 + 2𝛾 Tr(𝐵),

(6.10)
𝜆∈R+ 𝑡 ∈R+
𝐵∈S+𝑑
1 1
where Δ𝜆 = Σ̂ − 2 (𝜆𝜃𝜃 ⊤ − 𝛾𝐼 𝑑 )Σ̂ − 2 for any 𝜆 ≥ 0. Note that Δ𝜆 is well-defined
because Σ̂ ≻ 0. Recall now that the standard risk coefficient 𝛼 was assumed to be
102 D. Kuhn, S. Shafiee, and W. Wiesemann

non-negative. If 𝜆 > 0, then the supremum over 𝑡 in (6.10) evaluates to 𝛼2 /(4𝜆).


Otherwise, if 𝜆 = 0, then this supremum evaluates to +∞. From now on, we
may thus restrict the outer minimization problem in (6.10) to strictly positive 𝜆.
Similarly, if Δ𝜆 ⊀ 0, then the supremum over 𝐵 in (6.10) evaluates to +∞. From
now on, we may thus restrict the outer minimization problem in (6.10) to 𝜆 that
satisfy 𝛾𝐼 𝑑 −𝜆𝜃𝜃 ⊤ ≻ 0. This constraint is equivalent to 𝜆 < 𝛾k𝜃 k −2 and guarantees
that Δ𝜆 ≺ 0. As 𝜆 > 0, this in turn implies that 𝐵★ = −𝛾Δ𝜆−1 is positive definite and
satisfies the first-order optimality condition 𝐵Δ𝜆 + Δ𝜆 𝐵 + 2𝛾𝐼 𝑑 = 0. Note that this
optimality condition can be interpreted as a continuous Lyapunov equation, and
therefore its solution 𝐵★ is in fact unique; see, e.g., (Hespanha 2019, Theorem 12.5).
By making the implicit constraints on 𝜆 explicit and by evaluating the two suprema
over 𝑡 and 𝐵 analytically, problem (6.10) can finally be reformulated as
𝛼2  1 1

inf + 𝛾 2 Tr Σ̂ 2 (𝛾𝐼 𝑑 − 𝜆𝜃𝜃 ⊤ ) −1 Σ̂ 2
0<𝜆<𝛾 k 𝜃 k −2 4𝜆

𝛼2  𝜃 ⊤ Σ̂𝜃  𝛼2 k𝜃 k 2 p
= inf + 𝛾 Tr Σ̂ + −1 = 𝛾 Tr Σ̂ + + 𝛼 𝜃 ⊤ Σ̂𝜃.
0<𝜆<𝛾 k 𝜃 k −2 4𝜆 𝜆 − k𝜃 k 2 /𝛾 4 𝛾
Here, the first equality exploits the Sherman-Morrison formula (Bernstein 2009,
Corollary 2.8.8) to rewrite the inverse matrix, and the second equality is obtained
by solving the minimization problem over 𝜆 analytically. Indeed, the infimum is
attained at the unique solution 𝜆★ of the first-order condition
1 k𝜃 k 2 2 p ⊤
= + 𝜃 Σ̂𝜃
𝜆 𝛾 𝛼
in the interior of the feasible set. In summary, we have solved both embedded
subproblems in (6.8) analytically. Substituting their optimal values into (6.8) yields
p 1 + 𝛼2 k𝜃 k 2
sup 𝜚P [𝜃 ⊤ 𝑍] = inf 𝜇ˆ ⊤ 𝜃 + 𝛼 𝜃 ⊤ Σ̂𝜃 + 𝛾𝑟 2 +
P∈G𝑟 ( 𝜇,
ˆ Σ̂) 𝛾≥0 4 𝛾
p p
= 𝜇ˆ ⊤ 𝜃 + 𝛼 𝜃 ⊤ Σ̂𝜃 + 𝑟 1 + 𝛼2 k𝜃 k.
Here, the second equality is obtained by solving the minimization problem over 𝛾
in closed form. We have thus established the desired formula (6.7) for Σ̂ ≻ 0.
It remains to be shown that (6.7) remains valid even if Σ̂ is singular. To this end,
use 𝐽(Σ̂) as a shorthand for the Gelbrich risk as a function of Σ̂. By leveraging
Berge’s maximum theorem (Berge 1963, pp. 115–116) and the continuity of the
Gelbrich distance (see the discussion after Proposition 2.2), it is easy to show
that 𝐽(Σ̂) is continuous on S+𝑑 . The claim thus follows by noting that (6.7) holds
for all Σ̂ ≻ 0, that both sides of (6.7) are continuous in Σ̂ and that every Σ̂ ∈ S+𝑑
can be expressed as a limit of positive definite matrices. 
Proposition 6.11 is due to Nguyen et al. (2021). It shows that, for a broad class
of risk measures, the worst-case risk over a Gelbrich ambiguity set reduces to a
Distributionally Robust Optimization 103

Markowitz-type mean-variance risk functional with a 2-norm regularization term.


We emphasize that the risk measure 𝜌 enters the resulting optimization model only
indirectly through the standard risk coefficient 𝛼.

6.9. Worst-Case Expectations over Kullback-Leibler Ambiguity Sets


Consider the worst-case expectation problem

sup E P [ℓ(𝑍)] : KL(P, P̂) ≤ 𝑟 , (6.11a)
P∈P(Z)

which maximizes the expected value of ℓ(𝑍) over the Kullback-Leibler ambiguity
set of all distributions supported on Z whose Kullback-Leibler divergence with
respect to P̂ ∈ P(Z) is at most 𝑟 ≥ 0. The Kullback-Leibler ambiguity set is a 𝜙-
divergence ambiguity set of the form (2.10), where 𝜙 satisfies 𝜙(𝑠) = 𝑠 log(𝑠) − 𝑠 + 1
for all 𝑠 ≥ 0. As 𝜙∞ (1) = +∞, we have KL(P, P̂) = ∞ unless P ≪ P̂. Hence,
problem (6.11a) maximizes only over distributions P that are absolutely continuous
with respect to P̂. Note that 𝜙∗ (𝑡) = 𝑒𝑡 − 1 for all 𝑡 ∈ R. By Theorem 4.14 and the
definition of the perspective function, the problem dual to (6.11a) is thus given by
  
ℓ(𝑍) − 𝜆 0
inf 𝜆 0 + 𝜆(𝑟 − 1) + E P̂ 𝜆 exp . (6.11b)
𝜆0 ∈R, 𝜆∈R+ 𝜆
The problems (6.11a) and (6.11b) can be solved in closed form if the loss function ℓ
is linear and the nominal distribution P̂ is Gaussian.
Proposition 6.12 (Worst-Case Expectations over KL Ambiguity Sets). Suppose
that Z = R𝑑 , P̂ ∈ P(Z) is a normal distribution with mean 𝜇ˆ ∈ R𝑑 and covariance
𝑑 , and 𝑟 > 0. Suppose also that ℓ is linear, that is, there exists 𝜃 ∈ R 𝑑
matrix Σ̂ ∈ S++
with ℓ(𝑧) = 𝜃 ⊤ 𝑧 for all 𝑧 ∈ Z. Then, the primal problem (6.11a) is solved by the
1 1
normal distribution P★ with mean 𝜇ˆ + (2𝑟) 2 Σ̂𝜃/(𝜃 ⊤ Σ̂𝜃) 2 and covariance matrix Σ̂.
1 1
The dual problem (6.11b) is solved by (𝜆★0 , 𝜆★), where 𝜆★ = (𝜃 ⊤ Σ̂𝜃) 2 /(2𝑟) 2 and

𝜆★0 = 𝜆★ log E P̂ exp ℓ(𝑍)/𝜆★ .


1 1
The optimal values of (6.11a) and (6.11b) are both equal to 𝜇ˆ ⊤ 𝜃 + (2𝑟) 2 (𝜃 ⊤ Σ̂𝜃) 2 .
Proof. Focus first on the dual problem (6.11b), and fix any 𝜆 ≥ 0. Then, the partial
minimization problem over 𝜆 0 is solved by
𝜆★0 (𝜆) = 𝜆 log E P̂ [exp (ℓ(𝑍)/𝜆)] .
Substituting this parametric minimizer back into (6.11b) shows that the optimal
value of the dual problem (6.11b) is given by

1
 
ℓ(𝑍)
inf 𝜆𝑟 + 𝜆 log E P̂ exp = inf 𝜆𝑟 + 𝜇ˆ ⊤ 𝜃 + 𝜃 ⊤ Σ̂𝜃
𝜆∈R+ 𝜆 𝜆∈R+ 2𝜆
1 1
= 𝜇ˆ ⊤ 𝜃 + (2𝑟) 2 (𝜃 ⊤ Σ̂𝜃) 2 ,
104 D. Kuhn, S. Shafiee, and W. Wiesemann

where the first equality exploits the linearity of ℓ, the normality of P̂ and the formula
for the expected value of a log-normal distribution. The second equality holds
1 1
because the minimization problem over 𝜆 ≥ 0 is solved by 𝜆★ = (𝜃 ⊤ Σ̂𝜃) 2 /(2𝑟) 2 .
Next, define P★ ∈ P(Z) as the normal distribution with mean 𝜇★ = 𝜇ˆ + Σ̂𝜃/𝜆★ and
covariance matrix Σ★ = Σ̂. Comparing the density functions of P̂ and P★ shows that
 ⊤
dP★ 𝜃 (𝑧 − 𝜇)
ˆ 𝜃 ⊤ Σ̂𝜃

(𝑧) = exp − ∀𝑧 ∈ Z.
dP̂ 𝜆★ 2(𝜆★)2
By Definition 2.8, we thus obtain

𝜃 ⊤ Σ̂𝜃
 ★ 
★ dP
KL(P , P̂) = log (𝑧) dP★(𝑧) = = 𝑟,
Z dP̂ 2(𝜆★)2
where the second and the third equalities follow readily from our formula for the
Radon-Nikodym derivative dP★/dP̂ and from basic algebra, respectively. Hence,
P★ is feasible in (6.11b). In addition, its objective function value is given by
1 1
E P★ [ℓ(𝑍)] = 𝜃 ⊤ 𝜇★ = 𝜇ˆ ⊤ 𝜃 + (2𝑟) 2 (𝜃 ⊤ Σ̂𝜃) 2 .

As the objective function values of P★ and (𝜆★0 , 𝜆★) with 𝜆★0 = 𝜆★0 (𝜆★) match,
weak duality as established in Theorem 4.14 implies that P★ is primal optimal and
that (𝜆★0 , 𝜆★) is dual optimal. This observation completes the proof. 

Proposition 6.12 is due to Hu and Hong (2013). It is also reminiscent of risk-


sensitive control theory (Hansen and Sargent 2008). In this stream of literature, a
fictitious adversary may perturb the distribution of the exogenous noise terms of
an optimal control problem arbitrarily but incurs a penalty equal to the Kullback-
Leibler divergence with respect to a Gaussian baseline model.

6.10. Worst-Case Expectations over Total Variation Balls


Consider the worst-case expectation problem

sup E P [ℓ(𝑍)] : TV(P, P̂) ≤ 𝑟 , (6.12a)
P∈P(Z)

which maximizes the expected value of ℓ(𝑍) over a total variation ball of radius 𝑟 ∈
[0, 1] around P̂ ∈ P(Z). Recall from Section 2.2.3 that the total variation distance
is a 𝜙-divergence and that the underlying entropy function satisfies 𝜙(𝑠) = 21 |𝑠 − 1|
for all 𝑠 ≥ 0 and 𝜙(𝑠) = ∞ for all 𝑠 < 0. Recall also that the total variation distance
between two distributions is bounded above by 1 and that this bound is attained
if the two distributions are mutually singular. An elementary calculation reveals
that the conjugate entropy function satisfies 𝜙∗ (𝑡) = max{𝑡 + 12 , 0} − 12 if 𝑡 ≤ 21 and
Distributionally Robust Optimization 105

𝜙∗ (𝑡) = +∞ if 𝑡 > 21 . By Theorem 4.14, the problem dual to (6.12a) is thus given by
  
1
 
𝜆
inf 𝜆0 + 𝜆 𝑟 − + E P̂ max ℓ(𝑍) − 𝜆 0 + , 0
𝜆0 ∈R, 𝜆∈R+ 2 2 (6.12b)
s.t. 𝜆 0 + 𝜆/2 ≥ sup ℓ(𝑧).
𝑧 ∈Z

The problems (6.12a) and (6.12b) can be solved in closed form if Z is compact.
Proposition 6.13 (Worst-Case Expectations over Total Variation Balls). Suppose
that Z ⊆ R𝑑 is compact, P̂ ∈ P(Z) and 𝑟 ∈ (0, 1), and define 𝛽𝑟 = 1 − 𝑟. In
addition, assume that E P̂ [ℓ(𝑍)] > −∞ and ℓ is upper semicontinuous. Then, the
optimal values of (6.12a) and (6.12b) are both equal to
(1 − 𝛽𝑟 ) · sup ℓ(𝑧) + 𝛽𝑟 · 𝛽𝑟 -CVaRP̂ [ℓ(𝑍)] . (6.13)
𝑧 ∈Z

The proof of Proposition 6.13 will reveal that (6.12a) and (6.12b) are both
solvable. Indeed, we will construct optimal solutions P★ and (𝜆★0 , 𝜆★) for (6.12a)
and (6.12b), respectively. A precise description of these optimizers is cumbersome
and thus omitted from the proposition statement. If the loss ℓ(𝑍) has a continuous
distribution under P̂, however, then P★ admits a simpler and more intuitive descrip-
tion. Indeed, in this case, P★ is obtained from P̂ by shifting the probability mass
of all outcomes 𝑧 ∈ Z associated with a high loss ℓ(𝑧) ≥ 𝛽𝑟 -VaRP̂ [ℓ(𝑍)] to some
outcome 𝑧 ∈ Z associated with the highest possible loss ℓ(𝑧) = max 𝑧 ′ ∈Z ℓ(𝑧′ ).

Proof of Proposition 6.13. For ease of notation, set ℓ = sup 𝑧 ∈Z ℓ(𝑧). Focus first on
the dual problem (6.12b), and fix any 𝜆 ≥ 0. Note that the dual objective function is
non-decreasing in 𝜆 0 . The partial minimization problem over 𝜆 0 is therefore solved
by 𝜆★0 (𝜆) = ℓ −𝜆/2. Substituting this parametric minimizer back into (6.12b) shows
that the optimal value of the dual problem is given by
  
ℓ + inf 𝜆(𝑟 − 1) + E P̂ max ℓ(𝑍) − ℓ + 𝜆, 0
𝜆∈R+
  
= 𝑟 ℓ + (1 − 𝑟) inf 𝜏 + (1 − 𝑟) −1 E P̂ max ℓ(𝑍) − 𝜏, 0 ,
𝜏 ≤ℓ

where the equality follows from the substitution 𝜏 ← ℓ − 𝜆. By Definition 5.10,


the infimum over 𝜏 evaluates to 𝛽𝑟 -CVaRP̂ [ℓ(𝑍)] with 𝛽𝑟 = 1 − 𝑟. Recall that
this infimum is attained by 𝜏★ = 𝛽𝑟 -VaRP̂ [ℓ(𝑍)], which is bounded above by ℓ. In
summary, we have thus shown that the optimal value of problem (6.12b) equals
(1 − 𝛽𝑟 ) · ℓ + 𝛽𝑟 · 𝛽𝑟 -CVaRP̂ [ℓ(𝑍)] .

To construct a primal maximizer, assume first that P̂(ℓ(𝑍) < ℓ) ≤ 𝑟, which implies
that 𝛽𝑟 -CVaRP̂ [ℓ(𝑍)] = ℓ. Thus, the optimal value of the dual problem (6.12b)
simplifies to ℓ, which is attained by any distribution P★ that is obtained from P̂ by
moving all probability mass from {𝑧 ∈ Z : ℓ(𝑧) < ℓ} to {𝑧 ∈ Z : ℓ(𝑧) = ℓ}.
106 D. Kuhn, S. Shafiee, and W. Wiesemann

Next, assume that P̂(ℓ(𝑍) < ℓ) > 𝑟, which implies that 𝛽𝑟 -VaRP̂ [ℓ(𝑍)] < ℓ. In
this case, we partition Z into the following four subsets.

Z1 = 𝑧 ∈ Z : 𝛽𝑟 -VaRP̂ [ℓ(𝑍)] > ℓ(𝑧)

Z2 = 𝑧 ∈ Z : ℓ > ℓ(𝑧) = 𝛽𝑟 -VaRP̂ [ℓ(𝑍)]

Z3 = 𝑧 ∈ Z : ℓ > ℓ(𝑧) > 𝛽𝑟 -VaRP̂ [ℓ(𝑍)]

Z4 = 𝑧 ∈ Z : ℓ = ℓ(𝑧)
Note that Z1 and Z3 can be empty, whereas Z2 and Z4 must be non-empty. We
also define P̂𝑖 as the nominal distribution P̂ conditioned on the event 𝑍 ∈ Z𝑖 for
all 𝑖 ∈ [4], and we define UZ4 as the uniform distribution on Z4 . Next, we set
P★ = 𝛽𝑟 − P̂(𝑍 ∈ Z3 ) − P̂(𝑍 ∈ Z4 ) · P̂2 +


P̂(𝑍 ∈ Z3 ) · P̂3 + P̂(𝑍 ∈ Z4 ) · P̂4 + (1 − 𝛽𝑟 ) · UZ4 .


Thus, P★ is a mixture of four probability distributions. As the non-negative mixture
probabilities sum to 1, P★ is a probability distribution. Using 𝜌 = P̂ + UZ4 as a
dominating measure for P̂ and P★ and recalling that 𝜙(𝑠) = 21 |𝑠 − 1| if 𝑠 ≥ 0, we find
4 ∫
1Õ dP★ dP̂
TV(P★, P̂) = D 𝜙 (P★, P̂) = (𝑧) − (𝑧) d𝜌(𝑧)
2 𝑖=1Z𝑖 d𝜌 d𝜌

= P̂(𝑍 ∈ Z1 ) + P̂(𝑍 ∈ Z2 ) + P̂(𝑍 ∈ Z3 ) + P̂(𝑍 ∈ Z4 ) − 𝛽𝑟 + 0 + (1 − 𝛽𝑟 ) = 𝑟,
where the third equality follows from the definition of P★ and the relation

P̂(𝑍 ∈ Z2 ) + P̂(𝑍 ∈ Z3 ) + P̂(𝑍 ∈ Z4 ) = P̂ ℓ(𝑍) ≥ 𝛽𝑟 -VaRP̂ [ℓ(𝑍)] ≥ 𝛽𝑟 ,
and the last equality follows from the definition of 𝛽𝑟 . Thus, P★ is feasible in (6.12a).
In addition, the objective function value of P★ in (6.12a) amounts to
E P★ [ℓ(𝑍)] = (1 − 𝛽𝑟 ) · ℓ
  
+ E P̂ ℓ(𝑍)| ℓ(𝑍) > 𝛽𝑟 -VaRP̂ [ℓ(𝑍)] · P̂ ℓ(𝑍) > 𝛽𝑟 -VaRP̂ [ℓ(𝑍)]
  
+ E P̂ ℓ(𝑍)| ℓ(𝑍) = 𝛽𝑟 -VaRP̂ [ℓ(𝑍)] · 𝛽𝑟 − P̂ ℓ(𝑍) > 𝛽𝑟 -VaRP̂ [ℓ(𝑍)]
= (1 − 𝛽𝑟 ) · ℓ + 𝛽𝑟 · 𝛽𝑟 -CVaRP̂ [ℓ(𝑍)] .
Here, the second equality follows from (Föllmer and Schied 2008, Theorem 4.47 &
Remark 4.48). Note that if the marginal distribution of ℓ(𝑍) is continuous under P̂,
then the above derivation simplifies. Indeed, in this case we have

P̂ ℓ(𝑍) > 𝛽𝑟 -VaRP̂ [ℓ(𝑍)] = 𝛽𝑟
and
 
E P̂ ℓ(𝑍)| ℓ(𝑍) > 𝛽𝑟 -VaRP̂ [ℓ(𝑍)] = 𝛽𝑟 -CVaRP̂ [ℓ(𝑍)] .
Irrespective of P̂, the objective function value of P★ in (6.12a) matches the optimal
Distributionally Robust Optimization 107

value of (6.12b). Weak duality as established in Theorem 4.14 thus implies that P★
solves the primal problem (6.12a). This observation completes the proof. 
Jiang and Guan (2018) and Shapiro (2017) study a variant of problem (6.12a) that
maximizes over a restricted total variation ball. Thus, they additionally impose P ≪
P̂ in (6.12a). The supremum of the resulting restricted problem amounts to
(1 − 𝛽𝑟 ) · ess supP̂ [ℓ(𝑍)] + 𝛽𝑟 · 𝛽𝑟 -CVaRP̂ [ℓ(𝑍)] ,
which may be strictly smaller than (6.13). If additionally ℓ(𝑍) has a continuous
marginal distribution under P̂, then the supremum is no longer attained.

6.11. Worst-Case Expectations over Lévy-Prokhorov Balls


Consider the worst-case expectation problem

sup E P [ℓ(𝑍)] : LP(P, P̂) ≤ 𝑟 , (6.14a)
P∈P(Z)

which maximizes the expected value of ℓ(𝑍) over a Lévy-Prokhorov ball of ra-
dius 𝑟 ∈ [0, 1] around P̂ ∈ P(Z). We assume here that the Lévy-Prokhorov dis-
tance is induced by a norm k · k on R𝑑 . By Proposition 2.22, the Lévy-Prokhorov
ball of radius 𝑟 ∈ (0, 1) coincides with the optimal transport ambiguity set

P = P ∈ P(Z) : OT𝑐𝑟 (P, P̂) ≤ 𝑟 ,
where the transportation cost function 𝑐𝑟 is defined through 𝑐𝑟 (𝑧, 𝑧ˆ) = 1 k 𝑧− 𝑧ˆ k>𝑟 .
Theorem 4.18 thus implies that the problem dual to (6.14a) is given by
  
ˆ
inf 𝜆𝑟 + E P̂ sup ℓ(𝑧) − 𝜆𝑐𝑟 (𝑧, 𝑍) (6.14b)
𝜆∈R+ 𝑧 ∈Z

whenever ℓ is upper semicontinuous. If Z is compact, then we can leverage


Proposition 6.13 to solve the problems (6.14a) and (6.14b) in closed form.
Proposition 6.14 (Worst-Case Expectations over Lévy-Prokhorov Balls). Suppose
that Z ⊆ R𝑑 is compact, P̂ ∈ P(Z) and 𝑟 ∈ (0, 1), and define 𝛽𝑟 = 1 − 𝑟. In
addition, assume that E P̂ [ℓ(𝑍)] > −∞ and ℓ is upper semicontinuous. Then, the
optimal values of (6.14a) and (6.14b) are both equal to
 
ˆ ,
(1 − 𝛽𝑟 ) · sup ℓ(𝑧) + 𝛽𝑟 · 𝛽𝑟 -CVaRP̂ ℓ𝑟 ( 𝑍) (6.15)
𝑧 ∈Z

where ℓ𝑟 (ˆ𝑧 ) = sup 𝑧 ∈Z {ℓ(𝑧) : k𝑧 − 𝑧ˆ k ≤ 𝑟} is an adversarial loss function that


assigns each 𝑧ˆ ∈ Z the worst-case loss in the 𝑟-neighborhood of 𝑧ˆ.
The proof of Proposition 6.14 will reveal that (6.14a) and (6.14b) are both solv-
able. However, a precise description of the respective optimizers is cumbersome
and thus omitted from the proposition statement. Note that the adversarial loss func-
tion ℓ𝑟 inherits upper semicontinuity from ℓ thanks to (Berge 1963, Theorem 2,
p. 116). The following lemma is needed in the proof of Proposition 6.14.
108 D. Kuhn, S. Shafiee, and W. Wiesemann

Lemma 6.15. Assume that Z ⊆ R𝑑 is compact, ℓ is upper semicontinuous, 𝑧ˆ ∈ Z


and 𝑟, 𝜆 ≥ 0. Then, the following identity holds.

sup ℓ(𝑧) − 𝜆 · 1 k 𝑧− 𝑧ˆ k>𝑟 = sup {ℓ𝑟 (𝑧) − 𝜆 · 1 𝑧≠ 𝑧ˆ } .
𝑧 ∈Z 𝑧 ∈Z

Proof. For ease of notation we introduce two auxiliary functions 𝑓 and 𝑔 from Z
to R, which are defined through 𝑓 (𝑧) = ℓ(𝑧)−𝜆 · 1 k 𝑧− 𝑧ˆ k>𝑟 and 𝑔(𝑧) = ℓ𝑟 (𝑧)−𝜆 · 1 𝑧≠ 𝑧ˆ
for all 𝑧 ∈ Z. Note that both 𝑓 and 𝑔 are upper semicontinuous.
First, select 𝑧★ ∈ arg max𝑧 ∈Z 𝑓 (𝑧), which exists because Z is compact and 𝑓 is
upper semicontinuous. If k𝑧★ − 𝑧ˆ k > 𝑟, then the definition of ℓ𝑟 implies that
sup 𝑓 (𝑧) = 𝑓 (𝑧★) = ℓ(𝑧★) − 𝜆 ≤ ℓ𝑟 (𝑧★) − 𝜆 = 𝑔(𝑧★) ≤ sup 𝑔(𝑧).
𝑧 ∈Z 𝑧 ∈Z

On the other hand, if k𝑧 − 𝑧ˆ k ≤ 𝑟, then


sup 𝑓 (𝑧) = 𝑓 (𝑧★) = ℓ(𝑧★) ≤ ℓ𝑟 (ˆ𝑧) = 𝑔(ˆ𝑧) ≤ sup 𝑔(𝑧).
𝑧 ∈Z 𝑧 ∈Z

Next, select 𝑧˜ ∈ arg max𝑧 ∈Z 𝑔(𝑧). If 𝑧˜ ≠ 𝑧ˆ, then with 𝑧★ ∈ arg max 𝑧 ∈Z ℓ(𝑧) we have
sup 𝑔(𝑧) = 𝑔(˜𝑧) = ℓ𝑟 (˜𝑧 ) − 𝜆 ≤ ℓ(𝑧★) − 𝜆 ≤ 𝑓 (𝑧★ ) = sup 𝑓 (𝑧),
𝑧 ∈Z 𝑧 ∈Z

where the inequalities follow from the definition of 𝑧★ and the non-negativity of 𝜆.
Conversely, if 𝑧˜ = 𝑧ˆ, then with 𝑧★ ′ ′
𝑟 ∈ arg max 𝑧 ′ ∈Z {ℓ(𝑧 ) : k𝑧 − 𝑧ˆ k ≤ 𝑟} we have

sup 𝑔(𝑧) = 𝑔(˜𝑧 ) = ℓ𝑟 (ˆ𝑧 ) = ℓ(𝑧★ ★


𝑟 ) = 𝑓 (𝑧𝑟 ) = sup 𝑓 (𝑧)
𝑧 ∈Z 𝑧 ∈Z

Thus, the claim follows. 

Proof of Proposition 6.14. Lemma 6.15 allows us to reformulate the dual prob-
lem (6.14b) in terms of the adversarial loss function ℓ𝑟 as
  
inf 𝜆𝑟 + E P̂ sup ℓ𝑟 (𝑧) − 𝜆 · 1 𝑧≠ 𝑧ˆ . (6.16)
𝜆∈R+ 𝑧 ∈Z

As 𝑟 > 0, Z is compact and ℓ𝑟 is upper semicontinuous, Theorem 4.18 implies


that (6.16) is the strong dual of a problem that maximizes the expected value of the
adversarial loss function ℓ𝑟 over an optimal transport ambiguity set corresponding
to the transportation cost function 𝑐0 (𝑧, 𝑧ˆ) = 1 𝑧≠ 𝑧ˆ . Its optimal value thus matches
 
sup E P [ℓ𝑟 (𝑍)] : OT𝑐0 (P, P̂) ≤ 𝑟 = sup E P [ℓ𝑟 (𝑍)] : TV(P, P̂) ≤ 𝑟 ,
P∈P(Z) P∈P(Z)

where the equality holds because TV = OT𝑐0 as shown in Proposition 2.24. Since
sup 𝑧 ∈Z ℓ𝑟 (𝑧) = sup 𝑧 ∈Z ℓ(𝑧) = ℓ, Proposition 6.13 readily implies that the supremum
of the resulting maximization problem over a total variation ball is given by
 
ˆ ,
(1 − 𝛽𝑟 ) · ℓ + 𝛽𝑟 · 𝛽𝑟 -CVaRP̂ ℓ𝑟 ( 𝑍)
Distributionally Robust Optimization 109

Assume now that 𝜓 : Z → Z is a Borel measurable function satisfying


𝜓(ˆ𝑧) ∈ arg max {ℓ(𝑧) : k𝑧 − 𝑧ˆ k ≤ 𝑟} ∀ˆ𝑧 ∈ Z,
𝑧 ∈Z

which exists thanks to (Rockafellar and Wets 2009, Corollary 14.6 and The-
orem 14.37), and define P̂ 𝜓 = P̂ ◦ 𝜓 −1 as the pushforward distribution of P̂ under 𝜓.
Next, we construct a primal maximizer under the assumption that P̂ 𝜓 (ℓ(𝑍) < ℓ) > 𝑟.
To this end, we partition Z into the following four subsets.

ˆ > ℓ(𝑧)
Z1 = 𝑧 ∈ Z : 𝛽𝑟 -VaRP̂ 𝜓 [ℓ( 𝑍)]

ˆ
Z2 = 𝑧 ∈ Z : ℓ > ℓ(𝑧) = 𝛽𝑟 -VaRP̂ 𝜓 [ℓ( 𝑍)]

ˆ
Z3 = 𝑧 ∈ Z : ℓ > ℓ(𝑧) > 𝛽𝑟 -VaRP̂ 𝜓 [ℓ( 𝑍)]

Z4 = 𝑧 ∈ Z : ℓ = ℓ(𝑧)

We also define P̂𝑖 as the distribution P̂ 𝜓 conditioned on the event 𝑍ˆ ∈ Z𝑖 for


all 𝑖 ∈ [4], and we define UZ4 as the uniform distribution on Z4 . Next, we set
P★ = 𝛽𝑟 − P̂ 𝜓 ( 𝑍ˆ ∈ Z3 ) − P̂ 𝜓 ( 𝑍ˆ ∈ Z4 ) · P̂2 +


P̂ 𝜓 ( 𝑍ˆ ∈ Z3 ) · P̂3 + P̂ 𝜓 ( 𝑍ˆ ∈ Z4 ) · P̂4 + (1 − 𝛽𝑟 ) · UZ4 .


Note that P★ is constructed as in the proof of Proposition 6.13, the only difference
being that P̂ is now replaced with its pushforward distribution P̂ 𝜓 . We then find

LP(P★, P̂) ≤ max OT𝑐𝑟 (P★, P̂), 𝑟

≤ max OT𝑐𝑟 (P★, P̂ 𝜓 ) + OT𝑐𝑟 (P̂ 𝜓 , P̂), 𝑟

≤ max TV(P★, P̂ 𝜓 ), 𝑟 = 𝑟,
where the first inequality follows from Proposition 2.22, and the second inequality
holds because 𝑐𝑟 is a pseudo-metric on Z, which implies that OT𝑐𝑟 is a pseudo-
metric on P(Z) and thus satisfies the triangle inequality. The third inequality holds
because OT𝑐𝑟 (P̂ 𝜓 , P̂) = 0 and because 𝑐0 (𝑧, 𝑧ˆ) ≥ 𝑐𝑟 (𝑧, 𝑧ˆ) for all 𝑧, 𝑧ˆ ∈ Z, which
implies that OT𝑐𝑟 (P★, P̂ 𝜓 ) ≤ TV(P★, P̂ 𝜓 ). Finally, the equality follows from the
proof of Proposition 6.13, which ensures that TV(P★, P̂ 𝜓 ) = 𝑟. We also have
 
E P★ [ℓ(𝑍)] = (1 − 𝛽𝑟 ) · ℓ + 𝛽𝑟 · 𝛽𝑟 -CVaRP̂ 𝜓 ℓ( 𝑍) ˆ
 
= (1 − 𝛽𝑟 ) · ℓ + 𝛽𝑟 · 𝛽𝑟 -CVaRP̂ ℓ(𝜓( 𝑍))ˆ .

where the two equalities follow again from the proof of Proposition 6.13 and from
the measure-theoretic change of variables formula, respectively. As ℓ(𝜓(ˆ𝑧)) = ℓ𝑟 (ˆ𝑧)
for every 𝑧ˆ ∈ Z, the objective function value of P★ in (6.14a) matches the optimal
value of the dual problem (6.14b). Weak duality as established in Theorem 4.18
thus implies that P★ solves the primal problem (6.14a). If P̂ 𝜓 (ℓ(𝑍) < ℓ) ≤ 𝑟, the
construction of a primal maximizer is simpler and thus omitted for brevity. 
110 D. Kuhn, S. Shafiee, and W. Wiesemann

The results of this section were first obtained by Bennouna and Van Parys (2023)
under the assumption that the nominal distribution P̂ is discrete.

6.12. Worst-Case Expectations over ∞-Wasserstein Balls


Consider the worst-case expectation problem

sup E P [ℓ(𝑍)] : W∞ (P, P̂) ≤ 𝑟 , (6.17a)
P∈P(Z)

which maximizes the expected value of ℓ(𝑍) over an ∞-Wasserstein ball of ra-
dius 𝑟 ∈ R+ around P̂ ∈ P(Z). We assume here that the ∞-Wasserstein distance
is induced by a given norm k · k on R𝑑 . Recall from Proposition 2.27 that the
∞-Wasserstein ambiguity set coincides with the optimal transport ambiguity set

P = P ∈ P(Z) : OT𝑐𝑟 (P, P̂) ≤ 0 ,
where the transportation cost function 𝑐𝑟 is defined through 𝑐𝑟 (𝑧, 𝑧ˆ) = 1 k 𝑧− 𝑧ˆ k>𝑟 .
We emphasize that, while the radius of the ∞-Wasserstein ball under considera-
tion is 𝑟, the radius of the corresponding optimal transport ambiguity set P is 0.
Theorem 4.18 thus implies that the problem dual to (6.17a) is given by
 
ˆ
inf E P̂ sup ℓ(𝑧) − 𝜆𝑐𝑟 (𝑧, 𝑍) (6.17b)
𝜆∈R+ 𝑧 ∈Z

whenever ℓ is upper semicontinuous. If Z is compact, then the problems (6.17a)


and (6.17b) can be solved in closed form.
Proposition 6.16 (Worst-Case Expectations over ∞-Wasserstein Balls). Suppose
that Z ⊆ R𝑑 is compact, P̂ ∈ P(Z), 𝑟 ∈ R+ , E P̂ [ℓ( 𝑍)] ˆ > −∞ and ℓ is upper
semicontinuous. Define the adversarial loss function ℓ𝑟 (ˆ𝑧) = sup 𝑧 ∈Z {ℓ(𝑧) : k𝑧 −
𝑧ˆ k ≤ 𝑟} as in Proposition 6.14, and let 𝜓 : Z → Z be a Borel function that satisfies
𝜓(ˆ𝑧 ) ∈ arg max {ℓ(𝑧) : k𝑧 − 𝑧ˆ k ≤ 𝑟} ∀ˆ𝑧 ∈ Z.
𝑧 ∈Z

Then, the primal problem (6.17a) is solved by P★ = P̂ ◦ 𝜓 −1 . In addition, the


ˆ
optimal values of (6.17a) and (6.17b) are both equal to E P̂ [ℓ𝑟 ( 𝑍)].
Proof. Note that the Borel function 𝜓 exists thanks to (Rockafellar and Wets 2009,
Corollary 14.6 and Theorem 14.37). This ensures that the pushforward distribution
P★ = P̂ ◦ 𝜓 −1 is well-defined. Note also that P★ is feasible in (6.17a) because

W∞ (P★, P̂) = inf 𝑟 ′ ≥ 0 : OT𝑐𝑟 ′ (P★, P̂) ≤ 0 ≤ 𝑟,
where the equality follows from Proposition 2.27 with 𝑑(𝑧, 𝑧ˆ) = k𝑧 − 𝑧ˆ k, and the
inequality holds because OT𝑐𝑟 (P★, P̂) = 0. We also have
E P★ [ℓ(𝑍)] = E P̂ [ℓ(𝜓(𝑍))] = E P̂ [ℓ𝑟 (𝑍)] .
Next, note that sup 𝑧 ∈Z ℓ(𝑧) − 𝜆𝑐𝑟 (𝑧, 𝑧ˆ) is non-increasing in 𝜆 for any fixed 𝑧ˆ ∈ Z.
Distributionally Robust Optimization 111

Also, it is uniformly bounded above by sup 𝑧 ∈Z ℓ(𝑧), which is a finite constant thanks
to the compactness of Z and the upper semicontinuity of ℓ. By the monotone
convergence theorem, the optimal value of the dual problem (6.17b) thus satisfies
   
 
ˆ ˆ
inf E P̂ sup ℓ(𝑧) − 𝜆𝑐𝑟 (𝑧, 𝑍) = E P̂ inf sup ℓ(𝑧) − 𝜆𝑐𝑟 (𝑧, 𝑍) = E P̂ ℓ𝑟 ( 𝑍) ˆ ,
𝜆∈R+ 𝑧 ∈Z 𝜆∈R+ 𝑧 ∈Z

where the second equality holds because Z is compact. Weak duality as established
in Theorem 4.18 thus implies that P★ solves the primal problem (6.17a). 

Proposition 6.16 shows that the worst-case expectation of the original loss ℓ(𝑍)
with respect to an ∞-Wasserstein ball coincides with the crisp expectation of the
ˆ with respect to the nominal distribution P̂. This result was
adversarial loss ℓ𝑟 ( 𝑍)
first discovered by Gao et al. (2017) for discrete nominal distributions and later
extend by Gao et al. (2024) to general nominal distributions. The loss function ℓ𝑟
is routinely used in machine learning for the adversarial training of neural net-
works (Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow and Fergus 2014,
Goodfellow et al. 2015). Proposition 6.16 thus reveals an intimate connection
between adversarial training and distributionally robust optimization with respect
to an ∞-Wasserstein ambiguity set. This connection has been further explored in
the context of adversarial classification by García Trillos and García Trillos (2022),
García Trillos and Murray (2022), García Trillos and Jacobs (2023), Bungert et al.
(2023, 2024), Pydi and Jog (2024), Frank and Niles-Weed (2024a) and Frank and
Niles-Weed (2024b).

6.13. Worst-Case Expectations over 1-Wasserstein Balls


Consider the worst-case expectation problem

sup E P [ℓ(𝑍)] : W1 (P, P̂) ≤ 𝑟 , (6.18a)
P∈P(Z)

which maximizes the expected value of ℓ(𝑍) over a 1-Wasserstein ball of radius 𝑟 ∈
R+ around P̂ ∈ P(Z). We assume here that the 1-Wasserstein distance is induced by
a given norm k · k on R𝑑 . Thus, the 1-Wasserstein
 ambiguity set coincides with the
optimal transport ambiguity set P = P ∈ P(Z) : OT𝑐 (P, P̂) ≤ 𝑟 corresponding to
the transportation cost function 𝑐 is defined through 𝑐(𝑧, 𝑧ˆ) = k𝑧− 𝑧ˆ k. Theorem 4.18
thus implies that the problem dual to (6.18a) is given by
 
ˆ
inf 𝜆𝑟 + E P̂ sup ℓ(𝑧) − 𝜆k𝑧 − 𝑍 k (6.18b)
𝜆≥0 𝑧 ∈Z

whenever ℓ is upper semicontinuous. If Z = R𝑑 and ℓ is convex and Lipschitz


continuous, then the problems (6.18a) and (6.18b) can be solved in closed form.
Proposition 6.17 (Worst-Case Expectations over 1-Wasserstein Balls). Suppose
ˆ > −∞ and ℓ is convex and
that Z = R𝑑 , P̂ ∈ P(Z) and 𝑟 ∈ R+ . If E P̂ [ℓ( 𝑍)]
112 D. Kuhn, S. Shafiee, and W. Wiesemann

Lipschitz continuous, then the optimal values of (6.18a) and (6.18b) are equal to
ˆ + 𝑟 lip(ℓ).
E P̂ [ℓ( 𝑍)]
Under the conditions of Proposition 6.17, the supremum of the primal prob-
lem (6.18a) is usually not attained. The proof constructs a sequence of distributions
that attain the supremum asymptotically. These distributions move an increasingly
small portion of P̂ increasingly far along the direction of steepest increase of ℓ.
Intuitively, the amount of probability mass transported over a distance Δ must decay
as O(𝑟/Δ) as Δ grows. The dual problem (6.18b) is solved by 𝜆★ = lip(ℓ).
Proof of Proposition 6.17. As the convex function ℓ is Lipschitz continuous, it is
in particular proper and closed. By the Fenchel-Moreau theorem (Lemma 4.2) ℓ
thus admits the dual representation
ℓ(𝑧) = sup 𝑧⊤ 𝑦 − ℓ ∗ (𝑦),
𝑦 ∈dom(ℓ ∗ )

where ℓ ∗ denotes the convex conjugate of ℓ. Put differently, ℓ coincides with


the pointwise supremum of the affine functions 𝑓 𝑦 (𝑧) = 𝑦 ⊤ 𝑧 − ℓ ∗ (𝑦) parametrized
by 𝑦 ∈ dom(ℓ ∗ ). Hölder’s inequality then implies that
𝑓 𝑦 (𝑧) − 𝑓 𝑦 (ˆ𝑧) = 𝑦 ⊤ (𝑧 − 𝑧ˆ) ≤ k𝑦k ∗ k𝑧 − 𝑧ˆ k,
where k · k ∗ denotes the norm dual to k · k. As Hölder’s inequality is tight, 𝑓 𝑦 is
Lipschitz continuous with Lipschitz modulus lip( 𝑓 𝑦 ) = k𝑦k ∗. In addition, as the
Lipschitz modulus of a supremum of affine functions coincides with the supremum
of the corresponding Lipschitz moduli, the Lipschitz modulus of ℓ is given by
lip(ℓ) = sup k𝑦k ∗ = max k𝑦k ∗ .
𝑦 ∈dom(ℓ ∗ ) 𝑦 ∈cl(dom(ℓ ∗ ))

The maximum in the last expression is attained by some 𝑦★ ∈ R𝑑 because lip(ℓ) < ∞
by assumption. Next, define 𝑧★ as any optimal solution of max k 𝑧 k ≤1 (𝑦★)⊤ 𝑧. By
construction, we thus have (𝑦★)⊤ 𝑧★ = k𝑦★ k ∗ . We also introduce a sequence {𝑦 𝑖 }𝑖∈N
in dom(ℓ ∗ ) that converges to 𝑦★, and we set 𝑞 𝑖 = 𝑖 −1 (1 + |ℓ ∗ (𝑦 𝑖 )|) −1 for every 𝑖 ∈ N.
In addition, we define 𝑓𝑖 : R𝑑 → R𝑑 through 𝑓𝑖 (𝑧) = 𝑧 + 𝑟𝑧★/𝑞 𝑖 for any 𝑖 ∈ N.
Thus, 𝑓𝑖 represents the translation that shifts each point in R𝑑 along the direction 𝑧★
by a distance equal to 𝑟/𝑞 𝑖 . We further define
P𝑖 = (1 − 𝑞 𝑖 ) P̂ + 𝑞 𝑖 P̂ ◦ 𝑓𝑖−1 ,
where P̂ ◦ 𝑓𝑖−1 stands for the pushforward distribution of P̂ under 𝑓𝑖 . Intuitively, P𝑖
is obtained by decomposing P̂ into two parts (1 − 𝑞 𝑖 )P̂ and 𝑞 𝑖 P̂ and then translating
the second part by 𝑟𝑧★/𝑞 𝑖 . By construction, we thus have OT𝑐 (P𝑖 , P̂) ≤ 𝑟 and
E P𝑖 [ℓ(𝑍)] = (1 − 𝑞 𝑖 ) E P̂ [ℓ(𝑍)] + 𝑞 𝑖 E P̂ [ℓ(𝑍 + 𝑟𝑧★/𝑞 𝑖 )]
≥ (1 − 𝑞 𝑖 ) E P̂ [ℓ(𝑍)] + 𝑞 𝑖 E P̂ [(𝑦 𝑖 )⊤ (𝑍 + 𝑟𝑧★/𝑞 𝑖 ) − ℓ ∗ (𝑦 𝑖 )].
Here, the inequality follows from the representation of ℓ in terms of its conjugate ℓ ∗ .
Distributionally Robust Optimization 113

As 𝑖 tends to infinity, 𝑞 𝑖 as well as 𝑞 𝑖 ℓ ∗ (𝑦 𝑖 ) converge to 0, and 𝑦 𝑖 converges to 𝑦★.


Recall also that (𝑦★)⊤ 𝑧★ = k𝑦★ k ∗ = lip(ℓ). This shows that the supremum of the
worst-case expectation problem (6.18a) is bounded below by E P̂ [ℓ(𝑍)] + 𝑟 lip(ℓ).
Next, define 𝜆★ = lip(ℓ), and note that
ℓ(ˆ𝑧) ≤ sup ℓ(𝑧) − 𝜆★ k𝑧 − 𝑧ˆ k ≤ sup ℓ(ˆ𝑧) + lip(ℓ)k𝑧 − 𝑧ˆ k − 𝜆★ k𝑧 − 𝑧ˆ k = ℓ(ˆ𝑧)
𝑧 ∈Z 𝑧 ∈Z

for all 𝑧ˆ ∈ Z, where the second inequality follows from the Lipschitz continuity
of ℓ, and the equality holds thanks to the definition of 𝜆★. Thus, the objective
function value of 𝜆★ in the dual problem (6.18b) is given by
 
 
★ ˆ ˆ + 𝑟 lip(ℓ).
𝜆 𝑟 + E P̂ sup ℓ(𝑧) − 𝜆k𝑧 − 𝑍 k = E P̂ ℓ( 𝑍)
𝑧 ∈Z

In summary, we have shown that—asymptotically for large 𝑖—the objective func-


tion value of P𝑖 in (6.18a) matches that of 𝜆★ in (6.18b). By weak duality as
established in Theorem 4.18, the supremum of the primal problem (6.18a) thus
coincides with the Lipschitz-regularized nominal loss E P̂ [ℓ(𝑍)] + 𝑟 lip(ℓ) and is
asymptotically attained by the distribution P𝑖 , which moves a fraction 𝑞 𝑖 of the
total probabilty mass by a distance 𝑟/𝑞 𝑖 along the direction 𝑧★. 

The connection between robustificaton and Lipschitz regularization was dis-


covered by Mohajerin Esfahani and Kuhn (2018). It offers a probabilistic inter-
pretation for regularization techniques commonly used in statistics and machine
learning (Shafieezadeh-Abadeh et al. 2015, 2019). Further extensions to noncon-
vex loss functions has been established in (Blanchet et al. 2019a, Ho-Nguyen and
Wright 2023, Shafiee, Aolaritei, Dörfler and Kuhn 2023, Gao et al. 2024, Zhang
et al. 2024a).

6.14. 1-Wasserstein Risk


Consider a law-invariant risk measure 𝜚 that can be expressed as a superposition
of CVaRs with different risk levels 𝛽 ∈ [0, 1]. Specifically, assume that
∫ 1
𝜚P [ℓ(𝑍)] = 𝛽-CVaRP [ℓ(𝑍)] d𝜎(𝛽) (6.19)
0
∫1
for all P ∈ P(Z), where 𝜎 is a probability distribution on [0, 1] with 0 𝛽 −1 d𝜎(𝛽) <
∞. Any 𝜚 with these properties is called a spectral risk measure (Acerbi 2002),
and (6.19) is termed a Kusuoka representation of 𝜚 (Kusuoka 2001, Shapiro 2013).
If the distribution of 𝑍 is only known to be close to P̂ ∈ P(Z), then it is natural
to quantify the riskiness of an uncertain loss ℓ(𝑍) under a spectral risk measure 𝜚
by the 1-Wasserstein risk, that is, the supremum of 𝜚P [ℓ(𝑍)] over all distributions P
in a 1-Wasserstain ball around P̂. The 1-Wasserstein risk is available in closed form
whenever Z = R𝑑 and ℓ is convex and Lipschitz continuous.
114 D. Kuhn, S. Shafiee, and W. Wiesemann

Proposition 6.18 (1-Wasserstein Risk). Let 𝜚 be a spectral risk measure satisfy-


∫1
ing (6.19) with 0 𝛽 −1 d𝜎(𝛽) < ∞. Assume that P̂ ∈ P(R𝑑 ) with E P̂ [k𝑍 k] < ∞ for
some norm k · k on R𝑑 . Define P = {P ∈ P(R𝑑 ) : W1 (P, P̂) ≤ 𝑟}, where 𝑟 ≥ 0 and
W1 is the 1-Wasserstein distance with transportation cost function 𝑐(𝑧, 𝑧ˆ) = k𝑧 − 𝑧ˆ k.
If ℓ is convex and Lipschitz continuous with lip(ℓ) < ∞, then we have
∫ 1
sup 𝜚P [ℓ(𝑍)] = 𝜚P̂ [ℓ(𝑍)] + 𝑟 lip(ℓ) 𝛽 −1 d𝜎(𝛽).
P∈P 0
∫1
Proof. The assumption 0 𝛽 −1 d𝜎(𝛽) < ∞ ensures that 𝜎({0}) = 0, and the as-
sumption E P̂ [k𝑍 k] < ∞ ensures via the Lipschitz continuity of ℓ that E P̂ [ℓ(𝑍)] is
finite. We first bound the worst-case risk from above. To this end, note that
∫ 1
sup 𝜚P [ℓ(𝑍)] ≤ sup 𝛽-CVaRP [ℓ(𝑍)] d𝜎(𝛽)
P∈P 0 P∈P
∫ 1
1
≤ sup E P [max {ℓ(𝑍) − 𝜏, 0}] d𝜎(𝛽)
inf 𝜏 +
0 𝜏 ∈R 𝛽 P∈P
∫ 1
1 
= inf 𝜏 + E P̂ [max {ℓ(𝑍) − 𝜏, 0}] + 𝑟 lip(ℓ) d𝜎(𝛽)
0 𝜏 ∈R 𝛽
∫ 1
= 𝜚P̂ [ℓ(𝑍)] + 𝑟 lip(ℓ) 𝛽 −1 d𝜎(𝛽) < +∞,
0

where the first inequality holds because P may adapt to 𝛽 when the supremum is
evaluated inside the integral, and the second inequality follows from the standard
max-min inequality. The first equality follows from the results on worst-case
expectations over 1-Wasserstein balls in Section 6.13.
To derive the converse inequality, we assume first that 𝜎({1}) = 0. The general
case will be addressed later. Note that 𝜇 = inf P∈P E P [ℓ(𝑍)] is finite because ℓ is
Lipschitz continuous and because E P̂ [k𝑍 k] < ∞, which implies via the proof of
Theorem 3.19 that all distributions in P have uniformly bounded first moment. We
may assume without loss of generality that 𝜇 ≥ 0. Otherwise, we may replace ℓ(𝑧)
with ℓ(𝑧)− 𝜇, which simply increases the worst-case risk by −𝜇 because any spectral
risk measure is translation invariant. The assumption that 𝜇 ≥ 0 then implies that
𝛽-CVaRP [ℓ(𝑍)] ≥ E P [ℓ(𝑍)] ≥ 0 ∀𝛽 ∈ [0, 1], ∀P ∈ P.
Thus, we have
∫ 1− 𝛿
sup 𝜚P [ℓ(𝑍)] = sup sup 𝛽-CVaRP [ℓ(𝑍)] d𝜎(𝛽)
P∈P P∈P 𝛿>0 𝛿
∫ 1− 𝛿
= sup sup 𝛽-CVaRP [ℓ(𝑍)] d𝜎(𝛽),
𝛿>0 P∈P 𝛿

where the first equality follows from the monotone convergence theorem and the
Distributionally Robust Optimization 115

assumption that 𝜎({0}) = 𝜎({1}) = 0. Hence, for any 𝜀 > 0 there is 𝛿 > 0 with
∫ 1− 𝛿
sup 𝜚P [ℓ(𝑍)] − sup 𝛽-CVaRP [ℓ(𝑍)] d𝜎(𝛽) ≤ 𝜀 (6.20a)
P∈P P∈P 𝛿

and
∫ 1 ∫ 1− 𝛿
−1
𝛽 d𝜎(𝛽) − 𝛽 −1 d𝜎(𝛽) ≤ 𝜀. (6.20b)
0 𝛿

Recall now from Theorem 3.19 that P is weakly compact and thus tight. Hence,
there exists a compact set C ⊆ R𝑑 with P(𝑍 ∉ C) ≤ 𝛿/2 for every P ∈ P. As C
is compact, 𝜏 = min𝑧 ∈C ℓ(𝑧) and 𝜏 = max𝑧 ∈C ℓ(𝑧) are both finite. Using the trivial
bounds P(ℓ(𝑍) ≥ 𝜏) ≥ P(𝑍 ∈ C) and P(ℓ(𝑍) ≤ 𝜏) ≥ P(𝑍 ∈ C) and noting that
P(𝑍 ∈ C) ≥ 1 − 𝛿/2 for every P ∈ P, one can then readily show that

𝜏 ≤ (1 − 𝛿)-VaRP [ℓ(𝑍)] ≤ 𝛽-VaRP [ℓ(𝑍)] ≤ 𝛿-VaRP [ℓ(𝑍)] ≤ 𝜏

for all 𝛽 ∈ [𝛿, 1 − 𝛿] and for all P ∈ P. Next, define 𝑦 𝑖 ∈ dom(ℓ ∗ ), 𝑞 𝑖 ∈ [0, 1], the
function 𝑓𝑖 : R𝑑 → R𝑑 and the distribution P𝑖 = (1 − 𝑞 𝑖 ) P̂ + 𝑞 𝑖 P̂ ◦ 𝑓𝑖−1 for 𝑖 ∈ N
as in Section 6.13. We then obtain
∫ 1− 𝛿
1
sup 𝜚P [ℓ(𝑍)] ≥ inf 𝜏 + E P𝑖 [max {ℓ(𝑍) − 𝜏, 0}] d𝜎(𝛽)
P∈P 𝛿 𝜏 ∈R 𝛽
∫ 1− 𝛿
1 − 𝑞𝑖 (6.21)
= inf 𝜏 + E P̂ [max {ℓ(𝑍) − 𝜏, 0}]
𝛿 𝜏 ∈ [ 𝜏,𝜏 ] 𝛽
𝑞𝑖   
+ E P̂ max ℓ(𝑍 + 𝑟𝑧★/𝑞 𝑖 ) − 𝜏, 0 d𝜎(𝛽).
𝛽

The inequality in (6.21) holds because 𝛽-CVaRP [ℓ(𝑍)] ≥ 0 for all 𝛽 ∈ [0, 1] by
assumption and because P𝑖 ∈ P as shown in Section 6.13. The equality follows from
the definition of P𝑖 and from (Rockafellar and Uryasev 2002, Theorem 10), which
ensures that the minimization problem over 𝜏 is solved by 𝛽-VaRP [ℓ(𝑍)] ∈ [𝜏, 𝜏].
As ℓ is proper, convex and lower semicontinuous, and as 𝑦 𝑖 belongs to the domain
of ℓ ∗ , the Fenchel-Moreau theorem further implies that

ℓ(𝑧 + 𝑟𝑧★/𝑞 𝑖 ) = sup (𝑧 + 𝑟𝑧★/𝑞 𝑖 )⊤ 𝑦 − ℓ ∗ (𝑦) ≥ (𝑧 + 𝑟𝑧★/𝑞 𝑖 )⊤ 𝑦 𝑖 − ℓ ∗ (𝑦 𝑖 ).


𝑦 ∈dom(ℓ ∗ )

The last expectation in (6.21) thus admits the lower bound


    
E P̂ max ℓ(𝑍 + 𝑟𝑧★/𝑞 𝑖 ) − 𝜏, 0 ≥ E P̂ ℓ(𝑍 + 𝑟𝑧★/𝑞 𝑖 ) − 𝜏
 
≥ E P̂ 𝑦 ⊤ ⊤ ★ ∗
𝑖 𝑍 + 𝑟 𝑦 𝑖 𝑧 /𝑞 𝑖 − ℓ (𝑦 𝑖 ) − 𝜏.
116 D. Kuhn, S. Shafiee, and W. Wiesemann

Substituting this estimate into (6.21) and letting 𝑖 tend to infinity yields
∫ 1− 𝛿
1 − 𝑞𝑖
sup 𝜚P [ℓ(𝑍)] ≥ lim inf 𝜏 + E P̂ [max {ℓ(𝑍) − 𝜏, 0}] d𝜎(𝛽)
P∈P 𝑖→∞ 𝛿 𝜏 ∈ [ 𝜏,𝜏 ] 𝛽
∫ 1− 𝛿
+ 𝑟 lip(ℓ) 𝛽 −1 d𝜎(𝛽)
𝛿
∫ 1− 𝛿 ∫ 1− 𝛿
= 𝛽-CVaRP [ℓ(𝑍)] d𝜎(𝛽) + 𝑟 lip(ℓ) 𝛽 −1 d𝜎(𝛽),
𝛿 𝛿

ℓ ∗ (𝑦
where we have used that 𝑞 𝑖 as well as 𝑞 𝑖 ⊤ ★
𝑖 ) converge to 0 and that 𝑦 𝑖 𝑧 converges
★ ⊤ ★
to (𝑦 ) 𝑧 = lip(ℓ) as 𝑖 tends to infinity; see also Section 6.13. The equality
follows from the monotone convergence theorem, which applies because 𝑞 𝑖 is
monotonically decreasing with 𝑖. Letting 𝜀 tend to 0 thus implies via (6.20) that
∫ 1 ∫ 1
sup 𝜚P [ℓ(𝑍)] ≥ 𝛽-CVaRP [ℓ(𝑍)] d𝜎(𝛽) + 𝑟 lip(ℓ) 𝛽 −1 d𝜎(𝛽).
P∈P 0 0

This lower bound matches the upper bound derived in the first part of the proof, and
thus the claim follows, provided that 𝜎({1}) = 0. If the probability distribution 𝜎
has an atom at 1, then it can be decomposed as 𝜎 = 𝜎 ˆ + 𝜎({1}) · 𝛿1 , where 𝜎ˆ is a
non-negative measure on (0, 1). We can thus decompose the risk under P as
∫ 1
𝜚P [ℓ(𝑍)] = 𝛽-CVaRP [ℓ(𝑍)] d𝜎(𝛽)
ˆ + 𝜎({1}) · E P [ℓ(𝑍)].
0
The first term in this decomposition can then be handled as above, and the second
term can be handled as in Section 6.13. Details are omitted for brevity. 

Proposition 6.18 shows that the 1-Wasserstein risk of a Lipschitz continuous


convex loss function coincides with the sum of the nominal risk and a Lipschitz
regularization term. It is asymptotically attained by the distribution P𝑖 , which
moves a fraction 𝑞 𝑖 of the total probability mass by a distance 𝑟/𝑞 𝑖 along the
direction 𝑧★. Proposition 6.17 emerges as a special case of Proposition 6.18 when
𝜎 = 𝛿1 . The worst-case risk over 𝑝-Wasserstein balls for 𝑝 ≥ 1 was first studied
by Pflug et al. (2012), and a result akin to Proposition 6.18 was obtained for linear
loss functions. Extensions to more general risk measures were studied by Pichler
(2013) and Wozabal (2014). The extension to convex loss functions is new.

6.15. 𝑝-Wasserstein Risk


We now show that if the loss function ℓ(𝑧) is linear, then the worst-case risk over a
𝑝-Wasserstein ball may be available in closed form even if 𝑝 ∈ (1, ∞). The results
of this section depend on the following lemma, which characterizes the conjugates
of powers of norms; see also (Zhen et al. 2023, Lemma C.9).
Lemma 6.19 (Conjugates of Powers of Norms). Assume that k · k and k · k ∗ are
Distributionally Robust Optimization 117

mutually dual norms on R𝑑 and that 𝑝, 𝑞 ∈ (1, ∞) are conjugate exponents with
1 1 (𝑞−1) /𝑞 𝑞 . Then, the following statements hold.
𝑝 + 𝑞 = 1. Define 𝜑(𝑞) = (𝑞 − 1)

1 𝑝 ∗ 1 𝑞
(i) If 𝑓 (𝑧) = 𝑝 k𝑧k , then 𝑓 (𝑦) = 𝑞 k𝑦k ∗ .
(ii) If 𝑔(𝑧) = k𝑧 − 𝑧ˆ k 𝑝 , then 𝑔∗ (𝑦) = 𝑦 ⊤ 𝑧ˆ + 𝜑(𝑞) k𝑦k ∗𝑞 .
Proof. As for assertion (i), fix any 𝑧, 𝑦 ∈ R𝑑 . We then have
1 1 1 1
𝑧⊤ 𝑦 − k𝑧k 𝑝 ≤ k𝑧k k𝑦k ∗ − k𝑧k 𝑝 ≤ max 𝑡 k𝑦k ∗ − 𝑡 𝑝 = k𝑦k ∗𝑞 ,
𝑝 𝑝 𝑡 ≥0 𝑝 𝑞
where the first inequality follows from the construction of the dual norm, and the
second inequality is obtained by maximizing over 𝑡 = k𝑧k. The equality holds
1/( 𝑝−1)
because the maximization problem is solved by 𝜏 = k𝑦k ∗ . Both inequalities

collapse to equalities if 𝑧 ∈ arg max k 𝑧 k=𝜏 𝑧 𝑦. This allows us to conclude that
1 1 𝑞
𝑓 ∗ (𝑦) = sup 𝑧⊤ 𝑦 − k𝑧k 𝑝 = k𝑦k ∗ .
𝑧 ∈R𝑑 𝑝 𝑞
As for assertion (ii), note that
1
𝑔∗ (𝑦) = sup 𝑦 ⊤ 𝑧 − k𝑧 − 𝑧ˆ k 𝑝 = 𝑦 ⊤ 𝑧ˆ + 𝑝 · sup (𝑦/𝑝)⊤ 𝑧 − k𝑧k 𝑝
𝑧 ∈R𝑑 𝑧 ∈R𝑑 𝑝
𝑝
= 𝑦 ⊤ 𝑧ˆ + k𝑦/𝑝k∗𝑞 = 𝑦 ⊤ 𝑧ˆ + 𝜑(𝑞) k𝑦k ∗𝑞 ,
𝑞
where the last two equalities exploit assertion (i) and the definition of 𝜑(𝑞). 

We now show that the worst-case CVaR of a linear loss function ℓ(𝑧) = 𝜃 ⊤ 𝑧 over
a 𝑝-Wasserstein ball of radius 𝑟 around P̂ equals the sum of the nominal CVaR
under P̂ and a regularization term that scales with the norm of 𝜃 and with 𝑟.
Proposition 6.20 (𝑝-Wasserstein Risk). Assume that P̂ ∈ P(R𝑑 ) with E P̂ [k𝑍 k 𝑝 ] <
∞ for some 𝑝 ∈ (1, ∞) and for some norm k · k on R𝑑 . Define P = {P ∈
P(R𝑑 ) : W 𝑝 (P, P̂) ≤ 𝑟}, where 𝑟 ≥ 0 and W 𝑝 is the 𝑝-Wasserstein distance with
transportation cost function 𝑐(𝑧, 𝑧ˆ) = k𝑧 − 𝑧ˆ k 𝑝 . If 𝜃 ∈ R𝑑 and 𝛽 ∈ (0, 1), then
sup 𝛽-CVaRP [𝜃 ⊤ 𝑍] = 𝛽-CVaRP̂ [𝜃 ⊤ 𝑍] + 𝑟 𝛽 −1/ 𝑝 k𝜃 k ∗ .
P∈P

Proof. By the definition of the CVaR by Rockafellar and Uryasev (2000), we have
1   
sup 𝛽-CVaRP [𝜃 ⊤ 𝑍] ≤ inf 𝜏 + sup E P max 𝜃 ⊤ 𝑍 − 𝜏, 0 , (6.22)
P∈P 𝜏 ∈R 𝛽 P∈P
where the inequality is obtained by interchanging the supremum over P and the
infimum over 𝜏. The underlying worst-case expectation problem satisfies
  
sup E P max 𝜃 ⊤ 𝑍 − 𝜏, 0
P∈P
118 D. Kuhn, S. Shafiee, and W. Wiesemann
" #

≤ inf 𝜆𝑟 + E P̂ sup max 𝜃 𝑧 − 𝜏, 0 − 𝜆k𝑧 − 𝑍ˆ k
𝑝 ⊤ 𝑝
𝜆≥0 𝑧 ∈R𝑑
" ( )#
= inf 𝜆𝑟 𝑝 + E P̂ max sup 𝜃 ⊤ 𝑧 − 𝜏 − 𝜆k𝑧 − 𝑍ˆ k 𝑝 , sup −𝜆k𝑧 − 𝑍ˆ k 𝑝
𝜆≥0 𝑧 ∈R𝑑 𝑧 ∈R𝑑
  
= inf 𝜆𝑟 + E P̂ max 𝜃 𝑍ˆ − 𝜏 + 𝜑(𝑞)𝜆
𝑝 ⊤ 𝑞
k𝜃/𝜆k ∗ ,0
𝜆≥0

where the inequality exploits weak duality, and the first equality is obtained by
interchanging the order of the two maximization operations. The second equality
follows from Lemma 6.19(ii). Substituting the resulting formula into (6.22) and
interchanging the infimum over 𝜏 with the infimum over 𝜆 then yields
sup 𝛽-CVaRP [𝜃 ⊤ 𝑍]
P∈P
𝜆𝑟 𝑝 1   
≤ inf + inf 𝜏 + E P̂ max 𝜃 ⊤ 𝑍ˆ − 𝜏 + 𝜑(𝑞)𝜆 k𝜃/𝜆k ∗𝑞 , 0
𝜆≥0 𝛽 𝜏 ∈R 𝛽
𝜆𝑟 𝑝
= inf + 𝛽-CVaRP̂ [𝜃 ⊤ 𝑍ˆ + 𝜑(𝑞)𝜆 k𝜃/𝜆k ∗𝑞 ]
𝜆≥0 𝛽
𝑝
ˆ + inf 𝜆𝑟 + 𝜑(𝑞)𝜆 k𝜃/𝜆k ∗𝑞 ,
= 𝛽-CVaRP̂ [𝜃 ⊤ 𝑍]
𝜆≥0 𝛽

where the equalities follow from the definition and the translation invariance of the
CVaR, respectively. Solving the minimization problem over 𝜆 analytically yields
ˆ + 𝑟 𝛽 −1/ 𝑝 k𝜃 k ∗ .
sup 𝛽-CVaRP [𝜃 ⊤ 𝑍] ≤ 𝛽-CVaRP̂ [𝜃 ⊤ 𝑍]
P∈P

To derive the converse inequality, we use 𝜏𝛽 as a shorthand for 𝛽-VaRP̂ [𝜃 ⊤ 𝑍],ˆ


★ ⊤
which is finite because 𝛽 ∈ (0, 1), and we select any 𝑧 ∈ arg max k 𝑧 k=1 𝜃 𝑧. In
addition, we decompose the nominal distribution as P̂ = 𝛽 P̂+ +(1− 𝛽) P̂ − , where P̂+
and P̂ − are probability distributions supported on Z+ = {𝑧 ∈ R𝑑 : 𝜃 ⊤ 𝑧 ≥ 𝜏𝛽 } and
Z − = {𝑧 ∈ R𝑑 : 𝜃 ⊤ 𝑧 ≤ 𝜏𝛽 }, respectively. Such a decomposition always exists
thanks to the definition of 𝜏𝛽 . For example, if P̂(𝜃 ⊤ 𝑍 = 𝜏𝛽 ) = 0, as would be the
case if P̂ was absolutely continuous with respect to Lebesgue measure, then P̂ −
and P̂+ can simply be obtained by conditioning P̂ on Z − and Z+ , respectively.
We also define 𝑓 : R𝑑 → R𝑑 through 𝑓 (𝑧) = 𝑧 + 𝑟𝑧★/𝛽1/ 𝑝 . Thus, 𝑓 shifts all
points in R𝑑 along the direction 𝑧★ by a distance equal to 𝑟/𝛽1/ 𝑝 . Finally, we
set P★ = 𝛽 P̂+ ◦ 𝑓 −1 + (1 − 𝛽) P̂ − . Hence, P★ is obtained by decomposing P̂ into
two parts 𝛽 P̂+ and (1 − 𝛽) P̂ − and then translating the first part by 𝑟𝑧★/𝛽1/ 𝑝 . We
thus have W 𝑝 (P★, P̂) ≤ 𝑟, and 𝛽-VaRP [𝜃 ⊤ 𝑍] = 𝜏𝛽 . This in turn implies that
sup 𝛽-CVaRP [𝜃 ⊤ 𝑍] ≥ 𝛽-CVaRP★ [𝜃 ⊤ 𝑍]
P∈P
1   
= 𝜏𝛽 + E P★ max 𝜃 ⊤ 𝑍 − 𝜏𝛽 , 0
𝛽
Distributionally Robust Optimization 119
   1−𝛽   
= 𝜏𝛽 + E P̂+ max 𝜃 ⊤ 𝑓 (𝑍) − 𝜏𝛽 , 0 + E P̂− max 𝜃 ⊤ 𝑍 − 𝜏𝛽 , 0
𝛽
 ⊤ 
= E P̂+ 𝜃 𝑍 + 𝑟 𝛽 −1/ 𝑝 ˆ + 𝑟 𝛽 −1/ 𝑝 k𝜃 k ∗ .
k𝜃 k ∗ = 𝛽-CVaRP̂ [𝜃 ⊤ 𝑍]
Here, the first equality follows from the definition of the CVaR and from (Rockafel-
lar and Uryasev 2002, Theorem 10), which ensures 𝜏 matches 𝛽-VaRP★ [ℓ(𝑍)] = 𝜏𝛽
at optimality. The second equality exploits the definition of P★, and the third equal-
ity holds because 𝜃 ⊤ 𝑧★ = k𝜃 k ∗ and because 𝜃 ⊤ 𝑧 ≥ 𝜏𝛽 for all 𝑧 ∈ Z+ and 𝜃 ⊤ 𝑧 ≤ 𝜏𝛽
for all 𝑧 ∈ Z − . Finally, the fourth equality follows from the construction of P̂+ and
from (Rockafellar and Uryasev 2002, Proposition 5). This completes the proof. 

7. Finite Convex Reformulations of Nature’s Subproblem


Although nature’s subproblem admits analytical solutions in important special
cases (cf. Section 6), it can usually only be solved numerically. Sometimes, nature’s
subproblem can be reformulated as an equivalent convex optimization problem. In
these cases, it can be addressed with off-the-shelf solvers. In other cases, however,
it may be necessary or preferable to develop customized solution algorithms.
This section focuses on finite convex reductions. That is, we will describe con-
ditions under which the dual worst-case expectation problems derived in Section 4
can be reformulated as finite convex minimization problems. These finite reformu-
lations are significant because they can be combined with the outer minimization
problem over 𝑥 ∈ X to construct a reformulation of the overall DRO problem (1.2)
as a classical minimization problem amenable to standard optimization software.
We subsequently dualize the finite convex reformulations of nature’s subproblem
to obtain equivalent finite convex maximization problems. These finite bi-dual
maximization problems are significant because their optimal solutions allow us
to construct worst-case distributions that (asymptotically) attain the supremum of
nature’s subproblem (4.1). Even though we only address worst-case expectations,
all results of this section readily extend to worst-case optimized certainty equival-
ents thanks to Theorem 5.18. For the sake of brevity, however, we will not elaborate
on these extensions. To simplify notation, we will always suppress the dependence
of the loss function ℓ on the decision variables 𝑥.
The remainder of this section develops as follows. In Section 7.1, we first outline
a general strategy for deriving finite convex dual and bi-dual reformulations of
nature’s subproblem (4.1). We subsequently exemplify this strategy for worst-case
expectation problems over Chebyshev ambiguity sets (Section 7.2), 𝜙-divergence
ambiguity sets (Section 7.3) and optimal transport ambiguity sets (Section 7.4).

7.1. General Proof Strategy


The worst-case expectation problem (4.1) constitutes a semi-infinite program that
involves infinitely many decision variables (because it optimizes over a subset of
an infinite-dimensional measure space) but only finitely many constraints (e.g.,
120 D. Kuhn, S. Shafiee, and W. Wiesemann

moment conditions and/or bounds on the divergence or discrepancy to a reference


distribution). The duality results of Section 4 enable us to recast this semi-infinite
maximization problem as a semi-infinite minimization problem with finitely many
variables and infinitely many constraints. We then leverage reformulation tech-
niques from robust optimization to recast the dual semi-infinite program as a
finite-dimensional convex minimization problem. These techniques exploit stand-
ard results from convex analysis as well as the S-Lemma, which we review next.
Throughout this discussion we adopt the convention that 0 · ∞ = ∞.
We first show that scaling and perspectivication constitute dual operations.
Lemma 7.1 (Duality of Scaling and Perspectivication). If 𝑓 : R𝑑 → R is a proper,
closed and convex function and 𝛼 ∈ R+ a fixed constant, then the following hold.
(i) If 𝑔(𝑧) = 𝛼 𝑓 (𝑧), then 𝑔∗ (𝑦) = ( 𝑓 ∗ ) 𝜋 (𝑦, 𝛼) for all 𝑦 ∈ R𝑑 .
(ii) If 𝑔(𝑧) = 𝑓 𝜋 (𝑧, 𝛼), then 𝑔∗ (𝑦) = cl(𝛼 𝑓 ∗ )(𝑦) for all 𝑦 ∈ R𝑑 .
Proof. We prove assertion (i) by case distinction. First, if 𝛼 > 0, then we have
𝑔∗ (𝑦) = sup 𝑦 ⊤ 𝑧 − 𝛼 𝑓 (𝑧) = 𝛼 sup (𝑦/𝛼)⊤ 𝑧 − 𝑓 (𝑧)
𝑧 ∈R𝑑 𝑧 ∈R𝑑
∗ ∗ 𝜋
= 𝛼 𝑓 (𝑦/𝛼) = ( 𝑓 ) (𝑦, 𝛼).
If 𝛼 = 0, on the other hand, then a similar reasoning shows that
𝑔∗ (𝑦) = sup 𝑦 ⊤ 𝑧 − 𝛿dom( 𝑓 ) (𝑧) = 𝛿dom(
∗ ∗
𝑓 ) (𝑦) = 𝛿dom( 𝑓 ∗∗ ) (𝑦)
𝑧 ∈R𝑑
= ( 𝑓 ) (𝑦) = ( 𝑓 ∗ ) 𝜋 (𝑦, 𝛼),
∗ ∞

where the first equality follows from our convention that 0 · ∞ = ∞, which implies
that 0 𝑓 (𝑧) = 𝛿dom( 𝑓 ) (𝑧). The second equality follows from the definition of the
support function, and the third equality holds because 𝑓 is convex and closed,
which implies via Lemma 4.2 that 𝑓 = 𝑓 ∗∗ . Finally, the fourth equality follows
from (Rockafellar 1970, Theorem 13.3), and the last equality exploits the definition
of the perspective function for 𝛼 = 0. This completes the proof of assertion (i).
As for assertion (ii), assume first that 𝛼 > 0, and note that
𝑔∗ (𝑦) = sup 𝑦 ⊤ 𝑧 − 𝑓 𝜋 (𝑧, 𝛼) = 𝛼 sup 𝑦 ⊤ (𝑧/𝛼) − 𝑓 (𝑧/𝛼) = 𝛼 𝑓 ∗ (𝑦) = cl(𝛼 𝑓 ∗ )(𝑦),
𝑧 ∈R𝑑 𝑧 ∈R𝑑

where the last equality holds because 𝑓 ∗ is closed. If 𝛼 = 0, then we have


𝑔∗ (𝑦) = sup 𝑦 ⊤ 𝑧 − 𝑓 ∞ (𝑧) = sup 𝑦 ⊤ (𝑧) − 𝛿dom(
∗ ∗
𝑓 ∗ ) = 𝛿cl(dom( 𝑓 )) (𝑦) = cl(𝛼 𝑓 )(𝑦).

𝑧 ∈R𝑑 𝑧 ∈R𝑑

Here, the first equality exploits the definition of the perspective. The second and
the third equalities follow from (Rockafellar 1970, Theorem 13.3) and (Rockafellar
1970, Theorem 13.2), respectively. The last equality, finally, holds because 0 𝑓 ∗ =
𝛿dom( 𝑓 ∗ ) by our conventions of extended arithmetic. This proves assertion (ii). 
The following lemma derives a formula for the conjugate of a sum of functions.
Distributionally Robust Optimization 121

Lemma 7.2 (Conjugates of Sums). If 𝑓 𝑘 : R𝑑 → Í R, 𝑘 ∈ [𝐾], are proper, convex


and closed functions, then the conjugate of 𝑓 = 𝑘 ∈ [𝐾 ] 𝑓 𝑘 satisfies
 Õ Õ 
∗ ∗
𝑓 (𝑦) ≤ inf 𝑓 (𝑦 𝑘 ) : 𝑦𝑘 = 𝑦 ∀𝑦 ∈ R𝑑 . (7.1)
𝑦1 ,...,𝑦𝐾 ∈R𝑑
𝑘 ∈ [𝐾 ] 𝑘 ∈ [𝐾 ]

If there exists 𝑧¯ ∈ ∩𝑘 ∈ [𝐾 ] rint(dom( 𝑓 𝑘 )), then the inequality in the above expression
reduces to an equality, and the minimum is attained for every 𝑦 ∈ R𝑑 .
The infimum on the right hand side of (7.1) defines a function of 𝑦. This function
is called the infimal convolution of the functions 𝑓 𝑘∗ , 𝑘 ∈ [𝐾]. Thus, Lemma 7.2
asserts that, under a mild Slater-type condition, the conjugate of a sum of functions
coincides with the infimal convolution of the conjugates of these functions.
Proof of Lemma 7.2. By using a standard variable splitting trick and the max-min
inequality, one can show that the conjugate of 𝑓 admits the following upper bound.
 Õ 
𝑓 ∗ (𝑦) = sup 𝑦⊤ 𝑧 − 𝑓 (𝑧 𝑘 ) : 𝑧 𝑘 = 𝑧 ∀𝑘 ∈ [𝐾]
𝑧,𝑧1 ,...,𝑧𝐾 ∈R𝑑 𝑘 ∈ [𝐾 ]
Õ
= sup inf 𝑦⊤ 𝑧 − 𝑓 (𝑧 𝑘 ) − 𝑦 ⊤𝑘 (𝑧 − 𝑧 𝑘 )
𝑧,𝑧1 ,...,𝑧𝐾 ∈R𝑑 𝑦1 ,...,𝑦𝐾 ∈R𝑑
𝑘 ∈ [𝐾 ]
Õ
≤ inf sup 𝑦⊤ 𝑧 − 𝑓 (𝑧 𝑘 ) − 𝑦 ⊤𝑘 (𝑧 − 𝑧 𝑘 )
𝑦1 ,...,𝑦𝐾 ∈R𝑑 𝑧,𝑧1 ,...,𝑧𝐾 ∈R𝑑
𝑘 ∈ [𝐾 ]
 Õ  Õ


= inf sup 𝑦 𝑧 − 𝑦 ⊤𝑘 𝑧 + sup 𝑦 ⊤𝑘 𝑧 𝑘 − 𝑓 (𝑧 𝑘 )
𝑦1 ,...,𝑦𝐾 ∈R𝑑 𝑧 ∈R𝑑 𝑑
𝑘 ∈ [𝐾 ] 𝑘 ∈ [𝐾 ] 𝑧 𝑘 ∈R
Í
The supremum over 𝑧 in the resulting expression evaluates to 0 if 𝑘 ∈ [𝐾 ] 𝑦 𝑘 = 𝑦
and to ∞ otherwise. In addition, the supremum over 𝑧 𝑘 evaluates to 𝑓 𝑘∗ (𝑦 𝑘 ) for
every 𝑘 ∈ [𝐾]. Substituting these analytical formulas into the last expression yields
 Õ Õ 
∗ ∗
𝑓 (𝑦) ≤ inf 𝑓 (𝑦 𝑘 ) : 𝑦𝑘 = 𝑦 .
𝑦1 ,...,𝑦𝐾 ∈R𝑑
𝑘 ∈ [𝐾 ] 𝑘 ∈ [𝐾 ]

If ∩𝑘 ∈ [𝐾 ] rint(dom( 𝑓 𝑘 )) is non-empty, then the above inequality becomes an equal-


ity, and the infimum is attained thanks to (Rockafellar 1970, Theorem 16.4). 
Consider now a classical optimization problem

inf 𝑓 (𝑧) : 𝑔 𝑘 (𝑧) ≤ 0 ∀𝑘 ∈ [𝐾] (P)
𝑧 ∈R𝑑

with objective function 𝑓 : R𝑑 → R and constraint functions 𝑔 𝑘 : R𝑑 → R,


𝑘 ∈ [𝐾]. Below we will show that the problem dual to (P) is given by
( 𝐾 𝐾
)
Õ Õ
sup − 𝑓 ∗ (𝛽0 ) − (𝑔∗𝑘 ) 𝜋 (𝛽 𝑘 , 𝛼 𝑘 ) : 𝛽𝑘 = 0 . (D)
𝛼1 ,..., 𝛼𝐾 ∈R+ 𝑘=1 𝑘=0
𝛽0 ,...,𝛽𝐾 ∈R𝑑
122 D. Kuhn, S. Shafiee, and W. Wiesemann

To this end, we adopt the following definition of a Slater point.


Definition 7.3 (Slater Point). A Slater point of the set Z = {𝑧 ∈ R𝑑 : 𝑔 𝑘 (𝑧) ≤
0 ∀𝑘 ∈ [𝐾]} is any vector 𝑧¯ ∈ Z with 𝑧¯ ∈ rint(dom(𝑔 𝑘 )) for all 𝑘 ∈ [𝐾] and
𝑔 𝑘 (¯𝑧 ) < 0 for all 𝑘 ∈ [𝐾] such that 𝑔 𝑘 is nonlinear. A Slater point 𝑧¯ of the set Z is
a Slater point of the minimization problem inf{ 𝑓 (𝑧) : 𝑧 ∈ Z } if 𝑧¯ ∈ rint(dom( 𝑓 )).
Slater points of maximization problems are defined in the obvious way. We
simply replace the requirement 𝑧¯ ∈ rint(dom( 𝑓 )) with 𝑧¯ ∈ rint(dom(− 𝑓 )). Using
Lemmas 7.1 and 7.2, we can now prove that (P) and (D) are indeed duals.
Theorem 7.4 (Convex Duality). Assume that the functions 𝑓 and 𝑔 𝑘 , 𝑘 ∈ [𝐾],
are proper, closed and convex. Then, the infimum of (P) is larger or equal to the
supremum of (D). In addition, the following strong duality relations hold.
(i) If (P) or (D) admits a Slater point, then the infimum of (P) matches the
supremum of (D), and (D) or (P) is solvable, respectively.
(ii) If the feasible set of (P) or (D) is non-empty and bounded, then the infimum
of (P) matches the supremum of (D), and (P) or (D) is solvable, respectively.
Proof. The max-min inequality readily implies that the infimum of (P) is bounded
below by the optimal value of its Lagrangian dual, that is, we have
Õ
inf (P) = inf sup 𝑓 (𝑧) + 𝛼 𝑘 𝑔 𝑘 (𝑧)
𝑧 ∈R𝑑 𝛼∈R𝐾
+ 𝑘 ∈ [𝐾 ]
Õ
≥ sup inf 𝑓 (𝑧) + 𝛼 𝑘 𝑔 𝑘 (𝑧)
𝑑
𝛼∈R+𝐾 𝑧 ∈R 𝑘 ∈ [𝐾 ]
Õ

= sup − sup 0 𝑧 − 𝑓 (𝑧) − 𝛼 𝑘 𝑔 𝑘 (𝑧)
𝛼∈R+𝐾 𝑧 ∈R𝑑 𝑘 ∈ [𝐾 ]
 Õ ∗
= sup − 𝑓 + 𝛼𝑘 𝑔𝑘 (0).
𝛼∈R+𝐾 𝑘 ∈ [𝐾 ]

The resulting lower bound involves the conjugate of a sum of several functions. By
Lemma 7.2, the conjugate of this sum is bounded below by the infimal convolution
of the conjugates of all functions in the sum. Consequently, we obtain
( 𝐾 𝐾
)
Õ Õ
inf (P) ≥ sup − 𝑓 ∗ (𝛽0 ) − (𝛼 𝑘 𝑔 𝑘 )∗ (𝛽 𝑘 ) : 𝛽𝑘 = 0 . (7.2)
𝛼1 ,..., 𝛼𝐾 ∈R+ 𝑘=1 𝑘=0
𝛽0 ,...,𝛽𝐾 ∈R𝑑

By Lemma 7.1 (i), we further have (𝛼 𝑘 𝑔 𝑘 )∗ (𝛽 𝑘 ) = (𝑔∗𝑘 ) 𝜋 (𝛽 𝑘 , 𝛼 𝑘 ) for all 𝛽 𝑘 ∈ R𝑑


and 𝛼 𝑘 ∈ R+ . Thus, the lower bound in (7.2) matches the supremum of (D).
This proves weak duality. For a proof of strong duality and solvability under the
conditions (i) and (ii), we refer to (Zhen et al. 2023, Theorem 2). 
Armed with Theorem 7.4, we can now show that the semi-infinite constraints
Distributionally Robust Optimization 123

appearing in the dual worst-case expectation problems derived in Section 4 can


systematically be reformulated in terms of finitely many convex constraints.
Proposition 7.5 (Semi-Infinite Constraints I). Assume that the functions 𝑓 : R𝑑 →
R and 𝑔 𝑘 : R𝑑 → R, 𝑘 ∈ [𝐾], are proper, closed and convex, and that there is
𝑧¯ ∈ R𝑑 with 𝑧¯ ∈ rint(dom(𝑔 𝑘 )), 𝑘 ∈ [𝐾], 𝑧¯ ∈ rint(dom( 𝑓 )) and 𝑔 𝑘 (¯𝑧 ) < 0 for all
𝑘 ∈ [𝐾] such that 𝑔 𝑘 is nonlinear. Then the semi-infinite constraint
𝑓 (𝑧) ≥ 0 ∀𝑧 ∈ R𝑑 : 𝑔 𝑘 (𝑧) ≤ 0 ∀𝑘 ∈ [𝐾]
holds if and only if there exist 𝛼1 , . . . , 𝛼𝐾 ∈ R+ and 𝛽0 , . . . , 𝛽 𝐾 ∈ R𝑑 with
𝐾
Õ 𝐾
Õ
𝑓 ∗ (𝛽0 ) + (𝑔∗𝑘 ) 𝜋 (𝛽 𝑘 , 𝛼 𝑘 ) ≤ 0 and 𝛽 𝑘 = 0.
𝑘=1 𝑘=0
Proof. The semi-infinite constraint in the statement of the proposition is satisfied
if and only if the infimum of (P) is non-negative. Under the stated assumptions,
Theorem 7.4 implies that this is the case precisely when the supremum of (D) is non-
negative. Since (P) admits a Slater point, the supremum of (D) is attained. Thus,
the supremum of (D) is non-negative if and only if there are 𝛼1 , . . . , 𝛼𝐾 ∈ R+ and
𝛽0 , . . . , 𝛽 𝐾 ∈ R𝑑 satisfying the constraints in the statement of the proposition. 
Proposition 7.5 enables us to derive finite convex reformulations of the semi-
infinite constraints that appear in the dual of the worst-case expectation prob-
lem (4.1) whenever the relevant objective and constraint functions are convex in 𝑧.
Another similar reformulation technique relies on the S-Lemma (see, e.g., Pólik
and Terlaky 2007), which we present without a proof.
Lemma 7.6 (S-Lemma (Yakubovich 1971)). Assume that 𝑓 : R𝑑 → R and
𝑔 : R𝑑 → R are quadratic functions. If there exists a Slater point 𝑧¯ ∈ R𝑑
such that 𝑔(¯𝑧) < 0, then the following two statements are equivalent.
(i) There is no 𝑧 ∈ R𝑑 such that 𝑓 (𝑧) < 0 and 𝑔(𝑧) ≤ 0.
(ii) There exists 𝛼 ∈ R+ such that 𝑓 (𝑧) + 𝛼𝑔(𝑧) ≥ 0 for all 𝑧 ∈ R𝑑 .
The S-Lemma allows us to derive a finite convex reformulations of semi-infinite
constraints that require a (possibly indefinite) quadratic function to be non-negative
over the feasible set of a single quadratic constraint. Note in particular that the
involved functions 𝑓 and 𝑔 are not required to be convex in 𝑧.
Proposition 7.7 (Semi-Infinite Constraints II). Assume that 𝑄 0 , 𝑄 1 ∈ S𝑑 , 𝑞 0 , 𝑞 1 ∈
R𝑑 , and 𝑟 0 , 𝑟 1 ∈ R. In addition, assume that there exists a Slater point 𝑧¯ ∈ R𝑑 such
that 𝑧¯⊤ 𝑄 0 𝑧¯ + 2𝑞 ⊤
0 𝑧¯ + 𝑟 0 < 0. Then, the semi-infinite constraint

𝑧⊤𝑄 1 𝑧 + 2𝑞 ⊤ 𝑑 ⊤ ⊤
1 𝑧 + 𝑟 1 ≥ 0 ∀𝑧 ∈ R : 𝑧 𝑄 0 𝑧 + 2𝑞 0 𝑧 + 𝑟 0 ≤ 0
holds if and only if there exists 𝛼 ∈ R+ with
 
𝑄 1 + 𝛼𝑄 0 𝑞 1 + 𝛼𝑞 0
 0.
𝑞⊤ ⊤
1 + 𝛼𝑞 0 𝑟 1 + 𝛼𝑟 0
124 D. Kuhn, S. Shafiee, and W. Wiesemann

Proof. We observe that


𝑧⊤ 𝑄 1 𝑧 + 2𝑞 ⊤ 𝑑 ⊤ ⊤
1 𝑧 + 𝑟 1 ≥ 0 ∀𝑧 ∈ R : 𝑧 𝑄 0 𝑧 + 2𝑞 0 𝑧 + 𝑟 0 ≤ 0
⇐⇒ ∃𝛼 ∈ R+ with 𝑧⊤ (𝑄 1 + 𝛼𝑄 0 ) 𝑧 + 2(𝑞 1 + 𝛼𝑞 0 )⊤ 𝑧 + 𝑟 1 + 𝛼𝑟 0 ≥ 0 ∀𝑧 ∈ R𝑑
 ⊤   
𝑧 𝑄 1 + 𝛼𝑄 0 𝑞 1 + 𝛼𝑞 0 𝑧
⇐⇒ ∃𝛼 ∈ R+ with ≥ 0 ∀𝑧 ∈ R𝑑 ,
1 𝑞⊤
1 + 𝛼𝑞 0
⊤ 𝑟 1 + 𝛼𝑟 0 1
where the first equivalence applies Lemma 7.6 to 𝑓 (𝑧) = 𝑧⊤𝑄 1 𝑧 + 2𝑞 ⊤ 1 𝑧 + 𝑟 1 and
𝑔(𝑧) = 𝑧⊤𝑄 0 𝑧 + 2𝑞 ⊤
0 𝑧 + 𝑟 0 . As quadratic forms are homogeneous of degree 2 as
well as continuous, the last statement is equivalent to the desired positive semidef-
initeness condition. This observation concludes the proof. 

Proposition 7.7 is particularly useful for deriving finite convex reformulations of


the dual worst-case expectation problems over Chebyshev or Gelbrich ambiguity
sets; see (4.5) and (4.6). As we will see, the corresponding semi-infinite constraints
fail to be convex in 𝑧, which implies that Proposition 7.5 is not applicable.
Finite convex reformulations of the dual worst-case expectation problem (4.1)
are key to solving the DRO problem (1.2). They allow us to combine the outer
minimization over 𝑥 ∈ X with the inner minimization over the auxiliary decision
variables of the dual worst-case expectation problem to obtain a finite convex
reformulation of (1.2). However, the finite dual reformulations of (4.1) do not
allow us to readily identify worst-case distributions that (asymptotically) attain
the supremum of (4.1). Such worst-case distributions enable decision-makers
to evaluate how a given candidate decision performs under the most challenging
conditions, which is the essence of stress testing and contamination experiments;
see, e.g., (Dupačová 2006). They also play a pivotal role in optimal uncertainty
quantification, where they are used to determine the sharpest possible probabilistic
bounds on quantities of interest, given limited information about the underlying
probability distributions. We direct the readers to (Owhadi, Scovel, Sullivan,
McKerns and Ortiz 2013, Ghanem, Higdon and Owhadi 2017) for more details.
To identify a worst-case distribution that attains the supremum of (4.1), or to
identify a sequence of distributions that attain this supremum asymptotically, we
consider the bi-dual reformulation of the worst-case expectation problem (4.1) that
results from dualizing the finite convex dual of (4.1). The bi-dual can often be
interpreted as a restriction of the worst-case expectation problem (4.1) to a subset
of distributions P ∈ P that are parametrized by finitely many decision variables.
Strong duality between problem (4.1), its dual and its bi-dual then allows us to
conclude that any optimal solution to this bi-dual problem represents a (sequence
of) distribution(s) that attains the supremum of (4.1) (asymptotically).
The idea of extracting worst-case distributions from the finite bi-dual of prob-
lem (4.1) was formalized by Delage and Ye (2010, § 4.2) for Chebyshev ambiguity
sets and later extended to optimal transport ambiguity sets by Mohajerin Esfahani
and Kuhn (2018). In Section 7.2 we will see that, for the Chebyshev ambiguity
Distributionally Robust Optimization 125

set (2.4) with uncertain moments, the worst-case distributions constitute mixtures
of distributions with first and second moments that are determined by the optimal
solution of the finite bi-dual problem. For 𝜙-divergence ambiguity sets centered at
a discrete distribution P̂, Section 7.3 will show that the worst-case distributions are
supported on the atoms of P̂ and (if 𝜙 grows at most linearly) on arg max 𝑧 ∈Z ℓ(𝑧)
with probability weights determined by the optimal solution to the finite bi-dual
problem. Similarly, for the optimal transport ambiguity set (2.27) centered at a
discrete distribution P̂, Section 7.4 will show that the worst-case distributions con-
stitute mixtures of discrete distributions, with the locations and probability weights
of their atoms determined by the optimal solution to the finite bi-dual problem.

7.2. Chebyshev Ambiguity Sets with Uncertain Moments


Recall that the Chebyshev ambiguity set (2.4) with uncertain moments is defined as

P = P ∈ P2 (Z) : E P [𝑍] = 𝜇, E P [𝑍 𝑍 ⊤ ] = 𝑀 ∀(𝜇, 𝑀) ∈ F ,
where F ⊆ R𝑑 × S+𝑑 represents a closed moment uncertainty set and P2 (Z) stands
for the family of all probability distributions on Z with finite second moments. This
section combines the duality result for Chebyshev ambiguity sets (cf. Theorem 4.6)
with the finite dual reformulation of the ensuing semi-infinite program (cf. Propos-
ition 7.7) to derive an equivalent reformulation of nature’s subproblem (4.1) as a
finite-dimensional minimization problem. We also show how the corresponding
bi-dual allows us to extract worst-case distributions P★ ∈ P that attain the optimal
value of (4.1). Since the support-only ambiguity sets (cf. Section 2.1.1), the Markov
ambiguity sets (cf. Section 2.1.2), the Chebychev ambiguity sets with known mo-
ments (cf. Section 2.1.3) and the mean-dispersion ambiguity sets (cf. Section 2.1.5)
can all be viewed as special instances of the Chebyshev ambiguity set with uncertain
moments, our results immediately extend to those ambiguity sets as well, and we do
not re-derive the corresponding statements for the sake of brevity. Due to its recent
applications in statistics (Nguyen, Kuhn and Mohajerin Esfahani 2022b), signal
processing (Nguyen et al. 2023) and control (Taşkesen et al. 2024), however, we
report the finite dual and bi-dual reformulations of the Gelbrich ambiguity set with
moment uncertainty set (2.8). All reformulations derived in this section leverage
Lemma 7.6. Thus, they require quadratic representations of the loss function ℓ and
the support set Z as detailed in the following assumption.
Assumption 7.8 (Regularity Conditions for Chebyshev Ambiguity Sets).
(i) The loss function ℓ is a point-wise maximum of quadratic functions,
ℓ(𝑧) = max ℓ 𝑗 (𝑧) with ℓ 𝑗 (𝑧) = 𝑧⊤ 𝑄 𝑗 𝑧 + 2𝑞 ⊤𝑗 𝑧 + 𝑞 0𝑗 , (7.3)
𝑗∈[𝐽 ]

where 𝐽 ∈ N, 𝑄 𝑗 ∈ S𝑑 , 𝑞 𝑗 ∈ R𝑑 , and 𝑞 0𝑗 ∈ R for all 𝑗 ∈ [𝐽].


(ii) The support set Z is an ellipsoid of the form
Z = {𝑧 ∈ R𝑑 : (𝑧 − 𝑧0 )⊤ 𝑄 0 (𝑧 − 𝑧0 ) ≤ 1}, (7.4)
126 D. Kuhn, S. Shafiee, and W. Wiesemann

where 𝑄 0 ∈ S+𝑑 and 𝑧0 ∈ R𝑑 .


Note that Assumption 7.8 does not impose any convexity conditions on the
quadratic component functions 𝑧⊤ 𝑄 𝑗 𝑧 + 2𝑞 ⊤𝑗 𝑧 + 𝑞 0𝑗 that make up the loss function ℓ.
Theorem 7.9 (Finite Dual Reformulation for Chebyshev Ambiguity Sets). If P is
the Chebyshev ambiguity set (2.4) with any F ⊆ R𝑑 ×S+𝑑 and Assumption 7.8 holds,
then the worst-case expectation problem (4.1) satisfies the weak duality relation
sup E P [ℓ(𝑍)]
P∈P
 inf 𝜆 0 + 𝛿F ∗ (𝜆, Λ)




 s.t. 𝜆" 0 ∈ R, 𝜆 ∈ R𝑑 , Λ ∈ S𝑑 , 𝛼 ∈ R+𝐽
 #
≤ Λ − 𝑄 + 𝛼 𝑄 1
𝜆 − 𝑞 − 𝛼 𝑄 𝑧

 𝑗 𝑗 0 2 𝑗 𝑗 0 0
 1
⊤  0 ∀ 𝑗 ∈ [𝐽].

 2 𝜆 − 𝑞 𝑗 − 𝛼 𝑗 𝑄 0 𝑧0 𝜆 0 − 𝑞 0𝑗 + 𝛼 𝑗 (𝑧⊤
0 𝑄 0 𝑧 0 − 1)

If F is a convex and compact set with 𝑀 ≻ 𝜇𝜇⊤ for all (𝜇, 𝑀) ∈ rint(F), then
strong duality holds, that is, the above inequality becomes an equality.
Proof. Weak duality follows from Theorem 4.6 and from the following equivalent
reformulation of the semi-infinite constraint in the dual problem (4.5).
𝜆 0 + 𝜆⊤ 𝑧 + 𝑧⊤ Λ𝑧 ≥ ℓ(𝑧) ∀𝑧 ∈ Z
⇐⇒ 𝜆 0 + 𝜆⊤ 𝑧 + 𝑧⊤ Λ𝑧 ≥ 𝑧⊤𝑄 𝑗 𝑧 + 2𝑞 ⊤𝑗 𝑧 + 𝑞 0𝑗 ∀𝑧 ∈ Z, ∀ 𝑗 ∈ [𝐽]
⇐⇒ ∃𝛼 ∈ R+𝐽 with
" #
1
Λ − 𝑄 𝑗 + 𝛼 𝑗 𝑄0 2 𝜆 − 𝑞 𝑗 − 𝛼 𝑗 𝑄 0 𝑧 0
1
⊤  0 ∀ 𝑗 ∈ [𝐽]
2 𝜆 − 𝑞 𝑗 − 𝛼 𝑗 𝑄 0 𝑧0 𝜆 0 − 𝑞 0𝑗 + 𝛼 𝑗 (𝑧⊤
0 𝑄 0 𝑧 0 − 1)

Here, the first equivalence holds thanks to Assumption 7.8 (i), and the second
equivalence follows from Proposition 7.7, which applies because 𝑧0 ∈ rint(Z)
constitutes a Slater point thanks to Assumption 7.8 (ii). In addition, as the loss
function is quadratic, strong duality follows readily from Theorem 4.6. 
Recall next that the Gelbrich ambiguity set (2.8) is defined in as an instance of
the Chebyshev ambiguity set (2.4) with moment uncertainty set
 
𝑑 𝑑 ∃Σ ∈ S+𝑑 with 𝑀 = Σ + 𝜇𝜇⊤ ,
F = (𝜇, 𝑀) ∈ R × S+ : .
G (𝜇, Σ), ( 𝜇,
ˆ Σ̂) ≤ 𝑟
Here, ( 𝜇,
ˆ Σ̂) is a nominal mean-covariance pair, and 𝑟 ≥ 0 is a size parameter. The
next result follows directly from Theorems 4.9 and 7.9. We thus omit its proof.
Theorem 7.10 (Finite Dual Reformulation for Gelbrich Ambiguity Sets). If P is
the Chebyshev ambiguity set (2.4) with F given by (2.8) and Assumption 7.8 holds,
then the worst-case expectation problem (4.1) satisfies the weak duality relation
sup E P [ℓ(𝑍)]
P∈P
Distributionally Robust Optimization 127

inf 𝜆 0 + 𝛾 𝑟 2 − k 𝜇k ˆ 2 − Tr(Σ̂) + Tr(𝐴0 ) + 𝛼0









 s.t. 𝜆" 0 ∈ R, 𝛼0 , 𝛾 ∈ R+ , 𝛼 ∈ R+𝐽 , 𝜆 ∈ R𝑑 , Λ ∈ S𝑑 , 𝐴0 ∈#S+𝑑

 1


 Λ − 𝑄 𝑗 + 𝛼 𝑗 𝑄0 2 𝜆 − 𝑞 𝑗 − 𝛼 𝑗 𝑄 0 𝑧0
1 ⊤  0 ∀ 𝑗 ∈ [𝐽]
𝜆 0 − 𝑞 0𝑗 + 𝛼 𝑗 (𝑧⊤


 2 𝜆 − 𝑞 𝑗 − 𝛼 𝑗 𝑄 0 𝑧0 0 𝑄 0 𝑧 0 − 1)

 " 1# " #

 𝛾𝐼 − Λ 𝛾 Σ̂ 2 𝛾𝐼 − Λ 𝛾 ˆ
𝜇 + 𝜆

 𝑑 𝑑 2

 1
 0,  0.
 𝛾 Σ̂ 2 𝐴0 (𝛾 𝜇ˆ + 𝜆 ⊤
) 𝛼 0
 2

If 𝑟 > 0, then strong duality holds, that is, the above inequality becomes an equality.

In order to characterize the extremal distributions that attain the supremum in


the worst-case expectation problem (4.1) over Chebyshev and Gelbrich ambiguity
sets, we first derive the corresponding bi-duals of (4.1).

Theorem 7.11 (Finite Bi-Dual Reformulation for Chebyshev Ambiguity Sets). If P


is the Chebyshev ambiguity set (2.4) with F ⊆ R𝑑 × S+𝑑 and Assumption 7.8 holds,
then the worst-case expectation problem (4.1) satisfies the weak duality relation

sup E P [ℓ(𝑍)]
P∈P
Õ


 sup Tr(𝑄 𝑗 Θ 𝑗 ) + 2𝑞 ⊤𝑗 𝜃 𝑗 + 𝑞 0𝑗 𝑝 𝑗



 𝑗∈[𝐽 ]
 s.t. 𝜇 ∈ R𝑑 , 𝑀 ∈ S𝑑 , 𝑝 ∈ R , 𝜃 ∈ R𝑑 , Θ ∈ S𝑑
 ∀ 𝑗 ∈ [𝐽]

 + 𝑗 + 𝑗 𝑗 +

  
≤ Θ𝑗 𝜃 𝑗 (7.5)

 ⊤  0, Tr(𝑄 0 Θ 𝑗 ) − 2𝑧⊤ 0 𝑄 0 𝜃 𝑗 + 𝑧⊤ 0 𝑄 0 𝑧0 𝑝 𝑗 ≤ 𝑝 𝑗 ∀ 𝑗 ∈ [𝐽]

 𝜃𝑗 𝑝𝑗

 Õ Õ Õ



 𝑝 𝑗 = 1, 𝜇 = 𝜃 𝑗, 𝑀 = Θ 𝑗 , (𝜇, 𝑀) ∈ F .

 𝑗∈[𝐽 ] 𝑗∈[𝐽 ] 𝑗∈[𝐽 ]

If F is a convex and compact set with 𝑀 ≻ 𝜇𝜇⊤ for all (𝜇, 𝑀) ∈ rint(F), then
strong duality holds, that is, the inequality (7.5) becomes an equality.

Proof. By decomposing the Gelbrich ambiguity set into Chebyshev ambiguity sets
of the form P(𝜇, 𝑀) = {P ∈ P(Z) : E P [𝑍] = 𝜇, E P [𝑍 𝑍 ⊤ ] = 𝑀 }, we obtain

sup E P [ℓ(𝑍)] = sup sup E P [ℓ(𝑍)]. (7.6)


P∈P (𝜇,𝑀)∈F P∈P(𝜇,𝑀)

The inner maximization problem on the right hand side of (7.6) represents a worst-
case expectation problem over an instance of the ambiguity set (2.4) with the
moment uncertainty set being the singleton {(𝜇, 𝑀)}. The support function of this
singleton is given by 𝛿 ∗{(𝜇,𝑀)} (𝜆, Λ) = 𝜆⊤ 𝜇 + Tr(Λ𝑀). Thus, Theorem 7.9 implies
128 D. Kuhn, S. Shafiee, and W. Wiesemann

that the inner supremum on the right hand side of (7.6) is bounded above by
inf 𝜆 0 + 𝜆⊤ 𝜇 + Tr(Λ𝑀)
s.t. 𝜆 0 ∈ R, 𝜆 ∈ R𝑑 , Λ ∈ S𝑑 , 𝛼 ∈ R+𝐽
" #
1
Λ − 𝑄 𝑗 + 𝛼 𝑗 𝑄0 2 𝜆 − 𝑞 𝑗 − 𝛼 𝑗 𝑄 0 𝑧 0
1
⊤  0 ∀ 𝑗 ∈ [𝐽].
2 𝜆 − 𝑞 𝑗 − 𝛼 𝑗 𝑄 0 𝑧 0 𝜆 0 − 𝑞 0𝑗 + 𝛼 𝑗 (𝑧⊤
0 𝑄 0 𝑧 0 − 1)

The dual of this semidefinite program can be represented as


Õ
sup Tr(𝑄 𝑗 Θ 𝑗 ) + 2𝑞 ⊤𝑗 𝜃 𝑗 + 𝑞 0𝑗 𝑝 𝑗
𝑗∈[𝐽 ]
s.t. 𝑝 𝑗 ∈ R+ , 𝜃 𝑗 ∈ R𝑑 , Θ 𝑗 ∈ S+𝑑 ∀ 𝑗 ∈ [𝐽]
 
Θ𝑗 𝜃 𝑗
 0, Tr(𝑄 0 Θ 𝑗 ) − 2𝑧⊤ ⊤
0 𝑄 0 𝜃 𝑗 + 𝑧 0 𝑄 0 𝑧 0 𝑝 𝑗 ≤ 𝑝 𝑗 ∀ 𝑗 ∈ [𝐽]
𝜃 ⊤𝑗 𝑝 𝑗
Õ Õ Õ
𝑝 𝑗 = 1, 𝜃 𝑗 = 𝜇, Θ 𝑗 = 𝑀.
𝑗∈[𝐽 ] 𝑗∈[𝐽 ] 𝑗∈[𝐽 ]

Strong duality holds because the primal minimization problem admits a Slater
point. Indeed, by defining Λ = 𝜆 0 𝐼 𝑑 and setting 𝜆 0 to a large value, one can ensure
that the linear matrix inequality in the primal problem holds strictly. Replacing the
inner supremum on the right hand side of (7.6) with the above dual semidefinite
program yields the upper bound in (7.5). If F is convex and compact with 𝑀 ≻ 𝜇𝜇⊤
for all (𝜇, 𝑀) ∈ rint(F), then (7.5) becomes an equality thanks to Theorem 7.9. 
Note that the bi-dual reformulation in (7.5) is solvable whenever F is compact.
Indeed, its objective function is ostensibly continuous. In addition, it is easy to
verify that its feasible region is compact provided that F is compact.
Theorem 7.12 (Finite Bi-Dual Reformulation for Gelbrich Ambiguity Sets). If P is
the Chebyshev ambiguity set (2.4) with F given by (2.8) and Assumption 7.8 holds,
then the worst-case expectation problem (4.1) satisfies the weak duality relation
sup E P [ℓ(𝑍)]
P∈P
Õ

 max Tr(𝑄 𝑗 Θ 𝑗 ) + 2𝑞 ⊤𝑗 𝜃 𝑗 + 𝑞 0𝑗 𝑝 𝑗



 𝑗∈[𝐽 ]



 s.t. 𝜇 ∈ R𝑑 , 𝑀, 𝑈 ∈ S+𝑑 , 𝐶 ∈ R𝑑×𝑑



 𝑝 𝑗 ∈ R+ , 𝜃 𝑗 ∈ R𝑑 , Θ 𝑗 ∈ S+𝑑 ∀ 𝑗 ∈ [𝐽]

      



 𝑀 −𝑈 𝐶 𝑈 𝜇 Θ 𝑗 𝜃 𝑗
≤ ⊤  0, ⊤  0, ⊤  0 ∀ 𝑗 ∈ [𝐽] (7.7)
 𝐶 Σ̂ 𝜇 1 𝜃 𝑗 𝑝 𝑗


 Tr(𝑄 0 Θ 𝑗 ) − 2𝑧⊤ 𝜃 𝑗 + 𝑧⊤ ≤ 𝑝𝑗 ∀ 𝑗 ∈ [𝐽]


 Õ 0 𝑄0Õ 0 𝑄 0 𝑧 0 𝑝 𝑗Õ



 𝑝 𝑗 = 1, 𝜇 = 𝜃 𝑗, 𝑀 = Θ𝑗



 𝑗∈[𝐽 ] 𝑗∈[𝐽 ] 𝑗∈[𝐽 ]

 2 − 2𝜇 ⊤ 𝜇ˆ + Tr(𝑀 + Σ̂ − 2𝐶) ≤ 𝑟 2 .
 k ˆ
𝜇k
 2
Distributionally Robust Optimization 129

If 𝑟 > 0, then strong duality holds, that is, the above inequality becomes an equality.
The proof of Theorem 7.12 follows from Proposition 2.3 and Theorem 7.11 and is
thus omitted. We are now ready to construct extremal distributions P★ ∈ P(Z) that
attain the supremum of the worst-case expectation problem (4.1) over the Chebyshev
ambiguity set (2.4). To this end, fix any maximizer (𝜇★, 𝑀 ★, 𝑝★, 𝜃★, Θ★) of the
bi-dual problem (7.5), which exists if F is compact. Next, define the index sets
 
J ∞ = 𝑗 ∈ [𝐽] : 𝑝★𝑗 = 0, Θ★𝑗 ≠ 0 and J + = 𝑗 ∈ [𝐽] : 𝑝★𝑗 > 0 ,

and define J = J + ∪ J ∞ . The extremal distributions P★ will be constructed


as mixtures of constituent distributions P 𝑗 , 𝑗 ∈ J , corresponding to different
pieces of the loss function ℓ. In the following, we use P ∼ (𝜇, 𝑀) to indicate
that the distribution P has mean 𝜇 and second-order moment matrix 𝑀. Note that
if Z = {𝑧 ∈ R𝑑 : (𝑧 − 𝑧0 )⊤ 𝑄 0 (𝑧 − 𝑧0 ) ≤ 1} is the ellipsoid from Assumption 7.8 (ii)
and P ∈ P(Z) is a distribution supported on Z with P ∼ (𝜇, 𝑀), then we have
 
1 ≥ E P (𝑍 − 𝑧0 )⊤ 𝑄 0 (𝑍 − 𝑧0 ) = Tr(𝑄 0 𝑀) + 2𝑧⊤ ⊤
0 𝜇 + 𝑧0 𝑄 0 𝑧0 .

The inequality in the above expression holds because P ∈ P(Z), and the equality
holds because P ∼ (𝜇, 𝑀). The following lemma by Hanasusanto et al. (2015a,
Proposition 6.1) shows the reverse implication. That is, if 𝜇 and 𝑀 satisfy the
above inequality, then there is a (discrete) distribution P ∼ (𝜇, 𝑀) supported on Z.
Lemma 7.13 (Distributions on Ellipsoids with Given Moments). If Z is the ellips-
oid from Assumption 7.8 (ii), and if Tr(𝑄 0 𝑀)+2𝑧⊤ ⊤ 𝑑
0 𝜇 +𝑧 0 𝑄 0 𝑧 0 ≤ 1 for some 𝑀 ∈ S+
and 𝜇 ∈ R𝑑 with 𝑀  𝜇𝜇⊤ , then there exists a discrete distribution P ∈ P(Z) with
at most 2𝑑 atoms that satisfies P ∼ (𝜇, 𝑀).
The proof of Lemma 7.13 is simple but tedious and thus omitted.
Theorem 7.14 (Extremal Distributions of Chebyshev Ambiguity Sets). If all con-
ditions of Theorem 7.11 for weak as well as strong duality are satisfied and that
(𝜇★, 𝑀 ★, 𝑝★, 𝜃★, Θ★) solves (7.5), then the following hold.

(i) If J ∞ = ∅, then there exist discrete distributions P★𝑗 ∼ (𝜃★𝑗/𝑝★𝑗, Θ★𝑗/𝑝★𝑗)


Í
supported on Z for all 𝑗 ∈ J + , and (4.1) is solved by P★ = 𝑗 ∈J + 𝑝★𝑗 P★𝑗. In
addition, we have P★ ∼ (𝜇★, 𝑀 ★), and P★ is supported on Z.
(ii) If J ∞ ≠ ∅, then there exist discrete distributions P𝑚 ★ 𝑚 ★ 𝑚
𝑗 ∼ (𝜃 𝑗 /𝑝 𝑗 , Θ 𝑗 /𝑝 𝑗 )
supported on Z for all 𝑗 ∈ J , where 𝑝 𝑚 ∞ ★ +
𝑗 = (1 − |J |/𝑚)𝑝 𝑗 for 𝑗 ∈ J
𝑚 ∞
and 𝑝 𝑗 = 1/𝑚 for 𝑗 ∈ J , and where 𝑚 is anyÍinteger with 𝑚 ≥ |J |. In ∞

addition, (4.1) is asymptotically solved by P𝑚 = 𝑗 ∈J 𝑝 𝑚 𝑚


𝑗 P 𝑗 as 𝑚 grows.

Proof. As for assertion (i), the constraints of problem (7.7) imply that
Tr(𝑄 0 Θ★𝑗/𝑝★𝑗) − 2𝑧⊤ ★ ★ ⊤
0 𝑄 0 𝜃 𝑗 /𝑝 𝑗 + 𝑧 0 𝑄 0 𝑧 0 ≤ 1
130 D. Kuhn, S. Shafiee, and W. Wiesemann

and
 
Θ★𝑗 𝜃★𝑗
0 ⇐⇒ Θ★𝑗/𝑝★𝑗  (𝜃★𝑗/𝑝★𝑗)(𝜃★𝑗/𝑝★𝑗)⊤
(𝜃★𝑗 )⊤ 𝑝★𝑗
for all 𝑗 ∈ J + . Lemma 7.13 thus guarantees that there exist discrete distribu-
tions P★𝑗 ∼ (𝜃★𝑗/𝑝★𝑗, Θ★𝑗/𝑝★𝑗), 𝑗 ∈ J + , all of which are supported on Z. Con-
Í
sequently, P★ = 𝑗 ∈J + 𝑝★𝑗 P★𝑗 is also supported on Z. In addition, we have
Õ Õ Õ
EP★ [𝑍] = 𝑝★𝑗 · EP★𝑗 [𝑍] = 𝑝★𝑗 · 𝜃★𝑗/𝑝★𝑗 = 𝜃★𝑗 = 𝜇★
𝑗 ∈J + 𝑗 ∈J + 𝑗 ∈J +

and
Õ Õ Õ
EP★ [𝑍 𝑍 ⊤ ] = 𝑝★𝑗 · EP★𝑗 [𝑍 𝑍 ⊤ ] = 𝑝★𝑗 · Θ★𝑗/𝑝★𝑗 = Θ★𝑗 = 𝑀 ★,
𝑗 ∈J + 𝑗 ∈J + 𝑗 ∈J +

that is, P★ ∼ (𝜇★ , 𝑀 ★). As (𝜇★, 𝑀 ★) ∈ F, it is now clear that P★ ∈ P and that
Õ
E P★ [ℓ(𝑍)] ≤ sup E P [ℓ(𝑍)] = Tr(𝑄 𝑗 Θ★𝑗) + 2𝑞 ⊤𝑗 𝜃★𝑗 + 𝑞 0𝑗 𝑝★𝑗,
P∈P 𝑗∈[𝐽 ]

where the equality follows from strong duality as established in Theorem 7.11. At
the same time, the definition of P★ as a mixture distribution and the definition of ℓ
in (7.3) as a pointwise maximum of quadratic component functions implies that
Õ Õ
E P★ [ℓ(𝑍)] ≥ 𝑝★𝑗 · E P★𝑗 [ℓ 𝑗 (𝑍)] = Tr(𝑄 𝑗 Θ★𝑗) + 2𝑞 ⊤𝑗 𝜃★𝑗 + 𝑞 0𝑗 𝑝★𝑗 .
𝑗 ∈J + 𝑗∈[𝐽 ]

Specifically, the inequality holds because ℓ ≥ ℓ 𝑗 for every 𝑗 ∈ [𝐽], and the equality
holds because 𝜃★𝑗 = 0 and Θ★𝑗 = 0 whenever 𝑝★𝑗 = 0. Indeed, if 𝑝★𝑗 = 0, then Θ★𝑗 = 0
because the index set J ∞ is empty, and the linear matrix inequality in (7.5) implies
that 𝜃★𝑗 = 0 whenever Θ★𝑗 = 0. The above inequalities thus ensure that P★ solves the
worst-case expectation problem (4.1). This completes the proof of assertion (i).
Next, we address assertion (ii). Similar arguments as in the proof of asser-
tion (i) can be used to show that P𝑚 ∈ P for every 𝑚 ≥ |J ∞ |. This implies that
E P𝑚 [ℓ(𝑍)] ≤ supP∈P E P [ℓ(𝑍)] whenever 𝑚 ≥ |J ∞ |. In addition, we observe that
Õ Õ
lim E P𝑚 [ℓ(𝑍)] ≥ lim 𝑝𝑚
𝑗 · E P 𝑚 [ℓ 𝑗 (𝑍)] = lim 𝑝 𝑚𝑗 · E P𝑚 [ℓ 𝑗 (𝑍)]
𝑚→∞ 𝑚→∞ 𝑚→∞
𝑗 ∈J 𝑗 ∈J
Õ
= Tr(𝑄 𝑗 Θ★𝑗) + 2𝑞 ⊤𝑗 𝜃★𝑗 + 𝑞 0𝑗 𝑝★𝑗 = sup E P [ℓ(𝑍)],
𝑗∈[𝐽 ] P∈P

where the second equality exploits the definition of P𝑚 and the third equality follows
from strong duality as established in Theorem 7.11. This completes the proof. 
Theorem 7.14 also applies to the Gelbrich ambiguity set, which constitutes a
Chebyshev ambiguity set of the form (2.4) with F given by (2.8). The extremal
distribution P★ identified in Theorem 7.14 (i) constitutes a mixture of different
Distributionally Robust Optimization 131

distributions P★𝑗 , each of which corresponds to a component ℓ 𝑗 of the loss function ℓ;


see Assumption 7.8 (i). The mixture components P★𝑗 may be set to any distributions
on Z that satisfy the prescribed moment conditions. Note that discrete distributions
consistent with these requirements are guaranteed to exist thanks to Lemma 7.13.
However, if Z = R𝑑 , say, then one could also set P★𝑗 to the Gaussian distribution with
the given first and second moments. From the proof of Theorem 7.14 it becomes
clear that P★𝑗 must be supported on {𝑧 ∈ Z : ℓ 𝑗 (𝑧) ≥ ℓ 𝑗 ′ (𝑧) ∀ 𝑗 ′ ≠ 𝑗 }, which is
generically nonconvex. Therefore, Kuhn et al. (2019, § 2.2) conjectured that the
construction of P★𝑗 is NP-hard. From the proof of Lemma 7.13 in (Hanasusanto et al.
2015a, § 6) it becomes clear, however, that P★𝑗 can be constructed efficiently. Similar
comments are in order for the distributions P𝑚 𝑗 appearing in Theorem 7.14 (ii).
If J ∞ ≠ ∅, then the extremal distributions constructed in Theorem 7.14 contain
diverging mixture components whose covariance matrices explode along certain
recession directions of the support set Z (that is, along the eigenvectors of Θ★𝑗, 𝑗 ∈
J ∞ , corresponding to non-zero eigenvalues). However, these diverging mixture
components are assigned weights that decay with their variances such that the
covariance matrix of the entire mixture distribution remains bounded.
The following lemma establishes a sufficient condition for J ∞ to be empty,
which ensures via Theorem 7.14 (i) that problem (4.1) is solvable.
Lemma 7.15. If all conditions of Theorem 7.14 are satisfied and the support set Z
defined in (7.4) is compact, then J ∞ = ∅, and thus problem (4.1) is solvable.
Proof. If 𝑝★𝑗 = 0 for some 𝑗 ∈ [𝐽], then the linear matrix inequality in (7.5) implies
that 𝜃★𝑗 = 0. Consequently, the 𝑗-th trace inequality simplifies to Tr(𝑄 0 Θ★𝑗) ≤ 0.
As 𝑄 0 ≻ 0 because Z is compact, we thus find that Θ★𝑗 = 0. In summary, we have
shown that 𝑝★𝑗 = 0 implies Θ★𝑗 = 0, and therefore J ∞ is empty as desired. 
We conclude this section with some remarks on worst-case expectation problems
with more generic moment ambiguity sets. Translated into our terminology, Richter
(1957) and Rogosinski (1958) show that if P = {P ∈ P(Z) : E P [ 𝑓 (𝑍)] = 𝜇} for
some 𝑓 : Z → R𝑚 and 𝜇 ∈ R𝑚 , and if (4.1) is solvable, then the supremum in (4.1)
is attained by a discrete distribution with at most 𝑚 + 2 atoms. See (Shapiro et al.
2009, Theorem 7.32) for modern proof of this result. Note also that, under the
given assumptions, the worst-case expectation problem (4.1) can be recast as
∫ ∫ ∫ 
sup ℓ(𝑧) d𝜌(𝑧) : d𝜌(𝑧) = 1, 𝑓 (𝑧) d𝜌(𝑧) = 𝜇 . (7.8)
𝜚 ∈M+ (Z) Z Z Z

Problem (7.8) constitutes an infinite-dimensional linear program over the non-


negative Borel measures on Z with 𝑚 + 1 linear equality constraints. Every
finite-dimensional linear program with non-negative variables and 𝑚 + 1 equality
constraints is known to admit an optimal basic feasible solution with at most 𝑚 + 1
non-zero entries. The infinite-dimensional analog of a basic feasible solution is
a discrete measure with at most 𝑚 + 1 atoms. Accordingly, one can prove that
132 D. Kuhn, S. Shafiee, and W. Wiesemann

if (7.8) is solvable, then its supremum is attained by a measure with at most 𝑚 + 1


atoms (Pinelis 2016, Corollary 5 and Proposition 6(v)). This result strengthens the
Richter-Rogosinski theorem. However, the minimum number of atoms required for
an optimal measure cannot be reduced beyond 𝑚+1 without additional assumptions.
The above reasoning implies that the worst-case expectation problem (4.1) and
its reformulation (7.8) as a semi-infinite linear program can be reduced to a finite-
dimensional optimization problem over the locations and probabilities of the 𝑚 + 1
atoms of a discrete measure. Finite reductions of this type are routinely studied in
optimal uncertainty quantification (Owhadi et al. 2013). However, they generically
represent nonconvex optimization problems. Indeed, even the integral of a linear
function with respect to a discrete measure involves products of the probabilities and
the coordinates of the measure’s atoms. If (7.8) is solvable and ℓ is representable as
a pointwise maximum of 𝐽 concave functions, then the 𝑚 + 1 atoms of an extremal
measure can be further condensed. That is, using an induction argument and an
iterative application of Jensen’s inequality, one can show that (7.8) is solved by
a discrete measure with at most 𝐽 atoms (Han, Tao, Topcu, Owhadi and Murray
2015, Lemma 3.1). This result is significant even though 𝐽 is not necessarily
smaller than 𝑚 + 1. It implies that (7.8) admits a finite reduction that optimizes
over discrete measures with 𝐽 atoms. And this (nonconvex) finite reduction is
intimately related to the dual problem (4.4) derived in Theorem 4.5 through a
‘primal-worst-equals-dual-best’ duality scheme for robust optimization problems
(Beck and Ben-Tal 2009). Specifically, (4.4) can be viewed as a ‘primal-worst’
robust optimization problem, and the finite reduction corresponding to discrete
measures with 𝐽 atoms can be viewed as the corresponding ‘dual-best’ optimization
problem (Zhen et al. 2023). These problems share the same optimal value under
mild regularity conditions. In addition, the (dual best) finite reduction can be
convexified by applying a variable transformation and a perspectification trick (Han
et al. 2015, Theorem 1.1). The same convex reformulation can also be obtained
by dualizing the finite dual reformulation of the (primal worst) problem (4.4) as
outlined in Section 7.1. For further details we refer to (Zhen et al. 2023).

7.3. 𝜙-Divergence Ambiguity Sets


Recall that the 𝜙-divergence ambiguity set (2.10) is defined as

P = P ∈ P(Z) : D 𝜙 (P, P̂) ≤ 𝑟 ,
where 𝑟 is a size parameter, 𝜙 is an entropy function in the sense of Definition 2.4,
D 𝜙 is the corresponding 𝜙-divergence in the sense of Definition 2.5, and P̂ ∈ P(Z)
is a reference distribution. In the following, we first demonstrate that the worst-case
expectation problem (4.1) over a 𝜙-divergence ambiguity sets can be reformulated
as a finite convex program whenever P̂ is discrete and ℓ is real-valued.
Í
Assumption 7.16 (Discrete Reference Distribution). We have P̂ = 𝑖∈ [ 𝑁 ] 𝑝ˆ𝑖 𝛿 𝑧ˆ𝑖
Distributionally Robust Optimization 133

for some 𝑁 ∈ N, where the probabilities 𝑝ˆ𝑖 , 𝑖 ∈ [𝑁], are strictly positive and sum
to 1, and where 𝑧ˆ𝑖 ∈ Z for every 𝑖 ∈ [𝑁]. In addition, ℓ(𝑧) ∈ R for all 𝑧 ∈ Z.
The requirement that 𝑝ˆ𝑖 be positive for every 𝑖 ∈ [𝑁] is non-restrictive because
atoms with zero probability can simply be eliminated without changing P̂.
Theorem 7.17 (Finite Dual Reformulation for 𝜙-Divergence Ambiguity Sets). If P
is the 𝜙-divergence ambiguity set (2.10) and Assumption 7.16 holds, then the worst-
case expectation problem (4.1) satisfies the weak duality relation
Õ

 inf
 𝜆0 ∈R,𝜆∈R+ 𝜆 0 + 𝜆𝑟 + 𝑝ˆ𝑖 · (𝜙∗ ) 𝜋 (ℓ(ˆ𝑧𝑖 ) − 𝜆 0 , 𝜆)


sup E P [ℓ(𝑍)] ≤ 𝑖∈ [ 𝑁 ] (7.9)
P∈P 
 s.t. 𝜆 0 + 𝜆 𝜙 ∞
(1) ≥ sup ℓ(𝑧),

 𝑧 ∈Z

where the product 𝜆 𝜙∞ (1) is assumed to evaluate to ∞ if 𝜆 = 0 and 𝜙∞ (1) = ∞.


If 𝑟 > 0 and 𝜙 is continuous at 1, then strong duality holds, that is, the above
inequality becomes an equality.
Theorem 7.17 is an immediate corollary of Theorem 4.11. Indeed, problem (7.9)
is obtained from (4.11) by re-expressing the integral with respect to the discrete
reference distribution P̂ as a weighted sum. Thus, no proof is required. Recall
now that the restricted 𝜙-divergence ambiguity set is defined as the set of all
distributions P ∈ P with P ≪ P̂. It is straightforward to verify that if P is discrete,
then the corresponding worst-case expectation problem (4.1) admits a finite convex
reformulation that is given by a relaxation of (7.9) without constraints. Details
are omitted for brevity. Next, we derive a finite convex program dual to (7.9) that
allows us to construct an extremal distribution.
Theorem 7.18 (Finite Bi-Dual Reformulations for 𝜙-Divergence Ambiguity Sets).
If P is the 𝜙-divergence ambiguity set (2.10), Assumption 7.16 holds, 𝑟 > 0 and 𝜙
is continuous at 1, then problem (4.1) satisfies the strong duality relation
Õ

 max 𝑝 0 ℓ + 𝑝 𝑖 ℓ(ˆ𝑧𝑖 )

 𝑝0 ,..., 𝑝 𝑁 ∈R+

 𝑖∈ [ 𝑁 ]

 Õ


 s.t. 𝑝0 + 𝑝𝑖 = 1
sup E P [ℓ(𝑍)] = (7.10)
P∈P 
 𝑖∈ [ 𝑁 ]

 Õ  
𝑝𝑖

 𝑝 0 𝜙∞ (1) + 𝑝ˆ𝑖 𝜙 ≤ 𝑟,


 𝑝ˆ𝑖
 𝑖∈ [ 𝑁 ]

where ℓ is a shorthand for sup 𝑧 ∈Z ℓ(𝑧). The product 𝑝 0 𝜙∞ (1) is assumed to equal 0
if 𝑝 0 = 0 and 𝜙∞ (1) = ∞. Similarly, 𝑝 0 ℓ is assumed to equal 0 if 𝑝 0 = 0 and ℓ = ∞.
The finite bi-dual reformulation (7.10) can readily be derived from the primal
worst-case expectation problem (4.1) or from its finite dual reformulation (7.9).
We find it insightful to derive (7.10) from (7.9). This is also more consistent with
134 D. Kuhn, S. Shafiee, and W. Wiesemann

the general proof strategy outlined in Section 7.1. We will briefly touch on the
derivation of (7.10) from the primal problem (4.1) after the proof.
Proof of Theorem 7.18. Assume first that 𝜙∞ (1) < ∞. Under the assumptions
stated in the theorem, the worst-case expectation problem (4.1) and its dual (7.9)
share the same optimal value thanks to Theorem 7.17. By dualizing the single
explicit constraint in (4.11) and using Lemma 7.1 (i), we thus find
Õ  
sup E P [ℓ(𝑍)] = inf 𝜆 0 + 𝜆𝑟 + 𝑝ˆ𝑖 sup 𝑦 𝑖 (ℓ(ˆ𝑧𝑖 ) − 𝜆 0 ) − 𝜆𝜙(𝑦 𝑖 )
P∈P 𝜆0 ∈R,𝜆∈R+ 𝑦𝑖 ∈R+
𝑖∈ [ 𝑁 ]
 
+ sup ℓ − 𝜆 0 − 𝜆𝜙∞ (1) 𝑝 0 .
𝑝0 ∈R+

Interchanging the infima and suprema and rearranging terms further yields
sup E P [ℓ(𝑍)]
P∈P
Õ  Õ 
= sup 𝑝0ℓ + 𝑝ˆ 𝑖 𝑦 𝑖 ℓ(ˆ𝑧𝑖 ) + inf 1 − 𝑝0 − 𝑝ˆ 𝑖 𝑦 𝑖 𝜆 0
𝑝0 ,𝑦1 ,...,𝑦 𝑁 ∈R+ 𝜆0 ∈R
𝑖∈ [ 𝑁 ] 𝑖∈ [ 𝑁 ]
 Õ 
+ inf 𝑟 − 𝑝 0 𝜙∞ (1) − 𝜙(𝑦 𝑖 ) 𝜆
𝜆∈R+
𝑖∈ [ 𝑁 ]
Õ

 sup 𝑝0ℓ + 𝑝ˆ𝑖 𝑦 𝑖 ℓ(ˆ𝑧𝑖 )



 𝑝0 ,𝑦0 ...,𝑦 𝑁 ∈R+ 𝑖∈ [ 𝑁 ]
= Õ Õ


 s.t. 𝑝0 + 𝑝ˆ𝑖 𝑦 𝑖 = 1, 𝑝 0 𝜙∞ (1) + 𝑝ˆ𝑖 𝜙(𝑦 𝑖 ) ≤ 𝑟.

 𝑖∈ [ 𝑁 ] 𝑖∈ [ 𝑁 ]

The first equality in the above expression follows from strong duality, which holds
because 𝑟 > 0 and 𝜙 is continuous at 1. Indeed, these conditions ensure that the
resulting maximization problem admits a Slater point with 𝑝 0 = 0 and 𝑦 𝑖 = 1 for
all 𝑖 ∈ [𝑁]. The substitution 𝑝 𝑖 ← 𝑝ˆ 𝑖 𝑦 𝑖 , 𝑖 ∈ [𝑁], finally shows that the obtained
problem is equivalent to (7.10). This proves the claim for 𝜙∞ (1) < ∞.
Suppose next that 𝜙∞ (1) = ∞ in which case 0 𝜙∞ (1) evaluates to ∞. Hence, the
constraint in (4.11) is satisfied for any (𝜆 0 , 𝜆) ∈ R × R+ and is thus redundant. By
repeating the steps from the first part of the proof with obvious minor modifications
shows that (7.10) still holds if we assume that 𝑝 0 𝜙∞ (1) and 𝑝 0 ℓ evaluate to 0
when 𝑝 0 = 0. Indeed, this means that 𝑝 0 = 0 is the only feasible solution in (7.10),
and problem (7.10) can be simplified by eliminating 𝑝 0 altogether. 
The finite bi-dual reformulation on the right hand side of (7.10) has a linear
objective function and a compact convex feasible region. Therefore, it is solvable
thanks to Weierstrass’ maximum theorem. In particular, note that the feasible region
is a subset of the probability simplex in R 𝑁 +1 . If there exists a worst-case scenario
𝑧ˆ0 ∈ arg max 𝑧 ∈Z ℓ(ˆ𝑧) (which must satisfy ℓ(𝑧0 ) = ℓ), then any maximizer 𝑝★ of the
Í𝑁 ★
bi-dual can be used to construct an extremal distribution P★ = 𝑖=0 𝑝 𝑖 𝛿 𝑧ˆ𝑖 for the
Distributionally Robust Optimization 135

worst-case expectation problem (4.1). Indeed, the constraints of problem (7.10)


ensure that 𝑝★0 , . . . , 𝑝★𝑁 are non-negative probabilities that sum to 1. Thus, P★ is a
Í𝑁
valid distribution supported on Z. Setting 𝜌 = 𝑖=0 𝛿 𝑧ˆ𝑖 , we also find

dP dP̂
 
D 𝜙 (P★, P̂) = 𝜙𝜋 (𝑧), (𝑧) d𝜌(𝑧)
Z d𝜌 d𝜌
𝜋 ★
 Õ 𝜋 ★ 
= 𝜙 𝑝0 , 0 + 𝜙 𝑝 𝑖 , 𝑝ˆ𝑖 ≤ 𝑟,
𝑖∈ [ 𝑁 ]

where the first equality exploits the definition of D 𝜙 , and the second equality
exploits our choice of the reference distribution 𝜌. In addition, the inequality
follows from the constraints of problem (7.10) and the observation that
𝜙 𝜋 (𝑝★0 , 0) = 𝜙∞ (𝑝★0 ) = 𝑝★0 𝜙∞ (1).
This confirms that P★ is feasible in (4.1). Also, its objective function value equals
𝑁
Õ
E P★ [ℓ(𝑍)] = 𝑝★𝑖ℓ(ˆ𝑧𝑖 ).
𝑖=0

As ℓ(ˆ𝑧0 ) = ℓ, we may conclude that E P★ [ℓ(𝑍)] coincides with the maximum of the
bi-dual reformulation in (7.10), which in turn matches the supremum of (4.1) by
virtue of Theorem 7.18. Hence, P★ is indeed a maximizer of problem (4.1).
Recall that if 𝜙∞ (1) = ∞, then D 𝜙 (P, P̂) = ∞ unless P ≪ P̂. Therefore,
every distribution P in a 𝜙-divergence ambiguity set around P̂ must be absolutely
continuous with respect to P̂. If 𝜙∞ (1) < ∞, on the other hand, then P can assign
a positive probability to points in Z that have zero probability under P̂. Note that
D 𝜙 (P, P̂) only depends on how much probability mass P removes from the support
of P̂, but it does not depend on where that probability mass is moved. As nature
aims to maximize the expected loss, it will move all of this probability mass to a
point with maximal loss within Z (i.e., to some point 𝑧ˆ0 ∈ arg max 𝑧 ∈Z ℓ(ˆ𝑧)).
If P is the restricted 𝜙-divergence ambiguity set (2.11), Assumption 7.16 holds,
𝑟 > 0 and 𝜙 is continuous at 1, then Theorem 7.18 remains valid with a minor
modification. That is, one must append the constraint 𝑝 0 = 0 to the finite bi-dual
reformulation on the right hand side of (7.10). Details are omitted for brevity.

7.4. Optimal Transport Ambiguity Sets


Recall that the optimal transport ambiguity set (2.27) is defined as

P = P ∈ P(Z) : OT𝑐 (P, P̂) ≤ 𝑟 ,
where 𝑟 ≥ 0 is a size parameter, 𝑐 is a transportation cost function in the sense of
Definition 2.14, OT𝑐 is the corresponding optimal transport discrepancy in the sense
of Definition 2.15, and P̂ ∈ P(Z) is a reference distribution. We will first show that
the worst-case expectation problem (4.1) over an optimal transport ambiguity set
136 D. Kuhn, S. Shafiee, and W. Wiesemann

can often be reformulated as a finite convex minimization problem. To this end, we


restrict attention to discrete reference distributions as in Assumption 7.16, and we
impose convexity conditions on the transportation cost function, the loss function,
and the support set Z. In addition, we impose a mild technical condition on the
support points of the discrete reference distribution P̂.
Assumption 7.19 (Regularity Conditions for Optimal Transport Ambiguity Sets).
(i) The loss function ℓ is a point-wise maximum of 𝐽 ∈ N concave functions, that
is, ℓ(𝑧) = max 𝑗 ∈ [ 𝐽 ] ℓ 𝑗 (𝑧), where −ℓ 𝑗 : Z → R is proper, convex and closed.
(ii) The support set is representable as Z = {𝑧 ∈ R𝑑 : 𝑔 𝑘 (𝑧) ≤ 0 ∀𝑘 ∈ [𝐾]} for
some 𝐾 ∈ N, where 𝑔 𝑘 : Z → R is proper, convex and closed.
(iii) The transportation cost function 𝑐(𝑧, 𝑧ˆ) is convex in 𝑧 for every fixed 𝑧ˆ ∈ Z.
(iv) The support point 𝑧ˆ𝑖 belongs to rint(dom(𝑐(·, 𝑧ˆ𝑖 ))) and constitutes a Slater
point for Z in the sense of Definition 7.3 for every 𝑖 ∈ [𝑁].
Assumption 7.19 (i) is non-restrictive because any continuous function ℓ on a
compact set Z can be uniformly approximated by a pointwise maximum of finitely
many concave functions ℓ 𝑗 , 𝑗 ∈ [𝐽], albeit maybe at the expense of requiring
large numbers 𝐽 of pieces. Assumptions 7.19 (ii) and (iii) are restrictive but
satisfied by support sets and transportation cost functions commonly encountered
in applications. Finally, Assumption 7.19 (iv) is of a purely technical nature and
can always be enforced by slightly perturbing the problem data.
Theorem 7.20 (Finite Dual Reformulation for Optimal Transport Ambiguity Sets).
If P is the optimal transport ambiguity set (2.27) and Assumptions 7.16 and 7.19
hold, then the worst-case expectation problem (4.1) obeys the weak duality relation
sup E P [ℓ(𝑍)]
P∈P
Õ


 inf 𝜆𝑟 + 𝑝ˆ𝑖 𝑠𝑖



 𝑖∈ [ 𝑁 ]

 s.t. 𝜆 ∈ R+ , 𝛼𝑖 𝑗𝑘 ∈ R+ , 𝑠𝑖 ∈ R ∀𝑖 ∈ [𝑁], 𝑗 ∈ [𝐽], 𝑘 ∈ [𝐾]



 𝜁 ℓ , 𝜁 𝑐 , 𝜁 𝑔 ∈ R𝑑 ∀𝑖 ∈ [𝑁], 𝑗 ∈ [𝐽], 𝑘 ∈ [𝐾]

 𝑖 𝑗 𝑖 𝑗 𝑖 𝑗𝑘


≤ (−ℓ 𝑗 )∗ (𝜁𝑖ℓ𝑗 ) + (𝑐∗𝑖 ) 𝜋 (𝜁𝑖𝑐𝑗 , 𝜆) (7.11)

 Õ
 𝑔

 + (𝑔∗𝑘 ) 𝜋 (𝜁𝑖 𝑗𝑘 , 𝛼𝑖 𝑗𝑘 ) ≤ 𝑠𝑖 ∀𝑖 ∈ [𝑁], 𝑗 ∈ [𝐽]



 𝑘 ∈ [𝐾 ]
Õ

 𝑔

 𝜁𝑖 𝑗 + 𝜁𝑖𝑐𝑗 +

𝜁𝑖 𝑗𝑘 = 0 ∀𝑖 ∈ [𝑁], 𝑗 ∈ [𝐽],


 𝑘 ∈ [𝐾 ]

where 𝑐𝑖 : Z → R is defined through 𝑐𝑖 (𝑧) = 𝑐(𝑧, 𝑧ˆ𝑖 ) for every 𝑖 ∈ [𝑁]. If 𝑟 > 0,
then strong duality holds, that is, the above inequality becomes an equality.
The dual minimization problem of Theorem (7.20) constitutes a finite convex
program because the conjugates (−ℓ 𝑗 )∗ , 𝑐∗𝑖 and 𝑔∗𝑘 and their perspectives are convex
functions. It accommodates O(𝑁 𝐽𝐾) decision variables and O(𝑁 𝐽) constraints.
Distributionally Robust Optimization 137

Proof of Theorem 7.20. By Theorem 4.18, we have


Õ

 inf 𝜆𝑟 + 𝑝ˆ𝑖 𝑠𝑖




 𝑖∈ [ 𝑁 ]
sup E P [ℓ(𝑍)] ≤ s.t. 𝜆 ∈ R+ , 𝑠𝑖 ∈ R ∀𝑖 ∈ [𝑁]
P∈P 



 sup ℓ(𝑧) − 𝜆𝑐(𝑧, ˆ
𝑧 𝑖 ) ≤ 𝑠 𝑖 ∀𝑖 ∈ [𝑁],
 𝑧 ∈Z

where 𝑠𝑖 represents an auxiliary epigraphical decision variable for any 𝑖 ∈ [𝑁].


By Assumption 7.19 (i) and the definition of the functions 𝑐𝑖 , 𝑖 ∈ [𝑁], the above
minimization problem is equivalent to the following robust convex program.
Õ
inf 𝜆𝑟 + 𝑝ˆ𝑖 𝑠𝑖
𝑖∈ [ 𝑁 ]
s.t. 𝜆 ∈ R+ , 𝑠𝑖 ∈ R (7.12)
sup ℓ 𝑗 (𝑧) − 𝜆𝑐𝑖 (𝑧) ≤ 𝑠𝑖 ∀𝑖 ∈ [𝑁], 𝑗 ∈ [𝐽]
𝑧 ∈Z

For any fixed 𝑖 ∈ [𝑁] and 𝑗 ∈ [𝐽], Assumptions 7.19 (i) and 7.19 (ii) imply that the
embedded maximization problem over 𝑧 constitutes a convex program. In addition,
this problem admits a Slater point 𝑧ˆ𝑖 thanks to Assumptions 7.19 (i) and 7.19 (iv).
In order to dualize this convex program, we first recall from Lemma 7.2 that the
conjugate of 𝑓 (𝑧) = −ℓ 𝑗 (𝑧) + 𝜆𝑐𝑖 (𝑧) at 𝜁 ∈ R𝑑 can be represented as
n o
𝑓 ∗ (𝜁) = min (−ℓ 𝑗 )∗ (𝜁𝑖ℓ𝑗 ) + (𝑐∗𝑖 ) 𝜋 (𝜁𝑖𝑐𝑗 , 𝜆) : 𝜁𝑖ℓ𝑗 + 𝜁𝑖𝑐𝑗 = 𝜁 .
𝜁𝑖ℓ𝑗 ,𝜁𝑖𝑐𝑗 ∈R𝑑

By Theorem 7.4, we thus obtain


Õ
 𝑔

 min (−ℓ 𝑗 )∗ (𝜁𝑖ℓ𝑗 )+(𝑐∗𝑖 ) 𝜋 (𝜁𝑖𝑐𝑗 , 𝜆) + (𝑔∗𝑘 ) 𝜋 (𝜁𝑖 𝑗𝑘 , 𝛼𝑖 𝑗𝑘 )



 𝑘 ∈ [𝐾 ]

 ℓ , 𝜁 𝑐 , 𝜁 𝑔 ∈ R 𝑑 ∀𝑘 ∈ [𝐾]
sup ℓ 𝑗 (𝑧) − 𝜆𝑐𝑖 (𝑧) = s.t. 𝛼 𝑖 𝑗𝑘 ∈ R + , 𝜁 𝑖 𝑗 𝑖 𝑗 𝑖 𝑗𝑘
𝑧 ∈Z 
 Õ
 ℓ 𝑐 𝑔


 𝜁𝑖 𝑗 + 𝜁𝑖 𝑗 + 𝜁𝑖 𝑗𝑘 = 0.
 𝑘 ∈ [𝐾 ]

Next, we replace each embedded maximization problem in (7.12) with its equivalent
dual minimization problem, and we eliminate the corresponding minimization
operators, which is allowed because all minima are attained. This yields the
desired finite convex reformulation of the problem dual to (4.1), and it establishes
weak duality. If 𝑟 > 0, then strong duality follows from Theorem 4.18. 
The finite convex reformulation of Theorem 7.20 was first derived under the more
restrictive assumption that 𝑐(𝑧, 𝑧ˆ) = k𝑧− 𝑧ˆ k by Mohajerin Esfahani and Kuhn (2018,
Theorem 4.2) and later generalized to arbitrary convex transportation cost functions
by Zhen et al. (2023, § 6). We next derive a finite convex bi-dual for the worst-case
expectation problem (4.1) over the optimal transport ambiguity set (2.27), which
forms the basis for identifying extremal distributions that (asymptotically) attain
the supremum in (4.1). Our derivation will rely on the following two lemmas.
138 D. Kuhn, S. Shafiee, and W. Wiesemann

First, we derive a formula for the conjugate of a scaled perspective function.


Lemma 7.21 (Conjugates of Scaled Perspectives I). If 𝑓 : R𝑑 → R is proper,
convex and closed, and if 𝛼 ∈ R+ , then, for all 𝑦 ∈ R𝑑 and 𝑦 0 ∈ R, we have
(
𝜋 ∗ 0 if ( 𝑓 ∗ ) 𝜋 (𝑦, 𝛼) ≤ −𝑦 0 ,
(𝛼 𝑓 ) (𝑦, 𝑦 0 ) =
∞ otherwise.
Proof. Assume first that 𝛼 > 0. If 𝜆 > 0, then we have
𝛼 𝑓 𝜋 (𝑧, 𝜆) = 𝛼𝜆 𝑓 (𝑧/𝜆) = (𝛼 𝑓 ) 𝜋 (𝑧, 𝜆) ∀𝑧 ∈ R𝑑 .
Similarly, if 𝜆 = 0, then 𝛼 𝑓 𝜋 (𝑧, 𝜆) = 𝛼 𝑓 ∞ (𝑧) = (𝛼 𝑓 )∞ (𝑧) = (𝛼 𝑓 ) 𝜋 (𝑧, 𝜆) for all
𝑧 ∈ R𝑑 . We thus have shown that 𝛼 𝑓 𝜋 = (𝛼 𝑓 ) 𝜋 . Next, define the set

C = (𝑦, 𝑦 0 ) ∈ R𝑑 × R : (𝛼 𝑓 )∗ (𝑦) ≤ −𝑦 0

= (𝑦, 𝑦 0 ) ∈ R𝑑 × R : ( 𝑓 ∗ ) 𝜋 (𝑦, 𝛼) ≤ −𝑦 0 ,
where the second equality follows from the definition of the perspective function.
By (Rockafellar 1970, Corollary 13.5.1), we have (𝛼 𝑓 ) 𝜋 = 𝛿C∗ . As C is closed, this
implies that ( 𝑓 𝜋 )∗ = 𝛿C∗∗ = 𝛿C , and thus the claim follows for 𝛼 > 0.
Assume next that 𝛼 = 0. In this case we have 𝛼 𝑓 𝜋 = 𝛿dom( 𝑓 𝜋 ) thanks to our rules
of extended arithmetic. This observation implies that
 ⊤
(𝛼 𝑓 𝜋 )∗ (𝑦, 𝑦 0 ) = 𝛿dom(

𝑓 )
𝜋 (𝑦, 𝑦 0 ) = sup sup 𝑦 𝑧 + 𝑦 0 𝜆 : (𝑧, 𝜆) ∈ dom( 𝑓 𝜋 )
𝜆∈R++ 𝑧 ∈R𝑑
 ⊤
= sup 𝜆 sup 𝑦 (𝑧/𝜆) : 𝑧/𝜆 ∈ dom( 𝑓 ) + 𝑦 0𝜆
𝜆∈R++ 𝑧 ∈R𝑑
(
0 ∗
if 𝛿dom(

= sup 𝜆𝛿dom( 𝑓 ) (𝑦) + 𝑦 0 ≤ 0,
𝑓 ) (𝑦) + 𝜆𝑦 0 =
𝜆∈R++ ∞ otherwise.

Note that it is sufficient to optimize only over 𝜆 > 0 because dom( 𝑓 𝜋 ) ⊆ R𝑑 × R+ .


As 𝑓 is convex and closed, we have 𝑓 = 𝑓 ∗∗ thanks to Lemma 4.2, and thus we find
∗ ∗ ∗ ∞ ∗ 𝜋
𝛿dom( 𝑓 ) (𝑦) = 𝛿dom( 𝑓 ∗∗ ) (𝑦) = ( 𝑓 ) (𝑦) = ( 𝑓 ) (𝑦, 0),

where the second and the third equalities follow from (Rockafellar 1970, The-
orem 13.3) and from the definition of the perspective, respectively. Combining the
above observations proves the claim for 𝛼 = 0. 
The next lemma derives a formula for the conjugate of a sum of scaled pre-
spectives. It thus generalizes Lemma 7.21, which addresses only one single scaled
perspective, and it is also related to Lemma 7.2, which characterizes the conjugate
of a sum of arbitrary convex functions—not necessarily scaled perspectives.
Lemma 7.22 (Conjugates of Perspective Functions II). Suppose that 𝑓𝑖 : R𝑑 → R,
𝑖 ∈ [𝑚], are proper, convex Í and closed and that there is 𝑧¯ ∈ ∩𝑖∈ [𝑚] rint(dom( 𝑓𝑖 )).
Let 𝑓 (𝑧1 , . . . , 𝑧 𝑚 , 𝜆) = 𝑖∈ [𝑚] 𝛼𝑖 𝑓𝑖𝜋 (𝑧𝑖 , 𝜆) be a weighted sum of the corresponding
Distributionally Robust Optimization 139

perspective functions with weight vector 𝛼 ∈ R+𝑚 . Then, the conjugate of 𝑓 satisfies
( Í

 if ∃𝛽 ∈ R𝑚 with 𝑖∈ [𝑚] 𝛽𝑖 = 𝑦 0 and


 0
𝑓 ∗ (𝑦 1 , . . . , 𝑦 𝑚 , 𝑦 0 ) = ( 𝑓𝑖∗ ) 𝜋 (𝑦 𝑖 , 𝛼𝑖 ) ≤ −𝛽𝑖 ∀𝑖 ∈ [𝑚],


 ∞ otherwise.

Proof. By using a variable splitting trick as in the proof of Lemma 7.2, we find
Õ Õ
𝑓 ∗ (𝑦 1 , . . . , 𝑦 𝑚 , 𝑦 0 ) = sup sup 𝑦 0 𝜆 + 𝑦⊤
𝑖 𝑧 𝑖 − 𝛼𝑖 𝑓𝑖𝜋 (𝑧𝑖 , 𝜆)
𝑧1 ,...,𝑧𝑚 ∈R𝑑 𝜆∈R+ 𝑖∈ [𝑚] 𝑖∈ [𝑚]
Õ
 sup 𝑦⊤ 𝜋
𝑖 𝑧 𝑖 − 𝛼𝑖 𝑓𝑖 (𝑧 𝑖 , 𝜆 𝑖 )

 𝑦 0𝜆 +


= 𝑧1 ,...,𝑧𝑚 ∈R𝑑 𝑖∈ [𝑚]
 𝜆∈R, 𝜆1 ,...,𝜆𝑚 ∈R+


 s.t. 𝜆 𝑖 = 𝜆 𝑖 ∈ [𝑚]
The resulting convex maximization problem admits a Slater point. To see this,
recall that there exists 𝑧¯ ∈ ∩𝑖∈ [𝑚] rint(dom( 𝑓𝑖 )). As dom( 𝑓 𝜋 ) is contained in the
cone generated by dom( 𝑓 )× {1}, we may thus conclude that the solution with 𝜆 = 1,
𝜆 𝑖 = 1 and 𝑧𝑖 = 𝑧¯ for all 𝑖 ∈ [𝑚] constitutes a Slater point. Therefore, the above
maximization problem admits a strong Lagrangian dual, that is, we have
𝑓 ∗ (𝑦 1 , . . . , 𝑦 𝑚 , 𝑦 0 )
Õ
= min sup 𝑦 0𝜆 + 𝑦⊤ 𝜋
𝑖 𝑧 𝑖 − 𝛼𝑖 𝑓𝑖 (𝑧 𝑖 , 𝜆 𝑖 ) + 𝛽𝑖 (𝜆 𝑖 − 𝜆)
𝛽1 ,...,𝛽𝑚 ∈R
𝑧1 ,...,𝑧𝑚 ∈R𝑑 𝑖∈ [𝑚]
𝜆∈R, 𝜆1 ,...,𝜆𝑚 ∈R+

 

Õ
 𝜋 ∗
Õ 

= min (𝛼𝑖 𝑓𝑖 ) (𝑦 𝑖 , 𝛽𝑖 ) : 𝛽𝑖 = 𝑦 0 ,
𝛽1 ,...,𝛽𝑚 ∈R 
 

 𝑖∈ [𝑚] 𝑖∈ [𝑚] 
see also Theorem 7.4. By Lemma 7.21, we further have (𝛼𝑖 𝑓𝑖𝜋 )∗ = 𝛿C𝑖 , where

C𝑖 = (𝑦, 𝑦 0 ) ∈ R𝑑 × R : ( 𝑓𝑖∗ ) 𝜋 (𝑦, 𝛼𝑖 ) ≤ −𝑦 0
for all 𝑖 ∈ [𝑚]. Substituting this alternative expression for (𝛼𝑖 𝑓𝑖𝜋 )∗ into the above
dual problem yields the desired formula. Thus, the claim follows. 

We emphasize that Lemmas 7.21 and 7.22 are complementary to Lemma 4.11.
Indeed, while Lemma 4.11 evaluates the conjugate only with respect to the first
argument of a perspective function, Lemmas 7.21 and 7.22 do so with respect to
both arguments. We are now ready to derive a finite bi-dual reformulation of the
worst-case expectation problem over an optimal transport ambiguity set.
Theorem 7.23 (Finite Bi-Dual Reformulation for Optimal Transport Ambiguity
Sets). If P is the optimal transport ambiguity set (2.27) and Assumptions 7.16
and 7.19 hold, then the worst-case expectation problem (4.1) satisfies the weak
140 D. Kuhn, S. Shafiee, and W. Wiesemann

duality relation
sup E P [ℓ(𝑍)]
P∈P
Õ Õ

 sup −(−ℓ 𝑗 ) 𝜋 (𝑝 𝑖 𝑗 𝑧ˆ𝑖 + 𝑧𝑖 𝑗 , 𝑝 𝑖 𝑗 )



 𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]



 s.t. 𝑝 𝑖 𝑗 ∈ R+ , 𝑧𝑖 𝑗 ∈ R𝑑 ∀𝑖 ∈ [𝑁], 𝑗 ∈ [𝐽]



 𝜋

 𝑔 𝑘 (𝑝 𝑖 𝑗 𝑧ˆ𝑖 + 𝑧𝑖 𝑗 , 𝑝 𝑖 𝑗 ) ≤ 0 ∀𝑖 ∈ [𝑁], 𝑗 ∈ [𝐽], 𝑘 ∈ [𝐾]
≤ Õ (7.13)



 𝑝 𝑖 𝑗 = 𝑝ˆ𝑖 ∀𝑖 ∈ [𝑁]



 𝑗∈[𝐽 ]
Õ Õ




 𝑐𝑖𝜋 (𝑝 𝑖 𝑗 𝑧ˆ𝑖 + 𝑧𝑖 𝑗 , 𝑝 𝑖 𝑗 ) ≤ 𝑟,
 𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]

where 𝑐𝑖 : Z → R is defined through 𝑐𝑖 (𝑧) = 𝑐(𝑧, 𝑧ˆ𝑖 ) for every 𝑖 ∈ [𝑁]. If 𝑟 > 0,
then strong duality holds, that is, the above inequality becomes an equality.
Proof. We will show that (7.13) is obtained by dualizing the finite dual reformula-
tion (7.11) of problem (4.1). To see this, we assign Lagrange multipliers 𝑝 𝑖 𝑗 ∈ R+
and 𝑧𝑖 𝑗 ∈ R𝑑 , 𝑖 ∈ [𝑁], 𝑗 ∈ [𝐽], to the first and second constraint groups in (7.11),
respectively. The Lagrangian dual of (7.11) can then be represented compactly as
sup inf 𝐿 1 (𝑠; 𝑝, 𝑧) + 𝐿 2 (𝜁 ℓ ; 𝑝, 𝑧) + 𝐿 3 (𝜆, 𝜁 𝑐 ; 𝑝, 𝑧, 𝜆) + 𝐿 4 (𝛼, 𝜁 𝑔 ; 𝑝, 𝑧),
𝑝≥0,𝑧 𝜆≥0, 𝛼≥0
𝑠,𝜁 ℓ ,𝜁 𝑐 ,𝜁 𝑔

where the Lagrangian is additively separable with respect to four disjoint groups
of primal decision variables, namely, 𝑠, 𝜁 ℓ , (𝜆, 𝜁 𝑐 ) and (𝛼, 𝜁 𝑔 ). The corresponding
partial Lagrangians are defined as follows.
Õ Õ Õ
𝐿 1 (𝑠; 𝑝, 𝑧) = 𝑝ˆ𝑖 𝑠𝑖 − 𝑝 𝑖 𝑗 𝑠𝑖
𝑖∈ [ 𝑁 ] 𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]
Õ Õ

𝐿 2 (𝜁 ; 𝑝, 𝑧) = 𝑝 𝑖 𝑗 · (−ℓ 𝑗 )∗ (𝜁𝑖ℓ𝑗 ) − 𝑧⊤ ℓ
𝑖 𝑗 𝜁𝑖 𝑗
𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]
Õ Õ
𝑐
𝐿 3 (𝜆, 𝜁 ; 𝑝, 𝑧) = 𝜆𝑟 + 𝑝 𝑖 𝑗 · (𝑐∗𝑖 ) 𝜋 (𝜁𝑖𝑐𝑗 , 𝜆) − 𝑧⊤ 𝑐
𝑖 𝑗 𝜁𝑖 𝑗
𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]
Õ Õ Õ
𝑔 𝑔
𝐿 4 (𝛼, 𝜁 𝑔 ; 𝑝, 𝑧) = 𝑝 𝑖 𝑗 · (𝑔∗𝑘 ) 𝜋 (𝜁𝑖 𝑗𝑘 , 𝛼𝑖 𝑗𝑘 ) − 𝑧⊤
𝑖 𝑗 𝜁𝑖 𝑗𝑘
𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ] 𝑘 ∈ [𝐾 ]

These partial Lagrangians can be minimized separately with respect to the primal
decision variables. For example, an elementary calcucation shows that
Õ

 0 if 𝑝 𝑖 𝑗 = 𝑝ˆ 𝑖 ∀𝑖 ∈ [𝑁],


inf 𝐿 1 (𝑠; 𝑝, 𝑧) = 𝑗∈[𝐽 ]
𝑠 
 −∞ otherwise.

Recall now that −ℓ 𝑗 is proper, convex and closed, which implies via Lemma 4.2 that
Distributionally Robust Optimization 141

(−ℓ 𝑗 )∗∗ = −ℓ 𝑗 . Note also that minimizing 𝐿 2 (𝜁 ℓ ; 𝑝, 𝑧) with respect to 𝜁 ℓ amounts to


evaluating the conjugate of a sum of conjugates with mutually different arguments.
By using Lemma 7.1 (i) and applying a few elementary manipulations we thus find
Õ Õ
inf 𝐿 2 (𝜁 ℓ ; 𝑝, 𝑧) = −(−ℓ 𝑗 ) 𝜋 (𝑧𝑖 𝑗 , 𝑝 𝑖 𝑗 ).
𝜁ℓ
𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]

Similarly, recall that 𝑐𝑖 is proper, convex and closed such that 𝑐∗∗ 𝑖 = 𝑐 𝑖 . Note also
𝑐 𝑐
that minimizing 𝐿 3 (𝜆, 𝜁 ; 𝑝, 𝑧) with respect to 𝜆 and 𝜁 amounts to evaluating the
conjugate of a sum of perspective functions with one common argument. By using
Lemma 7.22 and applying a few elementary manipulations we thus find
( Í Í

 if ∃𝛽𝑖 𝑗 ∈ R𝑚 with 𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ] 𝛽𝑖 𝑗 = 𝑟 and


 0
inf 𝐿 3 (𝜆, 𝜁 𝑐 ; 𝑝, 𝑧) = 𝑐𝑖𝜋 (𝑧𝑖 𝑗 , 𝑝 𝑖 𝑗 ) ≤ 𝛽𝑖 𝑗 ∀𝑖 ∈ [𝑁], 𝑗 ∈ [𝐽],
𝜆≥0,𝜁 𝑐 

 −∞ otherwise.

Finally, recall that 𝑔 𝑘 is proper, convex and closed such that 𝑔∗∗ 𝑘 = 𝑔 𝑘 . Note also
that minimizing 𝐿 4 (𝛼, 𝜁 𝑔 ; 𝑝, 𝑧) with respect to 𝛼 and 𝜁 𝑔 amounts to evaluating the
conjugate of a sum of perspective functions with mutually different arguments. By
using Lemma 7.21 and applying a few elementary manipulations we thus find
(
0 if 𝑔 𝑘𝜋 𝑧𝑖 𝑗 , 𝑝 𝑖 𝑗 ) ≤ 0 ∀𝑖 ∈ [𝑁], 𝑗 ∈ [𝐽], 𝑘 ∈ [𝐾],
inf 𝑔 𝐿 4 (𝛼, 𝜁 𝑔 ; 𝑝, 𝑧) =
𝛼≥0,𝜁 −∞ otherwise.
Substituting the infima of the partial Lagrangians into the dual objective yields the
following equivalent reformulation for the problem dual to (7.11).
Õ Õ
sup −(−ℓ 𝑗 ) 𝜋 (𝑧𝑖 𝑗 , 𝑝 𝑖 𝑗 )
𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]
s.t. 𝑝 𝑖 𝑗 ∈ R+ , 𝛽𝑖 𝑗 ∈ R, 𝑧𝑖 𝑗 ∈ R𝑑 ∀𝑖 ∈ [𝑁], 𝑗 ∈ [𝐽]
𝑔 𝑘𝜋 (𝑧𝑖 𝑗 , 𝑝 𝑖 𝑗 ) ≤ 0 ∀𝑖 ∈ [𝑁], 𝑗 ∈ [𝐽], 𝑘 ∈ [𝐾]
Õ
𝑝 𝑖 𝑗 = 𝑝ˆ𝑖 ∀𝑖 ∈ [𝑁] (7.14)
𝑗∈[𝐽 ]
𝑐𝑖𝜋 (𝑧𝑖 𝑗 , 𝑝 𝑖 𝑗 ) ≤ 𝛽𝑖 𝑗 ∀𝑖 ∈ [𝑁], 𝑗 ∈ [𝐽]
Õ Õ
𝛽𝑖 𝑗 = 𝑟
𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]

Note that if the finite dual reformulation (7.11) of the worst-case expectation prob-
lem is viewed as an instance of the primal convex program (P), then problem (7.14)
represents the corresponding instance of the dual convex program (D). By As-
sumptions 7.16 and 7.19, problem (7.14) admits a Slater point with 𝑝 𝑖 𝑗 = 𝑝ˆ𝑖 /𝐽 and
𝑧𝑖 𝑗 = 𝑧ˆ𝑖 for all 𝑖 ∈ [𝑁] and 𝑗 ∈ [𝐽]. Thus, strong duality holds thanks to The-
orem 7.4 (i). It remains to be shown that (7.14) is equivalent to (7.13). To this end,
note first that the last constraint in (7.14) can be relaxed to a less-than-or-equal-to
inequality without increasing the problem’s supremum such that 𝛽𝑖 𝑗 = 𝑐𝑖𝜋 (𝑧𝑖 𝑗 , 𝑝 𝑖 𝑗 )
142 D. Kuhn, S. Shafiee, and W. Wiesemann

at optimality. This allows us to eliminate the 𝛽𝑖 𝑗 variables from (7.14). Prob-


lem (7.13) is then obtained by applying the substitution 𝑧𝑖 𝑗 ← 𝑧𝑖 𝑗 − 𝑝 𝑖 𝑗 𝑧ˆ𝑖 . 
The finite bi-dual reformulation (7.13) is guaranteed to be solvable provided that
the transportation cost function satisfies the following additional assumption.
Assumption 7.24 (Identity of Indiscernibles). The transportation cost function is
real-valued and satisfies 𝑐(𝑧, 𝑧ˆ) = 0 if and only if 𝑧 = 𝑧ˆ.
Lemma 7.25 (Solvability of the Finite Bi-Dual Reformulation). Suppose that As-
sumptions 7.16, 7.19 and 7.24 hold. Then, problem (7.13) is solvable.
Proof. Under the stated assumptions, problem (7.13) maximizes an upper semi-
continuous function over a compact feasible region, and thus the claim follows
from Weierstrass’ maximum theorem. To see that the objective function of (7.13)
is upper semicontinuous, note that the functions −ℓ 𝑗 are proper, convex and closed
for all 𝑗 ∈ [𝐽] thanks to Assumption 7.19 (i). By (Rockafellar 1970, pages 35
and 67), their perspectives are proper, convex and closed, too; see also (Zhen et al.
2023, Proposition C.2). Thus, the negative perspective functions appearing in the
objective function of problem (7.13) are indeed upper semicontinuous. Similarly,
one can show that the feasible region of problem (7.13) is closed. Indeed, 𝑔 𝑘 and 𝑐𝑖
are proper, convex and closed for all 𝑘 ∈ [𝐾] and 𝑖 ∈ [𝑁] thanks to Assump-
tion 7.19 and Definition 2.14. This readily implies that their perspectives are lower
semicontinuous, and thus the feasible region of (7.13) is indeed closed. To see
that the feasible region is also bounded, note first that 𝑝 𝑖 𝑗 ∈ [0, 1] for all 𝑖 ∈ [𝑁]
and 𝑗 ∈ [𝐽]. Indeed, these variables must be non-negative and compatible with
the probabilities 𝑝ˆ𝑖 , 𝑖 ∈ [𝑁], of the discrete reference distribution. Next, we show
that the variables 𝑧𝑖 𝑗 for 𝑖 ∈ [𝑁] and 𝑗 ∈ [𝐽] are restricted to a bounded set, as
well. Indeed, by (Zhen et al. 2023, Lemma C.10), which applies thanks to Assump-
tion 7.24 and Definition 2.14, there exists 𝛿 > 0 such that 𝑐𝑖 (ˆ𝑧𝑖 + 𝑧) ≥ 𝛿k𝑧k 2 − 1 for
all 𝑧 ∈ R𝑑 and 𝑖 ∈ [𝑁]. The last constraint of problem (7.13) therefore implies that
Õ Õ Õ Õ 1+𝑟
𝑐𝑖𝜋 (𝑝 𝑖 𝑗 𝑧ˆ𝑖 + 𝑧𝑖 𝑗 , 𝑝 𝑖 𝑗 ) ≤ 𝑟 =⇒ k𝑧𝑖 𝑗 k 2 ≤ ,
𝛿
𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ] 𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]
Í Í Í
where we used the identity 𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ] 𝑝 𝑖 𝑗 = 𝑖∈ [ 𝑁 ] 𝑝ˆ𝑖 = 1 and the definition of
the perspective function. Thus, the feasible region of (7.13) is indeed bounded. 
We are now ready to construct extremal distributions P★ ∈ P(Z) that attain the
supremum of the worst-case expectation problem (4.1) over the optimal transport
ambiguity set (2.27). To this end, fix any maximizer (𝑝★, 𝑧★) of the bi-dual
problem (7.13), which exists thanks to Lemma 7.25. Next, define the index sets
 
J𝑖∞ = 𝑗 ∈ [𝐽] : 𝑝★𝑖𝑗 = 0, 𝑧★𝑖𝑗 ≠ 0 and J𝑖+ = 𝑗 ∈ [𝐽] : 𝑝★𝑖𝑗 > 0 ,
and define J𝑖 = J𝑖+ ∪ J𝑖∞ for any 𝑖 ∈ [𝑁]. The following theorem uses the
maximizer (𝑝★, 𝑧★) and the corresponding index sets to construct P★.
Distributionally Robust Optimization 143

Theorem 7.26 (Extremal Distributions of Optimal Transport Ambiguity Sets).


Suppose that all conditions of Theorem 7.23 for weak and strong duality are satis-
fied, Assumption 7.24 holds, and (𝑝★, 𝑧★) solves (7.13). Then, the following hold.

(i) If J𝑖∞ = ∅ for all 𝑖 ∈ [𝑁], then problem (4.1) is solved by


Õ Õ
P★ = 𝑝★𝑖𝑗 𝛿 𝑧ˆ𝑖 +𝑧𝑖★𝑗 / 𝑝𝑖★𝑗 .
𝑖∈ [ 𝑁 ] 𝑗 ∈J𝑖+

(ii) If J𝑖∞Í≠ ∅ forÍsome 𝑖 ∈ [𝑁], then problem (4.1) is asymptotically solved by


P𝑚 = 𝑖∈ [ 𝑁 ] 𝑗 ∈J𝑖 𝑝 𝑖𝑚𝑗 𝛿 𝑧𝑖𝑚𝑗 as 𝑚 ∈ N, 𝑚 ≥ max𝑖∈ [ 𝑁 ] |J𝑖∞ |, grows, where

 
|J𝑖∞ |
 
 𝑧𝑖★𝑗

 1−
 𝑝★𝑖𝑗 if 𝑗 ∈ J𝑖+ ,  𝑧ˆ𝑖 +

 𝑝𝑖★𝑗
if 𝑗 ∈ J𝑖+ ,
𝑚
𝑝 𝑖𝑚𝑗 = and 𝑧𝑖𝑚𝑗 = 𝑧𝑖★𝑗
 𝑝𝑚ˆ 𝑖
 if 𝑗 ∈ J𝑖∞ , 
 𝑧ˆ𝑖 +
 if 𝑗 ∈ J𝑖∞ .
 𝑝𝑖𝑚𝑗

Proof. In view of assertion (i), we first show that P★ defined in the statement of
the theorem is feasible in the worst-case expectation problem (4.1). To this end,
observe first that feasibility of (𝑝★, 𝑧★) in (7.13) implies that 𝑝★𝑖𝑗 ≥ 0 for all 𝑖 ∈ [𝑁]
Í Í
and 𝑗 ∈ [𝐽], and that 𝑖∈ [ 𝑁 ] 𝑗 ∈J𝑖+ 𝑝★𝑖𝑗 = 1. Note also that 𝑧ˆ𝑖 + 𝑧★𝑖𝑗 /𝑝★𝑖𝑗 ∈ Z for
all 𝑖 ∈ [𝑁] and 𝑗 ∈ J𝑖+ due to the second constraint in (7.13). This confirms that
P★ ∈ P(Z). The penultimate constraint group of problem (7.13) also implies that
Õ Õ
𝑝★𝑖𝑗 𝛿 ★ ★
 ∈ Γ(P★, P̂)
𝑧ˆ𝑖 +𝑧𝑖 𝑗 / 𝑝𝑖 𝑗 , 𝑧ˆ𝑖
𝑖∈ [ 𝑁 ] 𝑗 ∈J𝑖+

constitutes a valid transportation plan for morphing P̂ into P★. Thus, we find
Õ Õ
OT𝑐 (P★, P̂) ≤ 𝑝★𝑖𝑗 · 𝑐(ˆ𝑧𝑖 + 𝑧★𝑖𝑗 /𝑝★𝑖𝑗 , 𝑧ˆ𝑖 )
𝑖∈ [ 𝑁 ] 𝑗 ∈J𝑖+
Õ Õ
= 𝑐𝑖𝜋 (𝑝★𝑖𝑗 𝑧ˆ𝑖 + 𝑧★𝑖𝑗 , 𝑝★𝑖𝑗 ) ≤ 𝑟.
𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]

Here, the equality holds because all terms corresponding to 𝑖 ∈ [𝑁] and 𝑗 ∉ J𝑖+
vanish. Indeed, if 𝑗 ∉ J𝑖+ , then 𝑝★𝑖𝑗 = 0. As J𝑖∞ = ∅, this implies that 𝑧★𝑖𝑗 = 0.
Thus, we have 𝑐𝑖𝜋 (𝑝★𝑖𝑗 𝑧ˆ𝑖 + 𝑧★𝑖𝑗 , 𝑝★𝑖𝑗 ) = 𝑐𝑖𝜋 (0, 0) = 𝑐∞
𝑖 (0) = 0 by the definitions of
the perspective and the recession function. The second inequality in the above
expression follows from the last constraint in (7.13). In summary, we have shown
that P★ is feasible in (4.1). As for the objective function value of P★, note that
Õ Õ
E P★ [ℓ(𝑍)] ≤ sup E P [ℓ(𝑍)] ≤ −(−ℓ 𝑗 ) 𝜋 (𝑝★𝑖𝑗 𝑧ˆ𝑖 + 𝑧★𝑖𝑗 , 𝑝★𝑖𝑗 ),
P∈P 𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]

where the second inequality follows from the weak duality relation established in
144 D. Kuhn, S. Shafiee, and W. Wiesemann

Theorem 7.23. At the same time, however, the expected loss under P★ satisfies
Õ Õ 𝑧𝑖★𝑗 
★ ′
E P★ [ℓ(𝑍)] = max

𝑝 𝑖𝑗 ℓ 𝑗 ˆ
𝑧 𝑖 + 𝑝★ 𝑗 ∈[𝐽] 𝑖𝑗
𝑖∈ [ 𝑁 ] 𝑗 ∈J +
Õ Õ
≥ −(−ℓ 𝑗 ) 𝜋 (𝑝★𝑖𝑗 𝑧ˆ𝑖 + 𝑧★𝑖𝑗 , 𝑝★𝑖𝑗 )
𝑖∈ [ 𝑁 ] 𝑗 ∈J +
Õ Õ
= −(−ℓ 𝑗 ) 𝜋 (𝑝★𝑖𝑗 𝑧ˆ𝑖 + 𝑧★𝑖𝑗 , 𝑝★𝑖𝑗 ),
𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]

where the inequality uses the definition of the perspective function and the trivial
observation that 𝑗 ∈ J + is a feasible choice for 𝑗 ′ ∈ [𝐽]. The last equality holds
once more because 𝑝★𝑖𝑗 = 0 implies 𝑧★𝑖𝑗 = 0 and (−ℓ 𝑗 ) 𝜋 (0, 0) = (−ℓ 𝑗 )∞ (0) = 0 by
the definition of the perspective and the recession function. In summary, the above
inequalities imply that P★ is optimal in (4.1). Hence, assertion (i) follows.
As for assertion (ii), we first show that P𝑚 ∈ P for any fixed 𝑚 ≥ max𝑖∈ [ 𝑁 ] |J𝑖∞ |.
The constraints
Í Í of problem (7.13) imply that 𝑝 𝑖𝑚𝑗 ≥ 0 for all 𝑗 ∈ J𝑖 and 𝑖 ∈ [𝑁] and
that 𝑖∈ [ 𝑁 ] 𝑗 ∈J 𝑝 𝑖𝑚𝑗 = 1. They also imply that 𝑧𝑖𝑚𝑗 ∈ Z for every 𝑗 ∈ J𝑖 and 𝑖 ∈
[𝑁]. This is easy to see if 𝑗 ∈ J𝑖+ . If 𝑗 ∈ J𝑖∞ , on the other hand, then 𝑝★𝑖𝑗 = 0,
𝑧★𝑖𝑗 ≠ 0 and 𝑔 𝑘𝜋 (𝑧★𝑖𝑗 , 0) ≤ 0 for all 𝑘 ∈ [𝐾], which implies via (Rockafellar 1970,
Theorem 8.6) that 𝑧★𝑖𝑗 is a recession direction of Z. Geometrically, this means that
the ray emanating from any point in Z along the direction 𝑧★𝑖𝑗 never leaves Z. Thus,
𝑧𝑖𝑚𝑗 = 𝑧ˆ𝑖 + 𝑚 𝑧★𝑖𝑗 / 𝑝ˆ𝑖 ∈ Z for all 𝑖 ∈ [𝑁] and 𝑗 ∈ J𝑖∞ . In addition, one verifies that
Õ Õ
𝑝 𝑖𝑚𝑗 𝛿 𝑧 𝑚 , 𝑧ˆ  ∈ Γ(P★, P̂)
𝑖𝑗 𝑖
𝑖∈ [ 𝑁 ] 𝑗 ∈J𝑖

constitutes a valid transportation plan for morphing P̂ into P𝑚 . Thus, we find


OT𝑐 (P𝑚 , P̂)
Õ Õ
≤ 𝑝 𝑖𝑚𝑗 𝑐(𝑧𝑖𝑚𝑗 , 𝑧ˆ𝑖 )
𝑖∈ [ 𝑁 ] 𝑗 ∈J𝑖
Õ Õ 
|J𝑖∞ |
 𝑧𝑖★𝑗  Õ Õ 𝑝ˆ𝑖 𝑧★
𝑝★𝑖𝑗 1 −

= 𝑚 𝑐 𝑧ˆ𝑖 + ,
𝑝𝑖★𝑗 𝑖
ˆ
𝑧 + 𝑐 𝑧ˆ𝑖 + 𝑚 𝑝ˆ𝑖𝑖𝑗 , 𝑧ˆ𝑖
𝑗 ∈J ∞
𝑚
𝑖∈ [ 𝑁 ] 𝑗 ∈J𝑖+ 𝑖∈ [ 𝑁 ] 𝑖
Õ Õ 𝑧𝑖★𝑗 Õ Õ 𝑝ˆ𝑖 𝑧★
𝑝★𝑖𝑗 𝑐 𝑧ˆ𝑖 +
 
≤ 𝑝𝑖★𝑗
, 𝑧ˆ𝑖 + lim 𝑐 𝑧ˆ𝑖 + 𝑚 𝑝ˆ𝑖𝑖𝑗 , 𝑧ˆ𝑖

𝑚→∞ 𝑚
𝑖∈ [ 𝑁 ] 𝑗 ∈J𝑖+ 𝑖∈ [ 𝑁 ] 𝑗 ∈J𝑖
Õ Õ 𝑧𝑖★𝑗  Õ Õ 𝑝ˆ𝑖 𝑧★
𝑝★𝑖𝑗 𝑐 𝑧ˆ𝑖 +

= ,
𝑝𝑖★𝑗 𝑖
ˆ
𝑧 + lim 𝑐 𝑚 𝑝ˆ𝑖𝑖𝑗 , 𝑧ˆ𝑖

𝑚→∞ 𝑚
𝑖∈ [ 𝑁 ] 𝑗 ∈J𝑖+ 𝑖∈ [ 𝑁 ] 𝑗 ∈J𝑖
Õ Õ
= 𝑐𝑖𝜋 (𝑝★𝑖𝑗 𝑧ˆ𝑖 + 𝑧★𝑖𝑗 , 𝑝★𝑖𝑗 ) ≤ 𝑟,
𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ]

where the first equality follows from the definitions of 𝑝 𝑖𝑚𝑗 and 𝑧𝑖𝑚𝑗 . The second
Distributionally Robust Optimization 145

inequality holds because the transportation cost function 𝑐(𝑧, 𝑧ˆ) is non-negative and
convex in 𝑧, which implies that both terms in the third line are non-decreasing in 𝑚.
The second equality follows from Assumption 7.24, which ensures that 𝑐(𝑧, 𝑧ˆ) is
real-valued such that the reference point in the definition of the recession function
of 𝑐(·, 𝑧ˆ𝑖 ) can be chosen freely. The third equality exploits the definition of the
perspective function 𝑐𝑖𝜋 and the observation that 𝑐𝑖𝜋 (0, 0) = 𝑐∞ 𝑖 (0) = 0. Finally,
the last inequality follows from the last constraint of problem (7.13). We have
thus shown that P𝑚 is feasible in (4.1). In analogy to analysis for P★, one can
show that the asymptoticÍ expected
Í loss lim𝑚→∞ E P𝑚 [ℓ(𝑍)] is at least as large
as the optimal value 𝑖∈ [ 𝑁 ] 𝑗 ∈ [ 𝐽 ] −(−ℓ 𝑗 ) 𝜋 (𝑝★𝑖𝑗 𝑧ˆ𝑖 + 𝑧★𝑖𝑗 , 𝑝★𝑖𝑗 ) of the finite bi-dual
reformulation (7.13). However, as the suprema of (4.1) and (7.13) match, it is clear
that the distributions P𝑚 , 𝑚 ∈ N, must be asymptotically optimal in (4.1). 

If J𝑖∞ ≠ ∅ for some 𝑖 ∈ [𝑁], then the extremal distributions constructed in


Theorem 7.26 send atoms with decaying probabilities to infinity along specific
recession directions 𝑧★𝑖𝑗 , 𝑗 ∈ J𝑖∞ , of the support set Z. Moving atoms to infinity
is possible even when only a finite transportation budget 𝑟 is available provided
that the probability mass transported scales inversely with the transportation cost.
The following lemma establishes sufficient conditions for J𝑖∞ to be empty for
every 𝑖 ∈ [𝑁], which ensures via Theorem 7.26 (i) that problem (4.1) is solvable.
Lemma 7.27. If all assumptions of Theorem 7.26 are satisfied and either of the
following conditions holds, then J𝑖∞ = ∅ for every 𝑖 ∈ [𝑁], and (4.1) is solvable.
(i) The transportation cost function grows superlinearly in its first argument. By
this we mean that 𝑐∞
𝑖 (𝑧) = ∞ for any 𝑧 ≠ 0 and for any 𝑖 ∈ [𝑁].
(ii) The support set Z is bounded.
Proof. As usual, let (𝑝★, 𝑧★) be a maximizer of problem (7.13), which exists thanks
to Lemma 7.25. As for assertion (i), assume that the transportation cost function
grows superlinearly. For the sake of argument, assume also that there exists 𝑖 ∈ [𝑁]
with J𝑖∞ ≠ ∅. For every 𝑗 ∈ J𝑖∞ we thus have 𝑝★𝑖𝑗 = 0 and 𝑧★𝑖𝑗 ≠ 0. Hence, we find

𝑐𝑖𝜋 (𝑝★𝑖𝑗 𝑧ˆ𝑖 + 𝑧★𝑖𝑗 , 𝑝★𝑖𝑗 ) = 𝑐∞ ★


𝑖 (𝑧 𝑖 𝑗 ) = ∞,

where the first equality uses the definition of the perspective function, and the
second equality holds because the transportation cost function grows superlinearly.
Thus, (𝑝★, 𝑧★) violates the last constraint of problem (7.13), which contradicts its
assumed feasibility. We may thus conclude that J𝑖∞ = ∅ and that (4.1) is solvable.
As for assertion (ii), assume now that Z is bounded. Without loss of generality,
we may also assume that 𝑝★𝑖𝑗 = 0 for some 𝑖 ∈ [𝑁] and 𝑗 ∈ [𝐽] for otherwise J𝑖∞ is
trivially empty. The constraints of problem (7.13) then ensure that 𝑔 𝑘𝜋 (𝑧★𝑖𝑗 , 0) ≤ 0
for all 𝑘 ∈ [𝐾], which implies via (Rockafellar 1970, Theorem 8.6) that 𝑧★𝑖𝑗 is a
recession direction of Z. As Z is compact, however, this implies that 𝑧★𝑖𝑗 = 0. We
may thus again conclude that J𝑖∞ = ∅ and that (4.1) is solvable. 
146 D. Kuhn, S. Shafiee, and W. Wiesemann

Condition (i) of Lemma 7.27 is satisfied whenever P is a 𝑝-Wasserstein ball and


the transportation cost function is of the form 𝑐(𝑧, 𝑧ˆ) = k𝑧 − 𝑧ˆ k 𝑝 for some 𝑝 > 1.
The structural properties of the distributions that solve the worst-case expectation
problem (4.1) over an optimal transport ambiguity set, as well as necessary and
sufficient conditions for their existence, were studied by Wozabal (2012), Owhadi
and Scovel (2017), Yue et al. (2022) and Gao and Kleywegt (2023). In particular,
significant efforts were spent on characterizing the extremal distributions of a
Wasserstein ball centered at a discrete reference distributions with 𝑁 atoms. The
earliest result in this domain is due to Wozabal (2012, Theorem 3.3) who showed
that the worst-case expectation of a continuous bounded loss function is attained by
a discrete distribution with at most 𝑁 + 3 atoms. Later, Owhadi and Scovel (2017,
Theorem 2.3) and Gao and Kleywegt (2023, Corollary 1) managed to sharpen this
result by showing that the worst-case expectation is in fact attained by a discrete
distribution with at most 𝑁 + 2 or even only 𝑁 + 1 atoms, respectively; see also
(Yue et al. 2022, Theorem 4). Theorem 7.26 (i) and Lemma 7.27 reveal that if Z
is bounded and the loss function ℓ is concave, thus satisfying Assumption 7.19 (i)
with 𝐽 = 1, then the worst-case expected loss is attained by an 𝑁-point distribution.
For more general loss functions, however, every 𝑁-point distributions can be strictly
suboptimal even if problem (4.1) is solvable; see (Kuhn et al. 2019, Example 5).
The results in this section are based on (Zhen et al. 2023, § 6).

7.5. Nash Equilibria and Adversarial Attacks


The DRO problem (1.2) can be viewed as a zero-sum game in which the decision-
maker first chooses a decision 𝑥 ∈ X , and nature subsequently responds with a
distribution P ∈ P that adapts to 𝑥. Throughout this section we will refer to (1.2)
as the primal DRO problem. In addition, one can study the dual DRO problem
sup inf E P [ℓ(𝑥, 𝑍)] , (7.15)
P∈P 𝑥 ∈X

where nature first selects a distribution P ∈ P, and the decision-maker subsequently


responds with a decision 𝑥 ∈ X that adapts to P. In contrast to the primal DRO
problem (1.2), whose objective function is linear in P, the objective function of the
dual DRO problem (7.15) is concave in P. This difference makes the dual DRO
problem more challenging to solve. It is now natural to seek conditions that imply
strong duality and thus ensure that the infimum of the primal DRO problem (1.2)
coincides with the supremum of the dual DRO problem (7.15). One readily verifies
that strong duality is implied, for example, by the existence of a Nash equilibrium
(𝑥★, P★) ∈ X × P satisfying the saddle point condition
   
E P ℓ(𝑥★, 𝑍) ≤ E P★ ℓ(𝑥★, 𝑍) ≤ E P★ [ℓ(𝑥, 𝑍)] ∀𝑥 ∈ X , P ∈ P. (7.16)
We emphasize that the reverse implication is false, that is, strong duality does not
necessarily imply the existence of a Nash equilibrium. The primal DRO problem
naturally arises in many applications. The practical usefulness of the dual DRO
Distributionally Robust Optimization 147

problem, on the other hand, is less evident because this problem assumes somewhat
unrealistically that the decision-maker observes the distribution that governs 𝑍.
Nevertheless, the dual DRO problem has deep connections to robust statistics,
machine learning as well as several other disciplines as we explain below.
From the perspective of robust statistics, a minimizer 𝑥★ of the primal DRO
problem (1.2) can be interpreted as a robust estimator for the minimizer of the
stochastic program min 𝑥 ∈X E P0 [ℓ(𝑥, 𝑍)] corresponding to an unknown distribu-
tion P0 . When 𝑥★ and P★ satisfy the saddle point condition (7.16), then the robust
estimator 𝑥★ constitutes a best response to P★. Hence, it solves the stochastic pro-
gram corresponding to P★; see also (Lehmann and Casella 2006, Chapter 5). For
this reason, P★ is often referred to as the least favorable distribution. The existence
of P★ makes 𝑥★ a plausible estimator because it ensures that 𝑥★ is the minimizer of
a stochastic program corresponding to some distribution in the ambiguity set.
Algorithms for computing Nash equilibria of DRO problems are also relevant
for applications in machine learning. To see this, recall that adversarial training
aims to immunize machine learning models against adversarial perturbations of
the input data (Szegedy et al. 2014, Goodfellow et al. 2015, Mądry, Makelov,
Schmidt, Tsipras and Vladu 2018, Wang, Ma, Bailey, Yi, Zhou and Gu 2019,
Kurakin, Goodfellow and Bengio 2022). In this context, it is of interest to generate
adversarial examples, that is, maliciously designed inputs that mislead prediction
models encoded by parameters 𝑥 ∈ X . As a naïve approach to construct adversarial
examples, one could simply solve the worst-case expectation problem

sup E P [ℓ(𝑥,
ˆ 𝑍)], (7.17)
P∈P

which seeks a test distribution that maximizes the expected prediction loss of one
particular model encoded by 𝑥. ˆ Thus, any solution P★ of (7.17) can be viewed
as an adversarial attack, and samples drawn from P★ are naturally interpreted as
adversarial examples. In order to develop efficient strategies for attacking as well
as defending prediction models, however, it is desirable to construct adversarial
attacks that fool a broad spectrum of different models. Such attacks are called
transferable in the machine learning literature (Tramèr, Papernot, Goodfellow,
Boneh and McDaniel 2017, Demontis, Melis, Pintor, Jagielski, Biggio, Oprea,
Nita-Rotaru and Roli 2019, Kurakin et al. 2022). The dual DRO problem (7.15)
can be used to construct transferable attacks in a systematic manner. Indeed, the
solutions of (7.15) are not tailored to a particular model 𝑥ˆ ∈ X . Instead, they
aim to attack all models 𝑥 ∈ X simultaneously. If the primal DRO problem (1.2)
has a unique minimizer 𝑥★, then this minimizer can be recovered by solving the
stochastic program corresponding to the adversary’s Nash strategy P★.
To date, dual DRO problems have only been investigated in the context of specific
applications. For example, it is known that the least favorable distributions in dis-
tributionally robust estimation and Kalman filtering problems with a 2-Wasserstein
ambiguity set centered at a Gaussian reference distribution are themselves Gaus-
148 D. Kuhn, S. Shafiee, and W. Wiesemann

sian and can be computed efficiently via semidefinite programming (Shafieezadeh-


Abadeh et al. 2018, Nguyen et al. 2023). Several recent studies describe similar
results for distributionally robust optimal control problems with a 2-Wasserstein
ambiguity set (Al Taha et al. 2023, Hajar et al. 2023, Kargin et al. 2024a,b,c,d,
Taşkesen et al. 2024). When the Wasserstein ambiguity set is replaced with a
Kullback-Leibler ambiguity set around a Gaussian reference distribution, then the
least favorable distributions remain Gaussian and can be determined in quasi-closed
form (Levy and Nikoukhah 2004, 2012). In fact, these results even extend to gen-
eralized 𝜏-divergence ambiguity sets (Zorzi 2016, 2017b). Gaussian distributions
also solve several other minimax games reminiscent of DRO problems, which
are relevant for applications in statistics, control and information theory (Başar
and Mintz 1972, 1973, Başar and Max 1973, Başar 1977, Başar and Başar 1982,
Başar 1983, Başar and Başar 1984, Başar and Wu 1985, 1986). Furthermore, it is
possible to characterize the Nash equilibria of distributionally robust pricing and
auction design problems with support-only and Markov ambiguity sets in closed
form (Bergemann and Schlag 2008, Koçyiğit et al. 2020, 2022, Anunrojwong et al.
2024, Chen et al. 2024a). Minimax theorems establishing strong duality between
primal and dual DRO problems involving more general optimal transport ambi-
guity sets are reported in (Blanchet et al. 2022b, Shafiee et al. 2023, Frank and
Niles-Weed 2024b, Pydi and Jog 2024).

8. Regularization by Robustification
Classical stochastic optimization seeks decisions that perform well under a prob-
ability distribution P̂ estimated from training data. By ignoring any information
about estimation errors in P̂, however, stochastic optimization tends to output over-
fitted decisions that incur a low expected loss under P̂ but may perform poorly
under the unknown population distribution P. This problem becomes more acute if
training data is scarce. A key advantage of DRO vis-à-vis stochastic optimization
is that it has access to information about estimation errors. DRO uses this informa-
tion to prevent overfitting. Robustifying a stochastic optimization problem against
distributional uncertainty can thus be viewed as a form of implicit regularization.
We now show that there is often a deep connection between implicit regulariz-
ation (achieved by robustifying a problem against distributional uncertainty) and
explicit regularization (achieved by adding a penalty term to the problem’s objective
function). This discussion complements and extends several results from Section 6.
For example, in Section 6.9 we have seen that the worst-case expected value of a
linear loss function with respect to a Kullback-Leibler ambiguity set centered at
a Gaussian distribution coincides with the nominal expected loss and a variance
regularization term. Similarly, in Section 6.13 we have seen that the worst-case
expected value of a convex loss function with respect to a 1-Wasserstein ambiguity
set coincides with the nominal expected loss and a Lipschitz regularization term.
See also Sections 6.14 and 6.15 for some variants and generalizations of this result.
Distributionally Robust Optimization 149

In Section 8.1 we will show—in broad generality—that the worst-case expected


loss over a 𝜙-divergence ambiguity set is closely related to the nominal expec-
ted loss with a variance regularization term. Similarly, in Section 8.2 we will
show that the worst-case expected loss over a Wasserstein ambiguity set is closely
related to the nominal expected loss with variation and Lipschitz regularization
terms. In Section 8.3 we will further show that many popular risk measures are
Lipschitz continuous in the distribution of the relevant risk factors with respect to
a Wasserstein distance. This implies that the worst-case risk over a Wasserstein
ambiguity set is closely related to the nominal risk and a Lipschitz regularization
term. We remark that the connections between robustification and regularization
are less well understood for moment ambiguity sets. From Section 6.8 we know
that the worst-case risk of a linear loss function over a Gelbrich ambiguity set often
coincides with the nominal risk and a 2-norm regularization term. However, it
is unclear whether similar results can be obtained for nonlinear loss functions or
other moment ambiguity sets. Therefore, we will not touch on moment ambiguity
sets in this section. We emphasize that the connections between robustification and
regularization often enable statistical analyses of DRO problems; see Section 10.

8.1. 𝜙-Divergence Ambiguity Sets


As a motivating example, we show that robustification with respect to a Pearson
𝜒2 -divergence ambiguity set is closely related to variance regularization. To see
this, recall first that the Pearson 𝜒2 -divergence ambiguity set (2.17) is defined as

P = {P ∈ P(Z) : 𝜒2 (P, P̂) ≤ 𝑟}.

If ℓ is a bounded Borel function, Proposition 2.13 readily implies that


n 1 1
o
P ⊆ P ∈ P(Z) : E P [ℓ(𝑍)] ≤ E P̂ [ℓ(𝑍)] + 𝑟 2 VP̂ [ℓ(𝑍)] 2 ,

and thus we may conclude that


1 1
sup E P [ℓ(𝑍)] ≤ E P̂ [ℓ(𝑍)] + 𝑟 2 VP̂ [ℓ(𝑍)] 2 .
P∈P

Hence, the worst-case expected loss with respect to a Pearson 𝜒2 -divergence am-
biguity set of radius 𝑟 around P̂ is bounded above by the mean-standard deviation
1
risk measure with risk-aversion coefficient 𝑟 2 evaluated under P̂. By slight abuse
1 1
of terminology, the scaled standard deviation 𝑟 2 VP̂ [ℓ(𝑍)] 2 is commonly referred
to as a variance regularizer. By leveraging Theorem 4.15, the above bound can be
extended to arbitrary (possibly unbounded) Borel loss functions. This extension
critically relies on the following lemma.

Lemma 8.1 (Variance Formula). For any reference distribution P̂ ∈ P(Z), size
150 D. Kuhn, S. Shafiee, and W. Wiesemann

parameter 𝑟 > 0 and Borel function ℓ ∈ L(R𝑑 ) with E P̂ [|ℓ(𝑍)|] < ∞, we have
E P̂ [(ℓ(𝑍) − 𝜆 0 )2 ] 1 1
inf 𝜆𝑟 + = 𝑟 2 VP̂ [ℓ(𝑍)] 2 . (8.1)
𝜆0 ∈R,𝜆∈R+ 4𝜆
Proof. If E P̂ [ℓ(𝑍)2 ] = ∞, then both sides of (8.1) evaluate to ∞, and thus the claim
follows. In the remainder of the proof, we may thus assume that E P̂ [ℓ(𝑍)2 ] < ∞.
In this case, one readily verifies that the partial minimization problem over 𝜆 0 is
solved by 𝜆★0 = E P̂ [ℓ(𝑍)]. Substituting 𝜆★0 back into the objective function reveals
that the infimum on the left hand side of (8.1) equals inf 𝜆∈R+ 𝜆𝑟 + VP̂ [ℓ(𝑍)]/4𝜆.
In order to prove (8.1),
p it suffices to realize that this minimization problem over 𝜆
is solved by 𝜆★ = VP̂ [ℓ(𝑍)]/(4𝑟). This observation completes the proof. 
Theorem 8.2 (Variance Regularization). If P is the Pearson 𝜒2 -divergence ambi-
guity set (2.17) and E P̂ [|ℓ(𝑍)|] < ∞, then we have
1 1
sup E P [ℓ(𝑍)] ≤ E P̂ [ℓ(𝑍)] + 𝑟 2 VP̂ [ℓ(𝑍)] 2 .
P∈P

Proof. The claim trivially holds if 𝑟 = 0. We may thus assume that 𝑟 > 0.
Recall now that the entropy function 𝜙 inducing the Pearson 𝜒2 -divergence satisfies
𝜙(𝑠) = (𝑠 − 1)2 if 𝑠 ≥ 0 and 𝜙(𝑠) = ∞ if 𝑠 < 0. Hence, the conjugate entropy
function 𝜙∗ satisfies 𝜙∗ (𝑡) = 14 𝑡 2 + 𝑡 if 𝑡 ≥ −2 and 𝜙∗ (𝑡) = −1 if 𝑡 < −2, and its
domain is given by dom(𝜙∗ ) = R. As 𝜙∞ (1) = ∞, all distributions P ∈ P are
absolutely continuous with respect to P̂. Thus, Theorem 4.15 applies, and we find
sup E P [ℓ(𝑍)] = inf 𝜆 0 + 𝜆𝑟 + E P̂ [(𝜙∗ ) 𝜋 (ℓ(𝑍) − 𝜆 0 , 𝜆)]
P∈P 𝜆0 ∈R,𝜆∈R+

E P̂ [(ℓ(𝑍) − 𝜆 0 )2 ]
≤ inf E P̂ [ℓ(𝑍)] + 𝜆𝑟 +
𝜆0 ∈R,𝜆∈R+ 4𝜆
1 1
= E P̂ [ℓ(𝑍)] + 𝑟 2 VP̂ [ℓ(𝑍)] 2 ,
where the inequality holds because 𝜙∗ (𝑡) ≤ 14 𝑡 2 + 𝑡, and the second equality follows
from Lemma 8.1. Thus, the claim follows. 

Most 𝜙-divergences are smooth and non-negative and thus resemble the Pearson
𝜒2 -divergence locally around 1 (Polyanskiy and Wu 2024, § 7.10). Accordingly,
one can use a Taylor expansion to show that robustification over a 𝜙-divergence am-
biguity set of sufficiently small size 𝑟 is often equivalent to variance regularization.
To formalize this result, we assume from now on that 𝜙 is differentiable.
Assumption 8.3 (Differentiability). The entropy function 𝜙 is twice continuously
differentiable on a neighborhood of 1 with 𝜙(1) = 𝜙′ (1) = 0 and 𝜙′′ (1) = 2.
The assumption that 𝜙′ (1) = 0 incurs no loss of generality. Indeed, any entropy
function 𝜙 is equivalent to a transformed entropy function 𝜙˜ defined through 𝜙(𝑡)
˜ =
𝜙(𝑡) − 𝜙′ (1) · 𝑡 + 𝜙′ (1) with 𝜙˜′ (1) = 0. That is, both 𝜙 and 𝜙˜ induce the same
Distributionally Robust Optimization 151

divergence. Note that all entropy functions listed in Table 2.1—except for the
one associated with the total variation—satisfy 𝜙′ (1) = 0. The assumption that
𝜙′′ (1) = 2 serves as an arbitrary normalization but will simplify calculations.
Recall now that the restricted 𝜙-divergence ambiguity set (2.11) is defined as

P = P ∈ P(Z) : P ≪ P̂, D 𝜙 (P, P̂) ≤ 𝑟 .

Here, Z is a closed support set, 𝑟 ∈ R+ is a size parameter, 𝜙 is an entropy function


in the sense of Definition 2.4, D 𝜙 is the corresponding 𝜙-divergence in the sense
of Definition 2.5, and P̂ ∈ P(Z) is a reference distribution. The following theorem
provides a leading-order Taylor expansion of the worst-case expectation over P.

Theorem 8.4 (Taylor Expansion of Worst-Case Expectation). If P is the restricted


𝜙-divergence ambiguity set (2.11), the entropy function 𝜙 satisfies Assumption 8.3
and the loss ℓ(𝑍) is P̂-almost surely bounded, then we have
1 1 1
sup E P [ℓ(𝑍)] = E P̂ [ℓ(𝑍)] + 𝑟 2 VP̂ [ℓ(𝑍)] 2 + 𝑜(𝑟 2 ). (8.2)
P∈P

Proof. Note that (8.2) trivially holds if 𝑟 = 0. Similarly, if VP̂ [ℓ(𝑍)] = 0, then ℓ(𝑍)
coincides P̂-almost surely with E P̂ [ℓ(𝑍)]. As P is a restricted 𝜙-divergence am-
biguity set, this readily implies that E P [ℓ(𝑍)] = E P̂ [ℓ(𝑍)] for all P ∈ P. Indeed,
any P ∈ P satisfies P ≪ P̂. Hence, (8.2) is again trivially satisfied. In the remainder
of the proof we my therefore assume that 𝑟 > 0 and that VP̂ [ℓ(𝑍)] > 0.
Assumption 8.3 implies that 𝜙(𝑠) = (𝑠 − 1)2 + 𝑜(𝑠2 ). By Taylor’s theorem with
Peano remainder, 𝜙 can thus be bounded from below (or above) locally around 1
by a quadratic function whose second derivative is slightly smaller (or larger) than
𝜙′′ (1) = 2. Thus, there exists a function 𝜅 : R+ → R+ with lim 𝜀↓0 𝜅(𝜀) = 0 and

1
· 𝑠2 ≤ 𝜙(1 + 𝑠) ≤ (1 + 𝜅(𝜀)) · 𝑠2 ∀𝑠 ∈ [−𝜀, +𝜀] (8.3)
1 + 𝜅(𝜀)

for all sufficiently small 𝜀. The rest of the proof proceeds in two steps, both of
which exploit (8.3). First, we show that the right hands side of (8.2) provides
a lower bound on the worst-case expected loss over P (Step 1). Next, we show
that the right hands side of (8.2) also provides an upper bound on the worst-case
expected loss over P (Step 2). Taken together, Steps 1 and 2 will imply the claim.

Step 1. Every distribution P in the restricted 𝜙-divergence ambiguity set P satis-


fies P ≪ P̂ and has thus a density function 𝑓 ∈ L1 (P̂) with respect to P̂. Here,
L1 (P̂) denotes as usual the family of all Borel functions from Z to R that are
integrable with respect to P̂. As P ≪ P̂, we have D 𝜙 (P, P̂) = E P̂ [𝜙( 𝑓 (𝑍))] (see also
152 D. Kuhn, S. Shafiee, and W. Wiesemann

Section 2.2). Thus, the worst-case expectation problem over P can be recast as

 sup E P̂ [ℓ(𝑍) 𝑓 (𝑍)]


 𝑓 ∈L1 (P̂)


sup E P [ℓ(𝑍)] = s.t. P̂( 𝑓 (𝑍) ≥ 0) = 1
P∈P 
 E P̂ [ 𝑓 (𝑍)] = 1


 E P̂ [𝜙( 𝑓 (𝑍))] ≤ 𝑟.

Renaming 𝑓 (𝑧) + 1 as 𝑓 (𝑧) further yields

 sup E P̂ [ℓ(𝑍) 𝑓 (𝑍)]


 𝑓 ∈L1 (P̂)


sup E P [ℓ(𝑍)] = E P̂ [ℓ(𝑍)] + s.t. P̂( 𝑓 (𝑍) ≥ −1) = 1 (8.4)
P∈P 
 E P̂ [ 𝑓 (𝑍)] = 0


 E P̂ [𝜙(1 + 𝑓 (𝑍))] ≤ 𝑟.

Next, introduce an auxiliary function 𝜀 : R+ → R+ satisfying
1 ess sup P̂ [|ℓ(𝑍) − E P̂ [ℓ(𝑍)]|]
𝜀(𝑟) = 2𝑟 2 · 1
.
VP̂ [ℓ(𝑍)] 2
In addition, for every 𝑟 ∈ R+ , define the function 𝑓𝑟★ ∈ L1 (P̂) through
1
𝑟2 ℓ(𝑧) − E P̂ [ℓ(𝑍)]
𝑓𝑟★(𝑧) = 1
· 1
.
(1 + 𝜅(𝜀(𝑟))) 2 VP̂ [ℓ(𝑍)] 2
By construction, we may thus conclude that
1 ℓ(𝑍) − E P̂ [ℓ(𝑍)]
𝑓𝑟★(𝑍) ≤ 𝑟 2 · 1
≤ 𝜀(𝑟) P̂-a.s. (8.5)
VP̂ [ℓ(𝑍)] 2
for every 𝑟 ∈ R+ , where the two inequalities follow from the definitions of 𝑓𝑟★(𝑧)
and 𝜀(𝑟), respectively. In addition, we have E P̂ [ 𝑓𝑟★(𝑍)] = 0 and
   
E P̂ 𝜙(1 + 𝑓𝑟★(𝑍)) ≤ (1 + 𝜅(𝜀(𝑟))) · E P̂ 𝑓𝑟★(𝑍)2 ) = 𝑟
for all sufficiently small 𝑟. The inequality in the above expression follows from (8.5)
and from the upper bound on 𝜙 in (8.3), which holds for all sufficiently small 𝜀. The
equality exploits the definition of 𝑓𝑟★. This shows that 𝑓𝑟★ constitutes a feasible solu-
tion for the maximization problem in (8.4) if 𝑟 is sufficiently small. Substituting 𝑓𝑟★
into (8.4) then yields the desired lower bound. Indeed, we have
 
sup E P [ℓ(𝑍)] ≥ E P̂ [ℓ(𝑍)] + E P̂ ℓ(𝑍) 𝑓𝑟★(𝑍)
P∈P
1  
𝑟2 E P̂ ℓ(𝑍)(ℓ(𝑍) − E P̂ [ℓ(𝑍)])
= E P̂ [ℓ(𝑍)] + 1
· 1
(1 + 𝜅(𝜀(𝑟))) 2 VP̂ [ℓ(𝑍)] 2
1 1 1
= E P̂ [ℓ(𝑍)] + 𝑟 2 VP̂ [ℓ(𝑍)] 2 + 𝑜(𝑟 2 ),
Distributionally Robust Optimization 153

for all sufficiently small 𝑟, where the first equality follows from the definition
of 𝑓𝑟★. The second equality exploits the Taylor expansion of the inverse square root
function around 1 and the elementary observation that lim𝑟 ↓0 𝜅(𝜀(𝑟)) = 0.

Step 2. The Huber loss ℎ 𝜀 : R → R with tuning parameter 𝜀 > 0 is defined through
 1 2
𝑠 if |𝑠| ≤ 𝜀,
ℎ 𝜀 (𝑠) = 2 1 2
𝜀|𝑠| − 2 𝜀 otherwise.
By construction, ℎ 𝜀 is continuously differentiable, depends quadratically on 𝑠 if
|𝑠| ≤ 𝜀 and depends linearly on 𝑠 if |𝑠| > 𝜀. Its conjugate ℎ∗𝜀 : R → R satisfies
 1 2
∗ 𝑡 if |𝑡 | ≤ 𝜀,
ℎ 𝜀 (𝑡) = 2
∞ otherwise.
The lower bound on 𝜙 in (8.3) and the convexity of 𝜙 imply that
2
𝜙(𝑠) ≥ ℎ 𝜀 (𝑠 − 1) ∀𝑠 ∈ R
1 + 𝜅(𝜀)
whenever 𝜀 is sufficiently small. This uniform lower bound on 𝜙 in terms of ℎ 𝜀
gives rise to a uniform upper bound on 𝜙∗ in terms of ℎ∗𝜀 . Indeed, we have
2
𝜙∗ (𝑡) ≤ sup 𝑠𝑡 − ℎ 𝜀 (𝑠 − 1)
𝑠∈R 1 + 𝜅(𝜀)
(
2 (1+𝜅(𝜀))𝑡 2 2𝜀
1 if |𝑡 | ≤ 1+𝜅(𝜀) ,
ℎ∗

=𝑡+ 2 𝑡(1 + 𝜅(𝜀)) = 𝑡 + 4 (8.6)
1 + 𝜅(𝜀) 𝜀 ∞ otherwise,

for all sufficiently small 𝜀. The first equality in (8.6) is obtained by applying the
variable transformation 𝑠 ← 𝑠 − 1 and by extracting the constant 2/(1 + 𝜅(𝜀)) from
the supremum. The second equality follows from the definition of ℎ∗𝜀 . By weak
duality as established in Theorem 4.15, we then find
sup E P [ℓ(𝑍)] ≤ inf 𝜆 0 + 𝜆𝑟 + E P̂ [(𝜙∗ ) 𝜋 (ℓ(𝑍) − 𝜆 0 , 𝜆)]
P∈P 𝜆0 ∈R,𝜆∈R+
 



 inf E P̂ [ℓ(𝑍)] + 𝜆𝑟 + 1+𝜅(𝜀(𝑟
4𝜆
))
E P̂ (ℓ(𝑍) − 𝜆 0 )2
𝜆0 ∈R,𝜆∈R+
≤ 
2𝜀(𝑟 )𝜆

 s.t.


P̂ |ℓ(𝑍) − 𝜆 0 | ≤ 1+𝜅(𝜀(𝑟 )) = 1,
(8.7)
where the second inequality follows from the definition of the perspective function
and from (8.6), which holds for all sufficiently small 𝜀. Here, we have re-used the
function 𝜀(𝑟) introduced in Step 1. Next, we set 𝜆★0 = E P̂ [ℓ(𝑍)] and define
1
1 + 𝜅(𝜀(𝑟)) 2 1
𝜆★𝑟 = 1
· VP̂ [ℓ(𝑍)] 2
2𝑟 2
154 D. Kuhn, S. Shafiee, and W. Wiesemann

for any 𝑟 > 0. Note that (𝜆★0 , 𝜆★𝑟) is feasible in (8.7) provided that 𝑟 is sufficiently
small; in particular, 𝑟 must be small enough to satisfy 𝜅(𝜀(𝑟)) ≤ 3. Indeed, we have
 
1

2𝜀(𝑟)𝜆 ★  𝜀(𝑟)V [ℓ(𝑍)] 2
𝑟 P̂
P̂ |ℓ(𝑍) − 𝜆★0 | ≤ = P̂  |ℓ(𝑍) − E P̂ [ℓ(𝑍)]| ≤ 
1 + 𝜅(𝜀(𝑟)) 1
𝑟 1 + 𝜅(𝜀(𝑟))
2
 12
 
2  
 1 ess sup P̂ |ℓ(𝑍) − E P̂ [ℓ(𝑍)]|
= P̂  |ℓ(𝑍) − E P̂ [ℓ(𝑍)]| ≤  = 1,
1 + 𝜅(𝜀(𝑟)) 2
where the first equality follows from the definitions of 𝜆★0 and 𝜆★𝑟, the second
equality follows from the definition of 𝜀(𝑟), and the last equality holds because
𝜅(𝜀(𝑟)) ≤ 3. Substituting (𝜆★0 , 𝜆★𝑟) into (8.7) then yields the desired upper bound.
1 + 𝜅(𝜀(𝑟)) h ★ 2
 i
sup E P [ℓ(𝑍)] ≤ E P̂ [ℓ(𝑍)] + 𝜆★𝑟𝑟 + E P̂ ℓ(𝑍) − 𝜆 0
P∈P 4𝜆★𝑟
1
(1 + 𝜅(𝜀(𝑟))) 2 1 1
= E P̂ [ℓ(𝑍)] + · 𝑟 2 VP̂ [ℓ(𝑍)] 2
2
1 h
(1 + 𝜅(𝜀(𝑟))) 12 2 i
+ 1
𝑟 2E
P̂ ℓ(𝑍) − E P̂ [ℓ(𝑍)]
2VP̂ [ℓ(𝑍)] 2
1 1 1
= E P̂ [ℓ(𝑍)] + 𝑟 2 VP̂ [ℓ(𝑍)] 2 + 𝑜(𝑟 2 )
Here, the first equality follows from the definitions of 𝜆★0 and 𝜆★𝑟, and the second
equality holds because lim𝑟 ↓0 𝜅(𝜀(𝑟)) = 0. Hence, the claim follows. 

Theorem 8.4 reveals that, up to leading order in 𝑟, robustification with respect


to a restricted divergence ambiguity set is equivalent to variance regularization.
The requirement that the loss must be almost surely bounded is restrictive but
necessary. However, it can be relaxed if the entropy function 𝜙 grows superlinearly.
As an example, recall from Proposition 6.12 that the worst-case expectation of a
linear loss function with respect to a Kullback-Leibler ambiguity set centered at
1 1
a Gaussian distribution equals precisely E P̂ [ℓ(𝑍)] + (2𝑟) 2 VP̂ [ℓ(𝑍)] 2 without any
higher-order error terms. This formula is consistent with Theorem 8.4 because
the entropy function of the Kullback-Leibler divergence satisfies 𝜙′′ (1) = 1. Thus,
it must be scaled by 2 to satisfy Assumption 8.3. Note that any (non-constant)
linear loss functions fails to be P̂-almost surely bounded with respect to any (non-
degenerate) Gaussian distribution P̂. However, the conclusions of Theorem 8.4 hold
nevertheless because the underlying entropy function grows faster than linearly.
A Taylor expansion akin to (8.2) for empirical reference distributions and for
the Kullback-Leibler divergence ambiguity set (2.13) is due to Lam (2019). Duchi
et al. (2021) generalize this result to other 𝜙-divergences. Similar results for em-
pirical reference distributions are also reported by Lam (2016, 2018), Duchi and
Namkoong (2019) and Blanchet and Shapiro (2023) in different contexts. In a par-
Distributionally Robust Optimization 155

allel line of research, Gotoh, Kim and Lim (2018, 2021) derive a Taylor expansion
of the penalty-based worst-case expected loss supP∈P(Z) E P [ℓ(𝑍)] − 𝑟1 D 𝜙 (P, P̂).
They focus again on discrete empirical reference distributions and provide both
first- as well as higher-order terms of the corresponding Taylor expansion.
Maurer and Pontil (2009) show that variance-regularized empirical risk min-
imization may provide faster rates of convergence to the expected loss under the
population distribution compared to standard empirical risk minimization. This
improved convergence highlights the potential benefits of incorporating variance
regularization in the learning process. Unfortunately, simple stochastic optimiza-
tion problems with a mean-variance objective are NP-hard even if the underlying
loss function is convex in the decision variables (Ahmed 2006). In contrast, the
worst-case expectation with respect to any ambiguity set preserves the convexity of
the underlying loss function. Theorem 8.4 thus suggests that the worst-case expec-
ted loss over a restricted 𝜙-divergence ambiguity set provides a convex surrogate for
the nonconvex—but statistically attractive—variance-regularized empirical loss.

8.2. Wasserstein Ambiguity Sets


As a motivating example, we show that robustification with respect to a 1-Wasser-
stein ambiguity set is closely related to Lipschitz regularization. To see this, recall
first that the 𝑝-Wasserstein ambiguity set (2.28) for 𝑝 ∈ [1, ∞) is defined as

P = P ∈ P(Z) : W 𝑝 (P, P̂) ≤ 𝑟 .
Here, Z is a closed support set, 𝑟 ∈ R+ is a size parameter, W 𝑝 is the 𝑝-Wasserstein
distance induced by a norm k · k on R𝑑 (see Definition 2.18), and P̂ ∈ P(Z)
is a reference distribution. If the loss function ℓ is piecewise concave, then the
worst-case expectation problem (4.1) over P can be reformulated as a finite convex
program (see Theorem 7.20). For more general loss functions, however, exact
reformulations of (4.1) are unavailable. We now show that if 𝑝 = 1 and ℓ is Lipschitz
continuous as well as P̂-integrable, then the worst-case expectation problem (4.1)
admits a simple upper bound that involves the Lipschitz modulus of ℓ.
Proposition 8.5 (Lipschitz Regularization). Suppose that P is the 1-Wasserstein
ambiguity set of radius 𝑟 ∈ R+ around P̂ ∈ P(Z), and W1 is induced by a norm k · k
on R𝑑 . In addition, suppose that ℓ is Lipschitz continuous on Z with respect to the
same norm k · k and that E P̂ [|ℓ(𝑍)|] < ∞. Then, we have
sup E P [ℓ(𝑍)] ≤ E P̂ [ℓ(𝑍)] + 𝑟 · lip(ℓ). (8.8)
P∈P

We emphasize that evaluating the Lipschitz modulus of a generic loss function


is computationally challenging. For example, one can show that computing lip(ℓ)
is NP-hard even if k · k is the ∞-norm and even if ℓ is a (convex) conic quadratic
loss function; see, e.g., (Kuhn et al. 2019, Remark 3) for a simple proof.
156 D. Kuhn, S. Shafiee, and W. Wiesemann

Proof of Proposition 8.5. The Kantorovich-Rubinstein duality implies that


    
ℓ(𝑍) ℓ(𝑍)
sup E P [ℓ(𝑍)] = E P̂ [ℓ(𝑍)] + lip(ℓ) · sup E P − E P̂
P∈P P∈P lip(ℓ) lip(ℓ)
≤ E P̂ [ℓ(𝑍)] + 𝑟 · lip(ℓ).
Indeed, the normalized function ℓ/lip(ℓ) is Lipschitz continuous and has Lipschitz
modulus at most 1. By Corollary 2.19, we thus have for every P ∈ P that
   
ℓ(𝑍) ℓ(𝑍)
EP − E P̂ ≤ W1 (P, P̂) ≤ 𝑟.
lip(ℓ) lip(ℓ)
Therefore, the claim follows. 
Close connections between Wasserstein distributionally robust optimization and
Lipschitz regularization have been discovered in different contexts (Mohajerin Es-
fahani and Kuhn 2018, Shafieezadeh-Abadeh et al. 2015, 2019, Gao et al. 2024).
Recall that the upper bound in (8.8) is tight. Indeed, Proposition 6.17 implies
that (8.8) collapses to an equality if ℓ is convex and Z = R𝑑 . The Lipschitz mod-
ulus of the loss function encodes its variability. Thus, the Lipschitz regularization
term in (8.8) penalizes loss functions that display a high degree of variability. In the
following we will derive generalized variation regularization bounds akin to (8.8)
for worst-case expectation problems over 𝑝-Wasserstein ambiguity sets for 𝑝 ∈ N.
Toward this goal, for any 𝑘 ∈ Z+ we use 𝐷 𝑘 ℓ(ˆ𝑧 ), to denote the totally sym-
metric tensor of all 𝑘-th order partial derivatives of ℓ(𝑧) at 𝑧 = 𝑧ˆ. Accordingly,
𝐷 𝑘 ℓ(ˆ𝑧)[𝑧1 , . . . , 𝑧 𝑘 ] stands for the directional derivative of ℓ(𝑧) along the direc-
tions 𝑧𝑖 ∈ R𝑑 for 𝑖 ∈ [𝑘]. If 𝑧𝑖 = 𝑧 for all 𝑖 ∈ [𝑘], then we use 𝐷 𝑘 ℓ(ˆ𝑧)[𝑧] 𝑘 as a
shorthand for 𝐷 𝑘 ℓ(ˆ𝑧)[𝑧, . . . , 𝑧]. Any norm k · k on R𝑑 induces a norm on the space
of totally symmetric 𝑘-th order tensors through
 𝑘
k𝐷 𝑘 ℓ(ˆ𝑧)k = sup 𝐷 ℓ(ˆ𝑧 )[𝑧1 , . . . , 𝑧 𝑘 ] : k𝑧𝑖 k ≤ 1 ∀𝑖 ∈ [𝑘]
𝑧1 ,...,𝑧 𝑘 ∈R𝑑
 𝑘
= sup 𝐷 ℓ(ˆ𝑧)[𝑧] 𝑘 : k𝑧k ≤ 1 ,
𝑧 ∈R𝑑

where the second equality exploits the symmetry of 𝐷 𝑘 ℓ(ˆ𝑧) (Banach 1938, Satz 1).
By slight abuse of notation, we use the same symbol k · k for the tensor norm as for
the underlying vector norm k · k. The following theorem generalizes Proposition 8.5
to any 𝑝 ∈ N. This result is due to Shafiee et al. (2023, Theorem 3.2).
Theorem 8.6 (Variation and Lipschitz Regularization). If P is the 𝑝-Wasserstein
ambiguity set (2.28) for some 𝑝 ∈ N, where W 𝑝 is induced by a norm k · k on R𝑑 ,
Z is convex and ℓ is 𝑝 − 1 times continuously differentiable, then we have
𝑝−1 𝑘
Õ 𝑝
𝑟   1
ˆ +
sup E P [ℓ(𝑍)] ≤ E P̂ [ℓ( 𝑍)] ˆ 𝑞𝑘 𝑞𝑘 + 𝑟 lip(𝐷 𝑝−1 ℓ),
E P̂ k𝐷 𝑘 ℓ( 𝑍)k
P∈P 𝑘=1
𝑘! 𝑝!
where 𝑝 𝑘 = 𝑝/𝑘 and 𝑞 𝑘 = 𝑝/(𝑝 − 𝑘) for all 𝑘 ∈ [ 𝑝 − 1].
Distributionally Robust Optimization 157

Proof. Select any P ∈ P and any optimal coupling 𝛾★ ∈ Γ(P, P̂) with W 𝑝 (P, P̂) =
E 𝛾★ [k𝑍 − 𝑍ˆ k 𝑝 ] 1/ 𝑝 , which exists by Lemma 3.17. As 𝛾★ ∈ Γ(P, P̂), we have
 
ˆ = E 𝛾★ ℓ(𝑍) − ℓ( 𝑍)
E P [ℓ(𝑍)] − E P̂ [ℓ( 𝑍)] ˆ .
By (Krantz and Parks 2002, Theorem 2.2.5), we can expand ℓ(𝑧) − ℓ(ˆ𝑧) as a Taylor
series with Lagrange remainder. Thus, there exists a Borel function 𝑓 : Z ×Z → Z
that maps any pair (𝑧, 𝑧ˆ) to a point on the line segment between 𝑧 and 𝑧ˆ such that
𝑝−1
Õ 1 𝑘 1
ℓ(𝑧) − ℓ(ˆ𝑧) = 𝐷 ℓ(ˆ𝑧) [𝑧 − 𝑧ˆ] 𝑘 + 𝐷 𝑝 ℓ( 𝑓 (𝑧, 𝑧ˆ)) [𝑧 − 𝑧ˆ] 𝑝
𝑘=1
𝑘! 𝑝!
𝑝−1
Õ 1 1
≤ k𝐷 𝑘 ℓ(ˆ𝑧)k k𝑧 − 𝑧ˆ k 𝑘 + k𝐷 𝑝 ℓ( 𝑓 (𝑧, 𝑧ˆ))k k𝑧 − 𝑧ˆ k 𝑝 . (8.9)
𝑘=1
𝑘! 𝑝!
The inequality in (8.9) follows from the definition of the tensor norm. By Hölder’s
inequality, the expected value of the 𝑘-th term in (8.9) with respect to 𝛾★ satisfies
    1   1
ˆ k𝑍 − 𝑍ˆ k 𝑘 ≤ E 𝛾★ k𝑍 − 𝑍ˆ k 𝑘 𝑝𝑘 𝑝𝑘 E 𝛾★ k𝐷 𝑘 ℓ( 𝑍)k
E 𝛾★ k𝐷 𝑘 ℓ( 𝑍)k ˆ 𝑞𝑘 𝑞 𝑘
  1
ˆ 𝑞𝑘 𝑞 𝑘 ,
≤ 𝑟 𝑘 E P̂ k𝐷 𝑘 ℓ( 𝑍)k
where 𝑝 𝑘 = 𝑝/𝑘 and 𝑞 𝑘 = 𝑝/(𝑝 − 𝑘) represent conjugate exponents. The second
inequality in the above expression holds because 𝛾★ ∈ Γ(P, P̂), which implies that
1 𝑘
E 𝛾★ [k𝑍 − 𝑍ˆ k 𝑘 𝑝𝑘 ] 𝑝𝑘 = E 𝛾★ [k𝑍 − 𝑍ˆ k 𝑝 ] 𝑝 = W 𝑝 (P, P̂) 𝑘 ≤ 𝑟 𝑘 .
As Z is convex, we may conclude that 𝑓 (𝑧, 𝑧ˆ) ∈ Z for all 𝑧, 𝑧ˆ ∈ Z. Thus, the
expected value of the Lagrange remainder in (8.9) with respect to 𝛾★ satisfies
 
ˆ k𝑍 − 𝑍ˆ k 𝑝
E 𝛾★ k𝐷 𝑝 ℓ( 𝑓 (𝑍, 𝑍))k
 
≤ sup k𝐷 𝑝 ℓ(ˆ𝑧)k E 𝛾★ k𝑍 − 𝑍ˆ k 𝑝 ≤ 𝑟 𝑝 sup k𝐷 𝑝 ℓ(ˆ𝑧 )k ≤ 𝑟 𝑝 lip(𝐷 𝑝−1 ℓ),
𝑧ˆ ∈Z 𝑧ˆ ∈Z

where the second inequality exploits again Hölder’s inequality and the properties
of the optimal coupling 𝛾★. The third inequality follows from the mean value
theorem. The desired inequality is finally obtained by combining the upper bounds
on the expected values of all terms in (8.9) with respect to 𝛾★. 
Theorem 8.6 shows that the worst-case expected loss over a 𝑝-Wasserstein ball
is bounded above by the sum of the expected loss under the reference distribution,
𝑝 − 1 variation regularization terms, and a Lipschitz regularization term. Note
that 𝑝 1 = 𝑝 and 𝑞 = 𝑞 1 = 𝑝/(𝑝 − 1) are Hölder conjugates and that 𝐷 1 ℓ = ∇ℓ.
Thus, the term corresponding to 𝑘 = 1 in the upper bound of Theorem 8.6 can
ˆ 𝑞 ]] 1/𝑞 . The next theorem, which is
be expressed more explicitly as E P̂ [k∇ℓ( 𝑍)k
adapted from (Bartl, Drapeau, Oblój and Wiesel 2021, Gao et al. 2024), reveals
that this variation regularizer matches the leading term of a Taylor expansion of the
worst-case expected loss in the radius 𝑟 of the 𝑝-Wasserstein ball for any 𝑝 > 1.
158 D. Kuhn, S. Shafiee, and W. Wiesemann

Theorem 8.7 (Taylor Expansion of Worst-Case Expectation). Suppose that P is


the 𝑝-Wasserstein ambiguity set (2.28) for some 𝑝 > 0, where W 𝑝 is induced by a
norm k · k on R𝑑 , and Z is convex. Suppose also that the following hold.
(i) Growth Condition. There exist 𝑔, 𝛿0 > 0 such that ℓ(𝑧) − ℓ(ˆ𝑧) ≤ 𝑔k𝑧 − 𝑧ˆ k 𝑝
for all 𝑧, 𝑧ˆ ∈ Z with k𝑧 − 𝑧ˆ k > 𝛿0 .
(ii) Smoothness Condition. There exists 𝐿 > 0 such that k∇ℓ(𝑧) − ∇ℓ(ˆ𝑧)k ∗ ≤
𝐿 k𝑧 − 𝑧ˆ k for all 𝑧, 𝑧ˆ ∈ Z, where k · k ∗ is the norm dual to k · k.
(iii) Integrability Condition. Both E P̂ [k∇ℓ( 𝑍)k ˆ ∗𝑞 ] and E [k∇ℓ( 𝑍)kˆ ∗2𝑞−2 ] are

finite, where 𝑞 = 𝑝/(𝑝 − 1) is the Hölder conjugate of 𝑝.
Then, we have
 1
sup E P [ℓ(𝑍)] = E P̂ [ℓ(𝑍)] + 𝑟 · E P̂ k∇ℓ(𝑍)k ∗𝑞 𝑞 + 𝑜(𝑟). (8.10)
P∈P

Recall that all norms on R𝑑 are topologically equivalent. Thus, in the smoothness
condition we could equivalently use the primal norm instead of the dual norm to
measure differences between gradients. However, working with the dual norm is
more convenient and will simplify the proof of Theorem 8.7.
Proof of Theorem 8.7. For any fixed 𝛿 ∈ R+ and 𝑧ˆ ∈ Z, we define the variation of
the loss function ℓ over a norm ball of radius 𝛿 around 𝑧ˆ as

𝑉 𝛿 (ˆ𝑧 ) = sup ℓ(𝑧) − ℓ(ˆ𝑧) : k𝑧 − 𝑧ˆ k ≤ 𝛿 .
𝑧 ∈Z
Note that 𝑉 𝛿 (ˆ𝑧 ) is finite because ℓ is continuous thanks to the smoothness condition.
As a preparation to prove the theorem, we first establish simple upper and lower
bounds on 𝑉 𝛿 (ˆ𝑧 ). As Z is convex, the line segment from 𝑧ˆ to any 𝑧 ∈ Z is contained
in Z. The mean value theorem then implies that there exists a point 𝑧¯ ∈ Z on this
line segment that satisfies ℓ(𝑧) − ℓ(ˆ𝑧) = ∇ℓ(¯𝑧 )⊤ (𝑧 − 𝑧ˆ). Thus, we have
ℓ(𝑧) − ℓ(ˆ𝑧) − ∇ℓ(ˆ𝑧)⊤ (𝑧 − 𝑧ˆ) = ∇ℓ(¯𝑧)⊤ (𝑧 − 𝑧ˆ) − ∇ℓ(ˆ𝑧)⊤ (𝑧 − 𝑧ˆ)
≤ k∇ℓ(¯𝑧) − ∇ℓ(ˆ𝑧)k ∗ k𝑧 − 𝑧ˆ k ≤ 𝐿 k𝑧 − 𝑧ˆ k 2 ,
where the two inequalities follow from the definition of the dual norm and from
the smoothness condition, respectively. This implies that
∇ℓ(ˆ𝑧 )⊤ (𝑧 − 𝑧ˆ) − 𝐿 k𝑧 − 𝑧ˆ k 2 ≤ ℓ(𝑧) − ℓ(ˆ𝑧) ≤ ∇ℓ(ˆ𝑧)⊤ (𝑧 − 𝑧ˆ) + 𝐿 k𝑧 − 𝑧ˆ k 2 . (8.11)
The first inequality in (8.11) gives rise to a lower bound on 𝑉 𝛿 (ˆ𝑧). Indeed, we find

𝑉 𝛿 (ˆ𝑧 ) ≥ sup ∇ℓ(ˆ𝑧 )⊤ (𝑧 − 𝑧ˆ) − 𝐿 k𝑧 − 𝑧ˆ k 2 : k𝑧 − 𝑧ˆ k ≤ 𝛿
𝑧 ∈Z

≥ sup ∇ℓ(ˆ𝑧 )⊤ (𝑧 − 𝑧ˆ) : k𝑧 − 𝑧ˆ k ≤ 𝛿 − 𝐿𝛿 2 = k∇ℓ(ˆ𝑧)k ∗ 𝛿 − 𝐿𝛿2 , (8.12)
𝑧 ∈Z
where the equality follows from the definition of the dual norm. Similarly, the
second inequality in (8.11) gives rise to the following upper bound on 𝑉 𝛿 (ˆ𝑧 ).
𝑉 𝛿 (ˆ𝑧 ) ≤ k∇ℓ(ˆ𝑧)k ∗ 𝛿 + 𝐿𝛿2 ∀𝛿 ∈ R+ (8.13)
Distributionally Robust Optimization 159

This upper bound grows quadratically with 𝛿 and is therefore too loose for our
purposes if 𝑝 < 2. In this case, we must establish an alternative upper bound that
grows only as 𝛿 𝑝 . This is possible thanks to the growth condition on ℓ. To see this,
define the worst-case variation of ℓ over any ball of radius 𝛿0 as

𝑉 = sup ℓ(𝑧) − ℓ(ˆ𝑧) : 𝑧, 𝑧ˆ ∈ Z, k𝑧 − 𝑧ˆ k ≤ 𝛿0 .

One can show that 𝑉 is finite. If Z is compact, then this is a consequence of


Weierstrass’ maximum theorem, which applies because ℓ is continuous. If Z is
unbounded, on the other hand, then this is a consequence of the convexity of Z and
the growth condition on ℓ. In this case, there exists a recession direction 𝑑 of Z
with k𝑑 k = 2𝛿0 . Thus, for all 𝑧, 𝑧ˆ ∈ Z with k𝑧 − 𝑧ˆ k ≤ 𝛿 we have
ℓ(𝑧) − ℓ(ˆ𝑧) ≤ |ℓ(𝑧) − ℓ(𝑧 + 𝑑)| + |ℓ(𝑧 + 𝑑) − ℓ(ˆ𝑧)|
≤ 𝑔k𝑑 k 𝑝 + 𝑔k𝑧 + 𝑑 − 𝑧ˆ k 𝑝 ≤ 𝑔 (2𝛿0 ) 𝑝 + (3𝛿0 ) 𝑝 .


The second inequality follows from the growth condition on ℓ and the estimates
k𝑧 − (𝑧 + 𝑑)k = 𝛿0 and k(𝑧 + 𝑑) − 𝑧ˆ k ≥ k𝑑 k − k𝑧 − 𝑧ˆ k ≥ 𝛿0 . Thus, ℓ(𝑧) − ℓ(ˆ𝑧) admits
a finite upper bound independent of 𝑧 and 𝑧ˆ, which confirms that 𝑉 is finite.
The growth condition on ℓ ensures that 𝑉 𝛿 (ˆ𝑧) ≤ max{𝑉 , 𝑔𝛿 𝑝 }. Combining this
estimate with (8.13) and defining 𝑢(𝛿) = min{max{𝑉 , 𝑔𝛿 𝑝 }, 𝐿𝛿2 } yields
n o
𝑉 𝛿 (ˆ𝑧) ≤ min max{𝑉 , 𝑔𝛿 𝑝 }, k∇ℓ(ˆ𝑧)k ∗ 𝛿 + 𝐿𝛿 2 ≤ k∇ℓ(ˆ𝑧)k ∗ 𝛿 + 𝑢(𝛿).

Note that 𝑢(𝛿) = 𝑔𝛿 𝑝 for all sufficiently large 𝛿 and 𝑢(𝛿) = 𝐿𝛿2 for all sufficiently
small 𝛿. In between there is a (possibly empty) interval on which 𝑢(𝛿) = 𝑉 is
constant. Since 𝑝 ≤ 2, in all three regimes, 𝑢(𝛿) can be bounded above by 𝑔′ 𝛿 𝑝
for some growth parameter 𝑔′ ∈ R+ . Setting 𝐺 to the largest of these three growth
parameters, we may thus conclude that
𝑉 𝛿 (ˆ𝑧) ≤ k∇ℓ(ˆ𝑧)k ∗ 𝛿 + 𝐺𝛿 𝑝 ∀𝛿 ∈ R+ . (8.14)
Thus, if 𝑝 ≤ 2, then 𝑉 𝛿 (ˆ𝑧) admits an upper bound that grows only as 𝛿 𝑝 .
The remainder of the proof proceeds in two steps. First, we show that the right
hand side of (8.10) provides a lower bound on the worst-case expected loss over P
(Step 1). Next, we show that the right hand side of (8.10) also provides an upper
bound on the worst-case expected loss over P (Step 2). This will prove the claim.

Step 1. Define F as the family of all Borel functions 𝑓 : Z → Z. Any 𝑓 ∈ F


induces a pushforward distribution P = P̂ ◦ 𝑓 −1 supported on Z. By restricting the
Wasserstein ball around P̂ to contain only such pushforward distributions, we find
    
ˆ
sup E P [ℓ(𝑍)] ≥ sup E P̂ ℓ( 𝑓 ( 𝑍)) ˆ − 𝑍ˆ k 𝑝 ≤ 𝑟 𝑝
: E P̂ k 𝑓 ( 𝑍) (8.15a)
P∈P 𝑓 ∈F
  n h i   o
ˆ + sup E 𝑉 𝛿( 𝑍)
≥ E P̂ ℓ( 𝑍) ˆ : E 𝛿( 𝑍)
ˆ ( 𝑍) ˆ 𝑝 ≤ 𝑟 𝑝 , (8.15b)
P̂ P̂
𝛿 ∈Δ
160 D. Kuhn, S. Shafiee, and W. Wiesemann

where the set Δ in (8.15b) represents the family of all Borel functions 𝛿 : Z → R+ .
The second inequality in the above expression can be justified as follows. Select
any 𝛿 ∈ Δ feasible in (8.15b), and define 𝑓 ∈ F as any Borel function satisfying

𝑓 (ˆ𝑧 ) ∈ arg max ℓ(𝑧) : k𝑧 − 𝑧ˆ k ≤ 𝛿(ˆ𝑧) ∀ˆ𝑧 ∈ Z.
𝑧 ∈Z

Such a Borel function exists thanks to (Rockafellar and Wets 2009, Corollary 14.6
and Theorem 14.37). As 𝛿 is feasible in (8.15b), this function 𝑓 satisfies
   
ˆ − 𝑍ˆ k 𝑝 ≤ E 𝛿( 𝑍)
E P̂ k 𝑓 ( 𝑍) ˆ 𝑝 ≤ 𝑟𝑝

and is thus feasible in (8.15a). Its objective function value in (8.15a) satisfies
    h i
ˆ
E P̂ ℓ( 𝑓 ( 𝑍)) ˆ + E P̂ 𝑉 𝛿( 𝑍)
= E P̂ ℓ( 𝑍) ˆ ( ˆ
𝑍) .

Hence, any feasible solution in (8.15b) gives rise to a feasible solution in (8.15a)
with the same objective function value. This proves the inequality in (8.15a).
Substituting the lower bound (8.12) on 𝑉 𝛿 (ˆ𝑧) into (8.15b) then yields the estimate
(  
  sup E P̂ k∇ℓ( 𝑍)k ˆ ∗ 𝛿( 𝑍)
ˆ − 𝐿𝛿( 𝑍)
ˆ 2
ˆ + 𝛿 ∈Δ
sup E P [ℓ(𝑍)] ≥ E P̂ ℓ( 𝑍)   (8.16)
P∈P s.t. E P̂ 𝛿( 𝑍)ˆ 𝑝 ≤ 𝑟 𝑝.
ˆ ∗ = 0 P̂-almost surely, then we have established the desired lower bound.
If k∇ℓ( 𝑍)k
From now on we may thus assume that E P̂ [k∇ℓ( 𝑍)k ˆ ∗ ] > 0. Next, we construct a

function 𝛿 ∈ Δ feasible in the maximization problem in (8.16) and use its objective
function value as a lower bound on the problem’s supremum. Specifically, we set
k∇ℓ(ˆ𝑧)k ∗𝑞−1 𝑟
𝛿★(ˆ𝑧 ) = ∀ˆ𝑧 ∈ Z,
ˆ ∗𝑞 ] 1/ 𝑝
E P̂ [k∇ℓ( 𝑍)k
which is well-defined by the integrability condition. As 𝑞 − 1 = 𝑞/𝑝, we find
     1
ˆ 𝑝 = 𝑟 𝑝 and E P̂ k∇ℓ( 𝑍)k
E P̂ 𝛿★( 𝑍) ˆ ∗ 𝛿★( 𝑍)
ˆ 𝑝 = 𝑟 · E P̂ k∇ℓ( 𝑍)k
ˆ ∗𝑞 𝑞 .

Hence, 𝛿★ is feasible in (8.16), and its objective function value amounts to


   1 E [k∇ℓ( 𝑍)kˆ ∗2𝑞−2 ]
ˆ ∗ 𝛿★( 𝑍)
E P̂ k∇ℓ( 𝑍)k ˆ − 𝐿𝛿★( 𝑍)
ˆ 2 = 𝑟 · E P̂ k∇ℓ( 𝑍)k
ˆ ∗𝑞 𝑞 − 𝐿𝑟 2 · P̂ .
ˆ ∗𝑞 ] 2/ 𝑝
E P̂ [k∇ℓ( 𝑍)k
Note that the last term is again finite thanks to the integrability condition. Substi-
tuting this expression back into (8.16) yields the desired lower bound
   1
ˆ + 𝑟 · E P̂ k∇ℓ( 𝑍)k
sup E P [ℓ(𝑍)] ≥ E P̂ ℓ( 𝑍) ˆ ∗𝑞 𝑞 + O(𝑟 2 ).
P∈P

Step 2. By strong duality as established in Theorem 4.18, we have


 
sup E P [ℓ(𝑍)] = inf 𝜆𝑟 𝑝 + E P̂ sup ℓ(𝑧) − 𝜆k𝑧 − 𝑍ˆ k 𝑝
P∈P 𝜆∈R+ 𝑧 ∈Z
Distributionally Robust Optimization 161
 
= inf 𝜆𝑟 + E P̂ 𝑝 ˆ ˆ 𝑝
ℓ( 𝑍) + sup 𝑉 𝛿 ( 𝑍) − 𝜆𝛿 , (8.17)
𝜆∈R+ 𝛿 ∈R+

where the second equality follows from the observation that



sup ℓ(𝑧) − 𝜆k𝑧 − 𝑧ˆ k 𝑝 = sup sup ℓ(𝑧) − 𝜆𝛿 𝑝 : k𝑧 − 𝑧ˆ k ≤ 𝛿
𝑧 ∈Z 𝑧 ∈Z 𝛿 ∈R+
= ℓ(ˆ𝑧) + sup 𝑉 𝛿 (ˆ𝑧) − 𝜆𝛿 𝑝 .
𝛿 ∈R+

Next, we construct an upper bound on (8.17). In fact, we need separate constructions


for 𝑝 > 2 and 𝑝 ≤ 2. Assume first that 𝑝 > 2. In this case, we have
 
ˆ
sup E P [ℓ(𝑍)] − E P̂ ℓ( 𝑍)
P∈P
 
𝑝
≤ inf (𝜆 1 + 𝜆 2 )𝑟 + E P̂ sup k∇ℓ( 𝑍)k ˆ ∗ 𝛿 + 𝐿𝛿 − (𝜆 1 + 𝜆 2 )𝛿
2 𝑝
𝜆1 ,𝜆2 ∈R+ 𝛿 ∈R+
 
𝑝
≤ inf 𝜆 1 𝑟 + E P̂ sup k∇ℓ( 𝑍)k ˆ ∗ 𝛿 − 𝜆1 𝛿 𝑝
(8.18a)
𝜆1 ∈R+ 𝛿 ∈R+

+ inf 𝜆 2 𝑟 + sup 𝐿𝛿2 − 𝜆 2 𝛿 𝑝 ,


𝑝
(8.18b)
𝜆2 ∈R+ 𝛿 ∈R+

where the first inequality follows from the estimate (8.13), and the second inequality
holds because the supremum over 𝛿 is duplicated. The resulting upper bound on
the worst-case expected loss thus coincides with the sum of two infima. One
readily verifies that the maximization problem over 𝛿 in (8.18a) is solved by 𝛿★ =
ˆ ∗𝑞/ 𝑝 . Thus, the infimum in (8.18a) equals
(𝑝𝜆 1 )−𝑞/ 𝑝 k∇ℓ( 𝑍)k
1 𝑞   
ˆ ∗𝑞 = 𝑟 · E k∇ℓ( 𝑍)k
1
ˆ ∗𝑞 𝑞 ,
inf 𝜆 1 𝑟 𝑝 + (𝜆 1 𝑝)− 𝑝 E P̂ k∇ℓ( 𝑍)k P̂ (8.19a)
𝜆1 ∈R+ 𝑞
where the equality holds because the resulting minimization problem over 𝜆 1 is
 
ˆ ∗𝑞 1/𝑞 . Similarly, the maximization problem
solved by 𝜆★1 = 𝑝𝑟 − 𝑝/𝑞 E P̂ k∇ℓ( 𝑍)k
over 𝛿 in (8.18b) is solved by 𝛿★ = 𝐶1 𝜆 2−1/( 𝑝−2) , where 𝐶1 represents a positive
constant that only depends on 𝑝 and 𝐿. Thus, the infimum in (8.18b) equals
2
− 𝑝−2
inf 𝜆 2 𝑟 𝑝 + 𝐶2 𝜆 2 = 𝐶3 𝑟 2 , (8.19b)
𝜆2 ∈R+

where 𝐶2 and 𝐶3 are other positive constants depending on 𝑝 and 𝐿. The equality
in (8.19b) is obtained by solving the minimization problem over 𝜆 2 in closed form.
Replacing (8.18a) with (8.19a) and (8.18b) with (8.19b) finally yields
   1
ˆ + 𝑟 · E k∇ℓ( 𝑍)k
sup E P [ℓ(𝑍)] ≤ E P̂ ℓ( 𝑍) ˆ ∗𝑞 𝑞 + O(𝑟 2 ).

P∈P
Assume next that 𝑝 ≤ 2. In this case, we have
 
𝑝 ˆ ∗ 𝛿 + 𝐺𝛿 − (𝜆 1 + 𝜆 2 )𝛿
sup E P [ℓ(𝑍)] ≤ inf (𝜆 1 + 𝜆 2 )𝑟 + E P̂ sup k∇ℓ( 𝑍)k 𝑝 𝑝
P∈P 𝜆1 ,𝜆2 ∈R+ 𝛿 ∈R+
162 D. Kuhn, S. Shafiee, and W. Wiesemann
 
𝑝
≤ inf 𝜆 1 𝑟 + E P̂ ˆ ∗ 𝛿 − 𝜆1 𝛿 𝑝
sup k∇ℓ( 𝑍)k (8.20a)
𝜆1 ∈R+ 𝛿 ∈R+
+ inf 𝜆 2 𝑟 𝑝 + sup 𝐺𝛿 𝑝 − 𝜆 2 𝛿 𝑝 , (8.20b)
𝜆2 ∈R+ 𝛿 ∈R+

where the first inequality follows from the estimate (8.14). Note that the infimum
in (8.20a) is identical to that in (8.18a) and thus simplifies to (8.19a). Next, note
that the maximization problem over 𝛿 in (8.20b) is unbounded unless 𝜆 2 ≥ 𝐺.
This condition thus constitutes an implicit constraint for the minimization problem
over 𝜆 2 . Whenever 𝜆 2 satisfies this constraint, however, the supremum over 𝛿 eval-
uates to 0, and therefore the infimum over 𝜆 2 evaluates to 𝐺𝑟 𝑝 . Replacing (8.20a)
with (8.19a) and (8.20b) with 𝐺𝑟 𝑝 finally yields
   1
ˆ + 𝑟 · E P̂ k∇ℓ( 𝑍)k
sup E P [ℓ(𝑍)] ≤ E P̂ ℓ( 𝑍) ˆ ∗𝑞 𝑞 + O(𝑟 𝑝 ).
P∈P

As both O(𝑟 2 ) and O(𝑟 𝑝 ) for 0 < 𝑝 ≤ 2 are of the order 𝑜(𝑟), the claim follows. 

The proof of Theorem 8.7 reveals that the variation 𝑉 𝛿 (ˆ𝑧) equals k∇ℓ(ˆ𝑧 )k ∗ 𝛿 to first
ˆ ∗𝑞 ] 1/𝑞
order in 𝛿. Hence, it is natural to refer to the regularization term E P̂ [k∇ℓ( 𝑍)k
appearing in (8.10) as the total variation.
Regularizers penalizing the Lipschitz moduli, gradients, Hessians or tensors of
higher-order partial derivatives are successfully used in the adversarial training of
neural networks (Lyu, Huang and Liang 2015, Jakubovitz and Giryes 2018, Finlay
and Oberman 2021, Bai, He, Jiang and Obloj 2017) and in the stabilizing training
of generative adversarial networks (Roth, Lucchi, Nowozin and Hofmann 2017,
Nagarajan and Kolter 2017, Gulrajani, Ahmed, Arjovsky, Dumoulin and Courville
2017). However, these regularizers introduce nonconvexity into an otherwise
convex optimization problem. Theorems 8.6 and 8.7 thus suggest that the worst-
case expected loss with respect to a Wasserstein ambiguity set provides a convex
surrogate for the empirical loss with Lipschitz and/or variation regularizers.

8.3. Lipschitz Continuity of Law-Invariant Convex Risk Measures


Let 𝜚 be a law-invariant convex risk measure as introduced in Section 5. Recall that
all convex risk measures are translation invariant, monotone and convex. Assume
also that 𝜚 is an L 𝑝 -risk measure for some 𝑝 ≥ 1. By this we mean that 𝜚P [ℓ(𝑍)]
is finite whenever ℓ ∈ L 𝑝 (P) and P ∈ P(R𝑑 ), that is, whenever E P [|ℓ(𝑍)| 𝑝 ] < +∞.
The aim of this section is to derive interpretable and easily computable upper
bounds on the worst case of 𝜚P [ℓ(𝑍)] with respect to all distributions P of 𝑍 in a
𝑝-Wasserstein ball. To this end, we first recall the definition of a subgradient.
Definition 8.8 (Subgradient). If 𝜚 is a law-invariant convex L 𝑝 -risk measure for
some 𝑝 ≥ 1, then ℎ ∈ L𝑞 (P) is a subgradient of 𝜚P at ℓ0 ∈ L 𝑝 (P) if 𝑝1 + 𝑞1 = 1 and
𝜚P [ℓ(𝑍)] ≥ 𝜚P [ℓ0 (𝑍)] + E P [ℎ(𝑍) · (ℓ(𝑍) − ℓ0 (𝑍))] ∀ℓ ∈ L 𝑝 (P).
Distributionally Robust Optimization 163

We say that 𝜚P is subdifferentiable at ℓ0 if it has at least one subgradient at ℓ0 .


Definition 8.9 (Lipschitz Continuity). Let 𝜚 be a law-invariant convex L 𝑝 -risk
measure for some 𝑝 ≥ 1. Then, 𝜚 is Lipschitz continuous if there exists 𝐿 ≥ 0 with
1
| 𝜚P [ℓ(𝑍)] − 𝜚P [ℓ0 (𝑍)]| ≤ 𝐿 ·E P [|ℓ(𝑍) − ℓ0 (𝑍)| 𝑝 ] 𝑝 ∀ℓ, ℓ0 ∈ L 𝑝 (P), ∀P ∈ P(R𝑑 ).
We use lip(𝜚) to denote the Lipschitz modulus, i.e., the smallest 𝐿 with this property.
Lemma 8.10 (Subgradient Bounds). Let 𝜚 be a law-invariant convex L 𝑝 -risk
measure and ℎ ∈ L𝑞 (P) a subgradient of 𝜚P at ℓ0 ∈ L 𝑝 (P) for some P ∈ P(R𝑑 ),
where 1𝑝 + 𝑞1 = 1. If 𝜚 is Lipschitz continuous, then E P [|ℎ(𝑍)| 𝑞 ] 1/𝑞 ≤ lip(𝜚).
Proof. By the Lipschitz continuity of 𝜚 and the definition of subgradients, we have
1
𝜚P [ℓ0 (𝑍)] + lip(𝜚) · E P [|ℓ(𝑍) − ℓ0 (𝑍)| 𝑝 ] 𝑝
≥ 𝜚P [ℓ(𝑍)] ≥ 𝜚P [ℓ0 (𝑍)] + E P [ℎ(𝑍) · (ℓ(𝑍) − ℓ0 (𝑍))]
for every ℓ ∈ L 𝑝 (P). This inequality is equivalent to
" #
ℓ(𝑍) − ℓ0 (𝑍)
lip(𝜚) ≥ sup E P ℎ(𝑍) · = E P [|ℎ(𝑍)| 𝑞 ] 1/𝑞 ,
𝑝 𝑝1
ℓ ∈L 𝑝 (P) E P [|ℓ(𝑍) − ℓ0 (𝑍)| ]
ℓ≠ℓ0

where the equality holds because the L𝑞 -norm is dual to the L 𝑝 -norm. 
The results of this section also rely on the fundamentals of comonotonicity
theory, which we review next. For any Borel measurable function 𝑓 : R𝑑 → R the
distribution function 𝐹 : R → [0, 1] of the random variable 𝑓 (𝑍) under P is defined
through 𝐹(𝜏) = P( 𝑓 (𝑍) ≤ 𝜏) for every 𝜏 ∈ R, and the corresponding (left) quantile
function 𝐹 ← : [0, 1] → R is defined through 𝐹 ← (𝑞) = inf{𝜏 ∈ R : 𝐹1 (𝜏) ≥ 𝑞}
for every 𝑞 ∈ [0, 1]. Note that if 𝐹 is invertible, then 𝐹 ← = 𝐹 −1 . Note also
that 𝐹 is generally right-continuous, whereas 𝐹 ← is generally left-continuous. The
definition of the quantile function 𝐹 ← also readily implies the equivalence
𝐹(𝜏) ≥ 𝑞 ⇐⇒ 𝜏 ≥ 𝐹 ← (𝑞) ∀𝜏 ∈ R, ∀𝑞 ∈ [0, 1]. (8.21)
Definition 8.11 (Comonotonicity). Two random variables 𝑓 (𝑍) and 𝑔(𝑍) induced
by Borel measurable functions 𝑓 , 𝑔 : R𝑑 → R are comonotonic under P if
P ( 𝑓 (𝑍) ≤ 𝜏1 ∧ 𝑔(𝑍) ≤ 𝜏2 ) = min {𝐹(𝜏1 ), 𝐺(𝜏2 )} ∀𝜏1 , 𝜏2 ∈ R,
where 𝐹 and 𝐺 denote the distribution functions of 𝑓 (𝑍) and 𝑔(𝑍) under P.
The following proposition sheds more light on Definition 8.11. It shows that
comonotonic random variables can essentially always be expressed as functions of
each other (McNeil, Frey and Embrechts 2015, Corollary 5.17).
Proposition 8.12 (Comonotonicity). Let 𝑓 (𝑍) and 𝑔(𝑍) be two random variables
with respective distribution functions 𝐹 and 𝐺 under P as in Definition 8.11. If 𝐹
164 D. Kuhn, S. Shafiee, and W. Wiesemann

is continuous, then 𝑓 (𝑍) and 𝑔(𝑍) are comonotonic under P if and only if
𝑔(𝑍) = 𝐺 ← (𝐹( 𝑓 (𝑍))) P-a.s.
Proof. Note first that 𝐹( 𝑓 (𝑍)) follows the standard uniform distribution on [0, 1]
under P. To see this, note that for any 𝑞 ∈ [0, 1] we have
P (𝐹( 𝑓 (𝑍)) ≤ 𝑞) = P 𝑓 (𝑍) ≤ 𝐹 ← (𝑞) = 𝐹 𝐹 ← (𝑞) = 𝑞,
 

where the first two equalities follow from the definitions of 𝐹 ← and 𝐹, respectively,
while the last equality holds because 𝐹 is continuous.
Assume now that 𝑓 (𝑍) and 𝑔(𝑍) are comonotonic under P. Hence, we have
P ( 𝑓 (𝑍) ≤ 𝜏1 ∧ 𝑔(𝑍) ≤ 𝜏2 ) = min {𝐹(𝜏1 ), 𝐺(𝜏2 )}
= P (𝐹( 𝑓 (𝑍)) ≤ min {𝐹(𝜏1 ), 𝐺(𝜏2 )})
= P (𝐹( 𝑓 (𝑍)) ≤ 𝐹(𝜏1 ) ∧ 𝐹( 𝑓 (𝑍)) ≤ 𝐺(𝜏2 ))
= P 𝐹 ← (𝐹( 𝑓 (𝑍))) ≤ 𝜏1 ∧ 𝐺 ← (𝐹( 𝑓 (𝑍))) ≤ 𝜏2


for all 𝜏1 , 𝜏2 ∈ R. Here, the second equality holds because 𝐹( 𝑓 (𝑍)) follows the
standard uniform distribution under P. The last equality holds thanks to (8.21). As
𝐹 ← (𝐹( 𝑓 (𝑍))) is P-almost surely equal to 𝑓 (𝑍), we thus have
P ( 𝑓 (𝑍) ≤ 𝜏1 ∧ 𝑔(𝑍) ≤ 𝜏2 ) = P 𝑓 (𝑍) ≤ 𝜏1 ∧ 𝐺 ← (𝐹( 𝑓 (𝑍))) ≤ 𝜏2 .


for all 𝜏1 , 𝜏2 ∈ R. Hence, ( 𝑓 (𝑍), 𝑔(𝑍)) and ( 𝑓 (𝑍), 𝐺 ← (𝐹( 𝑓 (𝑍)))) are equal in law
under P. This implies in particular that the distribution of 𝑔(𝑍) conditional on 𝑓 (𝑍)
coincides with the distribution of 𝐺 ← (𝐹( 𝑓 (𝑍))) conditional on 𝑓 (𝑍) under P. As
the latter distribution is given by the Dirac point mass at 𝐺 ← (𝐹( 𝑓 (𝑍))), we may
conclude that 𝑔(𝑍) is P-almost surely equal to 𝐺 ← (𝐹( 𝑓 (𝑍))).
Assume now that 𝑔(𝑍) = 𝐺 ← (𝐹( 𝑓 (𝑍))) P-almost surely. Thus, we have
P ( 𝑓 (𝑍) ≤ 𝜏1 ∧ 𝑔(𝑍) ≤ 𝜏2 ) = P 𝑓 (𝑍) ≤ 𝜏1 ∧ 𝐺 ← (𝐹( 𝑓 (𝑍))) ≤ 𝜏2


= min {𝐹(𝜏1 ), 𝐺(𝜏2 )} ,


where the second equality follows from the first part of the proof. 

Next, we show that the correlation of two random variables with fixed marginals
is maximal if they are comonotonic (McNeil et al. 2015, Theorem 5.25).
Theorem 8.13 (Attainable Correlations). Let 𝑓 , 𝑓 ★, 𝑔 and 𝑔★ be real-valued Borel
measurable functions on R𝑑 . Assume that, if 𝑍 is governed by P, then 𝑓 (𝑍)
and 𝑓 ★(𝑍) have the same distribution function 𝐹, whereas 𝑔(𝑍) and 𝑔★(𝑍) have
the same distribution function 𝐺. If 𝑓 ★(𝑍) and 𝑔★(𝑍) are comonotonic, then
 
E P [ 𝑓 (𝑍) · 𝑔(𝑍)] ≤ E P 𝑓 ★(𝑍) · 𝑔★(𝑍) .
Proof. Define the joint distribution function 𝐻 : R2 → [0, 1] of 𝑓 (𝑍) and 𝑔(𝑍)
under P via 𝐻(𝜏1 , 𝜏2 ) = P( 𝑓 (𝑍) ≤ 𝜏1 ∧ 𝑔(𝑍) ≤ 𝜏2 ) for all 𝜏1 , 𝜏2 ∈ R. By (McNeil
Distributionally Robust Optimization 165

et al. 2015, Lemma 5.24), the covariance of 𝑓 (𝑍) and 𝑔(𝑍) under P satisfies
∫ +∞ ∫ +∞
covP ( 𝑓 (𝑍), 𝑔(𝑍)) = (𝐻(𝜏1 , 𝜏2 ) − 𝐹(𝜏1 ) 𝐺(𝜏2 )) d𝜏1 d𝜏2 . (8.22)
−∞ −∞
In addition, by the classical Fréchet bounds for copulas (McNeil et al. 2015,
Remark 5.8), we know that 𝐻(𝜏1 , 𝜏2 ) ≤ min{𝐹(𝜏1 ), 𝐺(𝜏2 )} for all 𝜏1 , 𝜏2 ∈ R. As
the marginal distribution functions 𝐹 and 𝐺 are fixed, it is evident from (8.22)
that the covariance of the random variables 𝑓 (𝑍) and 𝑔(𝑍) is maximized if their
joint distribution function 𝐻(𝜏1 , 𝜏2 ) coincides with its Fréchet upper bound. This,
however, happens if and only if 𝑓 (𝑍) and 𝑔(𝑍) are comonotonic under P. We have
thus shown that covP ( 𝑓 (𝑍), 𝑔(𝑍)) ≤ covP ( 𝑓 ★(𝑍), 𝑔★(𝑍)), which in turn implies that
E P [ 𝑓 (𝑍) · 𝑔(𝑍)] = covP ( 𝑓 (𝑍), 𝑔(𝑍)) + E P [ 𝑓 (𝑍)] · E P [𝑔(𝑍)]
   
≤ covP ( 𝑓 ★(𝑍), 𝑔★(𝑍)) + E P 𝑓 ★(𝑍) · E P 𝑔★(𝑍)
 
= E P 𝑓 ★(𝑍) · 𝑔★(𝑍) .
Here, the inequality exploits the assumption that 𝑓 (𝑍) equals 𝑓 ★(𝑍) in law and
that 𝑔(𝑍) equals 𝑔★(𝑍) in law under P. Hence, the claim follows. 
We are now ready to show that if 𝜚 is a Lipschitz continuous L 𝑝 -risk measure
and ℓ is a Lipschitz continuous loss function, then the risk 𝜚P [ℓ(𝑍)] is Lipschitz
continuous in the distribution P with respect to the 𝑝-Wasserstein distance.
Theorem 8.14 (Lipschitz Continuity of Risk Measures). If ℓ : R𝑑 → R is a
Lipschitz continuous loss function with respect to some norm k · k on R𝑑 , 𝑝 ≥ 1
and 𝜚 a Lipschitz continuous and law-invariant convex L 𝑝 -risk measure, then
ˆ ≤ lip(𝜚) · lip(ℓ) · W 𝑝 (P, P̂)
𝜚P [ℓ(𝑍)] − 𝜚P̂ [ℓ( 𝑍)]
1 1
for all P, P̂ ∈ P(R𝑑 ). Here, W 𝑝 is defined with respect to k · k, and 𝑝 + 𝑞 = 1.

Proof. Consider an arbitrary P ∈ P(R𝑑 ). By (Ruszczyński and Shapiro 2006, Co-


rollary 3.1), 𝜚P is continuous and subdifferentiable on the whole Banach space L 𝑝 (P)
equipped with its norm topology. The Fenchel-Moreau theorem thus implies that
𝜚P [ℓ ′ (𝑍)] = sup E P [ℎ′ (𝑍) · ℓ ′ (𝑍)] − 𝜚∗P [ℎ′ (𝑍)], (8.23a)
ℎ′ ∈L𝑞 (P)

for all ℓ ′ ∈ L 𝑝 (P), where


𝜚∗P [ℎ′ (𝑍)] = sup E P [ℎ′ (𝑍) · ℓ ′ (𝑍)] − 𝜚P [ℓ ′ (𝑍)] (8.23b)
ℓ ′ ∈L 𝑝 (P)

for all ℎ′ ∈ L𝑞 (P) (Rockafellar 1974, Theorem 5). The relation (8.23b) defines
a law-invariant convex risk measure 𝜚∗ . Indeed, 𝜚∗ is convex because pointwise
suprema of affine functions are convex. In addition, 𝜚∗ inherits law-invariance
from 𝜚. Note that ℎ ∈ L𝑞 (P) attains the supremum in (8.23a) at ℓ ′ = ℓ if and only if
𝜚P [ℓ(𝑍)] = E P [ℎ(𝑍) · ℓ(𝑍)] − 𝜚∗P [ℎ(𝑍)]
166 D. Kuhn, S. Shafiee, and W. Wiesemann

⇐⇒ 𝜚∗P [ℎ(𝑍)] = E P [ℎ(𝑍) · ℓ(𝑍)] − 𝜚P [ℓ(𝑍)]


⇐⇒ E P [ℎ(𝑍) · ℓ ′ (𝑍)] − 𝜚P [ℓ ′ (𝑍)]
≤ E P [ℎ(𝑍) · ℓ(𝑍)] − 𝜚P [ℓ(𝑍)] ∀ℓ ′ ∈ L 𝑝 (P),
where the last equivalence follows from the definition of 𝜚∗P [ℎ(𝑍)] in (8.23b). By
rearranging terms, we then find that the last inequality is equivalent to
𝜚P [ℓ(𝑍)] + E P [ℎ(𝑍) · (ℓ ′ (𝑍) − ℓ(𝑍))] ≤ 𝜚P [ℓ ′ (𝑍)] ∀ℓ ′ ∈ L 𝑝 (P).
Thus, ℎ attains the supremum in (8.23a) at ℓ if and only if it represents a subgradient
of 𝜚P at ℓ. As 𝜚P is subdifferentiable throughout L 𝑝 (P), the above reasoning implies
that the supremum in (8.23a) is always attained.
Select now any P, P̂ ∈ P(R𝑑 ) with W 𝑝 (P, P̂) < +∞. We assume temporarily
that P and P̂ are non-atomic, that is, P(𝑍 = 𝑧) = P̂(𝑍 = 𝑧) = 0 for all 𝑧 ∈ R𝑑 .
Thus, for any admissible distribution function 𝐹 there exists a Borel measurable
function 𝑓 : R𝑑 → R such that P( 𝑓 (𝑍) ≤ 𝜏) = 𝐹(𝜏) for all 𝜏 ∈ R; see, e.g.,
(Delage, Kuhn and Wiesemann 2019, Lemma 1). Note that non-atomicity will later
be relaxed. Select now also any ℎ ∈ L𝑞 (P) that attains the supremum in (8.23a)
at ℓ ′ = ℓ, which is guaranteed to exist. The representation (8.23a) then implies that
ˆ
𝜚P [ℓ(𝑍)] − 𝜚P̂ [ℓ( 𝑍)]
n   o
= E P [ℎ(𝑍) · ℓ(𝑍)] − 𝜚∗P [ℎ(𝑍)] − sup E P̂ ˆ ˆ ∗ ˆ
ℎ̂( 𝑍) · ℓ( 𝑍) − 𝜚P̂ [ ℎ̂( 𝑍)] .
ℎ̂∈L𝑞 (P̂)

In the following, we use 𝐹 to denote the distribution function of ℎ(𝑍) under P


and 𝐹ˆ to denote the distribution function of ℓ( 𝑍)
ˆ under P̂. In addition, we restrict
the above maximization problem to functions ℎ̂ for which the distribution function
ˆ coincides with 𝐹. As restricting the feasible set of a
of the random variable ℎ̂( 𝑍)
maximization problem leads to a lower bound on its optimal value, we find
ˆ
𝜚P [ℓ(𝑍)] − 𝜚P̂ [ℓ( 𝑍)]
 



 sup ˆ · ℓ( 𝑍)
E P̂ ℎ̂( 𝑍) ˆ
≤ E P [ℎ(𝑍) · ℓ(𝑍)] − ℎ̂∈L𝑞 (P̂) (8.24)
 ˆ ≤ 𝜏 = 𝐹(𝜏) ∀𝜏 ∈ R.


 s.t. P̂ ℎ̂( 𝑍)
Here, we have exploited the law-invariance of the risk measure 𝜚∗ , which implies
ˆ match. Next, define the function ℎ̂★ : R𝑑 → R through
that 𝜚∗P [ℎ(𝑍)] and 𝜚∗ [ ℎ̂( 𝑍)]

ˆ 𝑧 ))) ∀ˆ𝑧 ∈ R𝑑 .
ℎ̂★(ˆ𝑧) = 𝐹 ← (𝐹(ℓ(ˆ
Note that 𝐹ˆ is continuous because P̂ is non-atomic and ℓ is (Lipschitz) continuous.
By Proposition 8.12, the random variables ℎ̂★( 𝑍)ˆ and ℓ( 𝑍)
ˆ are thus comonotonic
ˆ
and have distribution functions 𝐹 and 𝐹 under P̂, respectively. Hence, ℎ̂★ is feasible
in the maximization problem in (8.24). In addition, by Theorem 8.13, ℎ̂★ is optimal.
Next, select any transportation plan 𝛾 ∈ Γ(P, P̂). As the marginal distributions
Distributionally Robust Optimization 167

of 𝑍 and 𝑍ˆ under 𝛾 are given by P and P̂, respectively, the above implies that
ˆ
𝜚P [ℓ(𝑍)] − 𝜚P̂ [ℓ( 𝑍)]
 



 sup ˆ · ℓ( 𝑍)
E 𝛾 ℎ̂(𝑍, 𝑍) ˆ
≤ E 𝛾 [ℎ(𝑍) · ℓ(𝑍)] − ℎ̂∈L𝑞 (𝛾) (8.25)

 s.t. ˆ ≤ 𝜏 = 𝐹(𝜏) ∀𝜏 ∈ R.

 𝛾 ℎ̂(𝑍, 𝑍)
Note that we have relaxed the maximization problem in (8.25) by allowing the
function ℎ̂ to depend both on 𝑍 and 𝑍. ˆ However, this extra flexibility does not
result in a higher optimal value. Indeed, Theorem 8.13 ensures that the supremum
is attained by any function ℎ̂ for which the random variables ℎ̂(𝑍, 𝑍) ˆ and ℓ( 𝑍)
ˆ are
ˆ
comonotonic and for which ℎ̂(𝑍, 𝑍) has distribution function 𝐹. As we have seen
before, there exists a function with these properties that does not depend on 𝑍.
Hence, the right to select a function ℎ̂ that depends on 𝑍 is worthless.
Observe now that the function ℎ̂(𝑍, 𝑍) ˆ = ℎ(𝑍) is feasible in (8.25). Thus, we find
 
ˆ ≤ E 𝛾 [ℎ(𝑍) · ℓ(𝑍)] − E 𝛾 ℎ(𝑍) · ℓ( 𝑍)
𝜚P [ℓ(𝑍)] − 𝜚P̂ [ℓ( 𝑍)] ˆ
 
≤ E 𝛾 ℎ(𝑍) · ℓ(𝑍) − ℓ( 𝑍) ˆ
 
≤ E 𝛾 ℎ(𝑍) · lip(ℓ) · k𝑍 − 𝑍ˆ k
 1 1
≤ lip(ℓ) · E 𝛾 k𝑍 − 𝑍ˆ k 𝑝 𝑝 · E P [ℎ(𝑍)𝑞 ] 𝑞
where the second inequality holds because all convex risk measures are monotonic,
which implies that the subgradient ℎ(𝑍) is P-almost surely non-negative. The third
inequality exploits the Lipschitz continuity of the loss function, and the fourth
inequality follows from Hölder’s inequality. As the resulting inequality holds for
all couplings 𝛾 ∈ Γ(P, P̂), the definition of the 𝑝-Wasserstein distance implies that
1
ˆ ≤ lip(ℓ) · W 𝑝 (P, P̂) · E P [ℎ(𝑍)𝑞 ] 𝑞
𝜚P [ℓ(𝑍)] − 𝜚P̂ [ℓ( 𝑍)]
≤ lip(𝜚) · lip(ℓ) · W 𝑝 (P, P̂),
where the second inequality follows from Lemma 8.10. The claim then follows by
interchanging the roles of P and P̂.
Recall now that we assumed P and P̂ are non-atomic. This assumption was needed
to show that the supremum in (8.24) is attained. In general, one can extend P to
a distribution P′ on R𝑑+1 under which (𝑍1 , . . . , 𝑍 𝑑 ) and 𝑍 𝑑+1 are independent and
have marginal distributions equal to P and to the uniform distribution on [0, 1],
respectively. In the same way, P̂ can be extended to a distribution P̂′ on R𝑑+1 . By
construction, P′ and P̂′ are non-atomic. As 𝜚 is law-invariant, we further have
ˆ = 𝜚P′ [ℓ(𝑍)] − 𝜚 ′ [ℓ( 𝑍)]
𝜚P [ℓ(𝑍)] − 𝜚P̂ [ℓ( 𝑍)] ˆ .

The right hand side of this equation can now be bounded as above. 

Theorem 8.14 immediately implies the following worst-case risk bound.


168 D. Kuhn, S. Shafiee, and W. Wiesemann

Corollary 8.15. If all assumptions of Theorem 8.14 hold and P = {P ∈ P(R𝑑 ) :


W 𝑝 (P, P̂) ≤ 𝑟} is a 𝑝-Wasserstein ball of radius 𝑟 ≥ 0 for any 𝑝 ≥ 1, then

sup 𝜚P [ℓ(𝑍)] ≤ 𝜚P̂ [ℓ(𝑍)] + 𝑟 · lip(𝜚) · lip(ℓ).


P∈P

Theorem 8.14 and Corollary 8.15 are due to Pichler (2013). Corollary 8.15
shows that the worst-case risk over all distributions in a 𝑝-Wasserstein ball is upper
bounded by the sum of the nominal risk and a Lipschitz regularization term for
a broad spectrum of law-invariant convex risk measures. If the loss function ℓ is
linear, that is, if ℓ(𝑧) = 𝜃 ⊤ 𝑧 for some 𝜃 ∈ R𝑑 , then this upper bound is often tight
(Pflug et al. 2012, Wozabal 2014). In this case the Lipschitz modulus of ℓ simplifies
to k𝜃 k ∗ . For example, the CVaR at level 𝛽 ∈ (0, 1] is a law-invariant convex L 𝑝 -
risk measure, and it is Lipschitz continuous with Lipschitz modulus 𝛽 −1/ 𝑝 . Thus,
Corollary 8.15 applies. From Proposition 6.20 we know, however, that the upper
bound is exact in this case. If additionally 𝑝 = 1, then Proposition 6.18 implies that
the upper bound remains exact whenever ℓ is convex and Lipschitz continuous.

9. Numerical Solution Methods for DRO Problems


The finite convex reformulations of the worst-case expectation problem (4.1)
presented in Section 7 are often susceptible to standard optimization software,
that is, they obviate the need for tailored algorithms. However, these reformu-
lations can have two significant drawbacks. First, the corresponding monolithic
optimization problems can become large and hence challenging to solve. Second,
depending on the chosen ambiguity set, the emerging reformulations may belong
to a class of optimization problems that are more difficult to solve than a determ-
inistic version of the original problem. For instance, even if the loss function ℓ
in the worst-case expectation (4.1) is piecewise affine and the support set Z is an
ellipsoid, the finite dual reformulation over Chebyshev ambiguity sets, as provided
by Theorem 7.9, results in a semidefinite program, as opposed to a numerically
favorable quadratically constrained quadratic program. Both disadvantages can be
alleviated by resorting to tailored algorithms, which we discuss in this section.
Most numerical methods for solving the DRO problem (1.2) address an equival-
ent reformulation of (1.2) obtained by dualizing the inner worst-case expectation
problem. This reformulation is usually constructed by leveraging one of the strong
duality theorems from Section 4. The resulting reformulation of (1.2) is thus
representable as a semi-infinite program of the form

inf 𝑓 (𝑦) : 𝑦 ∈ Y, 𝑔 𝑗 (𝑦, 𝑧 𝑗 ) ≤ 0 ∀𝑧 𝑗 ∈ Z, 𝑗 ∈ [𝑚] . (9.1)

Note that (9.1) is naturally interpreted as a classical robust optimization problem.


As an example, assume that P is the generic moment ambiguity set (2.1) and
Distributionally Robust Optimization 169

that some mild regularity conditions hold. In this case, Theorem 4.5 implies that
 inf 𝜆 0 + 𝛿F ∗ (𝜆)




inf sup E P [ℓ(𝑥, 𝑍)] = s.t. 𝑥 ∈ X , 𝜆 0 ∈ R, 𝜆 ∈ R𝑚
𝑥 ∈X P∈P 

 𝜆 0 + 𝑓 (𝑧)⊤ 𝜆 ≥ ℓ(𝑥, 𝑧) ∀𝑧 ∈ Z.

If the support function 𝛿F ∗ (𝜆) is known in closed form, then the resulting minimiz-

ation problem becomes an instance of (9.1) with 𝑦 = (𝑥, 𝜆 0 , 𝜆), Y = X × R × R𝑚 ,


𝑓 (𝑦) = 𝜆 0 + 𝛿F∗ (𝜆), 𝑚 = 1 and 𝑔 (𝑦, 𝑧 ) = 𝜆 + 𝑓 (𝑧 )⊤ 𝜆 − ℓ(𝑥, 𝑧 ). Alternatively,
1 1 0 1 1

𝛿F (𝜆) can be recast as the optimal value of a dual minimization problem, and the un-
derlying decision variables can be appended to 𝑦. As another example,Í if P is the 𝜙-
divergence ambiguity set (2.10) centered at a discrete distribution P̂ = 𝑖∈ [ 𝑁 ] 𝑝ˆ𝑖 𝛿 𝑧ˆ𝑖
and if mild regularity conditions hold, then Theorem 4.14 implies that
Õ

 inf 𝜆 0 + 𝜆𝑟 + 𝑝ˆ 𝑖 · (𝜙∗ ) 𝜋 (ℓ(ˆ𝑧𝑖 ) − 𝜆 0 , 𝜆)




 𝑖∈ [ 𝑁 ]
inf sup E P [ℓ(𝑥, 𝑍)] =
𝑥 ∈X P∈P 
 s.t. 𝑥 ∈ X , 𝜆 0 ∈ R, 𝜆 ∈ R+


 𝜆 0 + 𝜆 𝜙∞ (1) ≥ ℓ(𝑥, 𝑧) ∀𝑧 ∈ Z.

This minimization problem is readily recognized as an instance of (9.1). Note also
that if P is the restricted 𝜙-divergence ambiguity set (2.10) and P̂ is discrete, then,
under mild regularity conditions, Theorem 4.15 implies that the above reformula-
tion remains valid provided that Z is replaced with { 𝑧ˆ𝑖 : 𝑖 ∈ [𝑁]}. Finally, when
P is the optimal transport ambiguity set (2.27) centered at a discrete reference
distribution and if mild regularity conditions hold, then Theorem 4.18 implies that
Õ



 inf 𝜆𝑟 + 𝑝ˆ𝑖 𝑠𝑖


 𝑖∈ [ 𝑁 ]
inf sup E P [ℓ(𝑥, 𝑍)] = s.t. 𝑥 ∈ X , 𝜆 ∈ R , 𝑠 ∈ R 𝑁
𝑥 ∈X P∈P 
 +



 ℓ(𝑥, 𝑧) − 𝜆𝑐(𝑧, 𝑧ˆ𝑖 ) ≤ 𝑠𝑖 ∀𝑧𝑖 ∈ Z, 𝑖 ∈ [𝑁].
This minimization problem is again an instance of (9.1).
In the remainder of this section we discuss various numerical methods for solving
the semi-infinite program (9.1). Some of these methods solve one or several
relaxations of (9.1) that enforce the uncertainty-affected constraint only for a finite
subset Z̃ of Z. Hence, these methods assume access to a scenario oracle.
Definition 9.1 (Scenario Oracle). Given any finite scenario set Z̃ ⊆ Z, a scenario
oracle outputs a solution to the scenario problem

inf 𝑓 (𝑦) : 𝑦 ∈ Y, 𝑔 𝑗 (𝑦, 𝑧 𝑗 ) ≤ 0 ∀𝑧 𝑗 ∈ Z̃, ∀ 𝑗 ∈ [𝑚] . (9.2)
As we will see below, cutting plane algorithms refine scenario relaxations of the
semi-infinite program (9.1) by iteratively adding those parameter realizations 𝑧 ∈
Z \ Z̃ for which the constraint violation is maximal. Identifying such realizations
requires a noise oracle as per the following definition.
170 D. Kuhn, S. Shafiee, and W. Wiesemann

Definition 9.2 (Noise Oracle). Given any fixed decision 𝑦˜ ∈ Y, a noise oracle
outputs a solution to the noise problem
sup max 𝑔 𝑗 ( 𝑦˜ , 𝑧). (9.3)
𝑧 ∈Z 𝑗 ∈ [𝑚]
In the following, we first survey the scenario approach, which replaces the
semi-infinite program (9.1) with a finite scenario problem that offers stochastic
approximation guarantees. This approach calls the scenario oracle only once. We
then review cutting plane techniques that iteratively call scenario and noise oracles
to generate a solution sequence that attains the optimal value of problem (9.1),
either within finitely many iterations or asymptotically. Next, we study online
convex optimization algorithms, which do not require expensive scenario and/or
noise oracles and instead solve only deterministic versions of problem (9.1) and
use cheap first-order updates of the candidate decisions and/or incumbent worst-
case parameter realizations. We close with an overview of specialized numerical
solution methods that are tailored to specific ambiguity sets.

9.1. The Scenario Approach


The scenario approach was pioneered by De Farias and Van Roy (2004) in the
context of robust Markov decision processes and by Calafiore and Campi (2005,
2006) and Campi and Garatti (2008, 2011) in the context of generic robust optimiz-
ation problems of the form (9.1). The scenario approach replaces the semi-infinite
constraint in (9.1) with a collection of finitely many constraints corresponding to
uncertainty realizations sampled from some fixed distribution Q ∈ P(Z).
Algorithm 1: Scenario Approach
1. Initialization. Fix a distribution Q ∈ P(Z).
2. Sampling. Draw 𝑁 independent samples 𝑍1 , . . . , 𝑍 𝑁 from Q and construct
the scenario set Z̃ = {𝑍1 , . . . , 𝑍 𝑁 }.
2. Termination. Return the output 𝑌˜ of the scenario oracle (9.2) with input Z̃.
Note that, as the input to the scenario oracle (9.2) is a random scenario set
governed by the 𝑁-fold product distribution Q 𝑁 , its output 𝑌˜ is also random. Fix
now a constraint violation probability 𝛿 ∈ (0, 1), a significance level 𝜂 ∈ (0, 1), and
ensure that the sample size 𝑁 in Step 2 of Algorithm 1 satisfies 𝑁 ≥ 𝑁(𝑑 𝑦 , 𝛿, 𝜂),
where 𝑑 𝑦 is the dimension of the decision vector 𝑦 and

 𝑦 −1  
𝑑Õ 


 𝑁 𝑖 𝑁 −𝑖


𝑁(𝑑 𝑦 , 𝛿, 𝜂) = min 𝑁 ∈ N : 𝛿 (1 − 𝛿) ≤𝜂 .

 𝑖 

𝑖=0
 
Assuming that the objective and constraint functions of problem 9.1 are convex
in 𝑦 for any fixed 𝑧 𝑗 , 𝑗 ∈ [𝑚], and that the optimal solution to (9.2) exists and is
unique for any fixed scenario set Z̃, Algorithm 1 then guarantees that
Q 𝑁 Q 𝑔 𝑗 (𝑌˜ , 𝑍) ≤ 0 ∀ 𝑗 ∈ [𝑚] ≥ 1 − 𝛿 ≥ 1 − 𝜂,
 
Distributionally Robust Optimization 171

where 𝑍 follows Q and 𝑌˜ is governed by Q 𝑁 ; see (Campi and Garatti 2008,


Theorem 1). In other words, the output 𝑌˜ of the scenario oracle (9.2) constitutes a
feasible solution of the chance constrained program
 
inf 𝑓 (𝑦) : 𝑦 ∈ Y, Q 𝑔 𝑗 (𝑦, 𝑍) ≤ 0 ∀ 𝑗 ∈ [𝑚] ≥ 1 − 𝛿 .
with probability at least 1 − 𝜂, where 𝜂 can be interpreted as the (small) chance of
poorly approximating Q in Step 2 of Algorithm 1. We emphasize that the convexity
of (9.1) plays a crucial role in the derivation of this probabilistic guarantee.
Two remarks on the scenario approach are in order. First, its performance
guarantee is stochastic as it relates to a chance constrained program that relaxes
the semi-infinite program (9.1). Second, the sample size 𝑁(𝑑 𝑦 , 𝛿, 𝜂) needed for
a probabilistic guarantee is of the order Õ((𝑑 𝑦 + log(1/𝜂))/𝛿), that is, it grows
linearly with the dimension 𝑑 𝑦 of the decision vector 𝑦. This dependence limits
the problem dimensions that can be handled in practice. Robust performance
guarantees for the scenario approach have been studied by Mohajerin Esfahani,
Sutter and Lygeros (2015). The dependence of the probabilistic performance
guarantees on the dimension of the decision vector 𝑦 can be improved by using
regularization (Campi and Caré 2013), one-off calibration schemes (Caré, Garatti
and Campi 2014) and sequential validation (Calafiore, Dabbene and Tempo 2011)
or by exploiting limited support ranks (Schildbach, Fagiano and Morari 2013) and
solution-dependent numbers of support constraints (Campi and Garatti 2018). In
general, however, the number of sampled constraints may remain large, which can
be an impediment to the adoption of the scenario approach in large-scale problems.

9.2. Cutting Plane Algorithms


Mutapcic and Boyd (2009) propose an iterative method for solving the semi-infinite
program (9.1), which is based on Kelley’s cutting-plane algorithm (Kelley 1960).
Their method can be described as follows.
Algorithm 2: Cutting-Plane Algorithm
1. Initialization. Select a non-empty finite scenario set Z̃ ⊆ Z, and set the
feasibility threshold parameter 𝜀 to a small value.
2. Master Problem. Solve the scenario oracle problem (9.2) to find 𝑦˜ .
3. Sub-Problem. Solve the noise oracle problem (9.3) to find 𝑧˜.
4. Termination. If 𝑔 𝑗 ( 𝑦˜ , 𝑧˜) ≤ 𝜀 for all 𝑗 ∈ [𝑚], terminate with 𝑦˜ as a 𝜀-feasible
solution to (9.1). Otherwise, update Z̃ ← Z̃ ∪ { 𝑧˜ } and return to Step 2.
Algorithm 2 alternates between two steps. Step 2 solves the scenario oracle
problem (9.2) for a finite scenario set Z̃ and outputs a candidate solution 𝑦˜ . Step 3
then solves the noise oracle problem (9.3) for the given candidate solution 𝑦˜ and
outputs a most violated scenario 𝑧˜. If the optimal value of (9.3) exceeds 𝜀, then the
scenario 𝑧˜ is added to the scenario set Z̃ and the process is repeated. Otherwise, 𝑦˜
172 D. Kuhn, S. Shafiee, and W. Wiesemann

from Step 2 is returned as an 𝜀-feasible solution to the semi-infinite program (9.1),


that is, a solution 𝑦˜ ∈ Y that satisfies 𝑔 𝑗 ( 𝑦˜ , 𝑧 𝑗 ) ≤ 𝜀 for all 𝑧 𝑗 ∈ Z and 𝑗 ∈ [𝑚].
Cutting plane algorithms replace the sampling procedure of the scenario ap-
proach with a noise oracle, but they still require access to a scenario oracle that
solves the master problems. In contrast to the scenario approach, however, the
number of constraints in the master problem increases with each iteration, which
can limit scalability. If the constraint functions 𝑔 𝑗 (𝑦, 𝑧 𝑗 ) are Lipschitz continuous
in 𝑦, then Algorithm 2 terminates after O(𝛿 −𝑑𝑦 ) iterations, which however is ex-
ponential in the dimension of 𝑦 (Mutapcic and Boyd 2009, § 5.2). Despite this, in
practice, cutting plane algorithms often converge to near-optimal solutions in very
few iterations, which has contributed to their widespread use in robust optimization.

9.3. Online Convex Optimization Algorithms


Cutting plane algorithms can become computationally expensive due to their reli-
ance on scenario and noise oracles for the solution of the master and sub-problems,
respectively. In the following, we review how ideas from online convex optimiza-
tion can help to alleviate these shortcomings (Shalev-Shwartz 2012, Hazan 2022).
In their seminal work on this topic, Ben-Tal, Hazan, Koren and Mannor (2015b)
employ a bisection search to reduce the solution of problem (9.1) to the solution of
a sequence of robust feasibility problems of the form

inf 0 : 𝑦 ∈ Y, 𝑓 (𝑦) ≤ 𝑐, 𝑔 𝑗 (𝑦, 𝑧 𝑗 ) ≤ 0 ∀𝑧 𝑗 ∈ Z, ∀ 𝑗 ∈ [𝑚] . (9.4)
More precisely, the following bisection algorithm can be used to solve (9.1).
Algorithm 3: Bisection Algorithm
1. Initialization. Find an interval [𝑎, 𝑏] that contains the optimal value of (9.1),
and select an arbitrary feasible solution 𝑦˜ .
2. Decision Problem. Set 𝑐 = (𝑎 + 𝑏)/2, and check if (9.4) is feasible or not.
3. Update. If (9.4) is feasible, update 𝑦˜ to a solution of the feasibility prob-
lem (9.4), and set 𝑏 ← 𝑐; otherwise, set 𝑎 ← 𝑐.
4. Termination. If 𝑏 − 𝑎 ≤ 𝛿, terminate with 𝑦˜ as an approximately optimal
solution to (9.1). Otherwise, return to Step 2.
Ben-Tal et al. (2015b) use techniques from online convex optimization to solve
the robust feasibility problem (9.4). In particular, they develop a method similar
to Algorithm 2 that approximately solves a nominal feasibility problem instead of
calling the scenario oracle and that uses a first-order update rule instead of calling
the noise oracle. Accordingly, they require the constraint functions 𝑔 𝑗 , 𝑗 ∈ [𝑚], to
be differentiable. Their algorithm can be summarized as follows.
Algorithm 4: Dual-Subgradient Meta Algorithm
1. Initialization: Choose initial uncertainty realizations 𝑧 𝑗 ∈ Z, 𝑗 ∈ [𝑚].
Distributionally Robust Optimization 173

2. Nominal Problem: Find 𝑦˜ that solves the approximate feasibility problem



inf 0 : 𝑓 (𝑦) ≤ 𝑐, 𝑔 𝑗 (𝑦, 𝑧 𝑗 ) ≤ 𝜀 ∀ 𝑗 ∈ [𝑚]
corresponding to the current uncertainty realizations and corresponding to
some 𝜀 > 0. If no such 𝑦˜ exists, terminate and report that (9.4) is infeasible.
3. Update Parameters: Update 𝑧 𝑗 , 𝑗 ∈ [𝑚], using the gradient rule
𝑧 𝑗 ← ProjZ [𝑧 𝑗 + 𝜂∇𝑧 𝑔 𝑗 ( 𝑦˜ , 𝑧 𝑗 )] ∀ 𝑗 ∈ [𝑚],
where 𝜂 is a given stepsize and ProjZ denotes the Euclidean projection onto Z.
4. Termination: Once a termination condition is met, return the average of all
candidate solutions 𝑦˜ found in Step 2.
Algorithms 3 and 4 can be combined to a single algorithm that finds a 𝛿-optimal
and 𝜀-feasible solution to the semi-infinite program (9.1) in O(𝜀 −2 log(1/𝛿)) iter-
ations. This convergence guarantee holds under the following assumptions. The
feasible sets Y and Z are closed and convex, the objective function 𝑓 : Y → R is
convex and Lipschitz continuous, and the constraint functions 𝑔 𝑗 : Y × Z → R,
𝑗 ∈ [𝑚], constitute saddle functions. Specifically, 𝑔(𝑦, 𝑧 𝑗 ) is convex and Lipschitz
continuous in 𝑦 as well as concave and upper semicontinuous in 𝑧 𝑗 for every
𝑗 ∈ [𝑚]. We refer to (Ben-Tal et al. 2015b) for further implementation details.
Algorithm 4 still solves multiple nominal feasibility problems in Step 2, which
can be expensive. As a remedy, Ho-Nguyen and Kılınç-Karzan (2018) reduce the
solution of the feasibility problem (9.4) to the verification of the inequality
min max max 𝑔 𝑗 (𝑦, 𝑧) ≤ 𝜀 (9.5)
𝑦 ∈Y𝑐 𝑧 ∈Z 𝑗 ∈ [𝑚]

for a given tolerance 𝜀 > 0, where Y𝑐 = {𝑦 ∈ Y : 𝑓 (𝑦) ≤ 𝑐}. Checking (9.5) re-
quires the solution of a saddle point problem. Note that the objective function of this
saddle point problem is convex in 𝑦 but but fails to be concave in 𝑧 when 𝑚 > 1.
Therefore, standard primal-dual algorithms from online convex optimization do
not apply. Nevertheless, Ho-Nguyen and Kılınç-Karzan (2018) construct an online
algorithm that outputs a trajectory of candidate solutions 𝑦˜ and uncertainty realiza-
tions 𝑧˜ that converge to a saddle point. This method uses a first-order algorithm A 𝑦
for solving the (parametric) minimization problem min 𝑦 ∈Y𝑐 max 𝑗 ∈ [𝑚] 𝑔 𝑗 (𝑦, 𝑧) as
well as a first-order algorithm A 𝑗 for solving the (parametric) maximization prob-
lem max𝑧 ∈Z 𝑔 𝑗 (𝑦, 𝑧) for each 𝑗 ∈ [𝑚] as subroutines. Specifically, A 𝑦 is assumed
to map any history of candidate solutions 𝑦˜ 1 , . . . , 𝑦˜ 𝑡 and uncertainty realizations
𝑧1𝑗 , . . . 𝑧𝑡𝑗 ∈ Z for 𝑗 ∈ [𝑚] and 𝑡 ∈ N to a new candidate solution 𝑦˜ 𝑡+1 such that
Õ Õ
max 𝑔 𝑗 ( 𝑦˜ 𝑡 , 𝑧˜𝑡𝑗 ) − min max 𝑔 𝑗 (𝑦, 𝑧˜𝑡𝑗 ) ≤ R 𝑦 (𝑇 ) ∀𝑇 ∈ N,
𝑗 ∈ [𝑚] 𝑦 ∈Y𝑐 𝑗 ∈ [𝑚]
𝑡 ∈ [𝑇 ] 𝑡 ∈ [𝑇 ]

where R 𝑦 (𝑇 ) is a sublinear regret bound. Similarly, it is assume that A 𝑗 maps any


history of candidate solutions and uncertainty realizations of length 𝑡 ∈ N to a new
174 D. Kuhn, S. Shafiee, and W. Wiesemann

uncertainty realization 𝑧˜𝑡+1


𝑗 such that
Õ Õ
max 𝑔 𝑗 ( 𝑦˜ 𝑡 , 𝑧) − 𝑔 𝑗 ( 𝑦˜ 𝑡 , 𝑧˜𝑡𝑗 ) ≤ R 𝑗 (𝑇 ) ∀𝑇 ∈ N,
𝑧 ∈Z
𝑡 ∈ [𝑇 ] 𝑡 ∈ [𝑇 ]

where R 𝑗 (𝑇 ) is a sublinear regret bound for every 𝑗 ∈ [𝑚]. The algorithm by


Ho-Nguyen and Kılınç-Karzan (2018) can now be summarized as follows.
Algorithm 5: Online Learning Framework
1. Initialization: Initialize the solution history to H ← ∅.
2. Find Candidate Solution: Use algorithm A 𝑦 with input H to construct a
new candidate solution, that is, set 𝑦˜ ← A 𝑦 (H).
3. Find Uncertainty Realizations: Use algorithm A 𝑗 with input H to construct
a new uncertainty realization, that is, set 𝑧˜ 𝑗 ← A 𝑗 (H) for all 𝑗 ∈ [𝑚].
4. Update History: Update the solution history to H ← H ∪ {( 𝑦˜ , (˜𝑧 𝑗 ) 𝑗 )}.
4. Termination: Once a termination condition is met, return the average of all
candidate solutions 𝑦˜ found in Step 2.
Ho-Nguyen and Kılınç-Karzan (2018) show that Algorithm 5 solves the saddle
point problem on the left hand side of (9.5) approximately with regret guarantee
Õ
max max 𝑔 𝑗 ( 𝑦˜ 𝑡 , 𝑧) − min max 𝑔 𝑗 (𝑦, 𝑧˜𝑡 ) ≤ R 𝑦 (𝑇 ) + max R 𝑗 (𝑇 ) ∀𝑇 ∈ N.
𝑧 ∈Z 𝑗 ∈ [𝑚] 𝑦 ∈Y𝑐 𝑗 ∈ [𝑚] 𝑗 ∈ [𝑚]
𝑡 ∈ [𝑇 ]

The total regret bound in the above expression grows sublinearly with 𝑇 . Un-
der the usual convexity assumptions, Algorithms 3 and 5 can be combined to a
joint algorithm that finds a 𝛿-optimal and 𝜀-feasible solution to the semi-infinite
program (9.1) in O(𝜀 −2 log(1/𝛿)) iterations. Thus, the iteration complexity did
not improve vis-à-vis the algorithm by Ben-Tal et al. (2015b). However, the
computational effort per iteration is significantly lower for Algorithm 5 than for
Algorithm 4. Indeed, Algorithm 4 solves a feasibility problem with 𝑚 uncertainty
realizations in each iteration, whereas Algorithm 5 only calls the algorithms A 𝑦
and A 𝑗 , 𝑗 ∈ [𝑚], which compute cheap first-order updates. For further details,
we refer to (Ho-Nguyen and Kılınç-Karzan 2018). In addition, Ho-Nguyen and
Kılınç-Karzan (2019) exploit structural properties of the objective and constraint
functions to reduce the overall iteration complexity to O(𝜀 −1 log(1/𝛿)).
Up until now, all the algorithms discussed in this section relied on the bisection
method to reduce the semi-infinite program (9.1) to a sequence of robust feasib-
ility problems (9.4). This introduces unnecessary computational overhead. As a
remedy, Postek and Shtern (2024) use primal-dual saddle point algorithms that ad-
dress the following perspective reformulation of problem (9.1), which was initially
introduced in (Ho-Nguyen and Kılınç-Karzan 2018, Appendix A).
Õ
min max 𝑚 𝑓 (𝑦) + 𝜆 𝑗 𝑔 𝑗 (𝑦, 𝑧/𝜆 𝑗 )
𝑦 ∈Y 𝑧 ∈Z,𝜆∈Δ
𝑗 ∈ [𝑚]
Distributionally Robust Optimization 175
Í
Here, Δ𝑚 = {𝜆 ∈ R+𝑚 : 𝑗 ∈ [𝑚] 𝜆 𝑗 = 1}, and 0 · 𝑔 𝑗 (𝑦, 𝑧/0) is interpreted as the
negative recession function of the convex function −𝑔 𝑗 (𝑦, ·). By construction, the
objective function of this reformulation is convex in 𝑦 and jointly concave in 𝑍 and 𝜆.
While the primal-dual saddle point algorithm of Postek and Shtern (2024) typically
enjoys an iteration complexity of O(𝜀 −2 ), where 𝜀 now represents the primal-dual
gap in the saddle point formulation, it requires more sophisticated oracles than
those used by Ho-Nguyen and Kılınç-Karzan (2018, 2019). This is primarily
because the perspective transformation eliminates favorable properties such as
strong convexity and smoothness, and it also significantly degrades the Lipschitz
constants. To address this challenge while still relying on standard oracles as
in (Ho-Nguyen and Kılınç-Karzan 2018, 2019), Tu, Chen and Yue (2024) propose
a two-layer algorithm based on the following Lagrangian formulation of (9.1).
Õ
max𝑚 min max 𝑓 (𝑦) + 𝜆 𝑗 𝑔 𝑗 (𝑦, 𝑧)
𝜆∈R+ 𝑦 ∈Y 𝑧 ∈Z
𝑗 ∈ [𝑚]

Tu et al. (2024) show that their algorithm has an iteration complexity of O(𝜀 −3 ) or
O(𝜀 −2 ), depending on the smoothness of the objective and constraint functions.

9.4. Tailored Numerical Solution Methods for Specific Ambiguity Sets


With the exception of some of the online optimization algorithms, the numerical
solution methods discussed thus far still rely on general-purpose solvers to solve
auxiliary nominal, scenario, master and/or sub-problems. General-purpose solvers
are typically based on second-order interior-point methods that may fail to offer
scalability to large-scale problem instances. To alleviate this concern, several
first-order methods have been developed for specific classes of ambiguity sets.

9.4.1. Gelbrich Ambiguity Sets


Gelbrich ambiguity sets naturally emerge in signal processing and control applica-
tions. The standard reformulations of DRO problems over Gelbrich ambiguity sets,
however, constitute semidefinite programs (cf. Theorem 7.10), which significantly
limits their scalability. To circumvent this shortcoming, Shafieezadeh-Abadeh
et al. (2018) develop a Frank-Wolfe algorithm whose direction-finding subproblem
admits a quasi-closed form solution. This algorithm enjoys a sublinear conver-
gence rate. Leveraging the strong convexity of the Gelbrich ambiguity set, Nguyen
et al. (2023) improve this Frank-Wolfe algorithm to achieve a linear convergence
rate whenever the loss function satisfies the Levitin–Polyak condition (Levitin and
Polyak 1966). Using frequency-domain techniques, Kargin et al. (2024b,c) in-
troduce a Frank-Wolfe algorithm for infinite-horizon robust control problems that
involve infinite-dimensional moment matrices. Finally, McAllister and Mohajerin
Esfahani (2023) propose a Newton method for solving a class of DRO problems over
Gelbrich ambiguity sets that converges superlinearly in numerical experiments.
176 D. Kuhn, S. Shafiee, and W. Wiesemann

9.4.2. 𝜙-Divergence Ambiguity Sets


The existing literature largely focuses on DRO problems over the restricted 𝜙-
divergence ambiguity set (2.11), including the group DRO formulation intro-
duced by Sagawa, Koh, Hashimoto and Liang (2020) as a special case. Unfor-
tunately, stochastic gradient methods applied directly to the dual minimization
problem (4.12) are known to be unstable. This challenge motivated Namkoong and
Duchi (2016) to adopt a direct saddle point formulation of the DRO problem with
a discrete reference distribution P̂, which they solve iteratively with a bandit mirror
descent algorithm. Several other algorithms address the saddle point formulation,
including customized multi-level Monte Carlo methods (Levy, Carmon, Duchi and
Sidford 2020, Hu, Chen and He 2021, Hu, Wang, Chen and He 2024), acceler-
ated methods that query ball optimization oracles (Carmon and Hausler 2022),
and biased stochastic gradient methods (Ghosh, Squillante and Wollega 2021,
Wang, Gao and Xie 2024a, Azizian, Iutzeler and Malick 2023b). Gürbüzbal-
aban, Ruszczyński and Zhu (2022) and Zhu, Gürbüzbalaban and Ruszczyński
(2023) solve nonconvex DRO problems over classes of 𝜙-divergence ambiguity
sets. Specifically, Gürbüzbalaban et al. (2022) introduce a subgradient algorithm
for non-smooth and nonconvex loss functions, while Zhu et al. (2023) establish
convergence rates and finite-sample guarantees for a subgradient method targeted
at weakly convex loss functions. Both works build on the foundational results of
Ruszczyński (2021), which laid the groundwork for efficient first-order methods
for multilevel optimization problems.

9.4.3. Optimal Transport Ambiguity Sets


Li, Huang and So (2019c) develop a first-order iterative method for distributionally
robust logistic regression problems over 1-Wasserstein balls. This method is based
on a variant of the proximal alternating direction method of multipliers (ADMM).
Numerical experiments demonstrate that the proposed algorithm is several orders
of magnitude faster than general-purpose solvers. A similar conclusion is drawn by
Li, Chen and So (2020), who introduce epigraphical projection-based algorithms
to solve distributionally robust support vector machine problems. When the loss
function ℓ(𝑥, 𝑧) is either convex-concave or convex-convex in 𝑥 and 𝑧, respectively,
the reformulation of the DRO problem (1.2) reveals a structure that is conducive to
distributed implementation. Consequently, Cherukuri and Cortés (2019) use saddle
point algorithms related to the augmented Lagrangian method to solve the refor-
mulated problem over a network of agents. For convex-concave loss functions,
Li and Martínez (2020) propose a hybrid algorithm that combines Frank-Wolfe
and subgradient methods. For any fixed 𝑥 ∈ X , their approach solves the in-
ner maximization problem in (1.2) with a variant of the Frank-Wolfe algorithm.
The resulting maximizer is then used to construct an approximate subgradient for
the outer minimization problem. All of these algorithms crucially rely on the
reference distribution P̂ being discrete. Blanchet and Kang (2020) and Blanchet
et al. (2022c) propose a stochastic gradient descent algorithm to solve DRO prob-
Distributionally Robust Optimization 177

lems over optimal transport ambiguity sets with generic reference distributions.
Other stochastic optimization schemes leverage variance reduction techniques (Yu,
Lin, Mazumdar and Jordan 2022) and zeroth-order random reshuffling algorithms
(Maheshwari, Chiu, Mazumdar, Sastry and Ratliff 2022). These works typic-
ally rely on the duality results introduced in Section 4 and subsequently apply
stochastic subgradient descent, using subgradients of the regularized loss function
ℓ𝑐 (𝑥, 𝑧ˆ) = sup 𝑧 ∈Z ℓ(𝑥, 𝑧) − 𝜆𝑐(𝑧, 𝑧ˆ) with respect to 𝑥 and 𝜆. Ho-Nguyen and Wright
(2023) extend this approach to nonconvex robust binary classification problems.
Sinha et al. (2018) examine relaxed distributionally robust neural network training
problems, assuming that the required level of robustness against adversarial perturb-
ations is sufficiently small. This is tantamount to forcing 𝜆 to exceed a sufficiently
large threshold. If 𝑐(𝑧, 𝑧ˆ) = k𝑧 − 𝑧ˆ k 22 , this in turn ensures that the maximization
problem over 𝑧 that defines ℓ𝑐 (𝑥, 𝑧ˆ) has a strongly concave objective function and
is thus efficiently solvable. Stochastic subgradients of ℓ𝑐 (𝑥, 𝑧ˆ) are therefore readily
available thanks to Danskin’s theorem. Shafiee et al. (2023) leverage nonconvex
duality theorems, such as Toland’s duality principle, to solve distributionally robust
portfolio selection problems. Algorithms that minimize the variation-regularized
nominal loss, which is known to approximate the worst-case expected loss thanks
to Theorem 8.7, are explored by Li et al. (2022) and Bai et al. (2017). Finally, Wang
et al. (2021, 2024a) and Azizian et al. (2023b) introduce entropy and 𝜙-divergence
regularizers to improve the efficiency of algorithms for Wasserstein DRO prob-
lems, and Vincent, Azizian, Malick and Iutzeler (2024) provide a Python library
for training related distributionally robust machine learning models.

10. Statistical Guarantees


Despite ample empirical evidence that distributionally robust decisions can out-
perform those provided by alternative methodologies for decision-making under
uncertainty, the statistical properties of DRO remain underexplored. This section
aims to survey some of the key techniques and methods employed in the literature
to analyze the statistical aspects of DRO, while at the same time acknowledging
that numerous questions remain open in this domain.
The statistical guarantees of moment-based ambiguity sets are relatively weak
in the sense that the optimal value of problem (1.2) under a moment-based am-
biguity set P does not match the optimal value of the corresponding stochastic
program (1.1) under the unknown true distribution P = P0 even if P0 was known
exactly when P is constructed. The reason for this is that exact knowledge of lower-
order moments of P0 is not sufficient to uniquely characterize P0 itself. For this
reason, our review focuses on 𝜙-divergence and optimal transport ambiguity sets,
which offer asymptotic consistency guarantees as the number of samples available
from P0 grows, and we refer to Delage and Ye (2010) and Nguyen et al. (2021) for
statistical analyses of Chebyshev and Gelbrich ambiguity sets, respectively.
Section 10.1 introduces the data-driven optimization framework that we will
178 D. Kuhn, S. Shafiee, and W. Wiesemann

be interested in, as well as the two key performance criteria of excess risk and
out-of-sample disappointment. Subsequently, Section 10.2 surveys asymptotic
analyses, which are based on the laws of large numbers, the central limit theorem,
the empirical likelihood approach as well as the large and moderate deviations
principles. Finally, Section 10.3 reviews non-asymptotic analyses, which rely on
measure concentration bounds as well as generalization bounds.
Our review of the statistical properties of DRO omits several important topics.
For example, we do not cover domain adaptation guarantees (Farnia and Tse 2016,
Volpi, Namkoong, Sener, Duchi, Murino and Savarese 2018, Lee and Raginsky
2018, Lee, Park and Shin 2020, Sutter, Krause and Kuhn 2021, Taşkesen, Yue,
Blanchet, Kuhn and Nguyen 2021, Rychener et al. 2024), which ensure that a DRO
model trained on data from some source distribution generalizes to a different target
distribution. We also omit discussions of adversarial generalization bounds (Sinha
et al. 2018, Wang et al. 2019, Tu, Zhang and Tao 2019, Kwon, Kim, Won and Paik
2020, An and Gao 2021), which use DRO to analyze model robustness against
adversarial perturbations, as well as applications in high-dimensional statistical
learning (Aolaritei, Shafiee and Dörfler 2022b). Finally, we do not cover Bayesian
guarantees (Gupta 2019, Shapiro, Zhou and Lin 2023, Liu, Su and Xu 2024b),
which focus on average-case rather than worst-case performance guarantees.

10.1. Excess Risk and Out-of-Sample Disappointment


Consider the idealized scenario in which the uncertainty underlying a decision
problem follows a known probability distribution P0 ∈ P(Z). In this case, we aim
to determine a decision 𝑥0 that minimizes the expected value of a loss function
ℓ : X × Z → R with respect to P0 . That is, we seek an element 𝑥0 of
X0 = arg min E P0 [ℓ(𝑥, 𝑍)]. (10.1)
𝑥 ∈X

Note that problem (10.1) constitutes a classical stochastic program. While (10.1) is
theoretically sound, it faces two significant practical limitations. First, the distribu-
tion P0 underlying a decision problem is rarely known in practice. Second, even if
P0 was known, evaluating the objective function of (10.1) requires the computation
of an integral, which is intractable in high dimensions even for simple nonlinear
loss functions (Dyer and Stougie 2006, Hanasusanto, Kuhn and Wiesemann 2016).
In practice, we often observe the true probability distribution P0 indirectly
through historical data. From now on we thus assume to have access to 𝑁 inde-
pendent training samples from P0 , denoted as 𝑍1 , . . . , 𝑍 𝑁 . The goal of data-driven
optimization is to construct a decision from the training samples. This decision
should perform well not just on the training data, but also on unseen test samples
from P0 . The performance of a data-driven decision on test data is also called its
out-of-sample performance. Formally, data-driven optimization aims to learn a de-
cision rule T 𝑁 : Z 𝑁 ⇒ X that maps training samples from the product space Z 𝑁
to a set of candidate decisions in the decision space X . Note that T 𝑁 constitutes a
Distributionally Robust Optimization 179

set-valued mapping because it is usually constructed as the set of minimizers of an


optimization problem depending on the training samples. A data-driven decision
is then any (Borel measurable) function 𝑋ˆ 𝑁 of the training samples that satisfies
𝑋ˆ 𝑁 ∈ T 𝑁 (𝑍1 , . . . , 𝑍 𝑁 ).
Note that 𝑋ˆ 𝑁 inherits the randomness of the training samples and is therefore a
random vector. However, we notationally suppress its dependence of on the training
samples in order to avoid clutter. Instead, we us the superscript ‘ˆ’ together with
the subscript ‘𝑁’ to designate any random objects that are defined as functions of
𝑍1 , . . . , 𝑍 𝑁 and are thus governed by the product distribution P0𝑁 .
Arguably the simplest approach to data-driven optimization is the sample average
approximation (SAA), which is also known as empirical risk minimization in
statistics. The idea of SAA is to replace the unobservable true distribution P0
in (10.1) with the observable empirical distribution
1 Õ
P̂ 𝑁 = 𝛿 𝑍𝑖 (10.2)
𝑁
𝑖∈ [ 𝑁 ]

formed from the training samples 𝑍1 , . . . , 𝑍 𝑁 and to construct the decision rule
T 𝑁 (𝑍1 , . . . , 𝑍 𝑁 ) = arg min E P̂ 𝑁 [ℓ(𝑥, 𝑍)]. (10.3)
𝑥 ∈X
As the empirical distribution is discrete, the SAA approach obviates the need to
evaluate high-dimensional integrals and is thus computationally attractive. Never-
theless, the performance of its optimal solutions on test data can be disappointing
even when the test data are independently sampled from the true distribution P0 .
This phenomenon has been observed across various application domains and has
been given different names depending on the context. In finance, Michaud (1989)
identifies this issue as the error maximization effect of portfolio optimization. Stat-
istics and machine learning recognizes it as overfitting, a well-known challenge
where models perform well on training data but fail to generalize to new, unseen
test data. In the stochastic programming literature, Shapiro (2003) refers to this
phenomenon as the optimization bias, and in decision analysis the effect has been
described as the optimizer’s curse (Smith and Winkler 2006).
The disappointing out-of-sample performance of the SAA decisions prompted
statisticians and machine learners to add a regularization term to the objective
function in (10.3). The regularization term serves two purposes. It not only
combats overfitting to the training data, but it also encourages simpler decisions.
Such simplicity aligns with the principle of parsimony and reflects nature’s inherent
tendency towards simplicity. As Jeffreys and Wrinch (1921) aptly noted,
“The existence of simple laws is, then, apparently, to be regarded as a quality
of nature; and accordingly we may infer that it is justifiable to prefer a simple
law to a more complex one that fits our observations slightly better.”
180 D. Kuhn, S. Shafiee, and W. Wiesemann

Formally, the regularized SAA approach provides the decision rule


T 𝑁 (𝑍1 , . . . , 𝑍 𝑁 ) = arg min E P̂ 𝑁 [ℓ(𝑥, 𝑍)] + 𝑅(𝑥),
𝑥 ∈X
where the regularization function 𝑅 : X → R+ penalizes the complexity of de-
cision 𝑥. In the classical statistics literature, the regularization function is mostly
data independent, that is, it only depends on the decision 𝑥 and not on the observed
training data 𝑍1 , . . . , 𝑍 𝑁 . The most prominent examples include norm regulariza-
tion, where 𝑅(𝑥) = k𝑥 k, and Tikhonov regularization, where 𝑅(𝑥) = k𝑥 k 2 . These
regularization techniques balance the conflicting goals of computing decisions that
are optimal for the observed training data and maintaining model simplicity, thereby
improving the generalization capability of the derived decisions to unseen data.
Recall from Sections 6 and 8 that regularization and distributional robustness are
closely intertwined. Assume that we use the empirical distribution P̂ 𝑁 as the center
of a 𝜙-divergence ambiguity set (2.10) or optimal transport ambiguity set (2.27).
Then, the DRO approach provides the decision rule
T 𝑁 (𝑍1 , . . . , 𝑍 𝑁 ) = arg min sup E P [ℓ(𝑥, 𝑍)],
𝑥 ∈X P∈ P̂ 𝑁

which can be viewed as a variant of the regularized SAA decision rule. The cor-
responding data-dependent regularization function is called the DRO regularizer
and is given by
𝑅ˆ 𝑁 (𝑥) = sup E P [ℓ(𝑥, 𝑍)] − E P̂ 𝑁 [ℓ(𝑥, 𝑍)]. (10.4)
P∈ P̂ 𝑁

Thus, it depends on both the decision 𝑥 and the observed training data 𝑍1 , . . . , 𝑍 𝑁 .
The regularizer (10.4) quantifies how much the worst-case expected loss across all
distributions P ∈ P̂ 𝑁 can exceed the in-sample expected loss E P̂ 𝑁 [ℓ(𝑥, 𝑍)].
The performance of decision rules in data-driven optimization is primarily meas-
ured by two criteria, each of which is aligned with a different field of study and
addresses a different set of practical concerns. The first criterion, excess risk, is pre-
dominantly used in statistics. It quantifies the distance of a data-driven decision 𝑋ˆ 𝑁
to an optimal decision 𝑥0 . The second criterion, out-of-sample disappointment, is
more commonly employed in operations research. It provides a measure of how
much the out-of-sample risk of a data-driven decision 𝑋ˆ 𝑁 exceeds the in-sample
risk of 𝑋ˆ 𝑁 . In the following, we formally define both criteria.
Excess Risk. Let 𝜂 ∈ (0, 1) be a significance level, T 𝑁 be a decision rule, and
Δ : X × X0 → R+ be a performance function. Suppose that 𝑋ˆ 𝑁 ∈ T 𝑁 (𝑍1 , . . . , 𝑍 𝑁 ).
The excess risk criterion offers the guarantee that for any size 𝑁 ≥ 𝑁(X , Z, 𝜂) of
the training set, we have
P0𝑁 [Δ( 𝑋ˆ 𝑁 , 𝑥0 )] ≤ 𝛿ˆ 𝑁 ] ≥ 1 − 𝜂 (10.5)
for some (possibly data-dependent) error certificate 𝛿ˆ 𝑁 . In statistical learning
theory, performance functions often measure the regret in terms of the loss function
Distributionally Robust Optimization 181

ℓ under the true distribution P0 . Specifically, for any feasible candidate decisions
𝑥 ∈ X and any optimal decision 𝑥0 ∈ X0 , the regret takes the form
Δ(𝑥, 𝑥0 ) = E P0 [ℓ(𝑥, 𝑍)] − E P0 [ℓ(𝑥0 , 𝑍)] = E P0 [ℓ(𝑥, 𝑍)] − min E P0 [ℓ(𝑥, 𝑍)] ≥ 0.
𝑥 ∈X
In compressed sensing and M-estimation problems with linear models, performance
is often defined as the estimation error in the decision space, and it takes the form
Δ(𝑥, 𝑥0 ) = k𝑥 − 𝑥0 k 22 .
Here, we assume for simplicity that the minimizer 𝑥0 is unique. We refer to (Mendel-
son 2003, Bousquet, Boucheron and Lugosi 2004) for an introduction to statistical
learning theory. For more advanced treatments, we refer to (Anthony and Bart-
lett 1999, Koltchinskii 2011, Vapnik 2013, Shalev-Shwartz and Ben-David 2014,
Vershynin 2018, Wainwright 2019).
Out-of-Sample Disappointment. Let 𝜂 ∈ (0, 1) be a significance level and T 𝑁
be a decision rule. Suppose that 𝑋ˆ 𝑁 ∈ T 𝑁 (𝑍1 , . . . , 𝑍 𝑁 ). The out-of-sample
disappointment criterion offers the guarantee that for any size 𝑁 ≥ 𝑁(X , Z, 𝜂) of
the training set, we have
 
P0𝑁 E P0 [ℓ( 𝑋ˆ 𝑁 , 𝑍)] ≤ 𝐿ˆ 𝑁 ≥ 1 − 𝜂 (10.6)
for some (possibly data-dependent) loss certificate 𝐿ˆ 𝑁 . Alternatively, one can
express (10.6) as a probabilistic bound on the difference between the out-of-sample
performance and the in-sample performance,
 
P0𝑁 E P0 [ℓ( 𝑋ˆ 𝑁 , 𝑍)] − E P̂ 𝑁 [ℓ( 𝑋ˆ 𝑁 , 𝑍)] ≤ 𝛿ˆ𝑁 ≥ 1 − 𝜂,
for some error certificate 𝛿ˆ 𝑁 . Both criteria become equivalent when we set 𝛿ˆ 𝑁 =
E P̂ 𝑁 [ℓ( 𝑋ˆ 𝑁 , 𝑍)] + 𝐿ˆ 𝑁 . Unlike the excess risk bound (10.5), the out-of-sample
disappointment bound (10.6) does not require explicit knowledge of an optimal
decision 𝑥0 and solely leverages the statistical properties of P0 . As we will see in
the following sections, 𝐿ˆ 𝑁 and 𝛿ˆ 𝑁 typically correspond to the optimal value of the
DRO problem (1.2) and the DRO regularizer (10.4), respectively.
The next sections focus on ambiguity sets that are centered at the empirical dis-
tribution P̂ 𝑁 defined in (10.2). Specifically, we consider ambiguity sets constructed
using a discrepancy measure D : P(Z) × P(Z) → [0, ∞]:
P̂ 𝑁 = {P ∈ P(Z) : D(P, P̂ 𝑁 ) ≤ 𝑟 𝑁 }. (10.7)
The discrepancy measure D could be a 𝜙-divergence or a Wasserstein distance.
We will explain how the radius 𝑟 𝑁 should scale with the training sample size 𝑁 to
obtain the least conservative statistical guarantees.

10.2. Asymptotic Analyses


The laws of large numbers and the central limit theorem provide foundational
insights into the statistical properties of the SAA approach. Under appropriate
182 D. Kuhn, S. Shafiee, and W. Wiesemann

regularity conditions, the laws of large numbers guarantee that the empirical loss
E P̂ 𝑁 [ℓ(𝑥, 𝑍)] converges P0 -almost surely to the true expected loss E P0 [ℓ(𝑥, 𝑍)],
uniformly on X (see, e.g., Shapiro et al. 2009, § 7.2.5). This implies that the optimal
value and the set of optimal solutions of the SAA problem exhibit asymptotic
consistency, that is, they both converge to their counterparts in the stochastic
program under P0 as the sample size 𝑁 approaches infinity. The central limit
theorem, on the other hand hand, implies that the scaled difference between the
empirical loss (under P̂ 𝑁 ) and true expected loss (under P0 ) converges weakly to a
normal distribution with mean zero and variance equal to the true variance of the
loss under P0 (see, e.g., Shapiro et al. 2009, § 5.1.2). Thus, the optimal value of
the SAA problem also exhibits asymptotic normality. The asymptotic properties
of the SAA decision rule have been studied extensively, see, e.g., (Cramér 1946,
Huber 1967, Dupacová and Wets 1988, Shapiro 1989, 1990, 1991, 1993, King and
Wets 1991, King and Rockafellar 1993, Van der Vaart 1998, Lam 2021).
Building on these foundations, we will next review the asymptotic consistency
and normality of DRO decision rules. While studying these asymptotic behavi-
ors, different theoretical frameworks provide distinct insights. The central limit
theorem and empirical likelihood approaches characterize the typical fluctuations
around the mean under an appropriate scaling. The central limit theorem establishes
Gaussian convergence, whereas the empirical likelihood theory provides a nonpara-
metric framework for constructing likelihood ratio tests with asymptotic 𝜒2 -limits,
enabling hypothesis testing without specific parametric assumptions. In contrast,
large deviations theory examines the tail behavior of distribution sequences. Rather
than focusing on typical fluctuations, it characterizes the exponential decay rate of
probabilities associated with rare events far from the mean. Moderate deviations
theory bridges the gap between the typical and rare event analyses provided by the
aforementioned frameworks. It studies the asymptotic behavior of distribution se-
quences at intermediate scales, thus investigating larger deviations than the central
limit theorem but smaller deviations than large deviations theory.

10.2.1. Asymptotic Consistency and Normality


Lam (2019, Theorem 6) establishes the asymptotic (uniform strong) consistency of
the optimal value of DRO decision rules over likelihood ambiguity sets. The proof
relies on the preservation theorem of Glivenko-Cantelli classes (Van Der Vaart and
Wellner 2000, Theorem 3), which intuitively says that function classes maintain
their uniform convergence properties when combined through continuous opera-
tions, assuming that the original classes are well-behaved. Duchi et al. (2021,
Theorem 6) extend the analysis to more general 𝜙-divergence ambiguity sets.
Mohajerin Esfahani and Kuhn (2018, Theorem 3.6) establish the asymptotic con-
sistency of the optimal value and the optimal solutions of DRO decision rules over
1-Wasserstein balls of the form (10.7). Their proof combines the Borel–Cantelli
Lemma (Kallenberg 1997, Theorem 2.18) with measure concentration results by
Fournier and Guillin (2015, Theorem 2). Intuitively, the Borel–Cantelli lemma
Distributionally Robust Optimization 183

asserts that if probabilities of an infinite sequence of events (E 𝑁 ) 𝑁 ∈N have a finite


sum, then the probability of infinitely many occurrences of these events is zero.
Leveraging its contraposition, Mohajerin Esfahani and Kuhn (2018) consider the
events E 𝑁 = W1 (P0 , P̂ 𝑁 ) ≤ 𝑟 𝑁 , where P̂ 𝑁 is the empirical distribution over 𝑁
independent samples from P0 ; see (10.2). By selecting a converging sequence of
radii (𝑟 𝑁 ) 𝑁 ∈N that decay according to a scaling law informed by (Fournier and
Guillin 2015, Theorem 2), they prove that P∞ 0 (lim 𝑁 →∞ W1 (P0 , P̂ 𝑁 ) = 0) = 1. This
enables them to show that the optimal value of the DRO problem (1.2) over the
1-Wasserstein ball (10.7) converges asymptotically from above to the optimal value
of the stochastic program (10.1). They also establish asymptotic convergence of
the optimal solutions under an additional continuity assumption. This result can be
extended to general 𝑝-Wasserstein ambiguity sets (Kuhn et al. 2019, Theorem 20).
Similar asymptotic convergence results have been established by Gao et al. (2024,
Proposition 1), albeit through a different approach. Their proof does not rely
on measure concentration results or an explicit characterization of the radius 𝑟 𝑁 .
Instead, it leverages Theorem 4.18 together with the reverse Fatou lemma and
the monotone convergence theorem. This approach, however, does not explicitly
determine whether the asymptotic convergence occurs from above or below.
Lam (2019, Theorem 4) establishes the asymptotic normality of the optimal
values of DRO problems over likelihood ambiguity sets. In a similar fashion,
Duchi and Namkoong (2019, Theorem 10) establish the asymptotic normality of
the optimal solutions of DRO problems over Pearson 𝜒2 -divergence ambiguity
sets. Duchi and Namkoong (2021, Theorem 11) extend this result to Cressie-
Read ambiguity sets. The asymptotic normality of DRO decision rules over 𝑝-
Wasserstein balls, finally, is established by Blanchet et al. (2019b, 2022a,b).
More recently, Blanchet and Shapiro (2023) have developed a comprehensive
framework for analyzing statistical limit theorems for DRO decision rules over both
𝜙-divergences and Wasserstein ambiguity sets of the form (10.7). By connecting
data-driven DRO formulations to their regularized counterparts (cf. Section 8),
their framework provides insights into how DRO decision rules behave depending
on the rate at which the radius 𝑟 𝑁 decreases with the sample size 𝑁. Specifically,
Blanchet and Shapiro (2023, § 2.2) show that, under suitable regularity conditions,
DRO formulations typically exhibit three distinct asymptotic behaviors.
(i) When 𝑟 𝑁 decreases faster than the critical statistical rate of 𝑁 −1/2 , the DRO
effect becomes negligible compared to the sampling error, and the asymptotic
behavior of DRO mirrors that of standard empirical risk minimization.
(ii) When 𝑟 𝑁 decreases at precisely the critical rate 𝑁 −1/2 , the DRO effect mani-
fests itself as a quantifiable asymptotic bias term that acts as a regularizer,
and its interaction with the statistical noise results in a shifted normal limiting
distribution.
(iii) When 𝑟 𝑁 decreases slower than 𝑁 −1/2 , the DRO effect dominates the statist-
ical noise, leading to a limiting behavior governed primarily by the geometry
of the ambiguity set.
184 D. Kuhn, S. Shafiee, and W. Wiesemann

The analysis employs the functional central limit theorem alongside careful Taylor
expansions of the worst-case expectation akin to those presented in Section 8. In
particular, the authors establish that, under appropriate regularity conditions, the
limiting distributions are normal with explicitly characterized means and variances.

10.2.2. Empirical Likelihood Approach


The (generalized) empirical likelihood theory introduced by Owen (1988, 1990,
1991, 2001) provides a powerful nonparametric analogue to parametric maximum
likelihood theory. At its core, the empirical distribution P̂ 𝑁 serves as a nonpara-
metric maximum likelihood estimator for the unknown true distribution P̂0 , and
statistical relies on empirical likelihood ratios. Under suitable conditions, the em-
pirical likelihood ratio statistic converges to a 𝜒2 -distribution. Unlike the central
limit theorem, which yields normal approximations and thus symmetric confidence
intervals, the empirical likelihood theory typically produces asymmetric confidence
regions. A key advantage of this approach is that the resulting data-driven con-
fidence regions automatically adapt to the geometry of the underlying distribution
and naturally respect constraints such as boundedness or non-negativity, without
requiring explicit transformations or variance estimation. However, this theoretical
elegance and flexibility comes at the computational overhead of computing both
the lower and upper bounds of the confidence interval separately.
In the following, we briefly review the empirical likelihood approach and its ap-
plication to DRO decision rules. Let 𝑍1 , . . . , 𝑍 𝑁 be independent samples from P0 ,
and let 𝜃 : P(Z) → R be a statistical quantity of interest (e.g., the expected value
of 𝑍𝑖 ). Empirical likelihood confidence regions for 𝜃(P0 ) can be constructed as

Cˆ𝑁 = 𝜃(P) : D 𝜙 (P, P̂ 𝑁 ) ≤ 𝑁𝑟 (10.8)

For some 𝑟 ∈ R+ . Thus, the set Cˆ𝑁 is the image of a 𝜙-divergence neighborhood
around the empirical distribution P̂ 𝑁 under 𝜃. The key tool to establish probabilistic
bounds is the so called profile divergence 𝜋 𝑁 : R → R+ , which is defined through

𝜋 𝑁 (𝜏) = inf D 𝜙 (P, P̂ 𝑁 ) : 𝜃(P) = 𝜏 . (10.9)
P∈P(Z)

For a functional 𝜃 satisfying suitable smoothness conditions, the empirical likeli-


hood method provides asymptotically exact coverage guarantees of the form
lim P0𝑁 𝜃(P0 ) ∈ Cˆ𝑁 = lim P0𝑁 𝜋 𝑁 (𝜃(P0 )) ≤ 𝑁𝑟 = 1 − 𝜂,
 
𝑁 →∞ 𝑁 →∞

where 𝜂 represents a significance level determined by 𝑟 and 𝜃.


The classical empirical likelihood approach (Owen 1988, 2001) relies on the
empirical likelihood divergence with entropy function 𝜙(𝑠) = − log(𝑠) + 𝑠 − 1 if
𝑠 ≥ 0 and 𝜙(𝑠) = ∞ if 𝑠 < 0 (see Table 2.1). In this case, 𝜋 𝑁 is called profile
likelihood. Assume that 𝑍 is a 𝑑-dimensional random vector that is governed by
the distribution P0 and whose covariance matrix has rank 𝑑0 ≤ 𝑑. For the expected
Distributionally Robust Optimization 185

value 𝜃(P0 ) := E P0 [𝑍], Owen (1990) proves that, as 𝑁 → ∞, we have


𝑑
𝜋 𝑁 (E P0 [𝑍]) → 𝜒𝑑20 ,

where 𝜒𝑑20 denotes the 𝜒2 -distribution with 𝑑0 degrees of freedom. Thus, Cˆ𝑁
constitutes an asymptotically exact (1 − 𝜂)-confidence interval for 𝜃(P0 ) if we set 𝑟
in (10.8) to the (1 − 𝜂)-quantile of a 𝜒2 -distribution with 𝑑0 degrees of freedom.
In the context of stochastic programming problems, the statistical quantity of
interest is typically the optimal value of the stochastic program, that is, 𝜃(P) =
inf 𝑥 ∈X E P [ℓ(𝑥, 𝑍)]. In this case, the set Cˆ𝑁 becomes the interval
h i
Cˆ𝑁 = inf inf E P [ℓ(𝑥, 𝑍)], sup inf E P [ℓ(𝑥, 𝑍)] ,
P∈ P̂ 𝑁 𝑥 ∈X P∈ P̂ 𝑁
𝑥 ∈X

where P̂ 𝑁 is the 𝜙-divergence ambiguity set of the form (10.7) around P̂ 𝑁 . If


P̂ 𝑁 is a likelihood ambiguity set, Lam (2019) investigates the asymptotic cover-
age probability of this interval by leveraging asymptotic guarantees for the SAA
decision rule by Lam and Zhou (2017). In particular, he shows that if suitable
regularity conditions hold and 𝑟 𝑁 = 𝑟/𝑁, where 𝑟 is the (1 − 𝜂)-quantile of a 𝜒2 -
distribution with a single degree of freedom, then Cˆ𝑁 becomes an asymptotically
exact (1 − 𝜂)-confidence interval. One can thus show that the resulting confidence
1
bounds achieve the asymptotically exact coverage at the parametric rate O(𝑁 − 2 ).
Duchi et al. (2021) further generalize these results to DRO decision rules over
broader classes of 𝜙-divergence ambiguity sets. Additionally, He and Lam (2021)
examine higher-order coverage errors and introduce a correction term similar to
the Bartlett correction. The authors derive higher-order correction terms for gen-
eral von Mises differentiable functionals and thus move beyond the approximately
smooth functions previously studied in the empirical likelihood literature.
In a parallel line of research, Blanchet et al. (2019b, 2022a,b), Blanchet and
Kang (2021) and Lin, Blanchet, Glynn and Nguyen (2024) introduce the Wasser-
stein profile function as a Wasserstein analogue to the profile divergence (10.9).
This approach replaces the 𝜙-divergence with the 2-Wasserstein distance, and it
offers a geometric perspective on uncertainty quantification. This approach yields
1
confidence bounds that achieve asymptotic parametric rate O(𝑁 − 2 ). For more
details, we direct the readers to the recent survey by Blanchet et al. (2021).

10.2.3. Large and Moderate Deviations Principles


Unlike the central limit theorem and the empirical likelihood approach, which
characterize limits of distribution sequences, large and moderate deviations theory
study the asymptotic tail behavior of distribution sequences. Specifically, they
prove exponential decay rates of probabilities of rare events over sequences of ran-
dom variables. The foundations of large deviations theory trace back to two seminal
developments in physics and mathematics. The first is Boltzmann’s groundbreak-
ing works on statistical mechanics and entropy. The second is Cramér’s pioneering
186 D. Kuhn, S. Shafiee, and W. Wiesemann

paper on the asymptotic behavior of sums of random variables (Cramér 1938).


Despite these early advances, the field lacked a unified mathematical framework
until Varadhan’s seminal paper (Varadhan 1966), which introduces a formal defin-
ition of a large deviation principle. We refer to the textbooks by Ellis (2007) and
Dembo and Zeitouni (2009) for a modern treatment of the topic.
Assume now that the unknown true distribution P0 is known to belong to a para-
metric distribution family {P 𝜃 : 𝜃 ∈ Θ} ⊆ P(Z), where 𝜃 ranges over a prescribed
parameter space Θ. In this case, estimating P0 is tantamount to estimating the
unknown true parameter vector 𝜃 0 ∈ Θ that satisfies P0 = P 𝜃0 . A statistic 𝜃ˆ 𝑁
is a random variable valued in Θ and constructed from (𝑍1 , . . . , 𝑍 𝑁 ) ∼ P 𝑁 𝜃 that
converges in probability to 𝜃 as 𝑁 grows, for any 𝜃 ∈ Θ. Formally, we say that
the statistic 𝜃ˆ 𝑁 satisfies a large deviations principle with speed 𝑏 𝑁 and with lower
semicontinuous rate function 𝐼 : Θ × Θ → [0, ∞] if
1
− inf 𝐼(𝜃 ′ , 𝜃) ≤ lim inf log P 𝜃 (𝜃ˆ 𝑁 ∈ B)
𝜃 ′ ∈int(B) 𝑁 →∞ 𝑏 𝑁
(10.10)
1
≤ lim sup log P 𝜃 (𝜃ˆ 𝑁 ∈ B) ≤ − ′ inf 𝐼(𝜃 ′ , 𝜃)
𝑁 →∞ 𝑏 𝑁 𝜃 ∈cl(B)

for all 𝜃 ∈ Θ and for all Borel sets B ⊆ Θ. Here, we assume that the sequence 𝑏 𝑁 ,
𝑁 ∈ N, tends monotonically towards infinity. If (10.10) holds, one can show under
mild conditions that 𝐼(𝜃, 𝜃) = 0 because 𝜃ˆ 𝑁 converges to 𝜃 in probability under P 𝜃 .
It is therefore natural to interpret 𝐼(𝜃 ′ , 𝜃) as a discrepancy function that quantifies the
dissimilarity between the estimator realization 𝜃 ′ and the probabilistic model 𝜃. As 𝐼
is lower semicontinuous, the minimization problems on the left and on the right hand
side of (10.10) share the same infimum 𝑟 = inf 𝜃 ′ ∈int(B) 𝐼(𝜃 ′ , 𝜃) = inf 𝜃 ′ ∈cl(B) 𝐼(𝜃 ′ , 𝜃)
for most Borel sets B of interest. In these cases, the inequalities in (10.10) collapse
to equalities, and (10.10) simplifies to the more intuitive statement
P 𝜃 (𝜃ˆ 𝑁 ∈ B) = exp (−𝑟𝑏 𝑁 + 𝑜(𝑏 𝑁 )) .
That is, the probability of the estimator 𝜃ˆ𝑁 falling into the set B decays exponentially
at rate 𝑟 with speed 𝑏 𝑁 , where 𝑟 can be interpreted as the 𝐼-distance from 𝜃 to B.
Several statistics of practical interest satisfy large deviations principles. For
example, if Z is finite and {P 𝜃 : 𝜃 ∈ Θ} is the family of all distributions on Z
encoded by the corresponding probability vectors 𝜃 ∈ Θ, where Θ is the probability
simplex of appropriate dimension, then the empirical distribution P̂ 𝑁 correspond-
ing to the empirical probability vector 𝜃ˆ 𝑁 is an estimator for the data-generating
distribution P 𝜃 . In this case, Sanov’s theorem (Cover and Thomas 2006, The-
orem 11.4.1) asserts that 𝜃ˆ 𝑁 satisfies a large deviations principle with rate function
𝐼(𝜃 ′ , 𝜃) = KL(P 𝜃 ′ , P 𝜃 ) and speed 𝑏 𝑁 = 𝑁. Similarly, if {P 𝜃 : 𝜃 ∈ Θ} is any
distribution family parametrized by its unknown mean vector 𝜃 = E P 𝜃 [𝑍] and
if the log-moment generating function Λ 𝜃 (𝑡) Í = log(E P 𝜃 [exp(𝑡 ⊤ 𝑍)]) is finite for
all 𝑡, 𝜃 ∈ R𝑑 , then the sample mean 𝜃ˆ 𝑁 = 𝑁1 𝑖∈ [ 𝑁 ] 𝑍𝑖 is an estimator for 𝜃. In this
case, Cramér’s theorem (Cramér 1938) asserts that 𝜃ˆ 𝑁 satisfies a large deviations
Distributionally Robust Optimization 187

principle with rate function 𝐼(𝜃 ′ , 𝜃) = Λ∗𝜃 (𝜃 ′ ) and speed 𝑏 𝑁 = 𝑁. Note that the log-
moment generating function Λ 𝜃 as well as its conjugate Λ∗𝜃 are both convex. We
remark that a large deviations principle with sublinear speed (lim 𝑁 →∞ 𝑏 𝑁 /𝑁 = 0)
is sometimes referred to as a moderate deviations principle. For an example of a
moderate deviations principle we refer to (Jongeneel, Sutter and Kuhn 2022).
Van Parys et al. (2021) leverage Sanov’s theorem to show that the optimal
value of the DRO problem with a likelihood ambiguity set of radius 𝑟 around the
empirical distribution P̂ 𝑁 yields the least conservative confidence bound on the
optimal value of the true stochastic program, asymptotically as the sample size 𝑁
grows large, with significance level 𝜂 decaying exponentially as 𝑒 −𝑟 𝑁 . More
generally, Sutter, Van Parys and Kuhn (2024) assume that P0 is known to belong
to a parametric distribution family {P 𝜃 : 𝜃 ∈ Θ} and that 𝜃 admits an estimator 𝜃ˆ 𝑁
that satisfies a large deviations principle with rate function 𝐼 and speed 𝑏 𝑁 = 𝑁.
Under some regularity conditions, they then show that the optimal value of the
DRO problem with ambiguity set P̂ 𝑁 = {P 𝜃 : 𝜃 ∈ Θ, 𝐼(𝜃ˆ 𝑁 , 𝜃) ≤ 𝑟} yields again
the least conservative confidence bound on the optimal value of the true stochastic
program with significance level 𝜂 ∝ 𝑒 −𝑟 𝑁 . Similar statistical optimality results can
sometimes be obtained even when the training samples are serially dependent, e.g.,
when they are generated by a Markov process with unknown transition probability
matrix or certain autoregressive processes (Sutter et al. 2024).
The DRO estimators by Van Parys et al. (2021) and Sutter et al. (2024) lack
asymptotic consistency because they exploit large deviations principles with linear
speed 𝑏 𝑁 = 𝑁. Bennouna and Van Parys (2021) show that asymptotic consistency
can be recovered by relying on moderate deviations principles with sublinear speed.
This line of research has seen significant recent developments. The use of large
and moderate deviations principles has also been extended to various learning
and control settings such as distributionally robust Markov decision processes (Li,
Sutter and Kuhn 2021), bandit problems (Van Parys and Golrezaei 2024), bootstrap-
based methods (Bertsimas and Van Parys 2022), optimal learning (Ganguly and
Sutter 2023, Liu et al. 2023), control (Jongeneel, Sutter and Kuhn 2021, Jongeneel
et al. 2022), contextual learning (Srivastava, Wang, Hanasusanto and Ho 2021),
and robust statistics (Chan, Van Parys and Bennouna 2024).

10.3. Non-Asymptotic Analyses


Non-asymptotic statistics seeks finite-sample guarantees that hold regardless of the
sample size. This is in contrast to the asymptotic methods described in Section 10.2,
which rely on properties that emerge as sample size tends infinity. Non-asymptotic
methods allow for a rigorous control over error rates, which makes them robust
in situations where asymptotic approximations might produce misleading results.
In the following, we review two major classes of non-aymptotic analyses, that is,
measure concentration bounds and generalization bounds.
188 D. Kuhn, S. Shafiee, and W. Wiesemann

10.3.1. Measure Concentration Bounds


The most elementary approach to obtain finite sample guarantees is to design the
ambiguity set P̂ 𝑁 such that it contains the unknown true probability distribution
P0 with high probability. This requires an analysis of the convergence rate of P̂ 𝑁
towards P0 , and it leads to out-of-sample disappointment bounds that depend only
on P̂ 𝑁 and not on the complexity of the loss function ℓ or the decision space X .

Theorem 10.1 (Out-of-Sample Disappointment). Suppose that the ambiguity set


P̂ 𝑁 defined in (10.7) satisfies

P0𝑁 P0 ∈ P̂ 𝑁 ≥ 1 − 𝜂.

(10.11)

We then have
 
P0𝑁 E P0 [ℓ(𝑥, 𝑍)] ≤ sup E P [ℓ(𝑥, 𝑍)] ∀𝑥 ∈ X ≥ 1 − 𝜂. (10.12a)
P∈ P̂ 𝑁

Moreover, if 𝑋ˆ 𝑁 is an optimizer of the distributionally robust decision problem


with respect to the ambiguity set P̂ 𝑁 , then we have
 
P0𝑁 E P0 [ℓ( 𝑋ˆ 𝑁 , 𝑍)] ≤ min sup E P [ℓ(𝑥, 𝑍)] ≥ 1 − 𝜂. (10.12b)
𝑥 ∈X
P∈ P̂ 𝑁

The proof of (10.12a) and (10.12b) readily follows from the measure concentra-
tion bound (10.11) and is therefore omitted. Theorem 10.1 asserts that the worst-
case expected loss provides an upper confidence bound on the true expected loss
under the unknown data-generating distribution uniformly across all loss functions.
Moreover, it also asserts that the optimal value of the DRO problem (1.2) provides
an upper confidence bound on the out-of-sample performance of its optimizers.
When using 𝜙-divergences to construct P̂ 𝑁 as in (10.7), the probabilistic require-
ment (10.11) only applies to underlying distributions P0 that are discrete (Polyanskiy
and Wu 2024, § 7). In contrast, the Wasserstein distance applies to generic distri-
butions P0 . This area of study has a rich history, with seminal contributions from
Dudley (1969), Ajtai, Komlós and Tusnády (1984), and Dobrić and Yukich (1995).
More recent advancements have been made by Bolley, Guillin and Villani (2007),
Boissard and Le Gouic (2014), Dereich, Scheutzow and Schottstedt (2013), and
Fournier and Guillin (2015). Of particular importance to our discussion is the
following measure concentration result, which serves as the foundation for finite
sample guarantees in DRO over 𝑝-Wasserstein ambiguity sets.

Theorem 10.2 (Measure Concentration (Fournier and Guillin 2015, Theorem 2)).
Suppose that P̂ 𝑁 is the empirical distribution constructed from 𝑁 independent
samples from P0 , 𝑝 ≠ 𝑑/2, and that P0 is light-tailed in the sense that there exist
𝛼 > 𝑝 and 𝐴 > 0 such that E P0 (exp(k𝑍 k 𝛼 )) ≤ 𝐴. Then, there are constants
𝑐1 , 𝑐2 > 0 that depend on P0 only through 𝛼, 𝐴, and 𝑑 such that for any 𝜂 ∈ (0, 1],
Distributionally Robust Optimization 189

the concentration inequality P0𝑁 (𝑊 𝑝 (P0 , P̂) ≤ 𝑟 𝑁 ) ≥ 1−𝜂 holds whenever 𝑟 exceeds
  log(𝑐 /𝜂) min{1/𝑑,1/2}
 log(𝑐1 /𝜂)
 1



 if 𝑁 ≥ ,
𝑐2 𝑁 𝑐2
𝑟(𝑑, 𝑁, 𝜂) =  1/𝛼 (10.13)

 log(𝑐1 /𝜂) log(𝑐1 /𝜂)

 if 𝑁 < .
 𝑐2 𝑁 𝑐2

The result remains valid for 𝑝 = 𝑑/2 but with a more complicated formula
for 𝑟(𝑑, 𝑁, 𝜂) (Fournier and Guillin 2015, Theorem 2). Intuitively, Theorem 10.2
asserts that any 𝑝-Wasserstein ball P̂ 𝑁 of 𝑟 𝑁 ≥ 𝑟(𝑑, 𝑁, 𝜂) around P̂ 𝑁 represents
a (1 − 𝜂)-confidence set for the unknown data-generating distribution P0 . For
uncertainty dimensions 𝑑 > 2, the critical radius 𝑟(𝑑, 𝑁, 𝜂) of this confidence set
1
decays as O(𝑁 − 𝑑 ). In other words, to reduce the critical radius by 50%, the
sample size must increase by 2𝑑 . Unfortunately, this curse of dimensionality is
fundamental, and the decay rate of 𝑟(𝑑, 𝑁, 𝜂) is essentially optimal (Fournier and
Guillin 2015, § 1.3). Explicit constants 𝑐1 and 𝑐2 are provided by Fournier (2022).
Generic measure concentration bounds suffer from a curse of dimensionality.
Shafieezadeh-Abadeh et al. (2019) and Wu et al. (2022) show that this curse can
be overcome in the context of linear prediction models by projecting 𝑍 to a one-
1
dimensional random variable, yielding the parametric convergence rate O(𝑁 − 2 ).
Nietert et al. (2024a) develop a similar approach for rank-𝑘 linear models, where
1
2 < 𝑘 < 𝑑, and achieve an improved rate of O(𝑁 − 𝑘 ) based on 𝑘-sliced Wasserstein
distances. The 1-sliced Wasserstein distance is also used by Olea et al. (2022) to
1
obtain the parametric rate O(𝑁 − 2 ) for a class of regression problems.
We conclude this section by highlighting that the DRO approach admits instance-
dependent regret bounds, which essentially depend on no complexity measures of
the decision space or the loss function. Instead, they only depend on the complexity
of the optimal solution 𝑥0 through the DRO regularizer 𝑅ˆ 𝑁 (𝑥0 ). Zeng and Lam
(2022, Theorem 4.1) and Nietert et al. (2024a, Theorem 1) establish such bounds
for DRO problems over the ambiguity set (10.7) when D is the maximum mean
discrepancy and the (outlier-robust) Wasserstein distance, respectively. Similar
instance-dependent guarantees for DRO problems with Wasserstein ambiguity sets
are developed by Hou, Kassraie, Kratsios, Krause and Rothfuss (2023).

10.3.2. Generalization Bounds


An alternative approach to obtain statistical guarantees leverages the union bound
from probability theory and covering numbers or complexity measures from stat-
istical learning theory. The first step consists in deriving an inequality of the form
 
P0𝑁 E P0 [ℓ(𝑥, 𝑍)] ≤ 𝐿ˆ 𝑁 (𝑥) ≥ 1 − 𝜂 ∀𝑥 ∈ X , (10.14)

where the loss certificate 𝐿ˆ 𝑁 (𝑥) depends on the decision 𝑥 ∈ X . For example, a
guarantee of the form (10.14) can be obtained by combining empirical Bernstein
190 D. Kuhn, S. Shafiee, and W. Wiesemann

inequalities (Maurer and Pontil 2009) and a DRO model with a 𝜒2 -divergence
ambiguity set (Duchi and Namkoong 2019, Theorem 2). In this case, the certi-
ficate 𝐿ˆ 𝑁 (𝑥) reduces to the sum of the expected loss under P̂ 𝑁 and a variance
regularizer under P0 . Alternatively, a guarantee of the form (10.14) can also be
obtained by combining transport inequalities (Marton 1986, Talagrand 1996) and a
DRO model with a Wasserstein ambiguity set (Gao 2023, Theorem 1). In this case,
𝐿ˆ 𝑁 (𝑥) reduces to the sum of the expected loss under P̂ 𝑁 and a variation regularizer
under P0 . The second step consists in converting the individual guarantee (10.14)
to a uniform guarantee. For example, if X is finite, this can easily be achieved by
using the union bound. If X is uncountable, one may use one of several standard
techniques. If the loss function is Lipschitz continuous in 𝑥 ∈ X uniformly across
all 𝑧 ∈ Z and X is compact, then one can discretize X by uniform gridding. In
this case, the loss at an arbitrary point is uniformly approximated by the loss at
the nearest grid point, and a uniform guarantee can again be obtained by using the
union bound. However, the number of grid points needed for an 𝜀-approximation
is of the order O((1/𝜀)𝑑 ), which is impractical in high dimensions 𝑑. A more
sophisticated approach to discretize X exploits structural knowledge of the loss
function at multiple scales. However, obtaining tight approximation in high dimen-
sions remains challenging. In order to mitigate the computational burden related
to discretization, one may exploit several complexity measures that quantify the
expressiveness of the functions ℓ(𝑥, ·) for all 𝑥 ∈ X such as the VC dimension or
the Rademacher complexity as well as its local version. Nonetheless, Rademacher
complexities can be computationally challenging to compute. For full details we
refer to (Boucheron, Lugosi and Massart 2013, Vershynin 2018, Wainwright 2019).
The last step consists in approximating the certificate 𝐿ˆ 𝑁 (𝑥) by the worst-case
expected loss over a data-driven ambiguity set P̂ 𝑁 based on the 𝜒2 -divergence
or a Wasserstein distance. The corresponding approximation error can be con-
trolled by leveraging Taylor approximations as in Theorems 8.4 and 8.7 together
with appropriate concentration inequalities. In summary, this procedure shows
that the optimal value of a data-driven DRO problem over a 𝜒2 -divergence or a
Wasserstein ambiguity set provides a finite-sample upper confidence bound on the
corresponding stochastic program under the unknown true distribution P0 .
Duchi and Namkoong (2019) and Gao (2023) derive generalization bounds of this
kind for 𝜒2 -divergence and Wasserstein ambiguity sets, respectively, while Azizian,
Iutzeler and Malick (2023a) extend their analysis to entropic regularized optimal
1
transport ambiguity sets. All these bounds exhibit the parametric rate O(𝑁 − 2 ). In
addition, Duchi and Namkoong (2019) demonstrate that, under certain curvature
conditions, 𝜒2 -divergence decision rules can achieve the fast rate O(𝑁 −1 ).

Acknowledgments. This research was supported by the Swiss National Science


Foundation under the NCCR Automation (grant agreement 51NF40_180545). The
authors thank Nicolas Lanzetti, Mengmeng Li, Karthik Natarajan, Yves Rychener,
Distributionally Robust Optimization 191

Philipp Schneider, Buse Sen, Bradley Sturt and Man-Chung Yue for their valuable
feedback on the paper. We are responsible for all remaining errors.

References
C. Acerbi (2002), Spectral measures of risk: A coherent representation of subjective risk
aversion, Journal of Banking & Finance 26(7), 1505–1518.
A. Ahmadi-Javid (2012), Entropic value-at-risk: A new coherent risk measure, Journal of
Optimization Theory and Applications 155(3), 1105–1123.
S. Ahmed (2006), Convexity and decomposition of mean-risk stochastic programs, Math-
ematical Programming 106(3), 433–446.
M. Ajtai, J. Komlós and G. Tusnády (1984), On optimal matchings, Combinatorica 4(4),
259–264.
F. Al Taha, S. Yan and E. Bitar (2023), A distributionally robust approach to regret optimal
control using the Wasserstein distance, in IEEE Conference on Decision and Control,
pp. 2768–2775.
S. M. Ali and S. D. Silvey (1966), A general class of coefficients of divergence of one
distribution from another, Journal of the Royal Statistical Society: Series B 28(1),
131–142.
J. M. Altschuler and E. Boix-Adsera (2023), Polynomial-time algorithms for multimarginal
optimal transport problems with structure, Mathematical Programming 199(1), 1107–
1178.
L. Ambrosio, N. Gigli and G. Savaré (2008), Gradient Flows: In Metric Spaces and in the
Space of Probability Measures, Springer.
Y. An and R. Gao (2021), Generalization bounds for (Wasserstein) robust optimization, in
Advances in Neural Information Processing Systems, pp. 10382–10392.
B. Analui and G. C. Pflug (2014), On distributionally robust multiperiod stochastic optim-
ization, Computational Management Science 11, 197–220.
M. Anthony and P. L. Bartlett (1999), Neural Network Learning: Theoretical Foundations,
Cambridge University Press.
J. Anunrojwong, S. R. Balseiro and O. Besbes (2024), On the robustness of second-price
auctions in prior-independent mechanism design, Operations Research (Forthcoming).
L. Aolaritei, N. Lanzetti, H. Chen and F. Dörfler (2022a), Uncertainty propagation via
optimal transport ambiguity sets, arXiv:2205.00343.
L. Aolaritei, S. Shafiee and F. Dörfler (2022b), Wasserstein distributionally robust estim-
ation in high dimensions: Performance analysis and optimal hyperparameter tuning,
arXiv:2206.13269.
R. Arora and R. Gao (2022), Data-driven multistage distributionally robust linear optimiz-
ation with nested distance, Available from Optimization Online.
P. Artzner, F. Delbaen, J.-M. Eber and D. Heath (1999), Coherent measures of risk,
Mathematical Finance 9(3), 203–228.
C. Atkinson and A. F. Mitchell (1981), Rao’s distance measure, Sankhyā: The Indian
Journal of Statistics, Series A 43(3), 345–365.
W. Azizian, F. Iutzeler and J. Malick (2023a), Exact generalization guarantees for (regu-
larized) Wasserstein distributionally robust models, in Advances in Neural Information
Processing Systems, pp. 14584–14596.
192 D. Kuhn, S. Shafiee, and W. Wiesemann

W. Azizian, F. Iutzeler and J. Malick (2023b), Regularization for Wasserstein distribu-


tionally robust optimization, ESAIM: Control, Optimisation and Calculus of Variations
29(31), 1–33.
F. Bach (2013), Learning with submodular functions: A convex optimization perspective,
Foundations and Trends in Machine Learning 6(2-3), 145–373.
F. Bach (2019), Submodular functions: From discrete to continuous domains, Mathemat-
ical Programming 175(1-2), 419–459.
X. Bai, G. He, Y. Jiang and J. Obloj (2017), Wasserstein distributional robustness of neural
networks, in Advances in Neural Information Processing Systems, pp. 26322–26347.
Y. Bai, H. Lam and X. Zhang (2023), A distributionally robust optimization framework for
extreme event estimation, arXiv:2301.01360.
R. Baire (1905), Leçons sur les Fonctions Discontinues, Gauthier-Villars.
S. Banach (1938), Über homogene Polynome in (𝐿 2 ), Studia Mathematica 7(1), 36–44.
C. Bandi and D. Bertsimas (2014), Optimal design for multi-item auctions: A robust
optimization approach, Mathematics of Operations Research 39(4), 1012–1038.
D. Bartl, S. Drapeau, J. Oblój and J. Wiesel (2021), Sensitivity analysis of Wasser-
stein distributionally robust optimization problems, Proceedings of the Royal Society A
477(2256), 20210176.
T. Başar (1977), Optimum Fisherian information for multivariate distributions, The Annals
of Statistics 5(6), 1240–1244.
T. Başar (1983), The Gaussian test channel with an intelligent jammer, IEEE Transactions
on Information Theory 29(1), 152–157.
T. Başar and T. Ü. Başar (1984), A bandwidth expanding scheme for communication
channels with noiseless feedback in the presence of unknown jamming noise, Journal
of the Franklin Institute 317(2), 73–88.
T. Başar and P. Bernhard (1995), H∞ -optimal Control and Related Minimax Design Prob-
lems: A Dynamic Game Approach, Springer.
T. Başar and M. Max (1973), A multistage pursuit-evasion game that admits a Gaussian
random process as a maximin control policy, Stochastics 1(1-4), 25–69.
T. Başar and M. Mintz (1972), Minimax terminal state estimation for linear plants with
unknown forcing functions, International Journal of Control 16(1), 49–69.
T. Başar and M. Mintz (1973), On a minimax estimate for the mean of a normal random
vector under a generalized quadratic loss function, The Annals of Statistics 1(1), 127–
134.
T. Başar and Y. W. Wu (1985), A complete characterization of minimax and maximin
encoder-decoder policies for communication channels with incomplete statistical de-
scription, IEEE Transactions on Information Theory 31(4), 482–489.
T. Başar and Y. W. Wu (1986), Solutions to a class of minimax decision problems arising
in communication systems, Journal of Optimization Theory and Applications 51(3),
375–404.
T. Ü. Başar and T. Başar (1982), Optimum coding and decoding schemes for the trans-
mission of a stochastic process over a continuous-time stochastic channel with partially
unknown statisticst, Stochastics 8(3), 213–237.
H. I. Bayrak, Ç. Koçyiğit, D. Kuhn and M. C. Pınar (2022), Distributionally robust optimal
allocation with costly verification, arXiv:2211.15122.
G. Bayraksan and D. K. Love (2015), Data-driven stochastic programming using phi-
divergences, INFORMS Tutorials in Operations Research pp. 1–19.
Distributionally Robust Optimization 193

E. M. L. Beale (1955), On minimizing a convex function subject to linear inequalities,


Journal of the Royal Statistical Society: Series B 17(2), 173–184.
A. Beck and A. Ben-Tal (2009), Duality in robust optimization: Primal worst equals dual
best, Operations Research Letters 37(1), 1–6.
R. Belbasi, A. Selvi and W. Wiesemann (2023), It’s all in the mix: Wasserstein machine
learning with mixed features, arXiv:2312.12230.
A. Ben-Tal and E. Hochman (1972), More bounds on the expectation of a convex function
of a random variable, Journal of Applied Probability 9(4), 803–812.
A. Ben-Tal and A. Nemirovski (1998), Robust convex optimization, Mathematics of Oper-
ations Research 23(4), 769–805.
A. Ben-Tal and A. Nemirovski (1999a), Robust solutions of uncertain linear programs,
Operations Research Letters 25(1), 1–13.
A. Ben-Tal and A. Nemirovski (1999b), Robust truss topology design via semidefinite
programming, SIAM Journal on Optimization 7(4), 991–1016.
A. Ben-Tal and A. Nemirovski (2000), Robust solutions of linear programming problems
contaminated with uncertain data, Mathematical Programming 88(4), 411–424.
A. Ben-Tal and A. Nemirovski (2001), Lectures on Modern Convex Optimization: Analysis,
Algorithms, and Engineering Applications, SIAM.
A. Ben-Tal and A. Nemirovski (2002), Robust optimization–methodology and applications,
Mathematical Programming 92(3), 453–480.
A. Ben-Tal and M. Teboulle (1986), Expected utility, penalty functions, and duality in
stochastic nonlinear programming, Management Science 32(11), 1445–1466.
A. Ben-Tal and M. Teboulle (2007), An old-new concept of convex risk measures: The
optimized certainty equivalent, Mathematical Finance 17(3), 449–476.
A. Ben-Tal, A. Ben-Israel and M. Teboulle (1991), Certainty equivalents and informa-
tion measures: duality and extremal principles, Journal of Mathematical Analysis and
Applications 157(1), 211–236.
A. Ben-Tal, D. den Hertog and J.-P. Vial (2015a), Deriving robust counterparts of nonlinear
uncertain inequalities, Mathematical Programming 149(1), 265–299.
A. Ben-Tal, D. den Hertog, A. De Waegenaere, B. Melenberg and G. Rennen (2013), Robust
solutions of optimization problems affected by uncertain probabilities, Management
Science 59(2), 341–357.
A. Ben-Tal, L. El Ghaoui and A. Nemirovski (2009), Robust Optimization, Princeton
University Press.
A. Ben-Tal, E. Hazan, T. Koren and S. Mannor (2015b), Oracle-based robust optimization
via online learning, Operations Research 63(3), 628–638.
A. Bennouna and B. P. Van Parys (2021), Learning and decision-making with data: Optimal
formulations and phase transitions, arXiv:2109.06911.
A. Bennouna and B. P. Van Parys (2023), Holistic robust data-driven decisions,
arXiv:2207.09560.
A. Bennouna, R. Lucas and B. P. Van Parys (2023), Certified robust neural networks: Gen-
eralization and corruption resistance, in International Conference on Machine Learning,
pp. 2092–2112.
C. Berge (1963), Topological Spaces: Including a Treatment of Multi-Valued Functions,
Vector Spaces, and Convexity, Courier Corporation.
D. Bergemann and K. H. Schlag (2008), Pricing without priors, Journal of the European
Economic Association 6(2-3), 560–569.
194 D. Kuhn, S. Shafiee, and W. Wiesemann

D. S. Bernstein (2009), Matrix Mathematics: Theory, Facts, and Formulas, Princeton


University Press.
D. Bertsimas and D. den Hertog (2022), Robust and Adaptive Optimization, Dynamic
Ideas.
D. Bertsimas and I. Popescu (2002), On the relation between option and stock prices: A
convex optimization approach, Operations Research 50(2), 358–374.
D. Bertsimas and I. Popescu (2005), Optimal inequalities in probability theory: A convex
optimization approach, SIAM Journal on Optimization 15(3), 780–804.
D. Bertsimas and J. Sethuraman (2000), Moment problems and semidefinite optimization,
in Handbook of Semidefinite Programming: Theory, Algorithms, and Applications
(H. Wolkowicz, R. Saigal and L. Vandenberghe, eds), Springer, pp. 469–509.
D. Bertsimas and M. Sim (2004), The price of robustness, Operations Research 52(1),
35–53.
D. Bertsimas and B. P. Van Parys (2022), Bootstrap robust prescriptive analytics, Mathem-
atical Programming 195(1), 39–78.
D. Bertsimas, D. B. Brown and C. Caramanis (2011), Theory and applications of robust
optimization, SIAM Review 53(3), 464–501.
D. Bertsimas, D. den Hertog and J. Pauphilet (2021), Probabilistic guarantees in robust
optimization, SIAM Journal on Optimization 31(4), 2893–2920.
D. Bertsimas, X. V. Doan, K. Natarajan and C.-P. Teo (2010),Models for minimax stochastic
linear optimization problems with risk aversion, Mathematics of Operations Research
35(3), 580–602.
D. Bertsimas, V. Gupta and N. Kallus (2018a), Data-driven robust optimization, Mathem-
atical Programming 167(2), 235–292.
D. Bertsimas, V. Gupta and N. Kallus (2018b), Robust sample average approximation,
Mathematical Programming 171(1-2), 217–282.
D. Bertsimas, K. Natarajan and C.-P. Teo (2004), Probabilistic combinatorial optimiza-
tion: Moments, semidefinite programming, and asymptotic bounds, SIAM Journal on
Optimization 15(1), 185–209.
D. Bertsimas, K. Natarajan and C.-P. Teo (2006a), Persistence in discrete optimization
under data uncertainty, Mathematical Programming 108(2-3), 251–274.
D. Bertsimas, K. Natarajan and C.-P. Teo (2006b), Tight bounds on expected order statistics,
Probability in the Engineering and Informational Sciences 20(4), 667–686.
D. Bertsimas, S. Shtern and B. Sturt (2022), Two-stage sample robust optimization, Oper-
ations Research 70(1), 624–640.
D. Bertsimas, S. Shtern and B. Sturt (2023), A data-driven approach to multistage stochastic
linear optimization, Management Science 69(1), 51–74.
R. Bhatia, T. Jain and Y. Lim (2018), Strong convexity of sandwiched entropies and related
optimization problems, Reviews in Mathematical Physics 30(9), 1850014.
R. Bhatia, T. Jain and Y. Lim (2019), On the Bures–Wasserstein distance between positive
definite matrices, Expositiones Mathematicae 37(2), 165–191.
C. Bhattacharyya (2004), Second order cone programming formulations for feature selec-
tion, Journal of Machine Learning Research 5, 1417–1433.
P. Billingsley (2013), Convergence of Probability Measures, Wiley.
J. Birge and R.-B. Wets (1986), Designing approximation schemes for stochastic optim-
ization problems, in particular for stochastic programs with recourse, Mathematical
Programming Study 27, 54–102.
Distributionally Robust Optimization 195

J. R. Birge and F. Louveaux (2011), Introduction to Stochastic Programming, Springer.


C. M. Bishop (2006), Pattern Recognition and Machine Learning, Springer.
J. Blanchet and Y. Kang (2020), Semi-supervised learning based on distributionally robust
optimization, in Data Analysis and Applications 3 (A. Makrides, A. Karagrigoriou and
C. H. Skiadas, eds), Wiley, pp. 1–33.
J. Blanchet and Y. Kang (2021), Sample out-of-sample inference based on Wasserstein
distance, Operations Research 69(3), 985–1013.
J. Blanchet and K. Murthy (2019), Quantifying distributional model risk via optimal
transport, Mathematics of Operations Research 44(2), 565–600.
J. Blanchet and A. Shapiro (2023), Statistical limit theorems in distributionally robust
optimization, in Winter Simulation Conference, pp. 31–45.
J. Blanchet, L. Chen and X. Y. Zhou (2022a), Distributionally robust mean-variance
portfolio selection with Wasserstein distances, Management Science 68(9), 6382–6410.
J. Blanchet, P. W. Glynn, J. Yan and Z. Zhou (2019a), Multivariate distributionally ro-
bust convex regression under absolute error loss, in Advances in Neural Information
Processing Systems, pp. 11817–11826.
J. Blanchet, F. He and K. Murthy (2020), On distributionally robust extreme value analysis,
Extremes 23(2), 317–347.
J. Blanchet, Y. Kang and K. Murthy (2019b), Robust Wasserstein profile inference and
applications to machine learning, Journal of Applied Probability 56(3), 830–857.
J. Blanchet, D. Kuhn, J. Li and B. Taşkesen (2023), Unifying distributionally robust
optimization via optimal transport theory, arXiv:2308.05414.
J. Blanchet, H. Lam, Y. Liu and R. Wang (2024a), Convolution bounds on quantile
aggregation, Operations Research (Forthcoming).
J. Blanchet, J. Li, S. Lin and X. Zhang (2024b), Distributionally robust optimization and
robust statistics, arxiv:2401.14655.
J. Blanchet, K. Murthy and V. A. Nguyen (2021), Statistical analysis of Wasserstein
distributionally robust estimators, INFORMS Tutorials in Operations Research pp. 227–
254.
J. Blanchet, K. Murthy and N. Si (2022b), Confidence regions in Wasserstein distribution-
ally robust estimation, Biometrika 109(2), 295–315.
J. Blanchet, K. Murthy and F. Zhang (2022c), Optimal transport-based distributionally
robust optimization: Structural properties and iterative schemes, Mathematics of Oper-
ations Research 47(2), 1500–1529.
N. E. Blankenstein, E. A. Crone, W. van den Bos and A. C. K. van Duijvenvoorde (2016),
Adolescents display distinctive tolerance to ambiguity and to uncertainty during risky
decision making, Developmental Neuropsychology 41(1–2), 77–92.
E. Boissard and T. Le Gouic (2014), On the mean speed of convergence of empirical
and occupation measures in Wasserstein distance, Annales de l’IHP Probabilités et
Statistiques 50(2), 539–563.
F. Bolley, A. Guillin and C. Villani (2007), Quantitative concentration inequalities for
empirical measures on non-compact spaces, Probability Theory and Related Fields
137(3-4), 541–593.
G. Boole (1854), An Investigation of the Laws of Thought, Walton and Maberly.
S. Bose and A. Daripa (2009), A dynamic mechanism and surplus extraction under ambi-
guity, Journal of Economic theory 144(5), 2084–2114.
196 D. Kuhn, S. Shafiee, and W. Wiesemann

D. Boskos, J. Cortés and S. Martínez (2020), Data-driven ambiguity sets with probabilistic
guarantees for dynamic processes, IEEE Transactions on Automatic Control 66(7),
2991–3006.
P. Bossaerts, P. Ghirardato, S. Guarnaschelli and W. R. Zame (2010), Ambiguity in asset
markets: Theory and experiment, The Review of Financial Studies 23(4), 1325–1359.
S. Boucheron, G. Lugosi and P. Massart (2013), Concentration Inequalities: A Nonasymp-
totic Theory of Independence, Oxford University Press.
O. Bousquet, S. Boucheron and G. Lugosi (2004), Introduction to statistical learning
theory, in Advanced Lectures on Machine Learning (O. Bousquet, U. von Luxburg and
G. Rätsch, eds), Springer, pp. 169–207.
G. E. Box (1953), Non-normality and tests on variances, Biometrika 40(3-4), 318–335.
G. E. Box (1979), Robustness in the strategy of scientific model building, in Robustness in
statistics (R. L. Launer and G. N. Wilkinson, eds), Academic Press, pp. 201–236.
Y. Brenier (1991), Polar factorization and monotone rearrangement of vector-valued func-
tions, Communications on Pure and Applied Mathematics 44(4), 375–417.
H. Brezis (2011), Functional Analysis, Sobolev Spaces and Partial Differential Equations,
Springer.
J. Brugman, J. S. Van Leeuwaarden and C. Stegehuis (2022), Sharpest possible clustering
bounds using robust random graph analysis, Physical Review E 106(6), 064311.
M. Buckert, C. Schwieren, B. M. Kudielka and C. J. Fiebach (2014), Acute stress affects
risk taking but not ambiguity aversion, Frontiers in Neuroscience 8, 82.
N. Bui, D. Nguyen and V. A. Nguyen (2022), Counterfactual plans under distributional
ambiguity, in International Conference on Learning Representations.
L. Bungert, N. García Trillos and R. Murray (2023), The geometry of adversarial training in
binary classification, Information and Inference: A Journal of the IMA 12(2), 921–968.
L. Bungert, T. Laux and K. Stinson (2024), A mean curvature flow arising in adversarial
training, arXiv:2404.14402.
L. Cabantous (2007), Ambiguity aversion in the field of insurance: Insurers’ attitude to
imprecise and conflicting probability estimates, Theory and Decision 62(3), 219–240.
J. Cai, J. Y.-M. Li and T. Mao (2023), Distributionally robust optimization under distorted
expectations, Operations Research (Forthcoming).
G. C. Calafiore (2007), Ambiguous risk measures and optimal robust portfolios, SIAM
Journal on Optimization 18(3), 853–877.
G. C. Calafiore and M. C. Campi (2005), Uncertain convex programs: Randomized solu-
tions and confidence levels, Mathematical Programming 102(1), 25–46.
G. C. Calafiore and M. C. Campi (2006), The scenario approach to robust control design,
IEEE Transactions on Automatic Control 51(5), 742–753.
G. C. Calafiore and L. El Ghaoui (2006), On distributionally robust chance-constrained
linear programs, Journal of Optimization Theory and Applications 130(1), 1–22.
G. C. Calafiore, F. Dabbene and R. Tempo (2011), Research on probabilistic methods for
control system design, Automatica 47(7), 1279–1293.
M. C. Campi and A. Caré (2013), Random convex programs with 𝐿 1 -regularization:
sparsity and generalization, SIAM Journal on Control and Optimization 51(5), 3532–
3557.
M. C. Campi and S. Garatti (2008), The exact feasibility of randomized solutions of
uncertain convex programs, SIAM Journal on Optimization 19(3), 1211–1230.
Distributionally Robust Optimization 197

M. C. Campi and S. Garatti (2011), A sampling-and-discarding approach to chance-


constrained optimization: Feasibility and optimality, Journal of Optimization Theory
and Applications 148(2), 257–280.
M. C. Campi and S. Garatti (2018), Wait-and-judge scenario optimization, Mathematical
Programming 167(1), 155–189.
A. Caré, S. Garatti and M. C. Campi (2014), FAST—fast algorithm for the scenario
technique, Operations Research 62(3), 662–671.
Y. Carmon and D. Hausler (2022), Distributionally robust optimization via ball oracle
acceleration, in Advances in Neural Information Processing Systems, pp. 35866–35879.
G. Carroll (2017), Robustness and separation in multidimensional screening, Econometrica
85(2), 453–488.
T. Champion, L. De Pascale and P. Juutinen (2008), The ∞-Wasserstein distance: Local
solutions and existence of optimal transport maps, SIAM Journal on Mathematical
Analysis 40(1), 1–20.
G. Chan, B. Van Parys and A. Bennouna (2024), From distributional robustness to robust
statistics: A confidence sets perspective, arXiv:2410.14008.
P. Chebyshev (1874), Sur les valeurs limites des intégrales, Journal de Mathématiques
Pures et Appliquées 19, 157–160.
L. Chen and M. Sim (2024), Robust CARA optimization, Operations Research (Forthcom-
ing).
L. Chen, C. Fu, F. Si, M. Sim and P. Xiong (2023), Robust optimization with moment-
dispersion ambiguity, SSRN preprint 4525224.
L. Chen, S. He and S. Zhang (2011), Tight bounds for some risk measures, with applications
to robust portfolio selection, Operations Research 59(4), 847–865.
L. Chen, W. Ma, K. Natarajan, D. Simchi-Levi and Z. Yan (2022), Distributionally robust
linear and discrete optimization with marginals, Operations Research 70(3), 1822–1834.
L. Chen, D. Padmanabhan, C. C. Lim and K. Natarajan (2020), Correlation robust influence
maximization, in Advances in Neural Information Processing Systems, pp. 7078–7089.
R. Chen and I. C. Paschalidis (2018), A robust learning approach for regression models
based on distributionally robust optimization, Journal of Machine Learning Research
19(1), 517–564.
R. Chen and I. C. Paschalidis (2019), Selecting optimal decisions via distributionally robust
nearest-neighbor regression, in Advances in Neural Information Processing Systems,
pp. 749–759.
W. Chen, M. Sim, J. Sun and C.-P. Teo (2010), From CVaR to uncertainty set: Implications
in joint chance-constrained optimization, Operations Research 58(2), 470–485.
Z. Chen, Z. Hu and R. Wang (2024a), Screening with limited information: A dual per-
spective, Operations Research 72(4), 1487–1504.
Z. Chen, D. Kuhn and W. Wiesemann (2024b), Data-driven chance constrained programs
over Wasserstein balls, Operations Research 72(1), 410–424.
Z. Chen, M. Sim and H. Xu (2019), Distributionally robust optimization with infinitely
constrained ambiguity sets, Operations Research 67(5), 1328–1344.
J. Cheng, E. Delage and A. Lisser (2014), Distributionally robust stochastic knapsack
problem, SIAM Journal on Optimization 24(3), 1485–1506.
A. Cherukuri and J. Cortés (2019), Cooperative data-driven distributionally robust optim-
ization, IEEE Transactions on Automatic Control 65(10), 4400–4407.
198 D. Kuhn, S. Shafiee, and W. Wiesemann

L. Chizat (2022), Sparse optimization on measures with over-parameterized gradient des-


cent, Mathematical Programming 194(1), 487–532.
L. Chizat and F. Bach (2018), On the global convergence of gradient descent for over-
parameterized models using optimal transport, in Advances in Neural Information Pro-
cessing Systems.
P. Clément and W. Desch (2008), Wasserstein metric and subordination, Studia Mathem-
atica 189(1), 35–52.
J. Coulson, J. Lygeros and F. Dörfler (2021), Distributionally robust chance constrained
data-enabled predictive control, IEEE Transactions on Automatic Control 67(7), 3289–
3304.
T. Cover and J. Thomas (2006), Elements of Information Theory, Wiley.
H. Cramér (1938), Sur un nouveau théoreme-limite de la théorie des probabilités, Actualités
Scientifiques et Industrielles 736, 5–23.
H. Cramér (1946), Mathematical Methods of Statistics, Princeton University Press.
I. Csiszár (1963), Eine informationstheoretische Ungleichung und ihre Anwendung auf
den Beweis der Ergodizität von Markoffschen Ketten, Publications of the Mathematical
Institute of the Hungarian Academy of Sciences 8, 85–108.
I. Csiszár (1967), Information-type measures of difference of probability distributions and
indirect observation, Studia Scientiarum Mathematicarum Hungarica 2, 229–318.
G. B. Dantzig (1955), Linear programming under uncertainty, Management Science 1(3-4),
197–206.
G. B. Dantzig (1956), The Simplex Method, RAND Corporation.
B. Das, A. Dhara and K. Natarajan (2021), On the heavy-tail behavior of the distributionally
robust newsvendor, Operations Research 69(4), 1077–1099.
D. P. De Farias and B. Van Roy (2004), On constraint sampling in the linear programming
approach to approximate dynamic programming, Mathematics of Operations Research
29(3), 462–478.
E. Delage and D. A. Iancu (2015), Robust multistage decision making, INFORMS Tutorials
in Operations Research pp. 20–46.
E. Delage and Y. Ye (2010), Distributionally robust optimization under moment uncertainty
with application to data-driven problems, Operations Research 58(3), 595–612.
E. Delage, D. Kuhn and W. Wiesemann (2019), “Dice”-sion–making under uncertainty:
When can a random decision reduce risk?, Management Science 65(7), 3282–3301.
F. Delbaen (2002), Coherent risk measures on general probability spaces, in Advances in
Finance and Stochastics: Essays in Honour of Dieter Sondermann (K. Sandmann and
P. J. Schönbucher, eds), Springer, pp. 1–37.
A. Dembo and O. Zeitouni (2009), Large Deviations Techniques and Applications,
Springer.
V. DeMiguel and F. J. Nogales (2009), Portfolio selection with robust estimation, Opera-
tions Research 57(3), 560–577.
V. DeMiguel, L. Garlappi and R. Uppal (2009), Optimal versus naive diversification:
How inefficient is the 1/𝑛 portfolio strategy?, The Review of Financial Studies 22(5),
1915–1953.
A. Demontis, M. Melis, M. Pintor, M. Jagielski, B. Biggio, A. Oprea, C. Nita-Rotaru
and F. Roli (2019), Why do adversarial attacks transfer? Explaining transferability of
evasion and poisoning attacks, in USENIX Security Symposium, pp. 321–338.
Distributionally Robust Optimization 199

S. Dereich, M. Scheutzow and R. Schottstedt (2013), Constructive quantization: Approx-


imation by empirical measures, Annales de l’IHP Probabilités et Statistiques 49(4),
1183–1203.
S. Dharmadhikari and K. Joag-Dev (1988), Unimodality, Convexity, and Applications,
Elsevier.
I. Diakonikolas and D. M. Kane (2023), Algorithmic High-Dimensional Robust Statistics,
Cambridge University Press.
I. Diakonikolas, G. Kamath, D. Kane, J. Li, A. Moitra and A. Stewart (2019), Robust
estimators in high-dimensions without the computational intractability, SIAM Journal
on Computing 48(2), 742–864.
M. Z. Diao, K. Balasubramanian, S. Chewi and A. Salim (2023), Forward-backward
Gaussian variational inference via JKO in the Bures-Wasserstein space, in International
Conference on Machine Learning, pp. 7960–7991.
S. G. Dimmock, R. Kouwenberg and P. P. Wakker (2016), Ambiguity attitudes in a large
representative sample, Management Science 62(5), 1363–1380.
X. V. Doan and K. Natarajan (2012), On the complexity of nonoverlapping multivariate
marginal bounds for probabilistic combinatorial optimization problems, Operations
Research 60(1), 138–149.
X. V. Doan, X. Li and K. Natarajan (2015), Robustness to dependency in portfolio optim-
ization using overlapping marginals, Operations Research 63(6), 1468–1488.
V. Dobrić and J. E. Yukich (1995), Asymptotics for transportation cost in high dimensions,
Journal of Theoretical Probability 8(1), 97–118.
S. P. Dokov and D. P. Morton (2005), Second-order lower bounds on the expectation of a
convex function, Mathematics of Operations Research 30(3), 662–677.
D. L. Donoho and R. C. Liu (1988), The "automatic" robustness of minimum distance
functionals, The Annals of Statistics 16(2), 552–586.
M. D. Donsker and S. S. Varadhan (1983), Asymptotic evaluation of certain Markov process
expectations for large time. IV, Communications on Pure and Applied Mathematics
36(2), 183–212.
D. Dowson and B. Landau (1982), The Fréchet distance between multivariate normal
distributions, Journal of Multivariate Analysis 12(3), 450–455.
J. C. Doyle, K. Glover, P. Khargonekar and B. Francis (1989), Robust control of time-delay
systems, IEEE Transactions on Automatic Control 34(6), 674–683.
J. C. Duchi and H. Namkoong (2019), Variance-based regularization with convex object-
ives, Journal of Machine Learning Research 20(68), 1–55.
J. C. Duchi and H. Namkoong (2021), Learning models with uniform performance via
distributionally robust optimization, The Annals of Statistics 49(3), 1378–1406.
J. C. Duchi, P. W. Glynn and H. Namkoong (2021), Statistics of robust optimization: A
generalized empirical likelihood approach, Mathematics of Operations Research 46(3),
946–969.
J. Duchi, T. Hashimoto and H. Namkoong (2023), Distributionally robust losses for latent
covariate mixtures, Operations Research 71(2), 649–664.
R. M. Dudley (1969), The speed of mean Glivenko-Cantelli convergence, The Annals of
Mathematical Statistics 40(1), 40–50.
J. H. Dulá and R. V. Murthy (1992), A Tchebysheff-type bound on the expectation of
sublinear polyhedral functions, Operations Research 40(5), 914–922.
200 D. Kuhn, S. Shafiee, and W. Wiesemann

G. E. Dullerud and F. Paganini (2001), A Course in Robust Control Theory: A Convex


Approach, Springer.
J. Dupačová (2006), Stress testing via contamination, in Coping with Uncertainty: Model-
ing and Policy Issues (K. Marti, Y. Ermoliev, M. Makowski and G. Pflug, eds), Springer,
pp. 29–46.
J. Dupacová and R. Wets (1988), Asymptotic behavior of statistical estimators and of
optimal solutions of stochastic optimization problems, The Annals of Statistics 16(4),
1517–1549.
J. Dupačová (1966), On minimax solutions of stochastic linear programming problems,
Časopis pro pěstování matematiky 91(4), 423–430.
J. Dupačová (1987), The minimax approach to stochastic programming and an illustrative
application, Stochastics 20(1), 73–88.
J. Dupačová (1994), Applications of stochastic programming under incomplete informa-
tion, Journal of Computational and Applied Mathematics 56(1–2), 113–125.
P. Dupuis and Y. Mao (2022), Formulation and properties of a divergence used to compare
probability measures without absolute continuity, ESAIM: Control, Optimisation and
Calculus of Variations 28, Article 10.
D. Duque and D. P. Morton (2020), Distributionally robust stochastic dual dynamic pro-
gramming, SIAM Journal on Optimization 30(4), 2841–2865.
M. Dyer and L. Stougie (2006), Computational complexity of stochastic programming
problems, Mathematical Programming 106(3), 423–432.
H. Edmundson (1956), Bounds on the expectation of a convex function of a random
variable, Technical report, The Rand Corporation Paper 982, Santa Monica, California.
L. El Ghaoui and H. Lebret (1998a), Robust optimization of control systems: A convex
approach, IEEE Transactions on Automatic Control 43(3), 309–319.
L. El Ghaoui and H. Lebret (1998b), Robust solutions to least-squares problems with
uncertain data, SIAM Journal on Matrix Analysis and Applications 18(4), 1035–1064.
L. El Ghaoui, M. Oks and F. Oustry (2003), Worst-case value-at-risk and robust portfolio
optimization: A conic programming approach, Operations Research 51(4), 543–556.
L. El Ghaoui, F. Oustry and H. Lebret (1998), Robust solutions to uncertain semidefinite
programs, SIAM Journal on Optimization 9(1), 33–52.
R. S. Ellis (2007), Entropy, Large Deviations, and Statistical Mechanics, Springer.
D. Ellsberg (1961), Risk, ambiguity, and the Savage axioms, Quarterly Journal of Eco-
nomics 75(4), 643–669.
P. Embrechts and G. Puccetti (2006), Bounds for functions of multivariate risks, Journal
of Multivariate Analysis 97(2), 526–547.
L. G. Epstein and J. Miao (2003), A two-person dynamic equilibrium under ambiguity,
Journal of Economic Dynamics and Control 27(7), 1253–1288.
E. Erdoğan and G. Iyengar (2006), Ambiguous chance constrained problems and robust
optimization, Mathematical Programming 107(1-2), 37–61.
Y. Ermoliev, A. A. Gaivoronski and C. Nedeva (1985), Stochastic optimization problems
with incomplete information on distribution functions, SIAM Journal on Control and
Optimization 23(5), 697–716.
A. Esteban-Pérez and J. M. Morales (2022), Distributionally robust stochastic programs
with side information based on trimmings, Mathematical Programming 195(1), 1069–
1105.
Distributionally Robust Optimization 201

F. Farnia and D. Tse (2016), A minimax approach to supervised learning, in Advances in


Neural Information Processing Systems, pp. 4240–4248.
W. Fenchel (1953), Convex Cones, Sets, and Functions, Princeton University Press.
C. Finlay and A. M. Oberman (2021), Scaleable input gradient regularization for adversarial
robustness, Machine Learning with Applications 3, Article 100017.
G. B. Folland (1999), Real Analysis: Modern Techniques and Their Applications, John
Wiley & Sons.
H. Föllmer and A. Schied (2008), Stochastic Finance. An Introduction in Discrete Time,
de Gruyter.
N. Fournier (2022), Convergence of the empirical measure in expected Wasserstein dis-
tance: Non asymptotic explicit bounds in R𝑑 , arXiv:2209.00923.
N. Fournier and A. Guillin (2015), On the rate of convergence in Wasserstein distance of
the empirical measure, Probability Theory and Related Fields 162(3), 707–738.
N. Frank and J. Niles-Weed (2024a), The adversarial consistency of surrogate risks for
binary classification, in Advances in Neural Information Processing Systems, pp. 41343–
41354.
N. S. Frank and J. Niles-Weed (2024b), Existence and minimax theorems for adversarial
surrogate risks in binary classification, Journal of Machine Learning Research 25(58),
1–41.
K. Frauendorfer (1992), Stochastic Two-Stage Programming, Springer.
M. Fréchet (1935), Généralisation du théoreme des probabilités totales, Fundamenta Math-
ematicae 25(1), 379–387.
A. A. Gaivoronski (1991), A numerical method for solving stochastic programming prob-
lems with moment constraints on a distribution function, Annals of Operations Research
31(1), 347–370.
G. Gallego and I. Moon (1993), The distribution free newsboy problem: Review and
extensions, The Journal of the Operational Research Society 44(8), 825–834.
A. Ganguly and T. Sutter (2023), Optimal learning via moderate deviations theory,
arXiv:2305.14496.
R. Gao (2023), Finite-sample guarantees for Wasserstein distributionally robust optimiza-
tion: Breaking the curse of dimensionality, Operations Research 71(6), 2291–2306.
R. Gao and A. J. Kleywegt (2023), Distributionally robust stochastic optimization with
Wasserstein distance, Mathematics of Operations Research 48(2), 603–655.
R. Gao, X. Chen and A. J. Kleywegt (2017), Wasserstein distributional robustness and
regularization in statistical learning, arXiv:1712.06050.
R. Gao, X. Chen and A. J. Kleywegt (2024), Wasserstein distributionally robust optimiza-
tion and variation regularization, Operations Research 72(3), 1177–1191.
R. Gao, L. Xie, Y. Xie and H. Xu (2018), Robust hypothesis testing using Wasserstein
uncertainty sets, in Advances in Neural Information Processing Systems, pp. 7902–7912.
C. A. García Trillos and N. García Trillos (2022), On the regularized risk of distributionally
robust learning over deep neural networks, Research in the Mathematical Sciences 9(3),
54.
N. García Trillos and M. Jacobs (2023), An analytical and geometric perspective on
adversarial robustness, Notices of the American Mathematical Society 70(8), 1193–
1204.
N. García Trillos and R. Murray (2022), Adversarial classification: Necessary conditions
and geometric flows, Journal of Machine Learning Research 23(187), 1–38.
202 D. Kuhn, S. Shafiee, and W. Wiesemann

N. García Trillos, M. Jacobs and J. Kim (2023), The multimarginal optimal transport for-
mulation of adversarial multiclass classification, Journal of Machine Learning Research
24(45), 1–56.
H. Gassmann and W. Ziemba (1986), A tight upper bound for the expectation of a con-
vex function of a multivariate random variable, in Stochastic Programming 84 Part I
(A. Prékopa and R. J.-B. Wets, eds), Vol. 27, Springer, pp. 39–53.
M. Gelbrich (1990), On a formula for the 𝐿 2 Wasserstein metric between measures on
Euclidean and Hilbert spaces, Mathematische Nachrichten 147(1), 185–203.
G. Georgakopoulos, D. Kavvadias and C. H. Papadimitriou (1988), Probabilistic satisfiab-
ility, Journal of Complexity 4(1), 1–11.
R. Ghanem, D. Higdon and H. Owhadi (2017), Handbook of Uncertainty Quantification,
Springer.
S. Ghosh, M. Squillante and E. Wollega (2021), Efficient stochastic gradient descent for
learning with distributionally robust optimization, in Advances in Neural Information
Processing Systems, pp. 28310–28322.
I. Gilboa and D. Schmeidler (1989), Maxmin expected utility with a non-unique prior,
Journal of Mathematical Economics 18(2), 141–153.
C. Givens and R. Shortt (1984), A class of Wasserstein metrics for probability distributions,
The Michigan Mathematical Journal 31(2), 231–240.
M. Goerigk and J. Kurtz (2023), Data-driven robust optimization using deep neural net-
works, Computers & Operations Research 151, Article 106087.
I. J. Goodfellow, J. Shlens and C. Szegedy (2015), Explaining and harnessing adversarial
examples, in International Conference on Learning Representations.
J.-y. Gotoh, M. J. Kim and A. E. Lim (2018), Robust empirical optimization is almost the
same as mean–variance optimization, Operations Research Letters 46(4), 448–452.
J.-y. Gotoh, M. J. Kim and A. E. Lim (2021), Calibration of distributionally robust empirical
optimization models, Operations Research 69(5), 1630–1650.
N. Gravin and P. Lu (2018), Separation in correlation-robust monopolist problem with
budget, in SIAM Symposium on Discrete Algorithms, pp. 2069–2080.
M. Green and D. J. N. Limebeer (1995), H-infinity control theory: A tutorial, Automatica
31(2), 213–222.
G. Gül and A. M. Zoubir (2017), Minimax robust hypothesis testing, IEEE Transactions
on Information Theory 63(9), 5572–5587.
I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin and A. Courville (2017), Improved
training of Wasserstein GANs, in Advances in Neural Information Processing Systems,
pp. 5769–5779.
V. Gupta (2019), Near-optimal Bayesian ambiguity sets for distributionally robust optim-
ization, Management Science 65(9), 4242–4260.
M. Gürbüzbalaban, A. Ruszczyński and L. Zhu (2022), A stochastic subgradient method for
distributionally robust non-convex and non-smooth learning, Journal of Optimization
Theory and Applications 194(3), 1014–1041.
J. Hajar, T. Kargin and B. Hassibi (2023), Wasserstein distributionally robust regret-optimal
control under partial observability, in Allerton Conference on Communication, Control,
and Computing, pp. 1–6.
A. Hakobyan and I. Yang (2024), Wasserstein distributionally robust control of partially
observable linear stochastic systems, IEEE Transactions on Automatic Control 69(9),
6121–6136.
Distributionally Robust Optimization 203

H. Hamburger (1920), Über eine Erweiterung des Stieltjesschen Momentenproblems,


Mathematische Annalen 81(2), 235–319.
F. R. Hampel (1968), Contributions to the theory of robust estimation, Technical report,
University of California, Berkeley.
F. R. Hampel (1971), A general qualitative definition of robustness, The Annals of Math-
ematical Statistics 42(6), 1887–1896.
B. Han, C. Shang and D. Huang (2021), Multiple kernel learning-aided robust optimiza-
tion: Learning algorithm, computational tractability, and usage in multi-stage decision-
making, European Journal of Operational Research 292(3), 1004–1018.
S. Han, M. Tao, U. Topcu, H. Owhadi and R. M. Murray (2015), Convex optimal uncertainty
quantification, SIAM Journal on Optimization 25(3), 1368–1387.
G. A. Hanasusanto and D. Kuhn (2013), Robust data-driven dynamic programming, in
Advances in Neural Information Processing Systems, pp. 827–835.
G. A. Hanasusanto and D. Kuhn (2018), Conic programming reformulations of two-stage
distributionally robust linear programs over Wasserstein balls, Operations Research
66(3), 849–869.
G. A. Hanasusanto, D. Kuhn and W. Wiesemann (2016), A comment on “Computational
complexity of stochastic programming problems”, Mathematical Programming 159(1-
2), 557–569.
G. A. Hanasusanto, D. Kuhn, S. W. Wallace and S. Zymler (2015a), Distributionally robust
multi-item newsvendor problems with multimodal demand distributions, Mathematical
Programming 152(1), 1–32.
G. A. Hanasusanto, V. Roitch, D. Kuhn and W. Wiesemann (2015b), A distributionally
robust perspective on uncertainty quantification and chance constrained programming,
Mathematical Programming 151(1), 35–62.
L. P. Hansen and T. J. Sargent (2008), Robustness, Princeton University Press.
L. P. Hansen and T. J. Sargent (2010), Wanting robustness in macroeconomics, in Handbook
of Monetary Economics (B. M. Friedman and M. Woodford, eds), Vol. 3, Elsevier,
chapter 20, pp. 1097–1157.
C. A. Hartley and L. H. Somerville (2015), The neuroscience of adolescent decision-
making, Current Opinion in Behavioral Sciences 5, 108–115.
J. Hartung (1982), An extension of Sion’s minimax theorem with an application to a method
for constrained games, Pacific Journal of Mathematics 103(2), 401–408.
T. Hastie, R. Tibshirani and J. Friedman (2009), The Elements of Statistical Learning:
Data Mining, Inference, and Prediction, Springer.
F. Hausdorff (1923), Momentprobleme für ein endliches Intervall, Mathematische Zeits-
chrift 16(1), 220–248.
B. Hayden, S. Heilbronner and M. Platt (2010), Ambiguity aversion in rhesus macaques,
Frontiers in Neuroscience 4.
E. Hazan (2022), Introduction to Online Convex Optimization, MIT Press.
Q. He, G. Xue, C. Chen, Z. Lu, Q. Dong, X. Lei, N. Ding, J. Li, H. Li, C. Chen, J. Li, R. K.
Moyzis and A. Bechara (2010), Serotonin transporter gene-linked polymorphic region
(5-HTTLPR) influences decision making under ambiguity and risk in a large Chinese
sample, Neuropharmacology 59(6), 518–526.
S. He and H. Lam (2021), Higher-order expansion and Bartlett correctability of distribu-
tionally robust optimization, arXiv:2108.05908.
J. P. Hespanha (2019), Linear Systems Theory, Princeton University Press.
204 D. Kuhn, S. Shafiee, and W. Wiesemann

N. Ho-Nguyen and F. Kılınç-Karzan (2018), Online first-order framework for robust convex
optimization, Operations Research 66(6), 1670–1692.
N. Ho-Nguyen and F. Kılınç-Karzan (2019), Exploiting problem structure in optimization
under uncertainty via online convex optimization, Mathematical Programming 177(1),
113–147.
N. Ho-Nguyen and S. J. Wright (2023), Adversarial classification via distributional robust-
ness with Wasserstein ambiguity, Mathematical Programming 198(2), 1411–1447.
N. Ho-Nguyen, F. Kılınç-Karzan, S. Küçükyavuz and D. Lee (2022), Distributionally
robust chance-constrained programs with right-hand side uncertainty under Wasserstein
ambiguity, Mathematical Programming 196(1–2), 641–672.
P. Honeyman, R. E. Ladner and M. Yannakakis (1980), Testing the universal instance
assumption, Information Processing Letters 10(1), 14–19.
L. J. Hong, Z. Huang and H. Lam (2021), Learning-based robust optimization: Procedures
and statistical guarantees, Management Science 67(6), 3447–3467.
R. A. Horn and C. R. Johnson (1985), H∞-optimal control and related minimax design
problems, IEEE Transactions on Automatic Control 30(10), 1057–1069.
S. Hou, P. Kassraie, A. Kratsios, A. Krause and J. Rothfuss (2023), Instance-dependent
generalization bounds via optimal transport, Journal of Machine Learning Research
24(1), 16815–16865.
M. Hsu, M. Bhatt, R. Adolphs, D. Tranel and C. F. Camerer (2005), Neural systems
responding to degrees of uncertainty in human decision-making, Science 310(5754),
1680–1683.
Y. Hu, X. Chen and N. He (2021), On the bias-variance-cost tradeoff of stochastic optim-
ization, in Advances in Neural Information Processing Systems, pp. 22119–22131.
Y. Hu, J. Wang, X. Chen and N. He (2024), Multi-level Monte-Carlo gradient methods for
stochastic optimization with biased oracles, arXiv:2408.11084.
Z. Hu and L. J. Hong (2013), Kullback-Leibler divergence constrained distributionally
robust optimization, Available from Optimization Online.
Z. Hu, L. J. Hong and A. M.-C. So (2013), Ambiguous probabilistic programs, Available
from Optimization Online.
K. Huang, H. Yang, I. King, M. R. Lyu and L. Chan (2004), The minimum error minimax
probability machine, Journal of Machine Learning Research 5, 1253–1286.
P. Huber (1981), Robust Statistics, Wiley.
P. J. Huber (1964), Robust estimation of a location parameter, The Annals of Mathematical
Statistics 35(1), 73–101.
P. J. Huber (1967), The behavior of maximum likelihood estimates under nonstandard
conditions, in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics
and Probability, pp. 221–233.
P. J. Huber (1968), Robust confidence limits, Zeitschrift für Wahrscheinlichkeitstheorie
und verwandte Gebiete 10(4), 269–278.
H. Husain (2020), Distributional robustness with IPMs and links to regularization and
GANs, in Advances in Neural Information Processing Systems, pp. 11816–11827.
K. Isii (1960), The extrema of probability determined by generalized moments (I) Bounded
random variables, Annals of the Institute of Statistical Mathematics 12(2), 119–134.
K. Isii (1962), On sharpness of Tchebycheff-type inequalities, Annals of the Institute of
Statistical Mathematics 14(1), 185–197.
Distributionally Robust Optimization 205

G. Iyengar, H. Lam and T. Wang (2022), Hedging complexity in generalization via a


parametric distributionally robust optimization framework, arXiv:2212.01518.
R. Jagannathan (1977), Minimax procedure for a class of linear programs under uncertainty,
Operations Research 25(1), 173–177.
D. Jakubovitz and R. Giryes (2018), Improving DNN robustness to adversarial attacks using
Jacobian regularization, in European Conference on Computer Vision, pp. 514–529.
S. L. Janak, X. Lin and C. A. Floudas (2007), A new robust optimization approach
for scheduling under uncertainty: II. Uncertainty with known probability distribution,
Computers & Chemical Engineering 31(3), 171–195.
H. Jeffreys and D. Wrinch (1921), On certain fundamental principles of scientific enquiry,
Philosophical Magazine 42, 269–298.
J. L. W. V. Jensen (1906), Sur les fonctions convexes et les inégalités entre les valeurs
moyennes, Acta Mathematica 30(1), 175–193.
N. Jiang and W. Xie (2024), Distributionally favorable optimization: A framework for
data-driven decision-making with endogenous outliers, SIAM Journal on Optimization
34(1), 419–458.
R. Jiang and Y. Guan (2016), Data-driven chance constrained stochastic program, Math-
ematical Programming 158(1), 291–327.
R. Jiang and Y. Guan (2018), Risk-averse two-stage stochastic program with distributional
ambiguity, Operations Research 66(5), 1390–1405.
Y. Jiang and J. Obloj (2024), Sensitivity of causal distributionally robust optimization,
arXiv:2408.17109.
Y. Jiang, S. Chewi and A.-A. Pooladian (2024), Algorithms for mean-field variational in-
ference via polyhedral optimization in the Wasserstein space, in Conference on Learning
Theory, pp. 2720–2721.
W. Jongeneel, T. Sutter and D. Kuhn (2021), Topological linear system identification via
moderate deviations theory, IEEE Control Systems Letters 6, 307–312.
W. Jongeneel, T. Sutter and D. Kuhn (2022), Efficient learning of a linear dynamical system
with stability guarantees, IEEE Transactions on Automatic Control 68(5), 2790–2804.
H. Jylhä (2015), The 𝐿 ∞ optimal transport: infinite cyclical monotonicity and the existence
of optimal transport maps, Calculus of Variations and Partial Differential Equations 52,
303–326.
O. Kallenberg (1997), Foundations of Modern Probability, Springer.
T. Kargin, J. Hajar, V. Malik and B. Hassibi (2024a), The distributionally robust infinite-
horizon LQR, arXiv:2408.06230.
T. Kargin, J. Hajar, V. Malik and B. Hassibi (2024b), Distributionally robust Kalman
filtering over finite and infinite horizon, arXiv:2407.18837.
T. Kargin, J. Hajar, V. Malik and B. Hassibi (2024c), Infinite-horizon distributionally robust
regret-optimal control, in International Conference on Machine Learning, pp. 23187–
23214.
T. Kargin, J. Hajar, V. Malik and B. Hassibi (2024d), Wasserstein distributionally ro-
bust regret-optimal control over infinite-horizon, in Learning for Dynamics & Control
Conference, pp. 1688–1701.
S. Karlin and W. J. Studden (1966), Tchebycheff Systems: With Applications in Analysis
and Statistics, Interscience Publishers.
N. Karmarkar (1984), A new polynomial-time algorithm for linear programming, Combin-
atorica 4(4), 373–395.
206 D. Kuhn, S. Shafiee, and W. Wiesemann

J. E. Kelley, Jr (1960), The cutting-plane method for solving convex programs, Journal of
the Society for Industrial and Applied Mathematics 8(4), 703–712.
C. Kent, J. Li, J. Blanchet and P. W. Glynn (2021), Modified Frank Wolfe in probability
space, in Advances in Neural Information Processing Systems, pp. 14448–14462.
J. M. Keynes (1921), A Treatise on Probability, Macmillan.
L. G. Khachiyan (1979),A polynomial algorithm in linear programming,Doklady Akademii
Nauk 244(5), 1093–1096.
H. K. Khalil (1996), Control System Analysis and Design with Advanced Design Tools,
Prentice Hall.
A. J. King and R. T. Rockafellar (1993), Asymptotic theory for solutions in statistical
estimation and stochastic programming, Mathematics of Operations Research 18(1),
148–162.
A. J. King and R. J.-B. Wets (1991), Epi-consistency of convex stochastic programs,
Stochastics and Stochastic Reports 34(1-2), 83–92.
D. Klabjan, D. Simchi-Levi and M. Song (2013), Robust stochastic lot-sizing by means of
histograms, Production and Operations Management 22(3), 691–710.
F. H. Knight (1921), Risk, Uncertainty and Profit, Houghton Mifflin.
Ç. Koçyiğit, G. Iyengar, D. Kuhn and W. Wiesemann (2020), Distributionally robust
mechanism design, Management Science 66(1), 159–189.
Ç. Koçyiğit, N. Rujeerapaiboon and D. Kuhn (2022), Robust multidimensional pricing:
Separation without regret, Mathematical Programming 196(1–2), 841–874.
V. Koltchinskii (2011), Oracle Inequalities in Empirical Risk Minimization and Sparse
Recovery Problems, Springer.
P. Kouvelis and G. Yu (1997), Robust Discrete Optimization and Its Applications, Springer.
A. L. Krain, A. M. Wilson, R. Arbuckle, X. F. Castellanos and M. P. Milham (2006),
Distinct neural mechanisms of risk and ambiguity: A meta-analysis of decision-making,
NeuroImage 32(1), 477–484.
S. G. Krantz and H. R. Parks (2002), A Primer of Real Analytic Functions, Springer.
D. Kuhn (2005), Generalized Bounds for Convex Multistage Stochastic Programs, Springer.
D. Kuhn, P. Mohajerin Esfahani, V. A. Nguyen and S. Shafieezadeh-Abadeh (2019),
Wasserstein distributionally robust optimization: Theory and applications in machine
learning, INFORMS Tutorials in Operations Research pp. 130–166.
S. Kullback (1959), Information theory and statistics, Wiley.
M. Kupper and W. Schachermayer (2009), Representation results for law invariant time
consistent functions, Mathematics and Financial Economics 2(3), 189–210.
A. Kurakin, I. J. Goodfellow and S. Bengio (2022), Adversarial machine learning at scale,
in International Conference on Learning Representations.
S. Kusuoka (2001), On law invariant coherent risk measures, in Advances in Mathematical
Economics (S. Kusuoka and T. Maruyama, eds), Springer, pp. 83–95.
Y. Kwon, W. Kim, J.-H. Won and M. C. Paik (2020), Principled learning method for
Wasserstein distributionally robust optimization with local perturbations, in Interna-
tional Conference on Machine Learning, pp. 5567–5576.
D. N. Lal (1955), A note on a form of Tchebycheff’s inequality for two or more variables,
Sankhyā: The Indian Journal of Statistics 15(3), 317–320.
H. Lam (2016), Robust sensitivity analysis for stochastic systems, Mathematics of Opera-
tions Research 41(4), 1248–1275.
Distributionally Robust Optimization 207

H. Lam (2018), Sensitivity to serial dependency of input processes: A robust approach,


Management Science 64(3), 1311–1327.
H. Lam (2019), Recovering best statistical guarantees via the empirical divergence-based
distributionally robust optimization, Operations Research 67(4), 1090–1105.
H. Lam (2021), On the impossibility of statistically improving empirical optimization: A
second-order stochastic dominance perspective, arXiv:2105.13419.
H. Lam and C. Mottet (2017), Tail analysis without parametric models: A worst-case
perspective, Operations Research 65(6), 1696–1711.
H. Lam and E. Zhou (2017), The empirical likelihood approach to quantifying uncertainty
in sample average approximation, Operations Research Letters 45(4), 301–307.
H. Lam, Z. Liu and D. I. Singham (2024), Shape-constrained distributional optimization
via importance-weighted sample average approximation, arXiv:2406.07825.
H. Lam, Z. Liu and X. Zhang (2021), Orthounimodal distributionally robust optim-
ization: Representation, computation and multivariate extreme event applications,
arXiv:2111.07894.
M. Lambert, S. Chewi, F. Bach, S. Bonnabel and P. Rigollet (2022), Variational inference
via Wasserstein gradient flows, in Advances in Neural Information Processing Systems,
pp. 14434–14447.
G. R. Lanckriet, L. El Ghaoui, C. Bhattacharyya and M. I. Jordan (2001), Minimax
probability machine, in Advances in Neural Information Processing Systems, pp. 801–
807.
G. R. Lanckriet, L. El Ghaoui, C. Bhattacharyya and M. I. Jordan (2002), A robust minimax
approach to classification, Journal of Machine Learning Research 3, 555–582.
N. Lanzetti, S. Bolognani and F. Dörfler (2022), First-order conditions for optimization in
the Wasserstein space, arXiv:2209.12197.
N. Lanzetti, A. Terpin and F. Dörfler (2024), Variational analysis in the Wasserstein space,
arXiv:2406.10676.
J. B. Lasserre (2001), Global optimization with polynomials and the problem of moments,
SIAM Journal on Optimization 11(3), 796–817.
J. B. Lasserre (2002), Bounds on measures satisfying moment conditions, The Annals of
Applied Probability 12(3), 1114–1137.
J. B. Lasserre (2008), A semidefinite programming approach to the generalized problem
of moments, Mathematical Programming 112(1), 65–92.
J. B. Lasserre (2009), Moments, Positive Polynomials and Their Applications, World
Scientific.
J. B. Lasserre and T. Weisser (2021), Distributionally robust polynomial chance-constraints
under mixture ambiguity sets, Mathematical Programming 185(1-2), 409–453.
T. T.-K. Lau and H. Liu (2022), Wasserstein distributionally robust optimization with
Wasserstein barycenters, arXiv:2203.12136.
J. Lee and M. Raginsky (2018), Minimax statistical learning with Wasserstein distances,
in Advances in Neural Information Processing Systems, pp. 2687–2696.
J. Lee, S. Park and J. Shin (2020), Learning bounds for risk-sensitive learning, in Advances
in Neural Information Processing Systems, pp. 13867–13879.
E. L. Lehmann and G. Casella (2006), Theory of Point Estimation, Springer.
E. S. Levitin and B. T. Polyak (1966), Constrained minimization methods, USSR Compu-
tational Mathematics and Mathematical Physics 6(5), 1–50.
208 D. Kuhn, S. Shafiee, and W. Wiesemann

B. C. Levy (2008), Robust hypothesis testing with a relative entropy tolerance, IEEE
Transactions on Information Theory 55(1), 413–421.
B. C. Levy and R. Nikoukhah (2004), Robust least-squares estimation with a relative
entropy constraint, IEEE Transactions on Information Theory 50(1), 89–104.
B. C. Levy and R. Nikoukhah (2012), Robust state space filtering under incremental model
perturbations subject to a relative entropy tolerance, IEEE Transactions on Automatic
Control 58(3), 682–695.
D. Levy, Y. Carmon, J. C. Duchi and A. Sidford (2020), Large-scale methods for distri-
butionally robust optimization, in Advances in Neural Information Processing Systems,
pp. 8847–8860.
B. Li, R. Jiang and J. L. Mathieu (2016), Distributionally robust risk-constrained optimal
power flow using moment and unimodality information, in IEEE Conference on Decision
and Control, pp. 2425–2430.
B. Li, R. Jiang and J. L. Mathieu (2019a), Ambiguous risk constraints with moment and
unimodality information, Mathematical Programming 173(1-2), 151–192.
C. Li, U. Turmunkh and P. P. Wakker (2019b), Trust as a decision under ambiguity,
Experimental Economics 22(1), 51–75.
D. Li and S. Martínez (2020), Data assimilation and online optimization with performance
guarantees, IEEE Transactions on Automatic Control 66(5), 2115–2129.
J. Li, C. Chen and A. M.-C. So (2020), Fast epigraphical projection-based incremental
algorithms for Wasserstein distributionally robust support vector machine, in Advances
in Neural Information Processing Systems, pp. 4029–4039.
J. Li, S. Huang and A. M.-C. So (2019c), A first-order algorithmic framework for Wasser-
stein distributionally robust logistic regression, in Advances in Neural Information Pro-
cessing Systems, pp. 3937–3947.
J. Li, S. Lin, J. Blanchet and V. A. Nguyen (2022), Tikhonov regularization is optimal trans-
port robust under martingale constraints, in Advances in Neural Information Processing
Systems, pp. 17677–17689.
J. Y.-M. Li (2018), Closed-form solutions for worst-case law invariant risk measures with
application to robust portfolio optimization, Operations Research 66(6), 1533–1541.
J. Y.-M. Li and T. Mao (2022), A general Wasserstein framework for data-driven distribu-
tionally robust optimization: Tractability and applications, arXiv:2207.09403.
M. Li, T. Sutter and D. Kuhn (2021), Distributionally robust optimization with Markovian
data, in International Conference on Machine Learning, pp. 6493–6503.
Z. Li, R. Ding and C. A. Floudas (2011), A comparative theoretical and computational
study on robust counterpart optimization: I. Robust linear optimization and robust
mixed integer linear optimization, Industrial & Engineering Chemistry Research 50(18),
10567–10603.
F. Liese and I. Vajda (1987), Convex Statistical Distances, Teubner.
S. Lin, J. Blanchet, P. Glynn and V. A. Nguyen (2024), Small sample behavior of
Wasserstein projections, connections to empirical likelihood, and other applications,
arXiv:2408.11753.
F. Liu, Z. Chen, R. Wang and S. Wang (2024a), Newsvendor under mean-variance ambi-
guity and misspecification, arXiv:2405.07008.
J. Liu, Z. Su and H. Xu (2024b), Bayesian distributionally robust Nash equilibrium and its
application, arXiv:2410.20364.
Distributionally Robust Optimization 209

Z. Liu and P.-L. Loh (2023), Robust W-GAN-based estimation under Wasserstein contam-
ination, Information and Inference: A Journal of the IMA 12(1), 312–362.
Z. Liu, B. P. Van Parys and H. Lam (2023), Smoothed 𝑓 -divergence distribution-
ally robust optimization: Exponential rate efficiency and complexity-free calibration,
arXiv:2306.14041.
D. Z. Long, J. Qi and A. Zhang (2024), Supermodularity in two-stage distributionally
robust optimization, Management Science 70(3), 1394–1409.
C. Lyu, K. Huang and H.-N. Liang (2015), A unified gradient regularization family for
adversarial examples, in International Conference on Data Mining, pp. 301–309.
A. Madansky (1959), Bounds on the expectation of a convex function of a multivariate
random variable, The Annals of Mathematical Statistics 30(3), 743–746.
A. Mądry, A. Makelov, L. Schmidt, D. Tsipras and A. Vladu (2018), Towards deep learn-
ing models resistant to adversarial attacks, in International Conference on Learning
Representations.
C. Maheshwari, C.-Y. Chiu, E. Mazumdar, S. Sastry and L. Ratliff (2022), Zeroth-order
methods for convex-concave min-max problems: Applications to decision-dependent
risk minimization, in International Conference on Artificial Intelligence and Statistics,
pp. 6702–6734.
H.-Y. Mak, Y. Rong and J. Zhang (2015), Appointment scheduling with limited distribu-
tional information, Management Science 61(2), 316–334.
A. Markov (1884), On certain applications of algebraic continued fractions, PhD thesis, St
Petersburg (in Russian).
A. W. Marshall and I. Olkin (1960), A one-sided inequality of the Chebyshev type, The
Annals of Mathematical Statistics 31(2), 488–491.
K. Marton (1986), A simple proof of the blowing-up lemma, IEEE Transactions on In-
formation Theory 32(3), 445–446.
A. Maurer and M. Pontil (2009), Empirical Bernstein bounds and sample variance penal-
ization, in Conference on Learning Theory.
R. D. McAllister and P. Mohajerin Esfahani (2023),Distributionally robust model predictive
control: Closed-loop guarantees and scalable algorithms, arXiv:2309.12758.
A. McNeil, R. Frey and P. Embrechts (2015), Quantitative Risk Management: Concepts,
Techniques and Tools, Princeton University Press.
S. Mendelson (2003), A few notes on statistical learning theory, in Advanced Lectures on
Machine Learning (S. Mendelson and A. J. Smola, eds), Springer, pp. 1–40.
R. O. Michaud (1989), The Markowitz optimization enigma: Is ‘optimized’ optimal?,
Financial Analysts Journal 45(1), 31–42.
J. Milz and M. Ulbrich (2020), An approximation scheme for distributionally robust non-
linear optimization, SIAM Journal on Optimization 30(3), 1996–2025.
J. Milz and M. Ulbrich (2022), An approximation scheme for distributionally robust PDE-
constrained optimization, SIAM Journal on Control and Optimization 60(3),1410–1435.
V. K. Mishra, K. Natarajan, D. Padmanabhan, C.-P. Teo and X. Li (2014), On theoretical
and empirical aspects of marginal distribution choice models, Management Science
60(6), 1511–1531.
V. K. Mishra, K. Natarajan, H. Tao and C.-P. Teo (2012), Choice prediction with semidefin-
ite optimization when utilities are correlated, IEEE Transactions on Automatic Control
57(10), 2450–2463.
210 D. Kuhn, S. Shafiee, and W. Wiesemann

P. Mohajerin Esfahani and D. Kuhn (2018),Data-driven distributionally robust optimization


using the Wasserstein metric: Performance guarantees and tractable reformulations,
Mathematical Programming 171(1), 115–166.
P. Mohajerin Esfahani, S. Shafieezadeh-Abadeh, G. A. Hanasusanto and D. Kuhn (2018),
Data-driven inverse optimization with imperfect information, Mathematical Program-
ming 167(1), 191–234.
P. Mohajerin Esfahani, T. Sutter and J. Lygeros (2015), Performance bounds for the scenario
approach and an extension to a class of non-convex programs, IEEE Transactions on
Automatic Control 60(1), 46 – 58.
J. R. Munkres (2000), Topology, Prentice Hall.
A. Mutapcic and S. Boyd (2009), Cutting-set methods for robust convex optimization with
pessimizing oracles, Optimization Methods & Software 24(3), 381–406.
V. Nagarajan and J. Z. Kolter (2017), Gradient descent GAN optimization is locally stable,
in Advances in Neural Information Processing Systems, pp. 5591–5600.
H. Nakao, R. Jiang and S. Shen (2021), Distributionally robust partially observable Markov
decision process with moment-based ambiguity, SIAM Journal on Optimization 31(1),
461–488.
H. Namkoong and J. C. Duchi (2016), Stochastic gradient methods for distributionally
robust optimization with 𝑓 -divergences, in Advances in Neural Information Processing
Systems, pp. 2216–2224.
K. Natarajan (2021), Optimization with Marginals and Moments, Dynamic Ideas.
K. Natarajan and Z. Linyi (2007), A mean–variance bound for a three-piece linear function,
Probability in the Engineering and Informational Sciences 21(4), 611–621.
K. Natarajan, D. Pachamanova and M. Sim (2009a), Constructing risk measures from
uncertainty sets, Operations Research 57(5), 1129–1141.
K. Natarajan, D. Padmanabhan and A. Ramachandra (2023), Distributionally robust op-
timization through the lens of submodularity, arXiv:2312.04890.
K. Natarajan, M. Sim and J. Uichanco (2010), Tractable robust expected utility and risk
models for portfolio optimization, Mathematical Finance 20(4), 695–731.
K. Natarajan, M. Sim and J. Uichanco (2018), Asymmetry and ambiguity in newsvendor
models, Management Science 64(7), 3146–3167.
K. Natarajan, M. Song and C.-P. Teo (2009b), Persistency model and its applications in
choice modeling, Management Science 55(3), 453–469.
K. Natarajan, C. P. Teo and Z. Zheng (2011), Mixed 0-1 linear programs under objective
uncertainty: A completely positive representation, Operations Research 59(3), 713–728.
A. Nemirovski and A. Shapiro (2007), Convex approximations of chance constrained
programs, SIAM Journal on Optimization 17(4), 969–996.
Y. Nesterov and A. Nemirovskii (1994), Interior-Point Polynomial Algorithms in Convex
Programming, SIAM.
D. Nguyen, N. Bui and V. A. Nguyen (2022a), Distributionally robust recourse action, in
International Conference on Learning Representations.
V. A. Nguyen, D. Kuhn and P. Mohajerin Esfahani (2022b), Distributionally robust inverse
covariance estimation: The Wasserstein shrinkage estimator,Operations Research 70(1),
490–515.
V. A. Nguyen, S. Shafiee, D. Filipović and D. Kuhn (2021), Mean-covariance robust risk
measurement, arXiv:2112.09959.
Distributionally Robust Optimization 211

V. A. Nguyen, S. Shafieezadeh-Abadeh, D. Kuhn and P. Mohajerin Esfahani (2023),


Bridging bayesian and minimax mean square error estimation via Wasserstein distribu-
tionally robust optimization, Mathematics of Operations Research 48(1), 1–37.
V. A. Nguyen, S. Shafieezadeh-Abadeh, M.-C. Yue, D. Kuhn and W. Wiesemann (2019),
Optimistic distributionally robust optimization for nonparametric likelihood approxim-
ation, in Advances in Neural Information Processing Systems, pp. 15872–15882.
V. A. Nguyen, F. Zhang, J. Blanchet, E. Delage and Y. Ye (2020), Distributionally ro-
bust local non-parametric conditional estimation, in Advances in Neural Information
Processing Systems, pp. 15232–15242.
V. A. Nguyen, F. Zhang, S. Wang, J. Blanchet, E. Delage and Y. Ye (2024), Robustifying
conditional portfolio decisions via optimal transport, Operations Research (Forthcom-
ing).
S. Nietert, Z. Goldfeld and S. Shafiee (2024a), Outlier-robust Wasserstein DRO, in Ad-
vances in Neural Information Processing Systems, pp. 62792–62820.
S. Nietert, Z. Goldfeld and S. Shafiee (2024b), Robust distribution learning with local and
global adversarial corruptions, arXiv:2406.06509.
K. G. Nishimura and H. Ozaki (2004), Search and Knightian uncertainty, Journal of
Economic Theory 119(2), 299–333.
K. G. Nishimura and H. Ozaki (2006), An axiomatic approach to-contamination, Economic
Theory 27(2), 333–340.
J. L. M. Olea, C. Rush, A. Velez and J. Wiesel (2022), The out-of-sample prediction error
of the square-root-LASSO and related estimators, arXiv:2211.07608.
I. Olkin and F. Pukelsheim (1982), The distance between two random vectors with given
dispersion matrices, Linear Algebra and its Applications 48, 257–263.
C. Ordoudis, V. A. Nguyen, D. Kuhn and P. Pinson (2021), Energy and reserve dispatch
with distributionally robust joint chance constraints, Operations Research Letters 49(3),
291–299.
A. B. Owen (1988), Empirical likelihood ratio confidence intervals for a single functional,
Biometrika 75(2), 237–249.
A. B. Owen (1990), Empirical likelihood ratio confidence regions, The Annals of Statistics
18(1), 90–120.
A. B. Owen (1991), Empirical likelihood for linear models, The Annals of Statistics 19(4),
1725–1747.
A. B. Owen (2001), Empirical Likelihood, Chapman and Hall.
H. Owhadi and C. Scovel (2017), Extreme points of a ball about a measure with finite
support, Communications in Mathematical Sciences 15(1), 77–96.
H. Owhadi, C. Scovel, T. J. Sullivan, M. McKerns and M. Ortiz (2013), Optimal uncertainty
quantification, SIAM Review 55(2), 271–345.
V. M. Panaretos and Y. Zemel (2020), An Invitation to Statistics in Wasserstein Space,
Springer.
P. A. Parrilo (2000), Structured Semidefinite Programs and Semialgebraic Geometry Meth-
ods in Robustness and Optimization, PhD thesis, California Institute of Technology.
P. A. Parrilo (2003), Semidefinite programming relaxations for semialgebraic problems,
Mathematical Programming 96(2), 293–320.
B. Pass (2015), Multi-marginal optimal transport: Theory and applications, ESAIM: Math-
ematical Modelling and Numerical Analysis 49(6), 1771–1790.
212 D. Kuhn, S. Shafiee, and W. Wiesemann

S. Peng (1997), Backward SDE and related G-expectation, in Backward Stochastic Dif-
ferential Equations in Finance (N. El Karoui, S. Peng and M. C. Quenez, eds), Wiley,
pp. 141–160.
S. Peng (2007a), G-Brownian motion and dynamic risk measure under volatility uncer-
tainty, arXiv:0711.2834.
S. Peng (2007b), G-expectation, G-Brownian motion and related stochastic calculus of Itô
type, in Stochastic Analysis and Applications (F. E. Benth, G. Di Nunno, T. Lindstrom,
B. Oksendal and T. Zhang, eds), Springer, pp. 541–567.
S. Peng (2019), Nonlinear Expectations and Stochastic Calculus under Uncertainty: With
Robust CLT and G-Brownian Motion, Springer.
S. Peng (2023), G-Gaussian processes under sublinear expectations and q-Brownian motion
in quantum mechanics, Numerical Algebra, Control and Optimization 13(3-4), 583–603.
G. Perakis and G. Roels (2008), Regret in the newsvendor model with partial information,
Operations Research 56(1), 188–203.
S. Pesenti, Q. Wang and R. Wang (2024), Optimizing distortion riskmetrics with distribu-
tional uncertainty, Mathematical Programming (Forthcoming).
G. C. Pflug and A. Pichler (2014), Multistage Stochastic Optimization, Springer.
G. C. Pflug and D. Wozabal (2007), Ambiguity in portfolio selection, Quantitative Finance
7(4), 435–442.
G. C. Pflug, A. Pichler and D. Wozabal (2012), The 1/𝑁 investment strategy is optimal
under high model ambiguity, Journal of Banking & Finance 36(2), 410–417.
R. R. Phelps (1965), Lectures on Choquet’s Theorem, van Nostrand Mathematical Studies.
A. B. Philpott, V. L. de Matos and L. Kapelevich (2018), Distributionally robust SDDP,
Computational Management Science 15, 431–454.
A. Pichler (2013), Evaluations of risk measures for different probability measures, SIAM
Journal on Optimization 23(1), 530–551.
I. Pinelis (2016), On the extreme points of moments sets, Mathematical Methods of Oper-
ations Research 83(3), 325–349.
I. Pólik and T. Terlaky (2007), A survey of the S-lemma, SIAM Review 49(3), 371–418.
Y. Polyanskiy and Y. Wu (2024), Information Theory: From Coding to Learning, Cam-
bridge University Press.
I. Popescu (2005), A semidefinite programming approach to optimal-moment bounds for
convex classes of distributions, Mathematics of Operations Research 30(3), 632–657.
I. Popescu (2007), Robust mean-covariance solutions for stochastic optimization, Opera-
tions Research 55(1), 98–112.
K. Postek and S. Shtern (2024), First-order algorithms for robust optimization problems via
convex-concave saddle-point Lagrangian reformulation, INFORMS Journal on Comput-
ing (Forthcoming).
K. Postek, A. Ben-Tal, D. den Hertog and B. Melenberg (2018), Robust optimization with
ambiguous stochastic constraints under mean and dispersion information, Operations
Research 66(3), 814–833.
K. Postek, D. den Hertog and B. Melenberg (2016), Computationally tractable counterparts
of distributionally robust constraints on risk measures, SIAM Review 58(4), 603–650.
K. Postek, W. Romeijnders, D. den Hertog and M. H. van der Vlerk (2019), An approxima-
tion framework for two-stage ambiguous stochastic integer programs under mean-MAD
information, European Journal of Operational Research 274(2), 432–444.
Distributionally Robust Optimization 213

G. Puccetti and L. Rüschendorf (2013), Sharp bounds for sums of dependent risks, Journal
of Applied Probability 50(1), 42–53.
M. S. Pydi and V. Jog (2021), Adversarial risk via optimal transport and optimal couplings,
IEEE Transactions on Information Theory 67(9), 6031–6052.
M. S. Pydi and V. Jog (2024), The many faces of adversarial risk: An expanded study,
IEEE Transactions on Information Theory 70(1), 550–570.
H. Rahimian and S. Mehrotra (2022), Frameworks and results in distributionally robust
optimization, Open Journal of Mathematical Optimization 3, 1–85.
H. Rahimian, G. Bayraksan and T. Homem-de-Mello (2019a), Controlling risk and demand
ambiguity in newsvendor models, European Journal of Operational Research 279(3),
854–868.
H. Rahimian, G. Bayraksan and T. Homem-de-Mello (2019b), Identifying effective scen-
arios in distributionally robust stochastic programs with total variation distance, Math-
ematical Programming 173(1), 393–430.
H. Rahimian, G. Bayraksan and T. Homem-de-Mello (2022), Effective scenarios in
multistage distributionally robust optimization with a focus on total variation distance,
SIAM Journal on Optimization 32(3), 1698–1727.
M. D. Reid and R. C. Williamson (2011), Information, divergence and risk for binary
experiments, Journal of Machine Learning Research 12(22), 731–817.
H. Richter (1957), Parameterfreie Abschätzung und Realisierung von Erwartungswerten,
Blätter der DGVFM 3(2), 147–162.
R. T. Rockafellar (1970), Convex Analysis, Princeton University Press.
R. T. Rockafellar (1974), Conjugate Duality and Optimization, SIAM.
R. T. Rockafellar and J. O. Royset (2013), Superquantiles and their applications to risk,
random variables, and regression, INFORMS Tutorials in Operations Research pp. 151–
167.
R. T. Rockafellar and J. O. Royset (2014), Random variables, monotone relations, and
convex analysis, Mathematical Programming 148(1-2), 297–331.
R. T. Rockafellar and J. O. Royset (2015), Measures of residual risk with connections to re-
gression, risk tracking, surrogate models, and ambiguity, SIAM Journal on Optimization
25(2), 1179–1208.
R. T. Rockafellar and S. Uryasev (2000), Optimization of conditional value-at-risk, Journal
of Risk 2(3), 21–41.
R. T. Rockafellar and S. Uryasev (2002), Conditional value-at-risk for general loss distri-
butions, Journal of Banking & Finance 26(7), 1443–1471.
R. T. Rockafellar and S. Uryasev (2013), The fundamental risk quadrangle in risk man-
agement, optimization and statistical estimation, Surveys in Operations Research and
Management Science 18(1-2), 33–53.
R. T. Rockafellar and R. J.-B. Wets (2009), Variational Analysis, Springer.
R. T. Rockafellar, S. Uryasev and M. Zabarankin (2006), Generalized deviations in risk
analysis, Finance and Stochastics 10(1), 51–74.
R. T. Rockafellar, S. Uryasev and M. Zabarankin (2008), Risk tuning with generalized
linear regression, Mathematics of Operations Research 33(3), 712–729.
W. W. Rogosinski (1958), Moments of non-negative mass, Proceedings of the Royal Society
of London. Series A. Mathematical and Physical Sciences 245(1240), 1–27.
N. Rontsis, M. A. Osborne and P. J. Goulart (2020), Distributionally ambiguous optimiz-
ation for batch Bayesian optimization, Journal of Machine Learning Research 21(149),
1–26.
214 D. Kuhn, S. Shafiee, and W. Wiesemann

K. Roth, A. Lucchi, S. Nowozin and T. Hofmann (2017), Stabilizing training of gener-


ative adversarial networks through regularization, in Advances in Neural Information
Processing Systems, pp. 2018–2028.
J. O. Royset (2022), Risk-adaptive approaches to learning and decision making: A survey,
arXiv:2212.00856.
Y. Ruan, X. Li, K. Murthy and K. Natarajan (2022), A nonparametric approach with
marginals for modeling consumer choice, arXiv:2208.06115.
N. Rujeerapaiboon, D. Kuhn and W. Wiesemann (2016), Robust growth-optimal portfolios,
Management Science 62(7), 2090–2109.
N. Rujeerapaiboon, D. Kuhn and W. Wiesemann (2018), Chebyshev inequalities for
products of random variables, Mathematics of Operations Research 43(3), 887–918.
L. Rüschendorf (1983), Solution of a statistical optimization problem by rearrangement
methods, Metrika 30(1), 55–61.
L. Rüschendorf (1991), Fréchet-bounds and their applications, in Advances in Probability
Distributions with Given Marginals: Beyond the Copulas (G. Dall’Aglio, S. Kotz and
G. Salinetti, eds), Springer, pp. 151–187.
L. Rüschendorf (2013), Mathematical Risk Analysis: Dependence, Risk Bounds, Optimal
Allocations and Portfolios, Springer.
B. Rustem and M. Howe (2009), Algorithms for Worst-Case Design and Applications to
Risk Management, Princeton University Press.
A. Ruszczyński (2021), A stochastic subgradient method for nonsmooth nonconvex mul-
tilevel composition optimization, SIAM Journal on Control and Optimization 59(3),
2301–2320.
A. Ruszczyński and A. Shapiro (2006), Optimization of convex risk functions, Mathematics
of Operations Research 31(3), 433–452.
Y. Rychener, A. Esteban-Pérez, J. M. Morales and D. Kuhn (2024), Wasserstein distribu-
tionally robust optimization with heterogeneous data sources, arXiv:2407.13582.
U. Sadana, E. Delage and A. Georghiou (2024), Data-driven decision-making under un-
certainty with entropic risk measure, arXiv:2409.19926.
S. Sagawa, P. W. Koh, T. B. Hashimoto and P. Liang (2020), Distributionally robust
neural networks for group shifts: On the importance of regularization for worst-case
generalization, in International Conference on Learning Representations.
A. A. Salo and M. Weber (1995), Ambiguity aversion in first-price sealed-bid auctions,
Journal of Risk and Uncertainty 11(2), 123–137.
N. Sauldubois and N. Touzi (2024), First order martingale model risk and semi-static
hedging, arXiv:2410.06906.
S. L. Savage (2012), The Flaw of Averages: Why We Underestimate Risk in the Face of
Uncertainty, Wiley.
S. L. Savage, S. Scholtes and D. Zweidler (2006), Probability management, OR/MS Today.
H. Scarf (1958), A min-max solution to an inventory problem, in Studies in Mathematical
Theory of Inventory and Production (K. Arrow, S. Karlin and H. Scarf, eds), Stanford
University Press, pp. 201–209.
G. Schildbach, L. Fagiano and M. Morari (2013),Randomized solutions to convex programs
with multiple chance constraints, SIAM Journal on Optimization 23(4), 2479–2501.
A. Selvi, M. R. Belbasi, M. Haugh and W. Wiesemann (2022), Wasserstein logistic re-
gression with mixed features, in Advances in Neural Information Processing Systems,
pp. 16691–16704.
Distributionally Robust Optimization 215

S. Shafiee and D. Kuhn (2024), Minimax theorems and Nash equilibria in distributionally
robust optimization problems, Working Paper.
S. Shafiee, L. Aolaritei, F. Dörfler and D. Kuhn (2023), New perspectives on regulariz-
ation and computation in optimal transport-based distributionally robust optimization,
arXiv:2303.03900.
S. Shafieezadeh-Abadeh, D. Kuhn and P. Mohajerin Esfahani (2019), Regularization via
mass transportation, Journal of Machine Learning Research 20(103), 1–68.
S. Shafieezadeh-Abadeh, P. Mohajerin Esfahani and D. Kuhn (2015), Distributionally
robust logistic regression, in Advances in Neural Information Processing Systems,
pp. 1576–1584.
S. Shafieezadeh-Abadeh, V. A. Nguyen, D. Kuhn and P. Mohajerin Esfahani (2018),
Wasserstein distributionally robust Kalman filtering, in Advances in Neural Information
Processing Systems, pp. 8474–8483.
S. Shalev-Shwartz (2012), Online learning and online convex optimization, Foundations
and Trends in Machine Learning 4(2), 107–194.
S. Shalev-Shwartz and S. Ben-David (2014), Understanding Machine Learning: From
Theory to Algorithms, Cambridge University Press.
A. Shapiro (1989), Asymptotic properties of statistical estimators in stochastic program-
ming, The Annals of Statistics 17(2), 841–858.
A. Shapiro (1990), On differential stability in stochastic programming, Mathematical
Programming 47(1-3), 107–116.
A. Shapiro (1991), Asymptotic analysis of stochastic programs, Annals of Operations
Research 30(1), 169–186.
A. Shapiro (1993), Asymptotic behavior of optimal solutions in stochastic programming,
Mathematics of Operations Research 18(4), 829–845.
A. Shapiro (2001), On duality theory of conic linear problems, in Semi-Infinite Program-
ming (M. Á. Goberna and M. A. López, eds), Kluwer Academic Publishers, pp. 135–165.
A. Shapiro (2003), Monte Carlo sampling methods, in Stochastic Programming
(A. Ruszczyński and A. Shapiro, eds), Elsevier, pp. 353–425.
A. Shapiro (2013), On Kusuoka representation of law invariant risk measures, Mathematics
of Operations Research 38(1), 142–152.
A. Shapiro (2017), Distributionally robust stochastic programming, SIAM Journal on
Optimization 27(4), 2258–2275.
A. Shapiro and A. Kleywegt (2002), Minimax analysis of stochastic problems, Optimization
Methods and Software 17(3), 523–542.
A. Shapiro, D. Dentcheva and A. Ruszczyński (2009), Lectures on Stochastic Program-
ming: Modeling and Theory, SIAM.
A. Shapiro, E. Zhou and Y. Lin (2023), Bayesian distributionally robust optimization, SIAM
Journal on Optimization 33(2), 1279–1304.
K. S. Shehadeh (2023), Distributionally robust optimization approaches for a stochastic
mobile facility fleet sizing, routing, and scheduling problem, Transportation Science
57(1), 197–229.
K. S. Shehadeh, A. E. Cohn and R. Jiang (2020), A distributionally robust optimiza-
tion approach for outpatient colonoscopy scheduling, European Journal of Operational
Research 283(2), 549–561.
H. Shen and R. Jiang (2023), Chance-constrained set covering with Wasserstein ambiguity,
Mathematical Programming 198(1), 621–674.
216 D. Kuhn, S. Shafiee, and W. Wiesemann

M. R. Sheriff and P. Mohajerin Esfahani (2023), Nonlinear distributionally robust optim-


ization, arXiv:2306.03202.
J. A. Shohat and J. D. Tamarkin (1950), The Problem of Moments, American Mathematical
Society.
A. Sinha, H. Namkoong and J. Duchi (2018), Certifying some distributional robustness
with principled adversarial training, in International Conference on Learning Repres-
entations.
M. Sion (1958), On general minimax theorems, Pacific Journal of Mathematics 8(1),
171–176.
J. E. Smith and R. L. Winkler (2006), The optimizer’s curse: Skepticism and postdecision
surprise in decision analysis, Management Science 52(3), 311–322.
A. L. Soyster (1973), Convex programming with set-inclusive constraints and applications
to inexact linear programming, Operations Research 21(5), 1154–1157.
P. R. Srivastava, Y. Wang, G. A. Hanasusanto and C. P. Ho (2021), On data-driven
prescriptive analytics with side information: A regularized Nadaraya-Watson approach,
arXiv:2110.04855.
M. Staib and S. Jegelka (2019), Distributionally robust optimization and generalization in
kernel methods, in Advances in Neural Information Processing Systems, pp. 9134–9144.
T.-J. Stieltjes (1894), Recherches sur les fractions continues, Annales de la Faculté des
sciences de Toulouse pour les sciences mathématiques et les sciences physiques 8(4),
1–122.
V. Strassen (1965), The existence of probability measures with given marginals, The Annals
of Mathematical Statistics 36(2), 423–439.
T. Strohmann and G. Z. Grudic (2002), A formulation for minimax probability machine
regression, in Advances in Neural Information Processing Systems, pp. 785–792.
K. R. Stromberg (2015), An Introduction to Classical Real Analysis, American Mathemat-
ical Society.
L. Sun, W. Xie and T. Witten (2023), Distributionally robust fair transit resource allocation
during a pandemic, Transportation Science 57(4), 954–978.
T. Sutter, A. Krause and D. Kuhn (2021), Robust generalization despite distribution shift
via minimum discriminating information, in Advances in Neural Information Processing
Systems, pp. 29754–29767.
T. Sutter, B. P. Van Parys and D. Kuhn (2024), A Pareto dominance principle for data-driven
optimization, Operations Research 72(5), 1976–1999.
C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow and R. Fer-
gus (2014), Intriguing properties of neural networks, in International Conference on
Learning Representations.
M. Talagrand (1996), Transportation cost for Gaussian and other product measures, Geo-
metric & Functional Analysis 6(3), 587–600.
B. Taşkesen, D. Iancu, Ç. Koçyiğit and D. Kuhn (2024), Distributionally robust linear
quadratic control, in Advances in Neural Information Processing Systems, pp. 18613–
18632.
B. Taşkesen, S. Shafieezadeh-Abadeh and D. Kuhn (2023a), Semi-discrete optimal trans-
port: Hardness, regularization and numerical solution, Mathematical Programming
199(1), 1033–1106.
B. Taşkesen, S. Shafieezadeh-Abadeh, D. Kuhn and K. Natarajan (2023b), Discrete optimal
transport with independent marginals is #P-hard, SIAM Journal on Optimization 33(2),
589–614.
Distributionally Robust Optimization 217

B. Taşkesen, M.-C. Yue, J. Blanchet, D. Kuhn and V. A. Nguyen (2021), Sequential domain
adaptation by synthesizing distributionally robust experts, in International Conference
on Machine Learning, pp. 10162–10172.
A. H. Tchen (1980), Inequalities for distributions with given marginals, The Annals of
Probability 8(4), 814–827.
A. Terpin, N. Lanzetti and F. Dörfler (2024), Dynamic programming in probability spaces
via optimal transport, SIAM Journal on Control and Optimization 62(2), 1183–1206.
A. Terpin, N. Lanzetti, B. Yardim, F. Dörfler and G. Ramponi (2022), Trust region policy
optimization with optimal transport discrepancies: Duality and algorithm for continuous
actions, in Advances in Neural Information Processing Systems, pp. 19786–19797.
Y. L. Tong (1980), Probability Inequalities in Multivariate Distributions, Academic Press.
F. Tramèr, N. Papernot, I. Goodfellow, D. Boneh and P. McDaniel (2017), The space of
transferable adversarial examples, arXiv:1704.03453.
M. Y. Tsanga and K. S. Shehadeha (2024), On the trade-off between distributional belief
and ambiguity: Conservatism, finite-sample guarantees, and asymptotic properties,
arXiv:2410.19234.
K. Tu, Z. Chen and M.-C. Yue (2024), A max-min-max algorithm for large-scale robust
optimization, arXiv:2404.05377.
Z. Tu, J. Zhang and D. Tao (2019), Theoretical analysis of adversarial learning: A minimax
approach, in Advances in Neural Information Processing Systems, pp. 12280–12290.
A. Van Der Vaart and J. A. Wellner (2000), Preservation theorems for Glivenko-Cantelli
and uniform Glivenko-Cantelli classes, in High Dimensional Probability II (E. Giné,
D. M. Mason and J. A. Wellner, eds), Springer, pp. 115–133.
A. W. Van der Vaart (1998), Asymptotic Statistics, Cambridge University Press.
W. J. van Eekelen, D. den Hertog and J. S. van Leeuwaarden (2022), MAD dispersion
measure makes extremal queue analysis simple, INFORMS Journal on Computing 34(3),
1681–1692.
W. J. van Eekelen, G. A. Hanasusanto, J. J. Hasenbein and J. S. van Leeuwaarden (2023),
Second-order bounds for the M/M/s queue with random arrival rate, arXiv:2310.09995.
J. S. Van Leeuwaarden and C. Stegehuis (2021), Robust subgraph counting with
distribution-free random graph analysis, Physical Review E 104(4), 044313.
B. P. Van Parys (2024), Efficient data-driven optimization with noisy data, Operations
Research Letters 54, Article 107089.
B. P. Van Parys and N. Golrezaei (2024), Optimal learning for structured bandits, Manage-
ment Science 70(6), 3951–3998.
B. P. Van Parys, P. J. Goulart and P. Embrechts (2016a), Fréchet inequalities via convex
optimization, Available from Optimization Online.
B. P. Van Parys, P. J. Goulart and D. Kuhn (2016b), Generalized Gauss inequalities via
semidefinite programming, Mathematical Programming 156(1-2), 271–302.
B. P. Van Parys, P. J. Goulart and M. Morari (2019), Distributionally robust expectation
inequalities for structured distributions, Mathematical Programming 173(1-2), 251–280.
B. P. Van Parys, D. Kuhn, P. J. Goulart and M. Morari (2015), Distributionally robust
control of constrained stochastic systems, IEEE Transactions on Automatic Control
61(2), 430–442.
B. P. Van Parys, P. Mohajerin Esfahani and D. Kuhn (2021), From data to decisions:
Distributionally robust optimization is optimal, Management Science 67(6), 3387–3402.
V. Vapnik (2013), The Nature of Statistical Learning Theory, Springer.
218 D. Kuhn, S. Shafiee, and W. Wiesemann

S. S. Varadhan (1966), Asymptotic probabilities and differential equations, Communica-


tions on Pure and Applied Mathematics 19(3), 261–286.
R. Vershynin (2018), High-Dimensional Probability: An Introduction with Applications in
Data Science, Cambridge University Press.
C. Villani (2003), Topics in Optimal Transportation, American Mathematical Society.
C. Villani (2008), Optimal Transport: Old and New, Springer.
F. Vincent, W. Azizian, J. Malick and F. Iutzeler (2024), skwdro: A library for Wasserstein
distributionally robust machine learning, arXiv:2410.21231.
R. Volpi, H. Namkoong, O. Sener, J. Duchi, V. Murino and S. Savarese (2018), Generalizing
to unseen domains via adversarial data augmentation, in Advances in Neural Information
Processing Systems, pp. 5339–5349.
H. Vu, T. Tran, M.-C. Yue and V. A. Nguyen (2021), Distributionally robust fair principal
components via geodesic descents, in International Conference on Learning Represent-
ations.
M. J. Wainwright (2019), High-Dimensional Statistics: A Non-Asymptotic Viewpoint,
Cambridge University Press.
B. Wang and R. Wang (2011), The complete mixability and convex minimization problems
with monotone marginal densities, Journal of Multivariate Analysis 102(10), 1344–
1360.
C. Wang, R. Gao, W. Wei, M. Shafie-khah, T. Bi and J. P. Catalao (2018), Risk-based distri-
butionally robust optimal gas-power flow with Wasserstein distance, IEEE Transactions
on Power Systems 34(3), 2190–2204.
I. Wang, C. Becker, B. P. Van Parys and B. Stellato (2022), Mean robust optimization,
arXiv:2207.10820.
I. Wang, C. Becker, B. P. Van Parys and B. Stellato (2023), Learning decision-focused
uncertainty sets in robust optimization, arXiv:2305.19225.
J. Wang, R. Gao and Y. Xie (2021), Sinkhorn distributionally robust optimization,
arXiv:2109.11926.
J. Wang, R. Gao and Y. Xie (2024a), Regularization for adversarial robust learning,
arXiv:2408.09672.
R. Wang, L. Peng and J. Yang (2013), Bounds for the sum of dependent risks and worst
value-at-risk with monotone marginal densities, Finance and Stochastics 17(2), 395–
417.
S. Wang (2024), The power of simple menus in robust selling mechanisms, Management
Science (Forthcoming).
S. Wang, Z. Chen and T. Liu (2020), Distributionally robust hub location, Transportation
Science 54(5), 1189–1210.
S. Wang, S. Liu and J. Zhang (2024b), Minimax regret robust screening with moment
information, Manufacturing & Service Operations Management 26(3), 992–1012.
Y. Wang, X. Ma, J. Bailey, J. Yi, B. Zhou and Q. Gu (2019), On the convergence and
robustness of adversarial training, in International Conference on Machine Learning,
pp. 6586–6595.
Y. Wang, V. A. Nguyen and G. A. Hanasusanto (2024c), Wasserstein robust classification
with fairness constraints, Manufacturing & Service Operations Management 26(4),
1567–1585.
Y. Wang, M. N. Prasad, G. A. Hanasusanto and J. J. Hasenbein (2024d), Distributionally
robust observable strategic queues, Stochastic Systems 14(3), 229–361.
Distributionally Robust Optimization 219

Z. Wang, P. W. Glynn and Y. Ye (2016), Likelihood robust optimization for data-driven


problems, Computational Management Science 13, 241–261.
J. Weed and F. Bach (2019), Sharp asymptotic and finite-sample rates of convergence of
empirical measures in Wasserstein distance, Bernoulli 25(4A), 2620–2648.
P. Whittle (1990), Risk-Sensitive Optimal Control, Wiley.
W. Wiesemann, D. Kuhn and B. Rustem (2013), Robust Markov decision processes,
Mathematics of Operations Research 38(1), 153–183.
W. Wiesemann, D. Kuhn and M. Sim (2014), Distributionally robust convex optimization,
Operations Research 62(6), 1358–1376.
D. Wozabal (2012), A framework for optimization under ambiguity, Annals of Operations
Research 193(1), 21–47.
D. Wozabal (2014), Robustifying convex risk measures for linear portfolios: A nonpara-
metric approach, Operations Research 62(6), 1302–1315.
Q. Wu, J. Y.-M. Li and T. Mao (2022), On generalization and regularization via Wasserstein
distributionally robust optimization, arXiv:2212.05716.
S. Wu, S. Sun, J. A. Camilleri, S. B. Eickhoff and R. Yu (2021), Better the devil you know
than the devil you don’t: Neural processing of risk and ambiguity, NeuroImage 236,
118109.
W. Xie (2020), Tractable reformulations of distributionally robust two-stage stochastic
programs over the type-∞ Wasserstein ball, Operations Research Letters 48(4), 513–
523.
W. Xie (2021), On distributionally robust chance constrained programs with Wasserstein
distance, Mathematical Programming 186(1), 115–155.
W. Xie, S. Ahmed and R. Jiang (2022), Optimized Bonferroni approximations of distribu-
tionally robust joint chance constraints, Mathematical Programming 191(1), 79–112.
W. Xie and S. Ahmed (2017), Distributionally robust chance constrained optimal power
flow with renewables: A conic reformulation, IEEE Transactions on Power Systems
33(2), 1860–1867.
L. Xin and D. A. Goldberg (2021), Time (in)consistency of multistage distributionally
robust inventory models with moment constraints, European Journal of Operational
Research 289(3), 1127–1141.
L. Xin and D. A. Goldberg (2022), Distributionally robust inventory control when demand
is a martingale, Mathematics of Operations Research 47(3), 2387–2414.
C. Xu, J. Lee, X. Cheng and Y. Xie (2024), Flow-based distributionally robust optimization,
IEEE Journal on Selected Areas in Information Theory 5, 62–77.
H. Xu, C. Caramanis and S. Mannor (2009), Robustness and regularization of support
vector machines, Journal of Machine Learning Research 10(51), 1485–1510.
H. Xu, C. Caramanis and S. Mannor (2012a), A distributional interpretation of robust
optimization, Mathematics of Operations Research 37(1), 95–110.
H. Xu, C. Caramanis and S. Mannor (2012b), Optimization under probabilistic envelope
constraints, Operations Research 60(3), 682–699.
V. A. Yakubovich (1971), S-procedure in nonlinear control theory, Vestnik Leninggrad-
skogo Universiteta (in Russian) pp. 62–77.
I. Yang (2018), A dynamic game approach to distributionally robust safety specifications
for stochastic systems, Automatica 94, 94–101.
I. Yang (2020), Wasserstein distributionally robust stochastic control: A data-driven ap-
proach, IEEE Transactions on Automatic Control 66(8), 3863–3870.
220 D. Kuhn, S. Shafiee, and W. Wiesemann

J. Yang, L. Zhang, N. Chen, R. Gao and M. Hu (2022), Decision-making with side


information: A causal transport robust approach, Available from Optimization Online.
P. Yang and B. Chen (2018), Robust Kullback-Leibler divergence and universal hypothesis
testing for continuous distributions, IEEE Transactions on Information Theory 65(4),
2360–2373.
W. Yang and H. Xu (2016), Distributionally robust chance constraints for non-linear un-
certainties, Mathematical Programming 155(1-2), 231–265.
I. Yanıkoğlu, B. L. Gorissen and D. den Hertog (2019), A survey of adjustable robust
optimization, European Journal of Operational Research 277(3), 799–813.
Y.-L. Yu, Y. Li, D. Schuurmans and C. Szepesvári (2009), A general projection property for
distribution families, in Advances in Neural Information Processing Systems, pp. 2232–
2240.
Y. Yu, T. Lin, E. V. Mazumdar and M. Jordan (2022), Fast distributionally robust learning
with variance-reduced min-max optimization, in International Conference on Artificial
Intelligence and Statistics, pp. 1219–1250.
J. Yue, B. Chen and M.-C. Wang (2006), Expected value of distribution information for
the newsvendor problem, Operations Research 54(6), 1128–1136.
M.-C. Yue, D. Kuhn and W. Wiesemann (2022), On linear optimization over Wasserstein
balls, Mathematical Programming 195(1–2), 1107–1122.
G. Zames (1966), Robust control theory, Proceedings of the IEEE 54(9), 1442–1451.
O. Zeitouni and M. Gutman (1991), On universal hypotheses testing via large deviations,
IEEE Transactions on Information Theory 37(2), 285–290.
Y. Zeng and H. Lam (2022), Generalization bounds with minimal dependency on hypo-
thesis class via distributionally robust optimization, in Advances in Neural Information
Processing Systems, pp. 27576–27590.
A. Y. Zhang and H. H. Zhou (2020), Theoretical and computational guarantees of mean
field variational inference for community detection, The Annals of Statistics 48(5),
2575–2598.
L. Zhang, J. Yang and R. Gao (2024a), Optimal robust policy for feature-based newsvendor,
Management Science 70(4), 2315–2329.
L. Zhang, J. Yang and R. Gao (2024b), A short and general duality proof for Wasserstein
distributionally robust optimization, Operations Research (Forthcoming).
Y. Zhang, R. Jiang and S. Shen (2018), Ambiguous chance-constrained binary programs
under mean-covariance information, SIAM Journal on Optimization 28(4), 2922–2944.
C. Zhao and Y. Guan (2018), Data-driven risk-averse stochastic optimization with Wasser-
stein metric, Operations Research Letters 46(2), 262–267.
C. Zhao and R. Jiang (2017), Distributionally robust contingency-constrained unit com-
mitment, IEEE Transactions on Power Systems 33(1), 94–102.
J. Zhen, D. Kuhn and W. Wiesemann (2023), A unified theory of robust and distribu-
tionally robust optimization via the primal-worst-equals-dual-best principle, Operations
Research (Forthcoming).
K. Zhou and J. C. Doyle (1999), Essentials of Robust Control, Prentice Hall.
K. Zhou, J. C. Doyle and K. Glover (1996), Robust and Optimal Control, Prentice Hall.
B. Zhu, J. Jiao and J. Steinhardt (2022a), Generalized resilience and robust statistics, The
Annals of Statistics 50(4), 2256–2283.
J.-J. Zhu, W. Jitkrittum, M. Diehl and B. Schölkopf (2020), Worst-case risk quantification
under distributional ambiguity using kernel mean embedding in moment problem, in
IEEE Conference on Decision and Control, pp. 3457–3463.
Distributionally Robust Optimization 221

J.-J. Zhu, W. Jitkrittum, M. Diehl and B. Schölkopf (2021), Kernel distributionally robust
optimization: Generalized duality theorem and stochastic approximation, in Interna-
tional Conference on Artificial Intelligence and Statistics, pp. 280–288.
L. Zhu, M. Gürbüzbalaban and A. Ruszczyński (2023), Distributionally robust learn-
ing with weakly convex losses: Convergence rates and finite-sample guarantees,
arXiv:2301.06619.
S. Zhu, L. Xie, M. Zhang, R. Gao and Y. Xie (2022b), Distributionally robust weighted
𝑘-nearest neighbors, in Advances in Neural Information Processing Systems, pp. 29088–
29100.
M. Zorzi (2014), Multivariate spectral estimation based on the concept of optimal predic-
tion, IEEE Transactions on Automatic Control 60(6), 1647–1652.
M. Zorzi (2016), Robust Kalman filtering under model perturbations, IEEE Transactions
on Automatic Control 62(6), 2902–2907.
M. Zorzi (2017a), Convergence analysis of a family of robust Kalman filters based on the
contraction principle, SIAM Journal on Control and Optimization 55(5), 3116–3131.
M. Zorzi (2017b), On the robustness of the Bayes and Wiener estimators under model
uncertainty, Automatica 83, 133–140.
L. F. Zuluaga and J. F. Pena (2005), A conic programming approach to generalized
Tchebycheff inequalities, Mathematics of Operations Research 30(2), 369–388.
S. Zymler, D. Kuhn and B. Rustem (2013a), Distributionally robust joint chance constraints
with second-order moment information, Mathematical Programming 137(1-2), 167–
198.
S. Zymler, D. Kuhn and B. Rustem (2013b), Worst-case value at risk of nonlinear portfolios,
Management Science 59(1), 172–188.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy