0% found this document useful (0 votes)

7 views51 pages

On The Impossibility of Statistically Improving Empirical Optimization: A Second-Order Stochastic Dominance Perspective

The document discusses the theoretical impossibility of statistically improving solutions obtained from empirical optimization in stochastic optimization problems. It proves that the optimality gap of data-driven solutions cannot outperform that of empirical optimization under certain conditions, demonstrating this through various examples including regularized and distributionally robust optimization. The findings aim to guide practitioners against suboptimal strategies in large-sample scenarios, emphasizing the superiority of empirical optimization.

Uploaded by

陳徐行

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views51 pages

On The Impossibility of Statistically Improving Empirical Optimization: A Second-Order Stochastic Dominance Perspective

Uploaded by

陳徐行

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

On the Impossibility of Statistically Improving

Empirical Optimization: A Second-Order Stochastic

arXiv:2105.13419v1 [math.OC] 27 May 2021

Dominance Perspective
Henry Lam
Department of Industrial Engineering and Operations Research, Columbia University, New York, NY 10027,
henry.lam@columbia.edu

When the underlying probability distribution in a stochastic optimization is observed only through data,
various data-driven formulations have been studied to obtain approximate optimal solutions. We show that
no such formulations can, in a sense, theoretically improve the statistical quality of the solution obtained
from empirical optimization. We argue this by proving that the first-order behavior of the optimality gap
against the oracle best solution, which includes both the bias and variance, for any data-driven solution
is second-order stochastically dominated by empirical optimization, as long as suitable smoothness holds
with respect to the underlying distribution. We demonstrate this impossibility of improvement in a range of
examples including regularized optimization, distributionally robust optimization, parametric optimization
and Bayesian generalizations. We also discuss the connections of our results to semiparametric statistical
inference and other perspectives in the data-driven optimization literature.

Key words : empirical optimization, second-order stochastic dominance, optimality gap, regularization,
distributionally robust optimization

1. Introduction
We consider a stochastic optimization problem in the form:

min{Z(x) := ψ(x, P )} (1)

x∈X

where x is the decision variable in a known feasible region X ⊂ Rd , and Z : Rd → R is the objective
function that depends on an underlying probability distribution P and we denote Z(x) = ψ(x, P ).
A primary example of ψ(x, P ) is the expected value objective function EP [h(x, ξ)] where EP [·]
denotes the expectation with respect to P that generates a random object ξ ∈ Ξ. This, however,
can be more general, including for instance the (conditional) value-at-risk of ξ.
We focus on data-driven optimization where P is not known but only observed via i.i.d. data. In
this situation, the decision maker obtains a data-driven solution, say x̂, typically by solving some

1
2 Lam: On the Impossibility of Statistically Improving Empirical Optimization

reformulation of (1) that utilizes the data. Let x∗ be an (unknown) optimal solution for (1). We
are interested in the statistical properties of the optimality gap or regret

G (x̂) := Z(x̂) − Z(x∗ ) (2)

This captures the suboptimality of x̂ relative to the oracle optimal solution x∗ . (2) is a natural
metric to evaluate the quality of an obtained solution and, in the language of statistical learning,
it measures the generalization performance relative to the oracle best in terms of the objective
value. Obviously, if a solution bears a smaller optimality gap than another solution, then its true
objective value or generalization performance is also better by the same magnitude.
Data-driven optimization as discussed above arises ubiquitously across operations research and
machine learning, where the objective ψ(x, P ) ranges from an expected business revenue to the loss
of a statistical model. To obtain x̂ from data, the arguably most straightforward approach is empir-
ical optimization (EO), namely by replacing the unknown true P with the empirical distribution P̂
in the objective ψ(x, P ). For example, in expected value optimization where ψ(x, P ) = EP [h(x, ξ)],
this corresponds to minimizing EP̂ [h(x, ξ)], also known as the sample average approximation (SAA)
(Shapiro et al. 2014). Other than EO, there are plenty of actively studied reformulations, including
regularization that adds penalty terms to the objective (Friedman et al. 2001), and data-driven dis-
tributionally robust optimization (DRO) (Delage and Ye 2010, Goh and Sim 2010, Ben-Tal et al.
2013, Wiesemann et al. 2014, Lim et al. 2006, Rahimian and Mehrotra 2019) where one turns the
objective ψ(x, P ) into maxQ∈U ψ(x, Q), with U being a so-called uncertainty set or ambiguity set
that is calibrated from data and, at least intuitively speaking, has a high likelihood of containing
the true distribution.
Our main question to address is as follows: Considering the EO solution as a natural base-
line, could we possibly improve its statistical performance in terms of the optimality gap (2) (or
equivalently the attained true objective value), by incorporating some regularization or robustness-
enhancing modification? Our main assertion is that, under standard conditions, it is theoretically
impossible to improve the statistical performance of EO in this regard.
Our assertion is qualified by the following setup and conditions. First, we denote x̂EO
n = x(P̂n ) as

the EO solution obtained from n data points, where the subscript n in x̂EO
n and P̂n highlights the
dependence on sample size for the solution and the empirical distribution, and here x(·) is viewed
as a function on P̂n . By a modification to EO, we mean to consider a wider choice of data-driven
solution x̂λn = x(P̂n , λ), where λ is a tuning parameter in an expanded class of procedures that cover
EO in particular. Without loss of generality, we set x(P̂n , 0) = x(P̂n ), i.e., λ = 0 corresponds to EO.
This λ appears virtually in all common regularization approaches, and also calibration proposals
Lam: On the Impossibility of Statistically Improving Empirical Optimization 3

on the set size in DRO with a “consistent” uncertainty set, i.e., a set with the property that it
reduces to the singleton on the empirical distribution when its size λ is tuned to be zero. Typically,
λ is chosen depending on the sample size n, and as n grows, λ eventually shrinks to 0. For example,
in regularization λ represents a bias-variance tradeoff parameter introduced to avoid overfitting.
When n is large, the variance diminishes and so is the need to trade off a decrease in variance
with an increase in bias, deeming a shrinkage of λ to 0. Similarly, in DRO that uses a consistent
confidence region as the uncertainty set, the set size converges to zero as n gets large. Hereafter,
when the dependence of λ on n is needed, we use the notation λn .
Under the above setup, we have two main conditions. First, we focus on solutions x(·, ·) that are
smooth with respect to the distribution P and the tuning parameter λ. This condition is implied
by the smoothness of the objective function ψ(x, P ) (with respect to both x and P ). Second, we
consider the decision dimension d to be fixed or, in other words, the setting where the sample size
n is large relative to the dimension. To summarize, we look at the most basic setting of smooth
stochastic optimization in a large-sample regime, where our goal is to provide a fundamental
argument to show the superiority of EO over all other possibilities.
We now explain the impossibility of statistical improvement. By this we mean that, in the large-
sample regime, the optimality gap evaluated at the EO solution x̂EO
n is always no worse than
that evaluated at x̂λnn , in terms of the risk profile measured by second-order stochastic dominance,
regardless of how λn depends on n. As an immediate implication, this means for any non-decreasing
convex function f : R → R, we have

E[f (G (x̂EO λn
n ))] ≤ E[f (G (x̂n ))]

up to a negligible error, where E[·] denotes the expectation with respect to the data used to obtain
x̂EO
n or x̂λnn . In particular, setting f (y) = y 2 , we conclude that the mean squared error of Z(x̂EO
n )

against Z(x∗ ) is always no larger than that of Z(x̂λnn ). The same conclusion holds for f (y) = y p for
any other p ≥ 1.
We will show how this impossibility result applies to all common examples of regularization
on EO, and common DROs with a consistent uncertainty set. Moreover, we also show how our
result applies to parametric settings, where the optimization involve unknown finite-dimensional
parameters that need to be estimated. In the latter settings, the most straightforward approach is
to plug in a consistent point estimate of the parameter into the optimization formulation, where this
point estimate can be obtained by any common techniques such as maximum likelihood estimation
(MLE) and the method of moments (MM) (Van der Vaart 2000, Chapters 4, 5). Here, we can again
consider adding regularization on the plugged-in optimization formulation. We may also choose
4 Lam: On the Impossibility of Statistically Improving Empirical Optimization

to use Bayesian approaches such as minimizing the expected posterior cost (Wu et al. 2018), and
consider regularizing the Bayesian formulation. In all the above cases, our result concludes that it is
impossible to improve the solution obtained from EO or a simple consistent parameter estimation,
in terms of the asymptotic risk profile of the optimality gap, by injecting regularization or DRO.
We will connect and contrast our results with other viewpoints in the data-driven optimization
as well as the statistics literature. A reader who is proficient in stochastic optimization may find
our claim on the superiority of EO very natural. Yet, as far as we know, our studied perspective
appears unknown in the literature. Our results are intended to guide optimizers against engaging in
suboptimal strategies in the considered basic large-sample situations, as we show in this situation
that there is no theoretical improvement in using any strategies over EO. On the other hand, they
are not intended to undermine the diverse strategies in data-driven optimization, as there are other
situations where these strategies are used for good reasons (we will discuss these in Section 4).
In the following, Section 2 presents our main result, and Section 3 discusses its applications on
the range of data-driven formulations mentioned above. Then, in Section 4, we compare our results
with established viewpoints in the DRO literature and classical statistics, and discuss scenarios
beyond our considered setting in which alternate approaches to EO offer advantages.

2. Main Results
Recall that x∗ is a minimizer of Z(x). Also recall that n is the sample size in an i.i.d. data set
Pn
{ξi , i = 1, . . . , n}, and P̂n denotes its empirical distribution, i.e., P̂ (·) = (1/n) i=1 δξi (·) where δξi (·)
is the Dirac measure at the i-th data point ξi . x̂EO λ
n = x(P̂n ) is the EO solution, and x̂n = x(P̂n , λ)

is the obtained solution from an expanded data-driven optimization procedure parametrized by λ

where x̂0n = x̂EO
n .

We set up some notation. In the following, we denote EQ [·], V arQ (·) and CovQ (·, ·) as the
expectation, variance and covariance under a probability distribution Q. We use “⇒” to denote
p d
weak convergence or convergence in distribution, “→” denote convergence in probability, “=”
denote equality in distribution, and “a.s.” denote almost surely. We denote k · k as the Euclidean
norm. For any deterministic sequences ak ∈ R and bk ∈ R, both indexed by a common index, say k
that goes to ∞, we say that ak = o(bk ) if ak /bk → 0, ak = O(bk ) if there exists a finite M > 0 such
that |ak /bk | < M for all sufficiently large k, ak = ω(bk ) if |ak /bk | → ∞, ak = Ω(bk ) if there exists a
finite M > 0 such that |ak /bk | > M for all sufficiently large k, and ak = Θ(bk ) if there exist finite
M , M > 0 such that M < |ak /bk | < M for all sufficiently large k. We use the notations op (·) and
Op (·) to denote a smaller and an at most equal stochastic order respectively. Namely, for a sequence
of random vector Ak ∈ Rd and deterministic sequence bk ∈ R, both indexed by a common index k
p
that goes to ∞, Ak = op (bk ) means Ak /bk → 0 as k → ∞. Correspondingly, Ak = Op (bk ) means for
Lam: On the Impossibility of Statistically Improving Empirical Optimization 5

any ǫ > 0, there exists a large enough N > 0 and M > 0 such that P (kAk /bk k ≤ M ) ≥ 1 − ǫ for any
p
k > N . Finally, for a sequence of random variable Ak ∈ R, we say that Ak → ∞ if for any ǫ > 0
and M > 0 there exists N > 0 large enough such that P (Ak > M ) ≥ 1 − ǫ for any k > N . Similarly,
p
Ak → −∞ if for any ǫ > 0 and M > 0 there exists N > 0 large enough such that P (Ak < −M ) ≥ 1 − ǫ
for any k > N . Finally, we denote ess supQ f as the essential supremum of a random function f
under distribution Q, and ⊤ as the transpose.

2.1. Conditions
We make three assumptions:

Assumption 1 (Optimality conditions). A true minimizer x∗ for Z(x) satisfies the second-
order optimality conditions, namely the gradient ∇Z(x∗ ) = 0 and the Hessian ∇2 Z(x∗ ) is positive
semidefinite.

Assumption 2 (Linearizability). The data-driven solution x̂λn satisfies

λ ∗ 1
x̂n − x = hIF (ξ), P̂n − P i + λK + op √ + λ (3)
n

as n → ∞ and λ → 0 (at any rate), for some function IF : Ξ → Rd and constant vector K ∈ Rd .
Moreover, CovP (IF (ξ)), the covariance matrix of IF (ξ) under P , is entry-wise finite. Furthermore,
when λ = 0, we have x̂0n = x̂EO
n .

Assumption 3 (Non-degeneracy). We have K ⊤ ∇2 Z(x∗ )K 6= 0 where ∇2 Z(x∗ ) and K are

defined in Assumptions 1 and 2 respectively.

Assumption 1 is the standard optimality condition for unconstrained problem or when x∗ is in

the interior of the feasible region. Note that we implicitly assume that Z(·) is twice differentiable
in the assumption. The unconstrained or interior-point nature of the solution is not necessary and
is used here to present a cleaner setting (we will extend our result later to the constrained case).
Assumption 2 is a smoothness condition on x̂λn = x(P̂n , λ) with respect to both P̂n and λ. In
particular, it states that x̂λn converges to x∗ linearly in P̂n − P as n → ∞ and λ as λ → 0, with
√
a negligible remainder term of higher order than 1/ n + λ. More specifically, the function IF (·)
in (3) is known as the influence function (Hampel 1974) at P , which is viewed as the functional
gradient of x(·, λ) with respect to the probability distribution P . This gradient is a function on the
image of the random object generated by P , i.e., the dual space of P , and h·, ·i is an inner product
that is written more explicitly as
n
1X
Z Z
hIF (ξ), P̂n − P i := IF (ξ)d(P̂n − P ) = IF (ξ)dP̂n = IF (ξi ) (4)
n i=1
6 Lam: On the Impossibility of Statistically Improving Empirical Optimization

Here in (4), the first equality is the definition of the inner product h·, ·i. The second equality follows
by noting that we can always assume IF (ξ) satisfies EP [IF (ξ)] = 0 (because EP̂n [1] = 1, we can
take IF (ξ) − EP [IF (ξ)] to be our new influence function if not the case). The third equality follows
from the definition of the empirical distribution P̂n . Note that when λ = 0, Assumption 2 implies
that the EO solution x̂EO 0
n = x̂n satisfies

1
x̂EO
n
∗
− x = hIF (ξ), P̂n − P i + op √ (5)
n

Such a linear relation can in fact arise more generally, for instance under Hadamard differentiability
(Van der Vaart 2000 Chapter 20), though this is not needed for our current purpose.
Note that Assumption 2 also stipulates that K is the (partial) derivative of x(P, λ) with respect
p
to λ. Moreover, (3) and (5) imply the consistency of solutions x̂λn and x̂n , in the sense that x̂λn → x∗
p
and x̂EO ∗
n → x as n → ∞ and λ → 0 (at any rate).

Finally, Assumption 3 is a technical non-degeneracy assumption to guarantee that the impact

due to the introduction of λ in x̂λn is not completely canceled out by the Hessian ∇Z 2 (x∗ ). Note
that by Assumption 1 we always have K ⊤ ∇2 Z(x∗ )K ≥ 0, and Assumption 3 guarantees that this
inequality is strict.

2.2. Second-Order Stochastic Dominance

To explain our results, we need the following concept (Rothschild and Stiglitz 1970):
Definition 1 (Second-order stochastic dominance for losses). For any two real-
valued random variables A and B, we say that A is second-order stochastically dominated by B if
E[u(A)] ≤ E[u(B)] for any non-decreasing convex function u : R → R.
Definition 1 is also known as a monotone convex order, in which A is said to be smaller than
B in increasing convex order (Shaked and Shanthikumar 2007 Section 4.A.1). The following is an
important equivalence relation:

Proposition 1 (Decomposition via mean-preserving spread). For any two real-valued

random variables A and B, we have that A is second-order stochastically dominated by B if and
only if
B̃ = Ã + η + ǫ (6)
d d
for some Ã = A, B̃ = B, an a.s. non-negative random variable η, and a random variable ǫ such
that B̃ is a mean-preserving spread of Ã + η, i.e.,

E[ǫ|Ã + η] = 0 a.s.
Lam: On the Impossibility of Statistically Improving Empirical Optimization 7

Note that the definition of second-order stochastic dominance is defined solely on the probability
distributions of A and B above. It is common to define this notion on the distributions directly,
though here we use random variable for the convenience of our subsequent developments. Moreover,
let us make clear that our Definition 1 applies to loss (i.e., smaller is desirable) rather than gain (i.e.,
bigger is desirable), the latter more customarily used in the economics literature (Hanoch and Levy
1969, Hadar and Russell 1969, Rothschild and Stiglitz 1970). This distinction, however, is imma-
terial, as we can simply view gain as negative loss. That is, if we call the gain −A to second-order
stochastically dominate the gain −B, then Definition 1 states that E[−u(−(−A))] ≥ E[−u(−(−B))]
for any non-decreasing convex function u, which is equivalent to E[v(−A)] ≥ E[v(−B)] for any
non-decreasing concave function v. This reduces back to notion that a gain is preferable if it has a
higher expected utility.
Proposition 1 is well-established (e.g., Shaked and Shanthikumar 2007 Theorem 4.A.5). Proposi-
tion 1 states that a second-order stochastically dominant variable B, when compared to A, contains
two additional terms η and ǫ. The first term is a non-negative random variable η, so that A + η
is less attractive than A in terms of first-order stochastic dominance, i.e., the distribution func-
tion of A + η is at least that of A. The second term is ǫ that does not change the expectation
but adds more variability to A + η. This means that B is a so-called mean-preserving spread of
A + η (Landsberger and Meilijson 1993). From a risk-averse perspective, additional uncertainty
caused by higher variability is always undesirable. Second-order stochastic dominance thus stipu-
lates that B is less desirable than A in terms of both the notions that “smaller is desirable” and
“less uncertainty is desirable”.
Finally, we introduce the following notion that facilitates the use of stochastic dominance for
weak limits, which occurs in our studied large-sample regime:
Definition 2 (Asymptotic second-order stochastic dominance). For any two real-
valued random sequences An and Bn , we say that An is asymptotically second-order stochastically
dominated by Bn as n → ∞ if An ⇒ W0 and Bn ⇒ W1 such that W0 is second-order stochastically
dominated by W1 or, more generally, for any subsequences nk0 , nk1 → ∞ such that Ank0 and Bnk1
have weak limits, say W0 and W1 respectively, W0 is second-order stochastically dominated by W1 .
“Asymptotic” here means that the second-order stochastic dominance holds in the weak limits
or, in the case that no unique weak limits exist, then we look at all possible subsequences that have
weak limits. We also remark that the weak convergence here is broadly defined as including the
case where the limits W0 and W1 can be ±∞, in which case it means the corresponding sequences
p
→ ±∞.
8 Lam: On the Impossibility of Statistically Improving Empirical Optimization

2.3. The Impossibility Theorem

With the above setting and notation, we are now ready to state our main result.

Theorem 1 (Impossibility of statistical improvement from EO). Suppose Assumptions

1, 2 and 3 hold, and λn → 0 as n → ∞. Then the optimality gap of x̂EO
n scaled by n, namely
nG (x̂EO λn λn
n ), is asymptotically second-order stochastically dominated by that of x̂n , namely nG (x̂n ).

Theorem 1 states that no matter how we choose λn in the expanded data-driven procedure, as
long as it goes to 0, then it is impossible to improve x̂EO
n asymptotically in terms of the risk profile
of the optimality gap measured by second-order stochastic dominance. Note that the scaling n in
front of the optimality gap in the theorem is natural as this is the scaling that gives a nontrivial
limit under the first-order optimality condition.
The key in showing Theorem 1 is to decompose the weak limit of nG (x̂λnn ) into that of nG (x̂EO
n ),

a positive term, and a conditionally unbiased noise term that leads to the mean-preserving spread
as in (6). Essentially, the two extra terms signify that introducing λn would simultaneously lead
to an extra “bias” and an extra “variability”. This behavior holds regardless of how we choose the
sequence λn . The detailed proof of Theorem 1 is in Appendix EC.2, and the next subsection gives
the roadmap and intuitive explanation.

2.4. Explaining the Impossibility Theorem

To explain Theorem 1 in more detail, we start with writing the mathematical form of the optimality
gap more explicitly as follows:

Lemma 1. Under Assumptions 1 and 2, as n → ∞ and λ → 0, the optimality gap of x̂λn satisfies

G (x̂λn ) := Z(x̂λn ) − Z(x∗ )

1 1
= hIF (ξ), P̂n − P i⊤ ∇2 Z(x∗ )hIF (ξ), P̂n − P i + λ2 K ⊤ ∇2 Z(x∗ )K + λK ⊤ ∇2 Z(x∗ )hIF (ξ), P̂n − P i
2 2
1
+ op + λ2 (7)
n

Lemma 1 can be obtained straightforwardly by Taylor-expanding Z(x̂λn ) around x∗ , giving

1
Z(x̂λn ) − Z(x∗ ) = (x̂λn − x∗ )⊤ ∇2 Z(x∗ )(x̂λn − x∗ ) + op (kx̂λn − x∗ k) (8)
2
and plugging in the linear approximation of x̂λn − x∗ in (3). In (7), the first term is approxi-
mately G (x̂EO
n ) (i.e., when setting λ = 0), while the second and third term appear as the additional

quadratic and cross terms arising from the “bias” λK in (3) for x̂λn . In other words, Lemma 1
deduces the following relation between the optimality gaps of x̂λn and x̂EO
n

d 1 2 ⊤ 2 1
G (x̂λn ) = G (x̂EO
n )+ λ K ∇ Z(x∗ )K + λK ⊤ ∇2 Z(x∗ )hIF (ξ), P̂n − P i + op + λ2
2 n
Lam: On the Impossibility of Statistically Improving Empirical Optimization 9

or equivalently the relation between the true objective values attained by x̂λn and x̂EO
n

λ d EO 1 2 ⊤ 2 ∗ ⊤ 2 ∗ 1 2
Z(x̂n ) = Z(x̂n ) + λ K ∇ Z(x )K + λK ∇ Z(x )hIF (ξ), P̂n − P i + op +λ (9)
2 n
The detailed proof of Lemma 1 is left to Appendix EC.2.
Now we highlight the main intuition in obtaining Theorem 1. By the definition of hIF (ξ), P̂n − P i
in (4), the central limit theorem (CLT) implies that
Y
hIF (ξ), P̂n − P i ≈ √
n
in distribution, where Y is a Gaussian vector with mean 0 and covariance matrix CovP (IF (ξ)).
Thus we can write the expression of G (x̂(λ)) in Lemma 1 as

1 ⊤ 2 1 1 1
G (x̂λn ) ≈ Y ∇ Z(x∗ )Y + λ2 K ⊤ ∇2 Z(x∗ )K + √ λK ⊤ ∇2 Z(x∗ )Y + op + λ2 (10)
2n 2 n n
in distribution. Moreover, setting λ = 0, the EO optimality gap becomes

EO 1 ⊤ 2 ∗ 1
G (x̂n ) ≈ Y ∇ Z(x )Y + op (11)
2n n
Now consider all possible choices of the sequence λ in relation to n, and for each case we see
√
how the first three terms in (10) behaves. Suppose λn = o(1/ n). Then the second and third terms
are of smaller order than the first term, so that (10) reduces to (1/(2n))Y ⊤ ∇2 Z(x∗ )Y + op (1/n).
In this case, G (x̂λnn ) and G (x̂EO
n ) behave the same asymptotically. In other words, the expanded

procedure does not offer any first-order benefit relative to EO. Now suppose, on the other hand,
√
that λn = ω(1/ n). Then the second and third terms are both of bigger order than the first term,
with the second term the most dominant, and so (10) becomes 12 λ2n K ⊤ ∇2 Z(x∗ )K + op (λ2n ). In this
case, G (x̂λnn ) is of bigger order than G (x̂EO
n ), so that the expanded procedure gives a solution that
√
is worse than EO. Thus, we are left with choosing λn = Θ(1/ n).
√
Suppose λn ≈ a/ n for some finite a 6= 0 as n → ∞. We have

λn 1 ⊤ 2 ∗ 1 2 ⊤ 2 ∗ 1 ⊤ 2 ∗ 1
G (x̂n ) ≈ Y ∇ Z(x )Y + a K ∇ Z(x )K + aK ∇ Z(x )Y + op
2n 2n n n
so that all three terms have the same order 1/n. The coefficient in this first-order term is
1 ⊤ 2 1
Y ∇ Z(x∗ )Y + a2 K ⊤ ∇2 Z(x∗ )K + aK ⊤ ∇2 Z(x∗ )Y (12)
2 2
Note that the second term in (12) is deterministic and always non-negative. On the other hand,
the third term has conditional mean zero given the first term, namely
" #
1
E aK ⊤ ∇2 Z(x∗ )Y Y ⊤ ∇2 Z(x∗ )Y = 0 (13)
2
10 Lam: On the Impossibility of Statistically Improving Empirical Optimization

To see this, note that since ∇2 Z(x∗ ) is positive semidefinite by Assumption 1, Y ⊤ ∇2 Z(x∗ )Y can be
written as a sum-of-squares of linear transformations of Y , i.e., Y ⊤ ∇2 Z(x∗ )Y = k(∇2 Z(x∗ ))1/2 Y k2
where (∇2 Z(x∗ ))1/2 denotes the square-root matrix of ∇2 Z(x∗ ). Moreover, note that Y , as a mean-
zero Gaussian vector, is symmetric (i.e., the densities at y and −y are the same). Thus, conditional
on the knowledge of 21 Y ⊤ ∇2 Z(x∗ )Y , it is equally likely for Yj to take yj and −yj for any yj , and
we have E[Yj |Y ⊤ ∇2 Z(x∗ )Y ] = 0 for all components Yj , j = 1, . . . , d of Y . This implies (13). In other
words, (12) is a mean-preserving spread of (1/2)Y ⊤ ∇2 Z(x∗ )Y + (1/2)a2K ⊤ ∇2 Z(x∗ )K.
√
Putting the above together, we see that, in the case λn ≈ a/ n for finite a 6= 0, the first-order
term of G (x̂λnn ), (12), is exactly decomposable into the form in (6) in Proposition 1. Compared
with the first-order coefficient of G (x̂EO ⊤ 2 ∗ EO
n ), namely (1/2)Y ∇ Z(x )Y , we thus have G (x̂n ) second-

order stochastically dominated by G (x̂λnn ) in terms of first-order behavior. Therefore, in all possible
choices of λn , the asymptotic statistical behavior of G (x̂λnn ) cannot be better than G (x̂EO
n ).

We summarize the above intuitive explanation with the following two propositions:

Proposition 2 (A trichotomy of optimality gap behaviors). Under Assumptions 1, 2

√
and 3, suppose λn → 0 with nλn → a for some a ∈ R ∪ {−∞, ∞} as n → ∞. We have the following
regarding the optimality gap of x̂λnn :
1. When a = 0, then
1
nG (x̂λnn ) ⇒ Y ⊤ ∇2 Z(x∗ )Y
2
2. When a = ∞ or −∞, then
1 1
G (x̂λnn ) ⇒ K ⊤ ∇2 Z(x∗ )K
λ2n 2
3. When 0 < |a| < ∞, then
1 1
nG (x̂λnn ) ⇒ Y ⊤ ∇2 Z(x∗ )Y + a2 K ⊤ ∇2 Z(x∗ )K + aK ⊤ ∇2 Z(x∗ )Y
2 2
On the other hand, the optimality gap of x̂n satisfies, as n → ∞,
1 ⊤ 2
nG (x̂EO ∗
n ) ⇒ Y ∇ Z(x )Y
2
In all the above, Y denotes a Gaussian random vector with mean 0 and covariance matrix
CovP (IF (ξ)).

Proposition 3 (Comparisons on the trichotomy). Under the same assumptions and nota-
tions in Proposition 2, consider n → ∞. When a = 0, nG (x̂λnn ) and nG (x̂EO
n ) have the same weak
p
limit. When a = ∞ or −∞, nG (x̂λnn ) → ∞ whereas nG (x̂EO
n ) converges weakly to a tight random

variable. When 0 < |a| < ∞, the weak limit of nG (x̂EO

n ) is second-order stochastically dominated by

that of nG (x̂λnn ).
Lam: On the Impossibility of Statistically Improving Empirical Optimization 11

Proposition 2 summarizes the asymptotic limits of G (x̂λnn ) derived from Lemma 1. Proposition 3
then compares them with that of G (x̂EO
n ), and concludes that asymptotic second-order stochastic

dominance holds in all cases, thus concluding Theorem 1.

We close this section by cautioning that when Assumption 3 is violated, so that K ⊤ ∇2 Z(x∗ )K =
0, then we would have (∇2 Z(x∗ ))1/2 K = 0 and hence K ⊤ ∇2 Z(x∗ ) = 0. Thus both the second and
√
third terms in (12) vanish. In this case, choosing a suitable λn = ω(1/ n) potentially retains the
first-order behavior of G (x̂λnn ) to be comparable to G (x̂EO
n ), i.e., of order 1/n, with an involved coef-

ficient that could be bigger or smaller. Thus in this degenerate case the comparison is inconclusive.

2.5. Generalization to Constrained Problems

We close this section by presenting the generalization of our results to constrained optimization.
Consider
min Z(x) subject to gj (x) ≤ 0 for j ∈ J1 , gj (x) = 0 for j ∈ J2 (14)
x

That is, we have |J1 | inequality constraints and |J2 | equality constraints, with constraint functions
denoted gj (x). We could also add non-negativity constraints on x, as long as the solution satisfies
the assumptions we make momentarily. We assume the following first-order optimality conditions:

Assumption 4 (Lagrangian optimality condition). The optimal solution x∗ to (14) solves

X
∇Z(x) + α∗j ∇gj (x) = 0
j∈B

where α∗j ’s are the Lagrange multipliers, and B indicates the binding set of constraints, i.e., B =
{j ∈ J1 ∪ J2 : gj (x∗ ) = 0}. Moreover, ∇Z 2 (x) + j∈B α∗j ∇2 gj (x) = 0 is positive semidefinite.
P

The first condition in Assumption 4 is the standard KKT condition. The second condition is
the second-order optimality condition imposed on the Lagrangian. Note that we have implicitly
assumed Z and gj ’s are twice differentiable in the assumption.
In addition, we assume the following behavior on x̂λn :

Assumption 5 (Preservance of binding constraints). The data-driven solution x̂λn satis-

fies gj (x̂λn ) = 0 for any j ∈ B, with probability converging to 1 as n → ∞ and λ → 0.

Assumption 5 means that, asymptotically, the data-driven solution retains the same set of binding
constraints as the true solution. This condition typically holds for any solution that is consistent in
converging to x∗ . Next, in parallel to Assumption 3, we have the following non-degeneracy condition
in the constrained case:

Assumption 6 (Non-degeneracy in constrained problems). We have K ⊤ (∇2 Z(x∗ ) +

∗ 2 ∗ 2 ∗ ∗ 2 ∗
P
j∈B αj ∇ gj (x ))K 6= 0 where ∇ Z(x ), αj , ∇ gj (x ) and K are defined in Assumptions 4 and 2.
12 Lam: On the Impossibility of Statistically Improving Empirical Optimization

With the above assumptions, we now argue that all the impossibility results we have discussed
hold in the constrained setting, as follows:

Theorem 2 (Impossibility of statistical improvement over EO in constrained problems).

Suppose Assumptions 4, 5 and 6 hold, and so does the expansion of solution x̂λn depicted in
Assumption 2. Then the conclusions of Theorem 1, Lemma 1, Proposition 2 and Proposition 3 all
hold, except that ∇2 Z(x∗ ) is replaced by ∇2 Z(x∗ ) + j∈B α∗j ∇2 gj (x∗ ) in all these results.
P

The proof of Theorem 2 follows a similar development as in Section 2.4, but instead of using the
unconstrained first-order optimality to remove the first-order term in the Taylor series expansion
of G (x̂λn ), we use the Lagrangian (Assumption 4) to express this term in terms of the constraints,
which are in turn analyzed via another Taylor expansion. The detailed proof is in Appendix EC.2.

3. Applications to Data-Driven Optimization Formulations

We apply Theorem 1 to several common data-driven optimization formulations, including regular-
ization (Section 3.1), DRO (Section 3.2), parametric optimization (Section 3.3) and its Bayesian
counterpart (Section 3.4). For each data-driven optimization approach, we will identify the expan-
sion (3) in Assumption 2, the main condition to achieve our impossibility result. This requires
certain regularity conditions which for succinctness are presented in the Appendix.

3.1. Regularization
Consider Z(x) := ψ(x, P ) as the expected value objective function EP [h(x, ξ)] where h : X × Ξ → R.
The EO objective function is then ψ(x, P̂n ) = EP̂n [h(x, ξ)]. We consider x̂Reg,λ
n as a regularized
solution obtained from

min{ψ(x, P̂n ) + λR(x) = EP̂n [h(x, ξ)] + λR(x)} (15)

x∈X

where R : X → R is a penalty function on x and λ ≥ 0 is the regularization parameter. For instance,

when h represents the least-square objective or some loss function and R(x) = kxk2 , we have a
ridge regression (e.g., Friedman et al. 2001).
We have the following result:

Theorem 3 (Impossibility of improving EO via regularization). Consider Z(x) =

EP [h(x, ξ)] with a minimizer x∗ . Suppose we collect i.i.d. observations ξ1 , . . . , ξn with empirical
distribution P̂n , and let x̂EO
n be a minimizer of EP̂n [h(x, ξ)] and x̂Reg,λ
n a minimizer of (15). Under
Assumptions EC.1–EC.4 in Appendix EC.3, we have

x̂Reg,λ
n
n
− x∗

2 ∗ −1 ∗ 2 ∗ −1 ∗ 1 ∗
= −h(EP [∇ h(x , ξ)]) ∇h(x , ξ), P̂n − P i − λn (EP [∇ h(x , ξ)]) ∇R(x ) + op √ + λn k∇R(x (16)
)k
n
Lam: On the Impossibility of Statistically Improving Empirical Optimization 13

as n → ∞ and λn → 0 (at any rate relative to n). Hence, if in addition Assumptions 1 and 3 hold
with K = −(EP [∇2 h(x∗ , ξ)])−1 ∇R(x∗ ), then nG (x̂EO
n ) is asymptotically second-order stochastically

dominated by nG (x̂Reg,λ
n
n
).

The proof of Theorem 3 is in Appendix EC.3, which requires the theory of M -estimation with
nusiance parameter. In addition to smoothness (Assumptions EC.1, EC.3) and first-order optimal-
ity conditions (Assumption EC.4), we also need to control the function complexity of the loss func-
tion (Van Der Vaart and Wellner 1996) via Donsker and Glivenko-Cantelli properties (Assumption
EC.2). Theorem 3 can also be translated to DRO based on the Wasserstein ball thanks to its
connection with regularization; see the next subsection.
At a first glance, Theorem 3 may look contradictory to the regularization literature. For instance,
even in the linear regression, it is known that a ridge regression with a properly chosen λ can
always improve the estimation quality under large sample (Li et al. 1986, 1987). The catch here
is the criterion to measure the quality of solution. Our considered criterion in this paper is the
risk profile of the entire distribution of the optimality gap, whereas the criterion looked at in the
ridge regression can be seen to correspond to the expected optimality gap. This latter criterion
appears reasonable for estimation problems, but it is not sufficient from an optimization viewpoint:
An improvement in the expected gap does not necessarily mean the obtained solution is better
statistically in terms of the attained objective value, as the variability of the optimality gap can
become worse – In fact this fundamental dilemma is precisely what we showed.
To distinguish clearly optimality gap versus expected optimality gap, consider the approximation
of G (x̂λn ) in (7). Under standard regularity conditions, its expected value, taken with respect to all
the data, is

E[G (x̂λn )] = E[Z(x̂λn ) − Z(x∗ )]

"
1 1
= E hIF (ξ), P̂n − P i⊤ ∇2 Z(x∗ )hIF (ξ), P̂n − P i + λ2 K ⊤ ∇2 Z(x∗ )K
2 2
#
⊤ 2 ∗ 1 2
+ λK ∇ Z(x )hIF (ξ), P̂n − P i + o +λ
n

EO 1 2 ⊤ 2 ∗ 1 2
= E[G (x̂n )] + λ K ∇ Z(x )K + o +λ (17)
2 n

where the last equality follows by comparing with the approximation for E[G (x̂EO
n )], and noting that
1 2
2
λ K ⊤ ∇2 Z(x∗ )K is deterministic and λK ⊤ ∇2 Z(x∗ )hIF (ξ), P̂n − P i has mean 0. The dominant
term in the remainder, namely o(1/n + λ2 ), is actually of order O(λ/n), which comes from the
cross-term of two “P̂n − P ” and one “λ” (the other higher-order terms either have expectation 0 or
are dominated by others). Thus, we can choose λ = k/n, for some constant k, such that the second
14 Lam: On the Impossibility of Statistically Improving Empirical Optimization

term in (17) and this O(λ/n) term possess the same 1/n2 order and are in opposite signs, thus
leading to a smaller expected optimality gap over x̂EO
n . However, if the variability of the optimality

gap is taken into account, then we need to consider the third term of (7) which is of stochastic
√
order λ/ n. Choosing λ = k/n then entails this term to order 1/n3/2 , which is larger than the
improvement of order 1/n2 in the expected gap and as a result washes away this gain.
We should also make clear that our impossibility result in Theorem 1 does apply to the com-
parison in terms of the expected optimality gap, because the second-order stochastic dominance
implies as a particular case that the expected values follow the corresponding ordering. This may
appear again as a contradiction to our result in saying that regularization can improve the expected
gap of EO. However, this latter gain is actually of higher-order than 1/n, whereas in the dominant
term of order 1/n in the optimality gap there is no gain, which is what our result implies. This
means that, at least in low-dimensional cases, the improvement using regularization, even focusing
only on the expected optimality gap, is negligible.
Finally, we justify one of our claims above that the criterion used in justifying the gain in ridge
regression corresponds to the expected optimality gap. Consider the simple least-square problem
minβ∈Rd E(Y − X ⊤ β)2 , where ξ = (X, Y ) ∈ Rd+1 follows a linear model Y = X ⊤ β + ǫ with E[ǫ|X] =
0, and E[XX ⊤ ] = I. Suppose from data we obtain a coefficient estimate β̂. Then the optimality
gap G (β̂), according to the least-square objective function, is

E(Y − X ⊤ β̂)2 − E(Y − X ⊤ β)2

= −2E[Y X ⊤ (β̂ − β)] + E[β̂ ⊤ XX ⊤ β̂ − β ⊤ XX ⊤ β]

= −2E[Y X ⊤ ](β̂ − β) + (β̂ ⊤ β̂ − β ⊤ β)

= −2β ⊤ (β̂ − β) + kβ̂ k2 − kβ k2

= kβ̂ − β k2

where we have used E[Y X ⊤ ] = β ⊤ . That is, kβ̂ − β k2 is the optimality gap. Thus, in the regression
context, Ekβ̂ − β k2 is the mean squared error (MSE) of the estimator β̂, while in optimization
this is the expected optimality gap. It is shown (Li et al. 1986, 1987) that using a regularizing
penalty kβ k2 can improve the MSE of β̂. However, Theorem 3 shows it would not improve kβ̂ − β k2
asymptotically.

3.2. Distance-Based DRO

Consider again that ψ(x, P ) = EP [h(x, ξ)]. The DRO approach entails that, under uncertainty on
P , we obtain solution x̂DRO,λ
n from solving

min max EQ [h(x, ξ)] (18)

x∈X Q∈Uλ
Lam: On the Impossibility of Statistically Improving Empirical Optimization 15

where Uλ is a so-called uncertainty set or ambiguity set on the space of probability distributions
which, at least intuitively, is believed to contain the ground-truth P with high likelihood. The
parameter λ signifies the size of the set. By properly choosing λ and accounting for the worst-
case scenario in the inner maximization, (18) is hoped to output a higher-quality solution. DRO
is gaining surging popularity in recent years. It can be viewed as a generalization of the classical
deterministic RO (Ben-Tal et al. 2009, Bertsimas et al. 2011) where the uncertain parameter in an
optimization problem is now the underlying probability distribution in a stochastic problem.
Before we present our impossibility result regarding DRO over EO, let us first frame the landscape
of DRO and reason the DRO types that have a legitimate possibility of beating EO statistically. In
the DRO literature, the choice of Uλ can be categorized roughly into two groups. The first group
is based on partial distributional information, such as moment and support (Ghaoui et al. 2003,
Delage and Ye 2010, Goh and Sim 2010, Wiesemann et al. 2014, Hanasusanto et al. 2015), shape
(Popescu 2005, Van Parys et al. 2016, Li et al. 2017, Lam and Mottet 2017, Chen et al. 2021) and
marginal distribution (Chen et al. 2018, Doan et al. 2015, Dhara et al. 2021). This approach has
proven useful in robustifying decisions when facing limited distributional information, or when data
is scarce, e.g., in the extremal region. In such cases, by using Uλ that captures the known partial
information, the DRO guarantees a worst-case performance bound on EP [h(x̂DRO,λ
n , ξ)] (by using
the outer objective value, namelyEP [h(x̂DRO,λ
n , ξ)] ≤ maxQ∈Uλ EQ [h(x̂DRO,λ
n , ξ)]). Additionally, if Uλ
is calibrated from data to be a high-confidence region in containing P , then such a worst-case
bound holds with at least the same statistical confidence level (e.g., Delage and Ye 2010). However,
from a large-sample standpoint, the set Uλ constructed in these approaches typically bears intrinsic
looseness due to the use of only partial distributional information, and consequently the obtained
solution x̂DRO,λ
n does not converge to x∗ as n grows (regardless of how we choose λ).
The second group of Uλ comprises neighborhood balls in the probability space, namely Uλ = {Q :
D(Q, P 0) ≤ λ} for some statistical distance D(·, ·) between two probability distributions, baseline
distribution P 0 , and neighborhood size λ > 0. Common choices of D include the φ-divergence class
(Ben-Tal et al. 2013, Bertsimas et al. 2018, Bayraksan and Love 2015, Jiang and Guan 2016, Lam
2016, 2018), and the Wasserstein distance (Gao and Kleywegt 2016, Chen and Paschalidis 2018,
Esfahani and Kuhn 2018, Blanchet and Murthy 2019). When the ball center P 0 and the ball size λ
are chosen properly in relation to the data size, it can be guaranteed that a worst-case bound holds
for EP [h(x̂DRO
n , ξ)] with high confidence and, moreover, x̂DRO
n converges to x∗ . In other words, such
DRO can provide statistically consistent solutions. For this reason, in the following we will study
this approach. In particular, we will present φ-divergence-based DRO in detail, and then connect
Wasserstein-based DRO to the result in Section 3.1.
16 Lam: On the Impossibility of Statistically Improving Empirical Optimization

We consider D represented by a φ-divergence and the ball center P 0 taken as the empirical
distribution P̂n . For any two distributions Q and Q′ on the same domain, let L = dQ/dQ′ be the
Radon-Nikodym derivative or the likelihood ratio between Q and Q′ . Then D is defined by

D(Q, Q′ ) = EQ′ [φ(L)] (19)

where φ : R+ → R+ is a function with certain properties (described momentarily). For example,

when φ(x) = x log x − x + 1, (19) gives the Kullback-Leibler (KL) divergence, and φ(x) = (x − 1)2
gives the χ2 -distance. Note that such a D is only well-defined when Q is absolutely continuous
with respect to Q′ . Thus when we set the ball center as P̂n , the use of D(Q, P̂n ) is valid only
either in the discrete-support case (Ben-Tal et al. 2013, Bayraksan and Love 2015) or functions
as an “empirical” φ-divergence that confines distribution Q to the support generated by data
(Lam and Zhou 2017, Lam 2019, Duchi et al. 2021). In the first case, the resulting set Uλ can be
made to contain P with high confidence when λ is chosen large enough. However, it is shown
that such a calibration approach can be conservative, in the sense that a smaller λ is sufficient for
guaranteeing an upper bound for EP [h(x̂DRO,λ
n , ξ)] (i.e., there is a looseness when translating the
confidence guarantee for the uncertainty set Uλ in covering P , to the confidence level for bounding
EP [h(x̂DRO,λ
n , ξ)]). The empirical φ-divergence reduces this looseness by looking at the asymptotic
expansion of maxQ∈Uλ EP [h(x, ξ)] in terms of λ (Lam 2016, Gotoh et al. 2018, Duchi and Namkoong
2019) and connecting to the empirical likelihood theory (Wang et al. 2016, Lam and Zhou 2017,
Lam 2019, Duchi et al. 2021). In this way, even though the uncertainty set may never contain
the true distribution (e.g., when the true distribution is continuous), the resulting DRO can still
provide a bounding guarantee on EP [h(x̂DRO,λ
n , ξ)] with high confidence. Finally, we note that
there are other proposals in the literature that construct P 0 via density estimation and calibrate
λ by estimating the divergence between data and the constructed P 0 (e.g., Wang et al. 2009,
Póczos et al. 2012). However, while being consistent, the convergence rate is slow and as a result
this approach gives a very conservative Uλ (Hong et al. 2020) and in turn solution x̂DRO,λ
n .
To proceed, we introduce conditions on the function φ : R+ → R+ .

Assumption 7 (Smoothness of φ-function). φ(t) is convex for t ≥ 0, φ(1) = 0, and φ(t) is

twice continuously differentiable at t = 1 with φ′′ (1) > 0.

Let φ∗ (s) = supt≥0 {st − φ(t)} be the convex conjugate of φ. Thanks to the properties of φ above,
standard convex analysis (Rockafellar 2015) gives the following:

Lemma 2 (Smoothness of conjugate φ-function). Under Assumption 7, φ∗ is twice contin-

uously differentiable with φ∗ (0) = 0, φ∗ ′ (0) = 1 and φ∗ ′′ (0) = 1/φ′′ (φ∗ ′ (0)) = 1/φ′′ (1) > 0.
Lam: On the Impossibility of Statistically Improving Empirical Optimization 17

Together with the regularity conditions on the optimization problem in Appendix EC.4, we have
the following result:

Theorem 4 (Impossibility of improving EO via divergence DRO). Consider Z(x) =

EP [h(x, ξ)] with a minimizer x∗ . Suppose we collect i.i.d. observations ξ1 , . . . , ξn with empirical
distribution P̂n , and let x̂EO
n be a minimizer of EP̂n [h(x, ξ)] and x̂nD−DRO,λ a minimizer of (18) with
Uλ = {Q : D(Q, P̂n ) ≤ λ} and D defined in (19). Under Assumption 7 and Assumptions EC.5–EC.8
in Appendix EC.4, we have

x̂nD−DRO,λn − x∗
CovP (h(x∗ , ξ), ∇h(x∗ , ξ))
q
= −(EP [∇ h(x , ξ)]) h∇h(x , ξ), P̂n − P i − λn φ∗ ′′ (0)(EP [∇2 h(x∗ , ξ)])−1
2 ∗ −1 ∗
p
V arP (h(x∗ , ξ))
√

1
+ op √ + λn (20)
n

as n → ∞ and λn → 0 (at any rate relative to n). Hence, if in addition Assumptions 1 and 3 hold
with K = − φ∗ ′′ (0)(EP [∇2 h(x∗ , ξ)])−1 CovP (h(x∗ , ξ), ∇h(x∗ , ξ))/ V arP (h(x∗ , ξ)), then nG (x̂EO
p p
n )

is asymptotically second-order stochastically dominated by nG (x̂nD−DRO,λn ).

√
In Theorem 4, note that λn plays the role of λn in previous sections because of the scaling of
divergence D. Moreover, it is known that maxQ∈Uλ EQ [h(x, ξ)] admits an asymptotic expansion in
the form
q
max EQ [h(x, ξ)] ≈ EP̂n [h(x, ξ)] + 2λφ∗ ′′ (0)V arP̂n (h(x, ξ)) + · · · (21)
Q∈Uλ

as λ → 0 (Lam 2016, Dupuis et al. 2016, Gotoh et al. 2018, Lam 2018, Duchi and Namkoong 2019,
Duchi et al. 2021). The relation (20) can thus be obtained formally by turning the minimization
of (21) into a root-finding (or M -estimation; Van der Vaart 2000 Chapter 5) problem via the
first-order optimality condition, and then applying the delta method. Nonetheless, the precise
development needs more technicality, with the detailed proof of Theorem 4 in Appendix EC.4. We
also point out that (20) can be viewed as a generalization of Duchi and Namkoong (2019) that
considers the special case of χ2 -distance for which there is no higher-order term in (21), and thus
our proof directly uses the Karush–Kuhn–Tucker (KKT) condition instead of starting with the
Taylor expansion as in Duchi and Namkoong (2019). Moreover, (21) also relates to Gotoh et al.
(2021) that considers the expectation and variance of the cost function of the obtained solution
(see Section 4.2).
Next we consider D as the Wasserstein distance. The p-th order Wasserstein or optimal transport
distance is defined by
D(Q, Q′ ) = inf (Eπ ||ξ − ξ ′ kp )1/p (22)
π∈Π(Q,Q′ )
18 Lam: On the Impossibility of Statistically Improving Empirical Optimization

where Π(Q, Q′ ) denotes the set of all distributions with marginals Q and Q′ , and (ξ, ξ ′ ) is distributed
according to π. The definition (22) has a dual representation in terms of integral probability metric
(Sriperumbudur et al. 2012), and the norm k·k there can be replaced with more general transporta-
tion cost functions (e.g., Blanchet and Murthy 2019). The rich structural properties of Wasserstein
DRO has facilitated its tight connection with machine learning and statistics (Kuhn et al. 2019,
Rahimian and Mehrotra 2019, Blanchet et al. 2019, Gao et al. 2017a, Shafieezadeh-Abadeh et al.
2019, Chen and Paschalidis 2018).
Here we focus on the 1-st order Wasserstein distance (p = 1), and set the ball center P 0 as P̂n ,
so that Uλ = {Q : D(Q, P̂n) ≤ λ} where D is defined in (22). In this case, it is known that, when
Ξ is in the real space and h(x, ξ) is convex in ξ, the worst-case expectation in (18) possesses a
Lipschitz-regularized reformulation as

max EQ [h(x, ξ)] = EP̂n [h(x, ξ)] + λLip(h(x, ·)) (23)

Q∈Uλ

for any x, where Lip(h(x, ·) is the Lipschitz modulus of h(x, ·) given by Lip(h(x, ·) =
supξ6=ξ′ |h(x, ξ) − h(x, ξ ′ )|/||ξ − ξ ′ k. From (23), we immediately obtain the following:

Corollary 1 (Impossibility of improving EO via Wasserstein DRO). Consider

Z(x) = EP [h(x, ξ)] with a minimizer x∗ , h(x, ·) is convex for any x ∈ X , and ξ ∈ Ξ in the real
space. Suppose we collect i.i.d. observations ξ1 , . . . , ξn with empirical distribution P̂n , and let x̂EO
n

be a minimizer of EP̂n [h(x, ξ)] and x̂nW −DRO,λ a minimizer of (18) with Uλ = {Q : D(Q, P̂n) ≤ λ}
and D defined in (22) with p = 1. Under Assumptions EC.1–EC.4 in Appendix EC.3 where we set
R(x) = Lip(h(x, ·)), we have

x̂nW −DRO,λn − x∗

= −h(EP [∇2 h(x∗ , ξ)])−1 ∇h(x∗ , ξ), P̂n − P i − λn (EP [∇2 h(x∗ , ξ)])−1 ∇Lip(h(x∗ , ·))

1 ∗
+ op √ + λn k∇Lip(h(x , ·))k
n

as n → ∞ and λn → 0 (at any rate relative to n). Hence, if in addition Assumptions 1 and 3
hold with K = −(EP [∇2 h(x∗ , ξ)])−1 ∇Lip(h(x∗ , ·)), then nG (x̂EO
n ) is asymptotically second-order

stochastically dominated by nG (x̂nW −DRO,λn ).

Corollary 1 is an immediate consequence of Theorem 3 by observing the Lipschitz-regularized

reformulation of Wasserstein DRO. Though we do not pursue further in obtaining sufficient con-
ditions for the required regularity properties of Lip(h(x, ·)), we point out that this can be written
explicitly in terms of the dual norm of k · k in (22) for interesting examples such as regression
and classification problems (Esfahani and Kuhn 2018, Blanchet et al. 2019, Kuhn et al. 2019).
Lam: On the Impossibility of Statistically Improving Empirical Optimization 19

Moreover, more generally (i.e., p 6= 1 and h(x, ·) not necessarily convex), the worst-case objective
maxQ∈Uλ EQ [h(x, ξ)] admits an expansion similar to (21) where the first-order term is λV (h(x, ·)),
with V (h(x, ·)) being some variability measure of h in which Lip(h(x, ·)) is a special case (Gao et al.
2017a).

3.3. Parametric Optimization

We have focused on nonparametric problems thus far in the previous subsections. Our framework
applies similarly if we confine the unknown distribution P to a parametric model, and we will
present an impossibility result in improving parametric optimization using regularization.
In this parametric case, we write Z(x) = ψ(x, θ ∗ ) where θ ∗ is an unknown parameter in the model,
e.g., ψ(x, θ) = Eθ [h(x, ξ)] where Eθ [·] denotes the expectation under a parametric distribution, say
Pθ , that generates ξ and θ ∗ is the true value of θ. Evidently, we can write θ ∗ = θ(P ) where θ(·) is
viewed as a function on the true distribution P (e.g., θ(·) can be the solution to the score function
equation), so we can define ψ̃(x, Q) = ψ(x, θ(Q)) and thus Z(x) = ψ̃(x, P ). Suppose, given data,
we use a consistent estimator θ̂n to plug into ψ(x, θ̂n ), then we can write θ̂n = θ(P̂n ). Thus we
have minx∈X ψ̃(x, P̂n ) as our EO procedure, which reduces to the same setting as the previous
subsections. Note that the above discussion implicitly assumes the model is correctly specified, so
that the estimator θ̂n converges to the true parameter value θ ∗ . Nonetheless, the framework still
works even if the model is misspecified, in which case θ ∗ denotes the minimizer of the Kullback-
Leibler divergence between the true model and the considered model class, and θ̂n is then consistent
in estimating θ ∗ under mild conditions.
To make our discussion precise, let x∗ be an optimal solution to minx∈X {Z(x) = ψ(x, θ ∗ )} in
this parametric setting, where θ ∗ ∈ Rm is the true parameter (either when the model is correctly
specified, or interpreted as the minimizer of the Kullback-Leibler divergence when it is incorrectly
specified). To avoid ambiguity, we write ∇x ∈ Rd and ∇2x ∈ Rd as the gradient and Hessian taken
with respect to x, and ∇xθ ∈ Rd×m as the cross-gradients with respect to θ and x. We write
x̂nP −EO ∈ argminx∈X ψ(x, θ̂n ) where θ̂n is obtained from an estimation procedure on the probability
distribution model that satisfies the asymptotic behavior in (24) depicted momentarily. We also
consider a penalty function R : Rd → R, and the regularized formulation

x̂nP −Reg,λ ∈ argminx∈X {ψ(x, θ̂n ) + λR(x)}

We have the following result:

Theorem 5 (Impossibility of improvement over EO in parametric optimization).

Suppose Assumptions EC.9 and EC.10 in Appendix EC.5 hold. Also suppose θ̂n is obtained from
data such that
∗ 1
θ̂n − θ = hIFθ , P̂n − P i + op √ (24)
n
20 Lam: On the Impossibility of Statistically Improving Empirical Optimization

where IFθ (ξ) : Ξ → Rd has CovP (IFθ (ξ)) that is finite. Then

1
x̂nP −Reg,λ − x∗ = −h(∇2x ψ(x∗ , θ ∗ ))−1 ∇xθ ψ(x∗ , θ ∗ )IFθ , P̂n − P i− λ(∇2xψ(x∗ , θ ∗ ))−1 ∇x R(x)+op √ +λ
n
as n → ∞ and λ → 0. Hence, if in addition Assumptions 1 and 3 hold with K =
−(∇2x ψ(x∗ , θ ∗ ))−1 ∇x R(x), then nG (x̂nP −EO ) is asymptotically second-order stochastically dominated
by nG (x̂Pn −Reg,λn ) for any λn → 0.

The asymptotic (24) can be ensured by standard conditions. For instance, in the case where
there is no model misspecification and we use maximum likelihood estimator for θ̂n , (24) holds
under the Lipschitzness of the log-likelihood function, with IFθ (·) = Iθ−1
∗ sθ ∗ (·) where sθ ∗ (·) is the

score function and Iθ∗ is the Fisher information matrix (see, e.g., Van der Vaart 2000 Theorem
5.39).

3.4. Bayesian Optimization

We generalize the discussion in Section 3.3 to Bayesian optimization, in the following sense. We
impose a prior distribution on the unknown parameter θ, and obtain x̂nP −Bay,λ from

x̂nP −Bay,λ ∈ argminx∈X {EΘ|Dn [ψ(x, Θ)] + λR(x)} (25)

where EΘ|Dn [·] is the posterior distribution of θ given the collection of data Dn = {ξ1 , . . . , ξn },
and we denote Θ as the random variable distributed under this posterior distribution. In other
words, x̂nP −Bay,0 optimizes the posterior expectation of the original objective function ψ(x, θ), while
x̂nP −Bay,λ imposes additionally a regularizing penalty R with the regularization parameter λ. We
have the following result:

Theorem 6 (Impossibility of improvement over EO via Bayesian optimization).

Suppose all assumptions in Theorem 5 and Assumptions EC.11, EC.12 and EC.13 in Appendix
EC.6 hold. In addition, assume that

J p
PΘ|Dn − N θ̂n , →0 (26)
n TV

for some θ̂n where k · kT V denotes the total variation distance, PΘ|Dn is the posterior distribution of
√
θ, and J is some covariance matrix. Moreover, assume that n(Θ − θ̂n )|Dn is uniformly integrable
a.s. as n → ∞. Then

1
x̂nP −Bay,λ − x∗ = −h(∇2x ψ(x∗ , θ ∗ ))−1 ∇xθ ψ(x∗ , θ ∗ )IFθ , P̂n − P i− λ(∇2xψ(x∗ , θ ∗ ))−1 ∇x R(x)+op √ +λ
n
as n → ∞ and λ → 0. Hence, if in addition Assumptions 1 and 3 hold with K =
−(∇2x ψ(x∗ , θ ∗ ))−1 ∇x R(x), then nG (x̂nP −EO ) is asymptotically second-order stochastically dominated
by nG (x̂Pn −Bay,λn ) for any λn → 0.
Lam: On the Impossibility of Statistically Improving Empirical Optimization 21

Note that the asymptotic expressions of x̂nP −Bay,λ − x∗ in Theorem 6 and x̂nP −Reg,λ − x∗ in Theorem
5 are the same. This in particular implies that the simple Bayesian solution, x̂nP −Bay,0 , performs
asymptotically equivalently to the EO solution and, moreover, x̂nP −Bay,λ and x̂nP −Reg,λ perform
asymptotically equivalently.
Finally, we mention that the convergence (26) is standard, as guaranteed by the Berstein-von
Mises Theorem (e.g., Theorem 10.1 and the discussion on P.144 in Van der Vaart 2000).

4. Comparisons with Existing Literature

As we have mentioned in the introduction, in the literature data-driven formulations other than EO
have been proposed for good reasons. A focal point in recent years has been on DRO, which have
been studied by a range of interesting works. Below we compare our framework with these DRO
works in Sections 4.1 and 4.2, but noting that our framework in Section 2 applies more generally
to all “expanded” procedures in addition to DRO. In Sections 4.3 and 4.4 we connect our result
to semiparametric inference in statistics, and discuss alternate settings in which other approaches
than EO could offer advantages.

4.1. Absolute Performance Bounds and Out-of-Sample Disappointment

A common guarantee studied in the DRO literature is that the worst-case or robust objective
value provides an upper bound on the true objective value evaluated at the obtained solution.
Consider the expected value minimization problem minx∈X {Z(x) = EP [h(x, ξ)]}. Suppose Uλ is an
uncertainty set that contains the true distribution P . Then the DRO solution x̂DRO obtained from
solving minx∈X maxQ∈Uλ EQ [h(x, ξ)] must satisfy

Z(x̂DRO ) ≤ max EQ [h(x̂DRO , ξ)]

Q∈Uλ

Moreover, when data are observed, suppose Uλ is constructed as a high-confidence region for P , i.e.,
P(P ∈ Uλ ) ≥ 1 − α for some confidence level 1 − α. Then this confidence level can be translated into
at least the same confidence on a bound of the performance of x̂DRO via the worst-case objective
value, given by
DRO DRO
P Z(x̂ ) ≤ max EQ [h(x̂ , ξ)] ≥ 1 − α (27)
Q∈Uλ

where P denotes the probability with respect to the data. The above argument can be finite-sample
or asymptotically argued (in the latter case, the asymptotic confidence guarantee of the uncer-
tainty set would translate to the asymptotic confidence guarantee of the performance bound).
Such type of guarantees has been studied in, e.g., Delage and Ye (2010), Goh and Sim (2010),
Hanasusanto et al. (2015), Lam and Mottet (2017), Ben-Tal et al. (2013), Bertsimas et al. (2018),
Jiang and Guan (2016), Esfahani and Kuhn (2018). Van Parys et al. (2020) and Sutter et al.
22 Lam: On the Impossibility of Statistically Improving Empirical Optimization

(2020) in particular call the complement of the probability in the left hand side of (27) the out-of-
sample disappointment.
Our results in Sections 2 and 3.2 differ from the bound (27) in two important aspects. First
is that we are measuring the quality of solution by a ranking of the true objective value. That
is, an obtained solution x̂ that has a smaller value of Z(x̂) is regarded as more desirable. This is
different from (27) that guarantees the validity of the estimated objective value in bounding the
true value of an obtained solution. Note that the latter validity does not necessarily imply the
obtained solution performs better in the true objective value. In fact, we show that DRO cannot
be superior to EO in the latter aspect, at least in the large-sample regime that we consider.
Regarding which criterion, (27) or ours, should an optimizer use, it may depend on the particular
situation of interest. In terms of comparing solution quality, we believe there should be little
argument against our criterion of ranking the attained true objective value, as this appears the most
direct measurement of solution performance. On the other hand, in some high-stake situations, an
optimizer may want to obtain a reliable upper estimate, or to ensure a low enough upper bound, of
the attained objective value, in which case the conventional DRO guarantee (27) would be useful.
Our second distinction from bound (27) is that we study the error relative to the oracle best
solution, i.e., our approximation is on the optimality gap Z(x̂) − Z(x∗ ) for an obtained solution
x̂. A claimed drawback of using bound (27) is that it could be loose, thus unable to detect the
over-conservativeness of DRO (e.g., one can simply take Uλ to be extremely large, so that (27)
trivially holds). This latter criticism is resolved to an extent by a series of work that shows that, by
choosing Uλ properly, the worst-case objective value maxQ∈Uλ E[h(x, ξ)] differs from the true value
EP [h(x, ξ)] only by a small amount. This includes the empirical likelihood (Lam and Zhou 2017,
Lam 2019, Duchi and Namkoong 2019, Duchi et al. 2021), Bayesian (Gupta 2019), and large devi-
ations perspective (Van Parys et al. 2020) in divergence DRO, and the profile likelihood and vari-
ability regularization for Wasserstein DRO (Blanchet et al. 2019, Gao et al. 2017b). Moreover, it
can be shown that certain divergence DRO gives rise to the best possible bound in the form of (27),
which is argued via a match of the statistical performance with the CLT (Lam 2019, Gupta 2019,
Duchi et al. 2021) or the Bernstein bound (Duchi and Namkoong 2019), or a “meta-optimization”
that gives the tightest such bound subject to an allowed large deviations rate (Van Parys et al.
2020). Nonetheless, all these works mainly focus on an upper bound on Z(x̂) instead of the opti-
mality gap Z(x̂) − Z(x∗ ), the latter being more challenging due to the unknown oracle true optimal
value Z(x∗ ).
Lam: On the Impossibility of Statistically Improving Empirical Optimization 23

4.2. Bias-Variance Tradeoff

Consider the expected value minimization problem minx∈X {Z(x) = EP [h(x, ξ)]} where h is some
loss function. Gotoh et al. (2018, 2021) studies a regularized formulation using φ-divergence as the
penalty, defined as (in the notation of the current paper):

1
min max EQ [h(x, ξ)] + D(Q, P̂n ) (28)
x∈X Q λ

where Q lies in the relevant space of probability distributions. This formulation is akin to the
divergence DRO we discussed in Section 3.2, but using the Lagrangian formulation directly instead
of starting with the notion of uncertainty set. The parameter λ in (28) plays a similar role as the
ball size of our uncertainty set.
The main insight from Gotoh et al. (2018, 2021) is a desirable improvement on the bias-variance
tradeoff using the solution obtained in (28), which we call a “Lagrangian (L)-DRO” solution
x̂nL−DRO,λ . More precisely, Gotoh et al. (2018) concludes (Theorem 5.1 therein) that

EP̂n [h(x̂nL−DRO,λ , ξ)]

λ2
= EP̂n [h(x̂EO
n , ξ)] + CovP̂n (h(x̂EO EO ⊤ 2 EO
n , ξ), ∇h(x̂n , ξ)) (EP̂n [∇ h(x̂n , ξ)])
−1
CovP̂n (h(x̂EO EO
n , ξ), ∇h(x̂n , ξ))
2φ∗ ′′ (0)2
+ o(λ2 ) (29)

and

V arP̂n [h(x̂nL−DRO,λ , ξ)]

2λ
= V arP̂n [h(x̂EO
n , ξ)] − ∗ ′′ CovP̂n (h(x̂EO EO ⊤ 2 EO
n , ξ), ∇h(x̂n , ξ)) (EP̂n [∇ h(x̂n , ξ)])
−1
CovP̂n (h(x̂EO EO
n , ξ), ∇h(x̂n , ξ))
φ (0)
+ o(λ) (30)

as λ → 0, and Gotoh et al. (2021) develops these expansions further to center at EP [h(x̂EO
n , ξ)] and

V arP (h(x̂EO
n , ξ)). From (29) and (30), we see that while L − DRO deteriorates the expected loss, it

reduces the variance of the loss by a large magnitude (λ versus λ2 ), giving an overall improvement
on the mean squared error. In this sense, injecting a small λ > 0 in L − DRO is desirable compared
to EO with λ = 0.
Putting aside the technical differences in using L-DRO versus the common DRO in (18) (the
latter requires an extra layer of analysis in the Lagrangian reformulation), we point out two con-
ceptual distinctions between Gotoh et al. (2018, 2021) and our results in Section 2 and 3.2. First is
the criterion in measuring the quality of an obtained solution x̂, in particular the role of the vari-
ance of the loss function h(x, ξ), V arP (h(x, ξ)). Our criterion is in terms of the achieved optimality
gap, or equivalently Z(x̂), where Z(·) is the true objective function of the original optimization. In
24 Lam: On the Impossibility of Statistically Improving Empirical Optimization

particular, any risk-aware consideration should be already incorporated into the construction of the
loss h. When an obtained solution x̂ is used in many future test cases, an estimate of Z(x̂), using
ntest test data points, has a variance given by V arP (h(x̂, ξ))/ntest (instead of V arP (h(x̂, ξ))), and
thus the variance of h plays a relatively negligible role. This is different from Gotoh et al. (2018,
2021) who takes an alternate view that puts more weight on the variability of the loss function.
Our second main distinction from Gotoh et al. (2018, 2021) is our consideration of the attained
true objective value or generalization performance Z(x̂) as a random variable, and we study its
risk profile by assessing its entire distribution that exhibits the second-order stochastic dominance
relation put forth in Section 2. Here, the randomness of Z(x̂) comes from the statistical noise
from data used in constructing the obtained solution x̂. In contrast, Gotoh et al. (2018) studies
the mean of the attained objective value (and variance in the sense described above). In this latter
setting, as we have seen in (17), any expanded procedure with parameter λ satisfying Assumption
2 (not only divergence DRO) would lead to a deterioration of order λ2 from EO. This does show
an inferiority of the expanded procedure, but it does not give a complete picture as the attained
objective value can be better in other distributional aspects. Our Theorem 1 stipulates that even
considering the entire distribution, an expanded procedure like DRO still cannot outperform EO.

4.3. Connections to Local Asymptotic Minimax Theorem and Cramer-Rao Bound

Our main result connects to the classical asymptotic minimax theorem in semiparametric infer-
ence and the well-known Cramer-Rao bound. Consider the optimal solution x∗ as an estima-
tion target, and we construct an estimator x̂n from data. It is known from the local asymp-
totic minimax theorem (Kosorok 2007 Theorem 18.4) that, for any so-called regular estima-
√
tor x̂n (Kosorok 2007 Section 18.1), n(x̂n − x∗ ) asymptotically dominates N (0, Σ) where Σ =
EP [∇2 h(x∗ , ξ)])−1 CovP (∇h(x∗ , ξ))EP [∇2 h(x∗ , ξ)])−1 . Note that the latter variable is precisely the
√
limit of n(x̂EO ∗
n − x ) (Σ/n is the the variance of the first term in (16)). This asymptotic domi-
√
nance is in the sense that E[ℓ( n(x̂n − x∗ ))] ≥ E[ℓ(N (0, Σ))] under a proper asymptotic limit, for
any so-called subconvex function ℓ, namely ℓ such that the sublevel sets {y : ℓ(y) ≤ c} for any c ∈ R
are convex, symmetric about the origin, and closed.
On the other hand, our second-order stochastic dominance in Theorem 1 shows that E[f (nG (x̂n ))]
for any considered data-driven solution x̂n is at least E[f (nG (x̂EO
n )]), for any non-decreasing con-

vex function f . Noting from (8) that G (x̂n ) ≈ 21 (x̂n − x∗ )⊤ ∇2 Z(x∗ )(x̂n − x∗ ), our result implies,
√ √
roughly speaking, that E[f (g( n(x̂n − x∗ ))] is at least E[f (g( n(x̂EO ∗
n − x ))] asymptotically, where

g(y) = k(∇2 Z(x∗ ))1/2 y k2 which is convex and thus f ◦ g is subconvex by the non-decreasing convex
property of f and the form of g. Our result thus, at least intuitively, coincides with the implica-
tion of the minimax theorem. However, the minimax theorem considers a perturbation of the true
Lam: On the Impossibility of Statistically Improving Empirical Optimization 25

parameter value within a shrinking neighborhood at a specific rate (which leads to the notion of
regular estimators and also defines the plausible parameter values to be considered in the minimax
regime), and it claims the dominance relation using Anderson’s lemma, a result in convex geometry
(Van der Vaart 2000 Theorem 8.5). On the other hand, our Theorem 1 is obtained by perturbing
the data-driven procedure parametrized by λ, and analyzes the optimality gap or objective value
via second-order stochastic dominance. Our main insight to claim the superiority of EO, which
uses the mean-preserving spread in risk-based ranking, appears novelly beyond the semiparametric
literature.
Moreover, it is perhaps revealing to cast our result in the context of the conventional Cramer-
Rao bound. The latter states that, under suitable smoothness conditions, maximum likelihood is
the best among all possible estimators in estimating (functions of) unknown model parameters in
terms of asymptotic variance. Suppose we use a particular objective function, namely the expected
log-likelihood, in our framework. The Cramer-Rao bound would intuitively conclude the expected
optimality gap is best attained via EO, and thus our result can be viewed as a local generalization in
that it applies to general objective functions and concludes the superiority of the entire distribution
of the optimality gap when using EO.
To explain in detail, first note that (7) stipulates that the EO solution, x̂EO
n , satisfies

1 EO
Z(x̂EO ∗ ∗ ⊤ 2 ∗ EO ∗
n ) − Z(x ) ≈ (x̂n − x ) ∇ Z(x )(x̂n − x )
2

Now consider, under suitable conditions like the ones for Section 3,

E[(x̂EO ∗ ⊤ 2 ∗ EO ∗ ⊤ 2 ∗
n − x ) ∇ Z(x )(x̂n − x )] = E[hIF, P̂n − P i ∇ Z(x )hIF, P̂n − P i]
" n n
#
1X 1 X
≈E ∇h(x∗ , ξi )⊤ ∇2 Z(x∗ )−1 ∇2 Z(x∗ )∇2 Z(x∗ )−1 ∇h(x∗ , ξi )
n i=1 n i=1
tr(∇2 Z(x∗ )−1 Cov(∇h(x∗ , ξ)))
= (31)
n

where in the approximate equality we apply Theorem 3 with λ = 0 to deduce x̂EO n − x∗ ≈

Pn
∇2 Z(x∗ )−1 n1 i=1 ∇h(x∗ , ξi ) and noting that ∇2 Z(x∗ ) = EP [∇2 h(x∗ , ξ)], and in the last equality
we use the first-order optimality condition that EP [∇h(x∗ , ξ)] = 0.
Now, consider the special case where Z(x) = EP [h(x, ξ)] and h(x, ξ) = − log fx (ξ) for a paramet-
ric class of density {fx (·)}x . Suppose fx∗ (ξ) is the true density of the random variable ξ, whose
distribution is P . We therefore have Z(x) = EP [− log fx (ξ)] and x∗ , which is the true value of the
parameter in the density function of ξ, satisfies x∗ = argminx Z(x) as it minimizes the Kullback-
Leibler divergence from the true distribution among all densities parametrized as fx (·).
26 Lam: On the Impossibility of Statistically Improving Empirical Optimization

Note that in this special case,

tr(∇2 Z(x∗ )−1 CovP (∇h(x∗ , ξ))) = tr(∇2 Z(x∗ )−1 EP [∇h(x∗ , ξ))2 ]

= tr(∇2 Z(x∗ )−1 EP [∇ log fx∗ (ξ))2 ]

= tr(∇2 Z(x∗ )−1 (−EP [∇2 log fx∗ (ξ)]))

= tr(∇2 Z(x∗ )−1 ∇2 Z(x∗ ))

where we use the first-order optimality condition EP [∇h(x∗ , ξ)] = 0 in the first equality and the
equivalent expression of the Fisher information matrix in the third equality. This shows that (31)
becomes d/n in this special case.
Now consider x̂EO
n as an estimator of x∗ , or that (∇2 Z(x∗ ))1/2 x̂EO
n as an estimator
of (∇2 Z(x∗ ))1/2 x∗ . The Cramer-Rao bound states that, for any nearly unbiased estimator
(∇2 Z(x∗ ))1/2 x̂n , we have

tr(CovP ((∇2 Z(x∗ ))1/2 x̂n )) ≥ tr(∇ψ(x∗ )′ (nI)−1 ∇ψ(x∗ ))

where ψ(x) = Ex [(∇2 Z(x∗ ))1/2 x̂n ] = (∇2 Z(x∗ ))1/2 Ex x̂n , with Ex [·] being the expectation taken
under fx (·), and I is the Fisher information matrix given by I = −EP [∇2 log fx∗ (ξ)] = ∇2 Z(x∗ ).
Now, consider any estimator x̂n that satisfies Ex [x̂n ] = x + O(1/n) (such as all the data-driven
solutions we discussed in Section 3 using small enough λn , i.e., λn = O(1/n)). We have ∇ψ(x∗ ) ≈
(∇2 Z(x∗ ))1/2 and

⊤ d
E k(∇2 Z(x∗ ))1/2 (x̂n − x∗ )k2 ≈ tr(CovP ((∇2 Z(x∗ ))1/2 x̂n )) ≥ tr((∇2Z(x∗ ))1/2 (nI)−1 (∇2 Z(x∗ ))1/2 ) =
n

Therefore, in the case where ξ ∼ fx∗ (·) and h(x, ξ) = − log fx (ξ), the EO solution x̂EO
n is approxi-
mately optimal in the sense that the optimality gap Z(x) − Z(x∗ ), which is approximately 21 (x −
x∗ )⊤ ∇2 Z(x∗ )(x − x∗) for any estimator x close to x∗ in expectation, has expectation E[Z(x) − Z(x∗ )]
minimized at x̂EO
n up to a negligible error. Our main result is a generalization of the above. It
suggests that x̂EO
n not only minimizes Z(x) − Z(x∗) in expectation, but also minimizes E[g(Z(x) −
Z(x∗ ))] for any convex non-decreasing loss function g(·), or equivalently that Z(x̂n ) − Z(x∗ ) for
any data-driven solution x̂n second-order stochastically dominates Z(x̂EO ∗
n ) − Z(x ).

4.4. Beyond Empirical Optimization

Finally, while this paper advocates the superiority of EO in standard large-sample settings, we
point out, as also mentioned in the introduction, that there are many other settings where alternate
data-driven strategies than EO offer benefits. We have already mentioned some of them in Sections
Lam: On the Impossibility of Statistically Improving Empirical Optimization 27

4.1 and 4.2 when comparing with some existing DRO studies. We close this paper by discussing
some other prominent examples and the types of procedures that are demonstrably powerful:
Finite-sample performance: We have focused and claimed EO is best in the large-sample regime,
but for small sample the situation could be different. While it may be difficult to argue there are
generally superior procedures than EO (or vice versa) in finite sample, there exists documented
situations where some approaches are consistently better than EO. For example, the so-called
operational statistic (Liyanage and Shanthikumar 2005, Chu et al. 2008) strengthens data-driven
solutions to perform better than simple choices such as EO, uniformly across all possible parameter
values (in a parametric problem) and sample size. It does so by expanding the class of data-driven
solutions in which a better solution is systematically searched via a second-stage optimization or
Bayesian analysis. In certain structural problems, for instance those arising in inventory control,
this approach provably and empirically performs better than EO.
High dimension: We have fixed the problem dimension throughout this paper. It is well-known,
however, that certain regularization schemes, e.g., L1 -penalty as in LASSO (Friedman et al. 2001),
are suited for high-dimensional problems that exhibit sparse structures. Through the equivalence
with regularization, this also implies that certain types of Wasserstein DRO enjoy similar statistical
benefits (Blanchet et al. 2019, Shafieezadeh-Abadeh et al. 2019).
Distortion of loss function properties: For problems that do not satisfy our imposed smoothness
√
conditions, the decay rate of G (x̂EO 2
n ) could be 1/ n instead of 1/n. In this situation, χ -divergence

DRO, which exhibits variance regularization, can be used to boost the rate to 1/n in certain
examples (Duchi and Namkoong 2019). Moreover, it is also shown that DRO using a distance
induced from the reproducing kernel Hilbert space (Staib and Jegelka 2019), which relates to the
so-called kernel mean matching (Gretton et al. 2009), could exhibit finite-sample optimality gap
bounds that are free of any complexity of the loss function class, which is drastically distinct from
EO’s behavior (Zeng and Lam 2021).
Data pooling and contextual optimization: When there are simultaneously many stochastic
optimization problems to solve, it is shown that introducing a shrinkage onto EO can enhance
the tradeoff between the optimality gap and instability (Gupta and Rusmevichientong 2021,
Gupta and Kallus 2021). This shrinkage is in a similar spirit as the so-called James-Stein estimator
(Cox and Hinkley 1979) in classical statistics, in which individual estimators, or the solutions of the
individual optimization problems, are adjusted by “weighting” with a pooled estimator. Relatedly,
when the considered optimization problem involves parameters or outcomes that depend on covari-
ates, using no such covariate information, or using the naive predict-then-optimize approach, can
be improved by integrating the prediction and EO steps that leads to a better ultimate objective
value performance (e.g., Ban and Rudin 2019, Elmachtoub and Grigas 2021).
28 Lam: On the Impossibility of Statistically Improving Empirical Optimization

Acknowledgments
I gratefully acknowledge support from the National Science Foundation under grants CAREER CMMI-
1834710 and IIS-1849280.

References
Asmussen S, Glynn PW (2007) Stochastic Simulation: Algorithms and Analysis, volume 57 (Springer Science
& Business Media).

Ban GY, Rudin C (2019) The big data newsvendor: Practical insights from machine learning. Operations
Research 67(1):90–108.

Bayraksan G, Love DK (2015) Data-driven stochastic programming using phi-divergences. Tutorials in Oper-
ations Research, 1–19 (INFORMS).

Ben-Tal A, Den Hertog D, De Waegenaere A, Melenberg B, Rennen G (2013) Robust solutions of optimization
problems affected by uncertain probabilities. Management Science 59(2):341–357.

Ben-Tal A, El Ghaoui L, Nemirovski A (2009) Robust Optimization (Princeton University Press).

Bertsimas D, Brown DB, Caramanis C (2011) Theory and applications of robust optimization. SIAM Review
53(3):464–501.

Bertsimas D, Gupta V, Kallus N (2018) Robust sample average approximation. Mathematical Programming
171(1-2):217–282.

Blanchet J, Kang Y, Murthy K (2019) Robust Wsserstein profile inference and applications to machine
learning. Journal of Applied Probability 56(3):830–857.

Blanchet J, Murthy K (2019) Quantifying distributional model risk via optimal transport. Mathematics of
Operations Research 44(2):565–600.

Chen L, Ma W, Natarajan K, Simchi-Levi D, Yan Z (2018) Distributionally robust linear and discrete
optimization with marginals. Available at SSRN 3159473 .

Chen R, Paschalidis IC (2018) A robust learning approach for regression models based on distributionally
robust optimization. Journal of Machine Learning Research 19(13).

Chen X, He S, Jiang B, Ryan CT, Zhang T (2021) The discrete moment problem with nonconvex shape
constraints. Operations Research 69(1):279–296.

Chu LY, Shanthikumar JG, Shen ZJM (2008) Solving operational statistics via a Bayesian analysis. Opera-
tions Research Letters 36(1):110–116.

Cox DR, Hinkley DV (1979) Theoretical Statistics (CRC Press).

Delage E, Ye Y (2010) Distributionally robust optimization under moment uncertainty with application to
data-driven problems. Operations Research 58(3):595–612.

Dhara A, Das B, Natarajan K (2021) Worst-case expected shortfall with univariate and bivariate marginals.
INFORMS Journal on Computing 33(1):370–389.
Lam: On the Impossibility of Statistically Improving Empirical Optimization 29

Doan XV, Li X, Natarajan K (2015) Robustness to dependency in portfolio optimization using overlapping
marginals. Operations Research 63(6):1468–1488.

Duchi JC, Glynn PW, Namkoong H (2021) Statistics of robust optimization: A generalized empirical likeli-
hood approach. Mathematics of Operations Research .

Duchi JC, Namkoong H (2019) Variance-based regularization with convex objectives. J. Mach. Learn. Res.
20:68–1.

Dupuis P, Katsoulakis MA, Pantazis Y, Plechác P (2016) Path-space information bounds for uncertainty
quantification and sensitivity analysis of stochastic dynamics. SIAM/ASA Journal on Uncertainty
Quantification 4(1):80–111.

Elmachtoub AN, Grigas P (2021) Smart “predict, then optimize”. Management Science .

Esfahani PM, Kuhn D (2018) Data-driven distributionally robust optimization using the Wasserstein metric:
Performance guarantees and tractable reformulations. Mathematical Programming 171(1-2):115–166.

Friedman J, Hastie T, Tibshirani R (2001) The Elements of Statistical Learning, volume 1 (Springer Series
in Statistics New York).

Gao R, Chen X, Kleywegt AJ (2017a) Wasserstein distributional robustness and regularization in statistical
learning. arXiv e-prints arXiv–1712.

Gao R, Chen X, Kleywegt AJ (2017b) Wasserstein distributionally robust optimization and variation regu-
larization. arXiv preprint arXiv:1712.06050 .

Gao R, Kleywegt AJ (2016) Distributionally robust stochastic optimization with Wasserstein distance. arXiv
preprint arXiv:1604.02199 .

Ghaoui LE, Oks M, Oustry F (2003) Worst-case value-at-risk and robust portfolio optimization: A conic
programming approach. Operations Research 51(4):543–556.

Goh J, Sim M (2010) Distributionally robust optimization and its tractable approximations. Operations
Research 58(4-Part-1):902–917.

Gotoh JY, Kim MJ, Lim AE (2018) Robust empirical optimization is almost the same as mean–variance
optimization. Operations Research Letters 46(4):448–452.

Gotoh JY, Kim MJ, Lim AE (2021) Calibration of distributionally robust empirical optimization models.
Operations Research .

Gretton A, Smola A, Huang J, Schmittfull M, Borgwardt K, Schölkopf B (2009) Covariate shift by kernel
mean matching. Dataset Shift in Machine Learning 3(4):5.

Gupta V (2019) Near-optimal Bayesian ambiguity sets for distributionally robust optimization. Management
Science 65(9):4242–4260.

Gupta V, Kallus N (2021) Data pooling in stochastic optimization. Management Science .

30 Lam: On the Impossibility of Statistically Improving Empirical Optimization

Gupta V, Rusmevichientong P (2021) Small-data, large-scale linear optimization with uncertain objectives.
Management Science 67(1):220–241.

Hadar J, Russell WR (1969) Rules for ordering uncertain prospects. The American Economic Review
59(1):25–34.

Hampel FR (1974) The influence curve and its role in robust estimation. Journal of the American Statistical
Association 69(346):383–393.

Hanasusanto GA, Roitch V, Kuhn D, Wiesemann W (2015) A distributionally robust perspective on uncer-
tainty quantification and chance constrained programming. Mathematical Programming 151(1):35–62.

Hanoch G, Levy H (1969) The efficiency analysis of choices involving risk. The Review of Economic Studies
36(3):335–346.

Hong LJ, Huang Z, Lam H (2020) Learning-based robust optimization: Procedures and statistical guarantees.
Management Science .

Jiang R, Guan Y (2016) Data-driven chance constrained stochastic program. Mathematical Programming
158(1):291–327.

Kosorok MR (2007) Introduction to Empirical Processes and Semiparametric Inference (Springer Science &
Business Media).

Kuhn D, Esfahani PM, Nguyen VA, Shafieezadeh-Abadeh S (2019) Wasserstein distributionally robust opti-
mization: Theory and applications in machine learning. Operations Research & Management Science
in the Age of Analytics, 130–166 (INFORMS).

Lam H (2016) Robust sensitivity analysis for stochastic systems. Mathematics of Operations Research
41(4):1248–1275.

Lam H (2018) Sensitivity to serial dependency of input processes: A robust approach. Management Science
64(3):1311–1327.

Lam H (2019) Recovering best statistical guarantees via the empirical divergence-based distributionally
robust optimization. Operations Research 67(4):1090–1105.

Lam H, Mottet C (2017) Tail analysis without parametric models: A worst-case perspective. Operations
Research 65(6):1696–1711.

Lam H, Zhou E (2017) The empirical likelihood approach to quantifying uncertainty in sample average
approximation. Operations Research Letters 45(4):301 – 307.

Landsberger M, Meilijson I (1993) Mean-preserving portfolio dominance. The Review of Economic Studies
60(2):479–485.

Li B, Jiang R, Mathieu JL (2017) Ambiguous risk constraints with moment and unimodality information.
Mathematical Programming 1–42.
Lam: On the Impossibility of Statistically Improving Empirical Optimization 31

Li KC, et al. (1986) Asymptotic optimality of CL and generalized cross-validation in ridge regression with
application to spline smoothing. The Annals of Statistics 14(3):1101–1112.

Li KC, et al. (1987) Asymptotic optimality for Cp , CL , cross-validation and generalized cross-validation:
Discrete index set. The Annals of Statistics 15(3):958–975.

Lim AE, Shanthikumar JG, Shen ZM (2006) Model uncertainty, robust optimization, and learning. Models,
Methods, and Applications for Innovative Decision Making, 66–94 (INFORMS).

Liyanage LH, Shanthikumar JG (2005) A practical inventory control policy using operational statistics.
Operations Research Letters 33(4):341–348.

Póczos B, Xiong L, Schneider J (2012) Nonparametric divergence estimation with applications to machine
learning on distributions. arXiv preprint arXiv:1202.3758 .

Popescu I (2005) A semidefinite programming approach to optimal-moment bounds for convex classes of
distributions. Mathematics of Operations Research 30(3):632–657.

Rahimian H, Mehrotra S (2019) Distributionally robust optimization: A review. arXiv preprint

arXiv:1908.05659 .

Rockafellar RT (2015) Convex Analysis (Princeton University Press).

Rothschild M, Stiglitz JE (1970) Increasing risk: I. a definition. Journal of Economic theory 2(3):225–243.

Shafieezadeh-Abadeh S, Kuhn D, Esfahani PM (2019) Regularization via mass transportation. Journal of

Machine Learning Research 20(103):1–68.

Shaked M, Shanthikumar JG (2007) Stochastic Orders (Springer Science & Business Media).

Shapiro A, Dentcheva D, Ruszczyński A (2014) Lectures on Stochastic Programming: Modeling and Theory
(SIAM).

Sriperumbudur BK, Fukumizu K, Gretton A, Schölkopf B, Lanckriet GR, et al. (2012) On the empirical
estimation of integral probability metrics. Electronic Journal of Statistics 6:1550–1599.

Staib M, Jegelka S (2019) Distributionally robust optimization and generalization in kernel methods. Wal-
lach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds., Advances in Neural
Information Processing Systems, 9131–9141.

Sutter T, Van Parys BP, Kuhn D (2020) A general framework for optimal data-driven optimization. arXiv
preprint arXiv:2010.06606 .

Van der Vaart AW (2000) Asymptotic Statistics, volume 3 (Cambridge university press).

Van Der Vaart AW, Wellner JA (1996) Weak Convergence and Empirical Processes (Springer).

Van Parys BP, Esfahani PM, Kuhn D (2020) From data to decisions: Distributionally robust optimization
is optimal. Management Science .

Van Parys BP, Goulart PJ, Kuhn D (2016) Generalized Gauss inequalities via semidefinite programming.
Mathematical Programming 156(1-2):271–302.
32 Lam: On the Impossibility of Statistically Improving Empirical Optimization

Wang Q, Kulkarni SR, Verdú S (2009) Divergence estimation for multidimensional densities via k-nearest-
neighbor distances. IEEE Transactions on Information Theory 55(5):2392–2405.

Wang Z, Glynn PW, Ye Y (2016) Likelihood robust optimization for data-driven problems. Computational
Management Science 13(2):241–261.

Wiesemann W, Kuhn D, Sim M (2014) Distributionally robust convex optimization. Operations Research
62(6):1358–1376.

Wu D, Zhu H, Zhou E (2018) A Bayesian risk approach to data-driven stochastic optimization: Formulations
and asymptotics. SIAM Journal on Optimization 28(2):1588–1612.

Zeng Y, Lam H (2021) Complexity-free generalization via distributionally robust optimization. preprint .
e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization ec1

Proofs of Statements

EC.1. Review of Useful Results

The following is a classical theorem on M -estimation in the presence of extra parameters in the
stochastic equation.

Theorem EC.1 (a.k.a. Theorem 5.31 in Van der Vaart 2000). Consider, for all x ∈
X ⊂ Rd and λ ∈ Rm such that kλ − λ0 k ≤ δ for a given point λ0 for some δ > 0 , ϕx,λ (·) is a function
mapping Ξ to Rd . Also consider (possibly random) sequences xn ∈ X and λn for n = 1, 2, . . .. Let
P be a distribution generating ξ ∈ Ξ and EP [·] denotes its expectation. Let P̂n denotes the empir-
ical distribution of i.i.d. observations ξ1 , . . . , ξn and EP̂n [·] denotes its expectation. We assume the
following conditions:
1. {ϕx,λ (·) : kx − x∗ k ≤ δ, kλ − λ0 k ≤ δ } is Donsker.
2. EP kϕx,λ (ξ) − ϕx∗,λ0 (ξ)k2 → 0 as (x, λ) → (x∗ , λ0 ).
3. EP [ϕx∗ ,λ0 (ξ)] = 0
4. EP [ϕx,λ (ξ)] is differentiable at x = x∗ , uniformly in λ within a small neighborhood of λ0 with
non-singular derivative matrix Vx∗ ,λ = ∇EP [ϕx∗ ,λ (ξ)] such that Vx∗ ,λ → Vx∗ ,λ0 .
√
5. nEP̂n [ϕxn ,λn (ξ)] = op (1)
p
6. (xn , λn ) → (x∗ , λ0 ) as n → ∞
Then

∗ 1
xn − x = −Vx−1 −1
∗ ,λ EP [ϕx∗ ,λn (ξ)] − Vx∗ ,λ EP̂ [ϕx∗ ,λ0 (ξ)] + op √ + kEP [ϕx∗ ,λn (ξ)]k
0 0 n
n

We also need the following classical consistency theorem in M -estimation.

Theorem EC.2 (a.k.a. Theorem 5.9 in Van der Vaart 2000). Let Ψn (·) be a random
vector-valued functions and Ψ(·) be a deterministic vector-valued function of x ∈ X such that for
every ǫ > 0,
p
sup kΨn (x) − Ψ(x)k → 0
x∈X

inf kΨ(x)k > 0 = kΨ(x∗ )k

x∈X :kx−x∗ k≥ǫ

Then any sequence xn such that Ψn (xn ) = op (1) converges in probability to x∗ .

ec2 e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization

EC.2. Proofs for Section 2

Proof of Lemma 1. Define
Z(x) − Z(x∗ ) − 12 (x − x∗ )⊤ ∇2 Z(x∗ )(x − x∗ )
r(x) = (EC.1)
kx − x∗ k2
for x 6= x∗ and r(x∗ ) = 0. By Assumption 1, we know that r(x) is continuous at x∗ . From (EC.1),
we can write
1
Z(x̂λn ) − Z(x∗ ) = r(x̂λn )kx̂λn − x∗ k2 + (x̂λn − x∗ )⊤ ∇2 Z(x∗ )(x̂λn − x∗ ) (EC.2)
2
From Assumption 2, we have (3) and (5) with finite CovP (IF (ξ)). By the law of large numbers and
p p
Slutsky’s theorem, we have x̂λn → x∗ as n → ∞ and λ → 0 (at any rate). Therefore r(x̂λn ) → r(x∗ ).
√ √ Pn
Moreover, nhIF (ξ), P̂n − P i = n n1 i=1 IF (ξi ) − EP [IF (ξ)] ⇒ Y where Y is a Gaussian vector

with mean 0 and covariance CovP (IF (ξ)) by the CLT. Thus

λ ∗ 1 1 1
x̂n − x = Op √ + λK + op √ + λ = Op √ + λ (EC.3)
n n n
and hence
1
kx̂λn ∗ 2
− x k = Op + λ2
n
which gives
1
r(x̂λn )kx̂λn − x∗ k2 = op + λ2
n
Thus, (EC.2) becomes

1 1
Z(x̂λn ) − Z(x∗ ) = (x̂λn − x∗ )⊤ ∇2 Z(x∗ )(x̂λn − x∗ ) + op + λ2 (EC.4)
2 n
Now putting in (3), we get

Z(x̂λn ) − Z(x∗ )
⊤
1 1 2 ∗ 1
= hIF (ξ), P̂n − P i + λK + op √ + λ ∇ Z(x ) hIF (ξ), P̂n − P i + λK + op √ + λ
2 n n

1
+ op + λ2 (EC.5)
n
which can be further written as

1 1
(hIF (ξ), P̂n − P i + λK)⊤∇2 Z(x∗ )(hIF (ξ), P̂n − P i + λK) + op + λ2
2 n
or

1 1 1
hIF (ξ), P̂n − P i⊤ ∇2 Z(x∗ )hIF (ξ), P̂n − P i + λ2 K ⊤ ∇2 Z(x∗ )K +λK ⊤∇2 Z(x∗ )hIF (ξ), P̂n − P i +op + λ2
2 2 n

e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization ec3

Proof of Proposition 2. By Assumption 2 and the definition of hIF (ξ), P̂n − P i in (4), the CLT
implies that
√
nhIF (ξ), P̂n − P i ⇒ Y

where Y is a Gaussian vector with mean 0 and covariance CovP (IF (ξ)). From (7), and using the
√
continuity of the quadratic function and Slutsky’s theorem, we have, when a = 0 so that nλn → 0,
that
1 √ ⊤ √ 1
nG (x̂λnn ) = nhIF (ξ), P̂n − P i ∇2 Z(x∗ ) nhIF (ξ), P̂n − P i + nλ2n K ⊤ ∇2 Z(x∗ )K
2 2
√ ⊤ 2 ∗
√
2

+ nλn K ∇ Z(x ) nhIF (ξ), P̂n − P i + op 1 + nλn
1 ⊤ 2
⇒ Y ∇ Z(x∗ )Y
2
√
Likewise, when a = ∞ or −∞, which means nλn → ∞ or −∞, we have
1 λn 1 √ ⊤
2 ∗
√ 1
G (x̂n ) = n hIF (ξ), P̂ n − P i ∇ Z(x ) n hIF (ξ), P̂ n − P i + K ⊤ ∇2 Z(x∗ )K
λ2 2nλ2n 2
1 √
1

+√ K ⊤ ∇2 Z(x∗ ) nhIF (ξ), P̂n − P i + op +1
nλn nλ2n
1
⇒ K ⊤ ∇2 (x∗ )K
2
√
Moreover, when nλn → a for some 0 < |a| < ∞, we have
1 √ ⊤ √ 1
nG (x̂λnn ) = 2
nhIF (ξ), P̂n − P i ∇ Z(x ) ∗
nhIF (ξ), P̂n − P i + nλ2n K ⊤ ∇2 Z(x∗ )K
2 2
√ ⊤ 2 ∗
√
2

+ nλn K ∇ Z(x ) nhIF (ξ), P̂n − P i + op 1 + nλn
1 ⊤ 2 1
⇒ Y ∇ Z(x∗ )Y + a2 K ⊤ ∇2 Z(x∗ )K + aK ⊤ ∇2 Z(x∗ )Y
2 2
Finally, when λ = 0, we either use the same line of arguments as above or observe that we are
in the case a = 0, which reduces to
1 ′ 2
nG (x̂EO ∗
n ) ⇒ Y ∇ Z(x )Y
2
This concludes the proposition.
Proof of Proposition 3. When a = 0, from Proposition 2 it is trivial to see that the weak limits of
p
nG (x̂λnn ) and nG (x̂EO λn 2 2 λn
n ) coincide. When a = ∞ or −∞, we have nG (x̂n ) = (nλn )(1/λn )G (x̂n ) → ∞

since (1/2)K ⊤∇2 Z(x∗ )K > 0 by Assumption 3. Finally, when 0 < |a| < ∞, we compare the weak
1 2
limits of nG (x̂λnn ) and nG (x̂EO ⊤ 2 ∗
n ) in Proposition 2. Note that 2 a K ∇ (x )K is deterministic and

positive. Moreover, aK ⊤ ∇2 (x∗ )Y satisfies

" #
1
E aK ⊤ ∇2 Z(x∗ )Y Y ⊤ ∇2 Z(x∗ )Y = 0 (EC.6)
2
ec4 e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization

d
This is because Y , as a mean-zero Gaussian vector, is symmetric, i.e., Y = −Y , and
thus for any z ∈ R, Y given (1/2)Y ⊤ ∇2 Z(x∗ )Y = z has the same distribution as
−Y given (1/2)(−Y )⊤ ∇2 Z(x∗ )(−Y ) = z or equivalently (1/2)Y ⊤ ∇2 Z(x∗ )Y = z. This
implies E[Y |(1/2)Y ⊤ ∇2 Z(x∗ )Y ] = E[−Y |(1/2)Y ⊤ ∇2 Z(x∗ )Y ], which in turn implies
E[Y |(1/2)Y ⊤ ∇2 Z(x∗ )Y ] = 0 and thus (EC.6). Hence, by Proposition 1, we get that
(1/2)Y ⊤ ∇2 Z(x∗ )Y is second-order stochastically dominated by (1/2)Y ⊤ ∇2 Z(x∗ )Y +
(1/2)a2 K ⊤ ∇2 Z(x∗ )K + aK ⊤ ∇2 (x∗ )Y .
Proof of Theorem 1. This follows immediately from Proposition 3, allowing for the weak limit
of ∞ and using the subsequence argument in Definition 2 when needed.
Proof of Theorem 2. Note that Assumption 4 gives

X
∇Z(x∗ ) = − α∗j ∇gj (x∗ ) (EC.7)
j∈B

p
Moreover, since Assumption 2 implies x̂λn − x∗ → 0 as n → ∞ and λ → 0, we have, from Assumption
5, with probability converging to 1,

1
0 = gj (x̂λn ) − gj (x∗ ) = ∇gj (x∗ )⊤ (x̂λn − x∗ ) + (x̂λn − x∗ )⊤ ∇2 gj (x∗ )(x̂λn − x∗ ) + op (kx̂λn − x∗ k2 )
2

or
1
− ∇gj (x∗ )⊤ (x̂λn − x∗ ) = (x̂λn − x∗ )⊤ ∇2 gj (x∗ )(x̂λn − x∗ ) + op (kx̂λn − x∗ k2 ) (EC.8)
2

So, using (EC.7) and (EC.8), we get

Z(x̂λn ) − Z(x∗ )
1
= ∇Z(x∗ )⊤ (x̂λn − x∗ ) + (x̂λn − x∗ )⊤ ∇2 Z(x∗ )(x̂λn − x∗ ) + op (kx̂λn − x∗ k2 )
2
X 1
=− αj ∇gj (x ) (x̂n − x∗ ) + (x̂λn − x∗ )⊤ ∇2 Z(x∗ )(x̂λn − x∗ ) + op (kx̂λn − x∗ k2 )
∗ ∗ ⊤ λ

j∈B
2
1X ∗ λ 1
= αj (x̂n − x∗ )⊤ ∇2 gj (x∗ )(x̂λn − x∗ ) + (x̂λn − x∗ )⊤ ∇2 Z(x∗ )(x̂λn − x∗ ) + op (kx̂λn − x∗ k2 )
2 j∈B 2
!
1 λ X
= (x̂n − x∗ )⊤ ∇2 Z(x∗ ) + α∗j ∇2 gj (x∗ ) (x̂λn − x∗ ) + op (kx̂λn − x∗ k2 )
2 j∈B

It is now clear that the optimality gap Z(x̂λn ) − Z(x∗ ) behaves the same as (8) except that ∇2 Z(x∗ )
is replaced by ∇2 Z(x∗ ) + j∈B α∗j ∇2 gj (x∗ ). Thus the proofs of all theorems go through in the same
P

way as previously except the said change.

e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization ec5

EC.3. Conditions and Proof of Theorem 3

We first impose the following regularity conditions on h and R:

Assumption EC.1 (Regularity conditions on the loss function). For any ξ ∈ Ξ,

1. h(·, ξ) is a.s. twice continuously differentiable in x ∈ X .
2. h(·, ξ) is L1 -Lipschitz, i.e.,

|h(x1 , ξ) − h(x2 , ξ)| ≤ L1 (ξ)kx1 − x2 k a.s.

for some non-negative function L1 (·) where EP [L1 (ξ)] < ∞.

3. ∇h(·, ξ) is L2 -Lipschitz, i.e.,

k∇h(x1 , ξ) − ∇h(x2 , ξ)k ≤ L2 (ξ)kx1 − x2 k a.s.

for some non-negative function L2 (·) where EP [L2 (ξ)2 ] < ∞.

4. EP [∇2 h(x∗ , ξ)] is finite and positive definite.

Assumption EC.2 (Loss function complexity). {∇h(x, ·) : kx − x∗ k ≤ δ } is Donsker for

some small δ > 0, and {∇h(x, ·) : x ∈ X } is Glivenko-Cantelli.

Assumption EC.3 (Regularity conditions on the penalty function). We have:

1. R is twice continuously differentiable in x ∈ X .
2. ∇2 R(x) is uniformly bounded for any x ∈ X .

Assumption EC.1 stipulates that h is sufficiently smooth, with gradients satisfying Lipschitzness
and curvature properties. Assumption EC.2 is a condition on the “functional complexity” of the
class {h(x, ·)} and implies that h(x, ξ) satisfies a suitable uniform law of large numbers and CLT,
properties that are commonly used to ensure the validity of EO. Assumption EC.3 stipulates that
R is sufficiently smooth with a bounded Hessian matrix.
Next we state explicitly that the oracle solution x∗ and data-driven solution x̂Reg,λ
n both satisfy
corresponding first-order optimality conditions.

Assumption EC.4 (First-order optimality conditions). We have:

1. x∗ is a solution to ∇EP [h(x, ξ)] = 0, and inf x∈X :kx−x∗k≥ǫ k∇EP [h(x, ξ)]k > 0 for every ǫ > 0.
2. x̂Reg,λ
n is a solution to ∇EP̂n [h(x, ξ)] + λ∇R(x) = 0.

Assumption EC.4 part 1 signifies that the oracle solution x∗ satisfies the first-order optimality
condition, and moreover is unique in this satisfaction. Part 2 of the assumption signifies that the
regularized data-driven solution x̂Reg,λ
n satisfies the first-order opitmality condition of the regular-
ized EO problem.
We are now ready to prove Theorem 3.
ec6 e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization

Proof of Theorem 3. We verify all the conditions in Theorem EC.1 to conclude the result,
where we set xn in Theorem EC.1 to be x̂Reg,λ
n
n
. First of all, by Assumptions EC.1.1 and EC.1.2,
we can exchange derivative and expectation so that ∇EP [h(x, ξ)] = EP [∇h(x, ξ)] (e.g., Chapter
VII, Proposition 2.3 in Asmussen and Glynn 2007). Now, we take ψx,λ (ξ) in Theorem EC.1 as
∇h(x, ξ) + λ∇R(x), and set λ0 = 0 therein. Then:

Condition 1: Assumption EC.2 together with that R(x) is deterministic and satisfies Assump-
tion EC.3.1 give that {ψx,λ (·) : kx − x∗ k ≤ δ, |λ − λ0 | ≤ δ } is Donsker (Example 2.10.7 in
Van Der Vaart and Wellner 1996). Thus Condition 1 in Theorem EC.1 is satisfied.

Condition 2: Consider

EP k∇h(x, ξ) + λ∇R(x) − ∇h(x∗, ξ)k2 ≤ ((EP k∇h(x, ξ) − ∇h(x∗ , ξ)k2 )1/2 + λEP k∇R(x)k)2

≤ ((EP [L2 (ξ)2 ])1/2 kx − x∗ k + λEP k∇R(x)k)2

→0

as (x, λ) → (x∗ , 0), where we have used Minkowski’s inequality in the first inequality and Assump-
tion EC.1.3 in the second inequality. Thus Condition 2 in Theorem EC.1 is satisfied.

Condition 3: Condition 3 in Theorem EC.1 is satisfied since we have argued for Condition 1 above
that ∇EP [h(x, ξ)] = EP [∇h(x, ξ)], and that x∗ satisfies the first-order condition in Assumption
EC.4.1.

Condition 4: Assumptions EC.1.1 and EC.1.3 give that

∇h(x∗ + δej , ξ) − ∇h(x∗ , ξ)

≤ L2 (ξ)
δ

in a neighborhood of 0, where EP [L2 (ξ)] < ∞ and ej denotes a vector of 0’s except the j-th
component being 1, and δ 6= 0. Thus the dominated convergence theorem gives ∇EP [∇h(x∗ , ξ)] =
EP [∇2 h(x∗ , ξ)]. This shows that EP [∇h(x, ξ)] is differentiable at x = x∗ . Moreover, together with
Assumption EC.3.1 we have

∇(EP [∇h(x∗ , ξ)] + λ∇R(x∗ )) = EP [∇2 h(x∗ , ξ)] + λ∇2 R(x∗ )

Furthermore, clearly
EP [∇2 h(x∗ , ξ)] + λ∇2 R(x∗ ) → EP [∇2 h(x∗ , ξ)]
e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization ec7

as λ → 0, and with Assumption EC.1.4 we have EP [∇2 h(x∗ , ξ)] + λ∇R(x∗ ) being positive definite
for λ in a neighborhood of 0. Thus Condition 4 in Theorem EC.1 is satisfied.

Condition 5: Condition 5 in Theorem EC.1 is satisfied since ∇(EP̂n [h(x, ξ)] + λR(x)) =
EP̂n [∇h(x, ξ)] + λ∇R(x) by Assumptions EC.1.1 and EC.3.1, and by Assumption EC.4.2 we have
x̂Reg,λ
n being a solution to EP̂n [∇h(x, ξ) + λ∇R(x)] = 0.

Condition 6: We show that Condition 6 in Theorem EC.1 holds, by setting xn to be x̂Reg,λ

n
n
in
Theorem EC.2 and proving its convergence in probability to x∗ . To this end, consider Ψn (x) =
EP̂n [∇h(x, ξ)] + λn ∇R(x) and Ψ(x) = EP [∇h(x, ξ)], so that we have Ψn (x̂Reg,λ
n
n
) = 0 and Ψ(x∗ ) = 0
as argued before. Since {∇h(x, ·) : x ∈ X } is Glivenko-Cantelli by Assumption EC.2, we have

p
sup kEP̂n [∇h(x, ξ)]+λn ∇R(x) − EP [∇h(x, ξ)]k ≤ sup kEP̂n [∇h(x, ξ)] − EP [∇h(x, ξ)]k +λn sup k∇R(x)k → 0
x∈X x∈X x∈X

as n → ∞ by Minkowski’s equality and Assumption EC.3.2. This concludes supx∈X kΨn (x) −
p
Ψ(x)k → 0 in Theorem EC.2. Moreover, Assumption EC.4 concludes inf x∈X :kx−x∗k≥ǫ kΨ(x)k > 0 =
p
kΨ(x∗ )k. Therefore, by Theorem EC.2, we have x̂Reg,λ
n
n
→ x∗ .

With the above six conditions, we can invoke Theorem EC.1 to obtain

x̂Reg,λ
n
n
− x∗

= −(EP [∇2 h(x∗ , ξ)])−1 EP [∇h(x∗ , ξ) + λn ∇R(x)] − (EP [∇2 h(x∗ , ξ)])−1 EP̂n [∇h(x∗ , ξ)]

1 ∗ ∗
+ op √ + kE[∇h(x , ξ) + λn ∇R(x )]k
n

2 ∗ −1 ∗ 2 ∗ −1 ∗ 1 ∗
= −h(EP [∇ h(x , ξ)]) ∇h(x , ξ), P̂n − P i − λn (EP [∇ h(x , ξ)]) ∇R(x ) + op √ + λn k∇R(x )k
n

since EP [∇h(x∗ , ξ)] = 0 by the first-order optimality condition in Assumption EC.4.1.

EC.4. Conditions and Proofs of Theorem 4

We need the following technical assumption on the involved optimization problem.

Assumption EC.5 (Validity of KKT conditions). With probability converging to 1 as n →

∞, we have
n h i o
∗ h(x,ξ)−β
1. The optimization problem minx∈X ,α≥0,β∈R αEP̂n φ α
+ αλn + β has a unique
solution (x̂∗n , α̂∗n , β̂n∗ ) that satisfies the KKT conditions with x̂∗n ◦
∈ X , the interior of X .
2. For all x ∈ X , maxD(Q,P̂n )≤λn EQ [h(x, ξ)] 6= ess supP̂n h(x, ξ).
ec8 e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization

The first condition in Assumption EC.5 guarantees that we can use the KKT con-
ditions to locate the optimal solution of (18) (we will soon see that the problem
n h i o
minx∈X ,α≥0,β∈R αEP̂n φ∗ h(X)−β
α
+ αλn + β is a dualization of (18)). The second condition
ensures the obtained optimal solution is nontrivial. Both conditions need to be satisfied with over-
whelming probability as n → ∞.
Analogous to Assumption EC.4, we also need the following first-order optimality condition for
the oracle solution.

Assumption EC.6 (First-order optimality conditions). x∗ is a solution to ∇EP [h(x, ξ)] =

0, and inf x∈X :kx−x∗k≥ǫ k∇EP [h(x, ξ)]k > 0 for every ǫ > 0.

Next we present assumptions on h.

Assumption EC.7 (Regularity conditions on the loss function). For any ξ ∈ Ξ,

1. h(·, ξ) is a.s. twice continuously differentiable in x ∈ X .
2. h(·, ·) and ∇h(·, ·) are uniformly bounded over x ∈ X and ξ ∈ Ξ.
3. ∇h(·, ξ) is L2 -Lipschitz, i.e.,

k∇h(x1 , ξ) − ∇h(x2 , ξ)k ≤ L2 (ξ)kx1 − x2 k a.s.

for some non-negative function L2 (·) where EP [L2 (ξ)2 ] < ∞.

4. EP [∇2 h(x∗ , ξ)] is finite and positive definite.

Compared to Assumption EC.1, here we strengthen part 2 of the assumption to the uniform
boundedness of h and ∇h. Note that the uniform boundedness of ∇h implies L1 -Lipschitzness of h,
which is precisely Assumption EC.1.2. The stronger Assumption EC.7.2 can be potentially relaxed
but it helps streamline our mathematical arguments.

Assumption EC.8 (Loss function complexity). We have:

1. {h(x, ·) : x ∈ X }, {h(x, ·)2 : x ∈ X } and {∇h(x, ·) : x ∈ X } are Glivenko-Cantelli.
2. {φ∗ ′ (αh(x, ·) − β)∇h(x, ·) : kx − x∗ k ≤ δ, 0 ≤ α ≤ δ, 0 ≤ β ≤ δ } is Donsker for some small δ > 0.
3. inf x∈X V arP (h(x, ξ)) > 0.

Compared to Assumption EC.2, Assumption EC.8 requires the Glivenko-Cantelli property for h
and h2 , and the Donsker property for a larger function class {φ∗ ′ (αh(x·) − β)∇h(x, ·)}. Moreover,
it requires a non-degeneracy on h uniformly across all x ∈ X .
We are now ready to prove Theorem 4.
Proof of Theorem 4. Consider the inner maximization in (18). By a change of measure from Q
to P̂n , we rewrite it as
maxL≥0 EP̂n [h(x, ξ)L]
(EC.9)
subject to EP̂n [φ(L)] ≤ λ
e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization ec9

where the decision variable is now the likelihood ratio L = L(ξ). Next, since L ≡ 1 is in the interior
of the feasible region for any λ > 0, Slater’s condition holds and we rewrite (EC.9) as the Lagrangian
formulation

min max EP̂n [h(x, ξ)L] − α(EP̂n [φ(L)] − λ) − β(EP̂n [L] − 1)

α≥0,β L≥0

= min max EP̂n [(h(x, ξ) − β)L − αφ(L)] + αλ + β

α≥0,β L≥0

= min EP̂n max{(h(x, ξ) − β)L − αφ(L)} + αλ + β
α≥0,β L≥0

h(x, ξ) − β
= min αEP̂n max L − φ(L) + αλ + β
α≥0,β L≥0 α

∗ h(x, ξ) − β
= min αEP̂n φ + αλ + β (EC.10)
α≥0,β α

where in the above we define 0φ∗ (a/0) = ∞ if a > 0 and 0φ∗ (a/0) = 0 for a ≤ 0 (i.e., when α = 0
the resulting objective function is equal to ∞ if ess supP̂ h(x, ξ) > β, and β otherwise). Thus, (18)
is equivalent to
h(x, ξ) − β
min αEP̂n φ∗ + αλ + β (EC.11)
x∈X ,α≥0,β α
Now suppose the two events in Assumption EC.5 hold, which occur with probability reaching 1 as
n → ∞. In the first event, the KKT conditions of (EC.11) uniquely identifies the optimal solution.
Denote (x̂λn , α̂λn , β̂nλ ) as the optimal solution of (EC.11). If α̂λn = 0, then β̂nλ = ess supP̂n h(x̂λn , ξ) and
the objective value of (EC.11) becomes ess supP̂n h(x̂λn , ξ), which violates the second event. Thus,
with probability reaching 1, we must have α̂λn > 0. We will focus on this case for the rest of the
proof and use a standard convergence in probability argument to handle the other case (which
occurs with vanishing probability) when needed.
The KKT conditions in the case α̂λn > 0 are precisely the first-order optimality conditions

∗′ h(x, ξ) − β
EP̂n φ ∇h(x, ξ) = 0
α

∗ h(x, ξ) − β ∗′ h(x, ξ) − β h(x, ξ) − β
EP̂n φ − EP̂n φ +λ=0
α α α

h(x, ξ) − β
−EP̂n φ∗ ′ +1=0
α

or equivalently

h(x, ξ) − β
∗′
EP̂n φ ∇h(x, ξ) = 0
α

h(x, ξ) − β h(x, ξ) − β h(x, ξ) − β
EP̂n φ∗ ′ − EP̂n φ∗ =λ
α α α

∗′ h(x, ξ) − β
EP̂n φ =1
α
ec10 e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization

For convenience, we do a change of variable and define α̃ = 1/α and β̃ = β/α. Then the above
can be written as
h i
∗′
EP̂n φ (α̃h(x, ξ) − β̃)∇h(x, ξ) = 0 (EC.12)
h i h i
EP̂n φ∗ ′ (α̃h(x, ξ) − β̃)(α̃h(x, ξ) − β̃) − EP̂n φ∗ (α̃h(x, ξ) − β̃) = λ (EC.13)
h i
EP̂n φ∗ ′ (α̃h(x, ξ) − β̃) = 1 (EC.14)

and we denote (x̂λn , α̃λn , β̃nλ ) as the root to (EC.12)-(EC.14).

We verify Conditions 1-6 in Theorem EC.1. The main idea is to view (α̃λnn , β̃nλn ) as the estimated
parameter in Theorem EC.1 (i.e., the λn in Theorem EC.1 is now regarded as (α̃λnn , β̃nλn )).
Then x̂λnn is viewed as the root to (EC.12), which is parametrized by (α̃λnn , β̃nλn ) that in turn
depends on λn . In the following, we first show Condition 6 in Theorem EC.1, in the form
p
(x̂λnn , α̃λnn , β̃nλn ) → (x∗ , 0, 0) as n → ∞ and λn → 0 (at any rate).

Condition 6: We derive the asymptotic behaviors of α̃λn and β̃nλ . First consider (EC.14). We show
that we can find a unique root β̃n = β̃n (α̃) for (EC.14) as a function of α̃, at a neighborhood of
α̃ = 0. Here and from now on, we note that the values of α̃ and β̃ could depend on x but we will
suppress this dependence unless needed. Note that α̃ = β̃ = 0 solves (EC.14). Moreover, by Lemma
2 EP̂n [φ∗ ′ (α̃h(x, ξ) − β̃)] − 1 is continuously differentiable in α̃ and β̃ both at 0 and

∂
EP̂n [φ∗ ′ (α̃h(x, ξ) − β̃)] − 1 = −EP̂n [φ∗ ′′ (α̃h(x, ξ) − β̃)] = −φ∗ ′′ (0) < 0
∂ β̃ α̃=β̃=0 α̃=β̃=0

So by the implicit function theorem, we have β̃n (α̃) continuously differentiable in a neighborhood
around α̃ = 0. Furthermore, since

∂
E [φ∗ ′ (α̃h(x, ξ) − β̃)] − 1 = EP̂n [φ∗ ′′ (α̃h(x, ξ) − β̃)h(x, ξ)] = φ∗ ′′ (0)Ê[h(x, ξ)]
∂ α̃ P̂n
α̃=β̃=0 α̃=β̃=0

we have β̃ ′ (0) = −(−φ∗ ′′ (0)−1 )φ∗ ′′ (0)EP̂n [h(x, ξ)] = Ê[h(x, ξ)]. Moreover, since h is uniformly
bounded by Assumption EC.7.2, we have α̃h(x, ξ) − β̃ → 0 uniformly over x, ξ when α̃, β̃ → 0. Thus,
together with the mean value theorem, we have

β̃n (α̃) = β̃n′ (ζ)α̃ (EC.15)

for some ζ between 0 and α̃, when α̃ is in a sufficiently small neighborhood of 0, where β̃n′ (ζ)
converges to EP̂n [h(x, ξ)] uniformly over P̂n and x ∈ X as α̃ → 0.
e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization ec11

Now consider (EC.13). By the mean value theorem, we can write it as

EP̂n [φ∗ ′ (0)(α̃h(x, ξ) − β̃)] + EP̂n [φ∗ ′′ (ζ1 (x, ξ))(α̃h(x, ξ) − β̃)2 ] − EP̂n [φ∗ ′ (0)(α̃h(x, ξ) − β̃)]
1
− EP̂n [φ∗ ′′ (ζ2 (x, ξ))(α̃h(x, ξ) − β̃)2 ] = λ (EC.16)
2

which gives

1
EP̂n [φ∗ ′′ (ζ1 (x, ξ))(α̃h(x, ξ) − β̃)2 ] − EP̂n [φ∗ ′′ (ζ2 (x, ξ))(α̃h(x, ξ) − β̃)2 ] = λ (EC.17)
2

where ζ1 (x, ξ) and ζ2 (x, ξ) are some values between 0 and α̃h(x, ξ) − β̃. Let us take β̃ = β̃n (α̃) as
the root of (EC.14) described above, which is well-defined for α̃ in a neighborhood of 0. We can
write the left hand side of (EC.17) as

1
EP̂n [φ∗ ′′ (ζ1 (x, ξ))(α̃h(x, ξ) − β̃n (α̃))2 ] − EP̂n [φ∗ ′′ (ζ2 (x, ξ))(α̃h(x, ξ) − β̃n (α̃))2 ]
2

which can be written further as

∗ ′′ 1 ∗ ′′
EP̂n [φ (ζ1 (x, ξ))(h(x, ξ) − β̃n(ζ)) ] − EP̂n [φ (ζ2 (x, ξ))(h(x, ξ) − β̃n(ζ)) ] α̃2
′ 2 ′ 2
2

where ζ lies between 0 and α̃. For further convenience, we denote

1
un (α̃) = EP̂n [φ∗ ′′ (ζ1 (x, ξ))(h(x, ξ) − β̃n′ (ζ))2] − EP̂n [φ∗ ′′ (ζ2 (x, ξ))(h(x, ξ) − β̃n′ (ζ))2]
2

Note that un (α̃) is continuous at α̃ = 0, and un (0) = φ∗ ′′ (0)V arP̂n (h(x, ξ))/2. We argue that un (α̃)
converges to φ∗ ′′ (0)V arP (h(x, ξ))/2 > 0 uniformly over x ∈ X a.s. as n → ∞ and α̃ → 0 (at any rate
relative to n). More precisely, we consider

φ∗ ′′ (0)V arP (h(x, ξ)) φ∗ ′′ (0)V arP (h(x, ξ))

sup un (α̃) − ≤ sup |un (α̃) − un (0)| + sup un (0) −
x∈X 2 x∈X x∈X 2
(EC.18)
Now, by using the uniform boundedness of h (Assumption EC.7.2) and our conclusion above that
β̃n′ (ζ) → EP̂n [h(x, ξ)] uniformly over P̂n , x as α̃ → 0, we obtain supx∈X |un (α̃) − un (0)| → 0. On the
other hand, by the Glivenko-Cantelli properties of the classes of h and h2 (Assumption EC.8.1),
we have

φ∗ ′′ (0)V arP (h(x, ξ)) φ∗ ′′ (0)

sup un (0) − = sup V arP̂n (h(x, ξ)) − V arP (h(x, ξ)) → 0 a.s.
x∈X 2 2 x∈X

Thus (EC.18) goes to 0. Hence, given a sufficiently small λ and large n, we can find a root α̃λn to
(EC.17), or equivalently
un (α̃λn )(α̃λn )2 = λ
ec12 e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization

in a small neighborhood of 0, uniformly over all x ∈ X . Moreover, note that the events represented
in Assumption EC.5 occurs with probability tending to 1, and thus, by the uniform non-degeneracy
of the variance (Assumption EC.8.3), we have α̃λn satisfy
s
λ p
α̃λn = λ
→0 (EC.19)
un (α̃n )

as n → ∞ and λ → 0. Correspondingly,

p
β̃nλ = β̃n (α̃λn ) = β̃n′ (ζnλ )α̃λn → 0

p
where ζnλ lies between 0 and α̃λn . In other words, (α̃λn , β̃nλ ) → 0 as n → ∞ and λ → 0.
p
Next we show that x̂λn → x∗ . To this end, we verify the conditions in Theorem EC.2 where Φ(x) is
set to be ∇EP [h(x, ξ)] = EP [∇h(x, ξ)] (by the L1 -Lipschitzness of h implied by Assumption EC.7.2)
h i
∗′ λn λn
and Φn (x) is set to be EP̂n φ α̃n h(x, ξ) − β̃n ∇h(x, ξ) in (EC.12). We have
h i
p
sup EP̂n φ∗ ′ α̃λnn h(x, ξ) − β̃nλn ∇h(x, ξ) − EP [∇h(x, ξ)] → 0
x∈X

by the continuity of φ∗ ′ at 0 (Lemma 2), the Glivenko-Cantelli property of ∇h (Assumption

EC.8.1), the uniform boundedness of h and ∇h (Assumption EC.7.2), and our earlier conclusion
p
that (α̃λn , β̃nλ ) → 0. Moreover, inf x∈X :kx−x∗k≥ǫ k∇Eh(x, ξ)k > 0 for every ǫ > 0 by Assumption EC.6.
Finally, from our earlier conclusion, we have x̂λn being the root to
h i
EP̂n φ∗ ′ α̃λn h(x, ξ) − β̃nλ ∇h(x, ξ) = 0

with probability tending to 1. Thus, all the conditions in Theorem EC.2 are satisfied and we have

p p
x̂λn → x∗ . Therefore, we obtain x̂λn , α̃λn , β̃nλ → 0 as n → ∞ and λ → 0.
Moreover, from (EC.15) and (EC.19), and using the uniform boundedness and continuity of h
(Assumptions EC.7.1 and EC.7.2), we invoke the dominated convergence theorem to get further
that s
2λ
α̃λn = (1 + op (1)) (EC.20)
φ∗ ′′ (0)V arP (h(x∗ , ξ))
and s
2λ
β̃nλ ∗
= EP [h(x , ξ)] (1 + op (1)) (EC.21)
φ∗ ′′ (0)V arP (h(x∗ , ξ))

We now continue to verify all other conditions in Theorem EC.1. As mentioned, we set the λ
in the theorem as (α̃, β̃). We then set ϕx,α̃,β̃ (ξ) = φ∗ ′ (α̃h(x, ξ) − β̃)∇h(x, ξ) which is the function
e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization ec13

inside the expectation in (EC.12).

Condition 1: Direct use of Assumption EC.8.2.

Condition 2: We have

EP kφ∗ ′ (α̃h(x, ξ) − β̃)∇h(x, ξ) − ∇h(x∗, ξ)k2

≤ ((EP k(φ∗ ′ (α̃h(x, ξ) − β̃) − 1)∇h(x, ξ)k2)1/2 + (EP k∇h(x, ξ) − ∇h(x∗, ξ)k2 )1/2 )2 (EC.22)

by Minkowski’s inequality. For the first term in (EC.22), since h and ∇h are uniformly bounded
(Assumption EC.7.2) and φ∗ ′ is continuous at 0 (Lemma 2), as α̃, β̃ → 0, we have EP k(φ∗ ′ (α̃h(x, ξ) −
β̃) − 1)∇h(x, ξ)k2 → 0 uniformly in x. For the second term in (EC.22), we have

EP k∇h(x, ξ) − ∇h(x∗ , ξ)k2 ≤ EP [L2 (ξ)2 ]kx − x∗ k2 → 0

as x → x∗ since E[L2 (ξ)2 ] < ∞ (Assumption EC.7.3). Therefore, putting together we have (EC.22)
go to 0. This shows the satisfaction of Condition 2.

Condition 3: By the a.s. differentiability of h in Assumption EC.7.1 and the L1 -Lipschitzness of

h implied by Assumption EC.7.2, we have ∇EP [h(x, ξ)] = EP [∇h(x, ξ)]. Thus, Assumption EC.6
concludes the satisfaction of Condition 3.

Condition 4: Since h is uniformly bounded (Assumption EC.7.2) and also L1 -Lipschitz (implied
by Assumption EC.7.2), and φ∗ ′ is continuously differentiable at 0 (Lemma 2) and thus Lipschitz
continuous at a neighborhood of 0, we have φ∗ ′ (α̃h(x, ξ) − β̃) being L1 -Lipschitz and uniformly
bounded over x, ξ a.s. for a fixed sufficiently small (α̃, β̃). On the other hand, we have ∇h being L1 -
Lipschitz (implied by Assumption EC.7.3) and uniformly bounded (Assumption EC.7.2). Thus we
have φ∗ ′ (α̃h(x, ξ) − β̃)∇h(x, ξ) being L1 -Lipschitz. Moreover, by Assumption EC.7.1 and Lemma
2 again it is differentiable a.s. Thus, using the same argument used to verify Condition 4 in the
proof of Theorem 3, we can exchange derivative and expectation to obtain

∇EP [φ∗ ′ (α̃h(x, ξ) − β̃)∇h(x, ξ)] = EP [∇(φ∗ ′ (α̃h(x, ξ) − β̃)∇h(x, ξ))]

= EP [φ∗ ′′ (α̃h(x, ξ) − β̃)α̃∇h(x, ξ)∇h(x, ξ)′] + EP [φ∗ ′ (α̃h(x, ξ) − β̃)∇2 h(x, ξ)]

for (α̃, β̃) in a neighborhood of 0. In particular, ∇EP [∇h(x, ξ)] = EP [∇2 h(x, ξ)].
ec14 e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization

Now, by the uniform boundedness of h and ∇h (Assumption EC.7.1), the continuity of φ∗ ′

and φ∗ ′′ at 0 (Lemma 2) and the finiteness of EP [∇2 h(x∗ , ξ)] (Assumption EC.7.4), we use the
dominated convergence theorem to obtain

EP [φ∗ ′′ (α̃h(x∗ , ξ) − β̃)α̃∇h(x∗ , ξ)∇h(x∗, ξ)′ ] + EP [φ∗ ′ (α̃h(x∗ , ξ) − β̃)∇2 h(x∗ , ξ)] → EP [∇2 h(x∗ , ξ)]

as (α̃, β̃) → (0, 0).

Finally, by the continuity of φ∗ ′ φ∗ ′′ at 0 (Lemma 2), the uniform boundedness of
h and ∇h (Assumption EC.7.1), and Assumption EC.7.4, we have EP [φ∗ ′′ (α̃h(x∗ , ξ) −
β̃)α̃∇h(x∗ , ξ)∇h(x∗ , ξ)′ ] + EP [φ∗ ′ (α̃h(x∗ , ξ) − β̃)∇2 h(x∗ , ξ)] being positive definite with eigenvalues
uniformly bounded and away from 0 for (α̃, β̃) in a neighborhood of (0, 0). This concludes the
satisfaction of Condition 4.

h i
∗′
Condition 5: Since x̂λn satisfies EP̂n φ α̃λn h(x, ξ) − β̃nλ ∇h(x, ξ) = 0 with probability tending to
1, Condition 5 is readily satisfied.

We have therefore verified all conditions in Theorem EC.1. Hence,

x̂λnn − x∗
h i
= −(EP [∇2 h(x, ξ)])−1 EP φ∗ ′ α̃λnn h(x∗ , ξ) − β̃nλn ∇h(x∗ , ξ) − (EP [∇2 h(x, ξ)])−1 EP̂n [∇h(x∗ , ξ)]
i
1 h
∗′ λn ∗ λn

∗
+ op √ + EP φ α̃n h(x , ξ) − β̃n ∇h(x , ξ)
n
h i
= −(EP [∇ h(x, ξ)])−1 EP [∇h(x∗ , ξ)] + EP φ∗ ′′ (ζn ) α̃λnn h(x∗ , ξ) − β̃nλn ∇h(x∗ , ξ)
2

i
2 −1 ∗ 1 h
∗ ′′

λ ∗ λn

∗
− (EP [∇ h(x, ξ)]) EP̂n [∇h(x , ξ)] + op √ + EP φ (ζn ) α̃n h(x , ξ) − β̃n ∇h(x , ξ)
n
h i
∗ ′′
= −(EP [∇ h(x, ξ)]) EP φ (ζn ) α̃n h(x , ξ) − β̃nλn ∇h(x∗ , ξ)
2 −1 λn ∗

i
2 −1 ∗ 1 h
∗ ′′

λn ∗ λn

∗
− (EP [∇ h(x, ξ)]) EP̂n [∇h(x , ξ)] + op √ + E φ (ζn ) α̃n h(x , ξ) − β̃n ∇h(x (EC.23) , ξ)
n
where ζn lies between 0 and α̃λnn h(x∗ , ξ) − β̃nλn , and we have used the first-order optimality condi-
tion EP [∇h(x∗ , ξ)] = 0 derived previously. Now, substituting (EC.20) and (EC.21), and using the
continuity of φ∗ ′′ at 0 (Lemma 2), uniform boundedness of h and ∇h (Assumption EC.7.1) and the
characterizations in (EC.15) and (EC.19), we invoke the bounded convergence theorem to obtain
h i
EP φ∗ ′′ (ζn ) α̃λnn h(x∗ , ξ) − β̃nλn ∇h(x∗ , ξ)
"s #
′′ λn
= φ∗ (0)EP (h(x∗ , ξ) − EP [h(x∗ , ξ)])∇h(x∗, ξ) (1 + op (1))
φ∗ ′′ (0)V arP (h(x∗ , ξ))
s
λn φ∗ ′′ (0)
= CovP ((h(x∗, ξ), ∇h(x∗ , ξ))(1 + op (1))
V arP (h(x∗ , ξ))
e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization ec15

Putting this into (EC.23), we get

s
λn φ∗ ′′ (0)
−(EP [∇2 h(x, ξ)])−1 CovP ((h(x∗, ξ), ∇h(x∗ , ξ)) − (EP [∇2 h(x, ξ)])−1 EP̂n [∇h(x∗ , ξ)]
V arP (h(x∗ , ξ))
√

1
+ op √ + λn
n
CovP (h(x∗ , ξ), ∇h(x∗ , ξ))
q
= −(EP [∇2 h(x∗ , ξ)])−1 h∇h(x∗ , ξ), P̂n − P i − λn φ∗ ′′ (0)(EP [∇2 h(x∗ , ξ)])−1 p
V arP (h(x∗ , ξ))
√

1
+ op √ + λn
n

Proof of Corollary 1. The result is immediate by using, e.g., Kuhn et al. (2019) Theorem 10 to
obtain (23), on which we apply Theorem 3.

EC.5. Conditions and Proof of Theorem 5

We make the following assumptions:

Assumption EC.9 (Regularity conditions of parametric objective function). We

have:
1. The solution x∗ satisfies the first-order optimality condition, i.e., ∇x ψ(x∗ , θ ∗ ) = 0.
2. ∇x ψ(·, ·) : Rd+m → Rd is continuously differentiable at (x∗ , θ ∗ ) and the Hessian matrix of ψ(·, ·)
with respect to x at (x∗ , θ ∗ ), ∇2x ψ(x∗ , θ ∗ ), is invertible.

Assumption EC.10 (Regularity conditions on penalty). R is twice continuously differen-

tiable.

We are now ready to prove Theorem 5.

Proof of Theorem 5. The result follows from a standard application of the delta method. By
∗
√
(24), we have θ̂n − θ = Op (1/ n) by the CLT and Slutsky’s theorem. Now, by Assumption EC.9.1,
x∗ is the solution to ∇x ψ(x, θ ∗ ) + 0 · ∇x R(x) = 0. By Assumptions EC.9.2 and EC.10, the func-
tion ∇x ψ(x, θ) + λ∇x R(x) is continuously differentiable at (x, θ, λ) = (x∗ , θ ∗ , 0), and ∇2x ψ(x∗ , θ ∗ ) is
invertible. Thus, by the implicit function theorem, we have a unique solution x(θ, λ) to the equa-
tion ∇x ψ(x, θ) + λ∇x g(x) = 0 for (θ, λ) in a small enough neighborhood of (θ, 0), in which x(θ, λ)
is continuously differentiable. Moreover, the gradient of x(θ, λ) at (θ ∗ , 0) is given by

(∇2x ψ(x∗ , θ ∗ ))−1 [∇xθ ψ(x∗ , θ ∗ ), ∇x R(x∗ )]

Now let

r(h, λ) = x(θ ∗ + h, λ) − x(θ, 0) − (∇2xψ(x∗ , θ ∗ ))−1 ∇xθ ψ(x∗ , θ ∗ )h − (∇2xψ(x∗ , θ ∗ ))−1 ∇x R(x)λ
ec16 e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization

By the continuous differentiability of x(·, ·) argued above, we have r(h, λ) = o(k(h, λ)k) as (h, λ) → 0.
Thus, by the definition of convergence in probability, we have

x(θ̂n , λ) − x(θ, 0) − (∇2xψ(x∗ , θ ∗ ))−1 ∇xθ ψ(x∗ , θ ∗ )(θ̂n − θ ∗ ) − (∇2xψ(x∗ , θ ∗ ))−1 ∇x R(x∗ )λ

∗ ∗ ∗ 1
= r(θ̂n − θ , λ) = op (k(θ̂n − θ , λ)k) = op (kθ̂n − θ k + |λ|) = op √ +λ
n

Noting that x̂nP −Reg,λ = x(θ̂n , λ), this concludes the theorem.

EC.6. Conditions and Proof of Theorem 6

We make the following assumptions in addition to those in Section EC.5:

Assumption EC.11 (Additional regularity conditions for the objective function).

∇x ψ(x, θ), ∇xθ ψ(x, θ) and ∇2x ψ(x, θ) are continuous and bounded uniformly over x ∈ X and
θ ∈ Rm . Moreover, ∇2x ψ(x, θ) is positive definite with eigenvalues bounded away from 0 and ∞
uniformly over x ∈ X and θ ∈ Rm .

Assumption EC.12 (Additional regularity conditions for the penalty). ∇2x R(x) is uni-
formly bounded over x ∈ X .

Assumption EC.13 (Additional first-order optimality condition). With probability con-

verging to 1, there exists a solution x̂nP −Bay,λ as defined in (25) that satisfies the first-order opti-
mality condition, i.e., it solves

∇x EΘ|Dn [ψ(x, Θ)] + λ∇x R(x)

for n → ∞ and λ → 0.

We are now ready to prove Theorem 6.

Proof of Theorem 6. By (26), we have

Wn
EΘ|Dn [∇x ψ(x, Θ)] = EΘ|Dn ∇x ψ x, θ̂n + √ (EC.24)
n
p
where kWn − N (0, J )kT V → 0. Now consider (EC.24) componentwise, i.e.,

j Wn
EΘ|Dn ∇x ψ x, θ̂n + √ (EC.25)
n

where ∇jx for j = 1, . . . , d denotes the j-th component of the gradient. By the mean value theorem
and Assumption EC.11, we can write (EC.25) as

j j j Wn
∇x ψ(x, θ̂n ) + EΘ|Dn ∇xθ ψ(x, θ̂n + ζn ) √ (EC.26)
n
e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization ec17

√
where ∇jxθ ψ(x, θ) denotes the j-th row of ∇xθ ψ(x, θ), and ζnj is between 0 and Wn / n. By the
uniform boundedness of ∇xθ ψ(x, θ) in Assumption EC.11 and the almost sure uniform integrability
of Wn , we have ∇jxθ ψ(x, θ̂n +ζnj )Wn also a.s. uniformly integrable. Thus, together with the continuity
p p
of ∇xθ ψ(x, θ) in Assumption EC.11, θ̂n → θ ∗ as implied by (24), and kWn − N (0, J )kT V → 0 in (26),
we have
p
EΘ|Dn [∇jxθ ψ(x, θ̂n + ζnj )Wn ] → 0

Thus,
1
EΘ|Dn [∇x ψ(x, Θ)] = ∇x ψ(x, θ̂n ) + op √ (EC.27)
n
uniformly over x ∈ X .
Now, define x̂nP −Reg,λ = x(θ̂, λ) as in the proof of Theorem 5, and x̂P −Bay,λ as the root of

EΘ|Dn [∇x ψ(x, Θ)] + λ∇x R(x)

which exists with probability converging to 1 as n → ∞ and coincides with the definition in Assump-
tion EC.13, which is well-defined by using Assumption EC.11.
Now, by the first-order optimality condition,

∇x ψ(x̂nP −Reg,λ , θ̂n ) − ∇x ψ(x̂nP −EO , θ̂n ) = −λ∇x R(x̂nP −Reg,λ )

By the delta method, and together with the uniform boundedness of ∇2x ψ(x, θ) and the continuity
of ∇2x ψ(x, θ) in x uniformly over θ in Assumption EC.11, we can rewrite the above as

∇2x ψ(x̂nP −EO , θ̂n )(x̂nP −Reg,λ − x̂nP −EO ) + op (kx̂nP −Reg,λ − x̂nP −EO k) = −λ∇x R(x̂nP −Reg,λ )

Note that then we have

kx̂nP −Reg,λ − x̂nP −EO k ≤ k∇2x ψ(x̂nP −EO , θ̂n )−1 kk∇2x ψ(x̂nP −EO , θ̂n )(x̂nP −Reg,λ − x̂nP −EO )k

= O(λ) + op (kx̂nP −Reg,λ − x̂nP −EO k)

by using the uniform boundedness of the eigenvalues of ∇2x ψ(x, θ) away from 0 and ∞ in Assumption
p
EC.11, and the uniform boundedness of ∇x R(x) in Assumption EC.12. Thus x̂nP −Reg,λ − x̂nP −EO → 0
as λ → 0. We then have

x̂nP −Reg,λ − x̂nP −EO = −λ∇2x ψ(x̂nP −EO , θ̂n )−1 ∇x R(x̂nP −Reg,λ )(1 + op (1)) (EC.28)

Similarly, by the first-order optimality condition in Assumption EC.13, Assumption EC.11 and
(EC.27), we have

1
∇x ψ(x̂nP −Bay,λ , θ̂n ) − ∇x ψ(x̂nP −Bay,0 , θ̂) = −λ∇x R(x̂nP −Bay,λ ) + op √
n
ec18 e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization

So, by the delta method, and together with the uniform boundedness of ∇2x ψ(x, θ) in Assumption
EC.11, we have

1
∇2x ψ(x̂nP −Bay,0 , θ̂n )(x̂nP −Bay,λ − x̂nP −Bay,0 )+op (kx̂nP −Bay,λ − x̂nP −Bay,0 k) = −λ∇x R(x̂nP −Bay,λ )+op √
n

Then we have

kx̂nP −Bay,λ − x̂nP −Bay,0 k ≤ k∇2x ψ(x̂nP −Bay,0 , θ̂n )−1 kk∇2x ψ(x̂nP −Bay,0 , θ̂n )(x̂nP −Bay,λ − x̂nP −Bay,0 )k

1
= O(λ) + op √ + op (kx̂nP −Bay,λ − x̂nP −Bay,0 k)
n

by using the uniform boundedness of the eigenvalues of ∇2x ψ(x, θ) away from 0 and ∞ in Assumption
p
EC.11, and the uniform boundedness of ∇x R(x) in Assumption EC.12. Thus x̂nP −Bay,λ − x̂nP −Bay,0 →
0 as λ → 0 and n → ∞. We then have

1
x̂nP −Bay,λ − x̂nP −Bay,0 = −λ∇2x ψ(x̂nP −Bay,0 , θ̂n )−1 ∇x R(x̂nP −Bay,λ) + op √ (EC.29)
n

Now, by the delta method like in the proof of Theorem 5, and together with the continuity and
uniform boundedness of ∇2x ψ(x, θ) in Assumption EC.11, we have

∇x ψ(x̂nP −Bay,0 , θ̂n ) −∇xψ(x̂nP −EO , θ̂n ) = ∇2x ψ(x̂nP −EO , θ̂n )(x̂nP −Bay,0 − x̂nP −EO )+op (kx̂nP −Bay,0 − x̂nP −EO k)

Note also that by (EC.27) and the first-order optimality conditions in Assumption EC.9.1 and
Assumption EC.11,

1
∇x ψ(x̂nP −Bay,0 , θ̂n ) + op √ = 0 = ∇x ψ(x̂nP −EO , θ̂n )
n

so that

1
∇x ψ(x̂nP −Bay,0 , θ̂n ) − ∇x ψ(x̂nP −EO , θ̂n ) = op √
n
Thus

kx̂nP −Bay,0 − x̂nP −EO k ≤ k∇2x ψ(x̂nP −EO , θ̂n )−1 kk∇2x ψ(x̂nP −EO , θ̂n )(x̂nP −Bay,0 − x̂nP −EO )k

1
= op √ + op (kx̂nP −Bay,0 − x̂nP −EO k)
n

by using the uniform boundedness of the eigenvalues of ∇2x ψ(x, θ) away from 0 and ∞ in Assumption
EC.11. This gives

1
x̂nP −Bay,0 − x̂nP −EO = op √ (EC.30)
n
e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization ec19

Putting together (EC.28), (EC.29) and (EC.30), we have

x̂nP −Bay,λ − x̂nP −Reg,λ

= (x̂nP −Bay,λ − x̂nP −Bay,0 ) + (x̂nP −Bay,0 − x̂nP −EO ) + (x̂nP −EO − x̂nP −Reg,λ )

1
= −λ∇2x ψ(x̂nP −Bay,0 , θ̂n )−1 ∇x R(x̂nP −Bay,λ) + λ∇2x ψ(x̂nP −EO , θ̂n )−1 ∇x R(x̂nP −Reg,λ ) + op √
n

1
= op √ +λ
n
p
by the consistency of x̂nP −Bay,λ , x̂nP −Reg,λ , x̂nP −Bay,0 , x̂nP −EO → x∗ implied by these equations and the
continuity of ∇2x ψ(x, θ) in Assumption EC.11 and of ∇x R(x) in Assumption EC.12. Together with
the conclusions from Theorem 5, we prove the theorem.

Elements of Statistical Learning Solutions
100% (3)
Elements of Statistical Learning Solutions
112 pages
Distributionally Robust Optimization
No ratings yet
Distributionally Robust Optimization
221 pages
Distributionally Robust Instrumental Variables Estimation
No ratings yet
Distributionally Robust Instrumental Variables Estimation
90 pages
Distributionally Robust Optimization and Robust Statistics: Jose Blanchet, Jiajin Li, Sirui Lin, Xuhui Zhang
No ratings yet
Distributionally Robust Optimization and Robust Statistics: Jose Blanchet, Jiajin Li, Sirui Lin, Xuhui Zhang
47 pages
Mean Robust Optimization: Series A
No ratings yet
Mean Robust Optimization: Series A
43 pages
Optimal Stochastic Non-Smooth Non-Convex Optimization Through
No ratings yet
Optimal Stochastic Non-Smooth Non-Convex Optimization Through
39 pages
Stochastic Online Opitmization Using Kalman Recursion - Paper
No ratings yet
Stochastic Online Opitmization Using Kalman Recursion - Paper
55 pages
Learning Multivariate Gaussians With Imperfect Advice
No ratings yet
Learning Multivariate Gaussians With Imperfect Advice
42 pages
Gonzalez 2020
No ratings yet
Gonzalez 2020
79 pages
Achieving Robust Data-Driven Contextual Decision Making in A Data Augmentation Way
No ratings yet
Achieving Robust Data-Driven Contextual Decision Making in A Data Augmentation Way
30 pages
Cooper Box00018 fld00018 bdl0001 Doc0001
No ratings yet
Cooper Box00018 fld00018 bdl0001 Doc0001
18 pages
A-A Regularized Robust Design Criterion For Uncertain Data
No ratings yet
A-A Regularized Robust Design Criterion For Uncertain Data
23 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
71 pages
Weatherwax Epstein Hastie Solution Manual
No ratings yet
Weatherwax Epstein Hastie Solution Manual
147 pages
Uncertainty Notes
No ratings yet
Uncertainty Notes
166 pages
Weatherwax Epstein Hastie Solution Manual
No ratings yet
Weatherwax Epstein Hastie Solution Manual
147 pages
Figueiredo EM Algorithm
No ratings yet
Figueiredo EM Algorithm
35 pages
Harbrecht Randrhs
No ratings yet
Harbrecht Randrhs
23 pages
Stochastic Search Optimization
No ratings yet
Stochastic Search Optimization
317 pages
Gosh (2022) (CPT Convex)
No ratings yet
Gosh (2022) (CPT Convex)
12 pages
Lecture 19
No ratings yet
Lecture 19
25 pages
On Cost-Sensitive Distributionally Robust Log-Optimal Portfolio
No ratings yet
On Cost-Sensitive Distributionally Robust Log-Optimal Portfolio
14 pages
A Solution Manual and Notes For: The Elements of Statistical Learning by Jerome Friedman, Trevor Hastie, and Robert Tibshirani
No ratings yet
A Solution Manual and Notes For: The Elements of Statistical Learning by Jerome Friedman, Trevor Hastie, and Robert Tibshirani
121 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Ost Report On Sail
100% (1)
Ost Report On Sail
31 pages
Distributionally Robust Optimization of Moments Subject To Partial Ambiguity
No ratings yet
Distributionally Robust Optimization of Moments Subject To Partial Ambiguity
27 pages
Situation Analysis
100% (2)
Situation Analysis
6 pages
Linnear Nonlineae Numerical Method
No ratings yet
Linnear Nonlineae Numerical Method
43 pages
Data-Driven Robust Optimization
No ratings yet
Data-Driven Robust Optimization
43 pages
Internal
No ratings yet
Internal
25 pages
Efficient Inference For Differential Equation Models Without Numerical Solvers
No ratings yet
Efficient Inference For Differential Equation Models Without Numerical Solvers
12 pages
Lahore Chamber 2
No ratings yet
Lahore Chamber 2
97 pages
Teachers Guide Lower Secondary Arts
No ratings yet
Teachers Guide Lower Secondary Arts
113 pages
Chapter
No ratings yet
Chapter
46 pages
Lec 16
No ratings yet
Lec 16
10 pages
Smooth Splines Large Data
No ratings yet
Smooth Splines Large Data
22 pages
Linear Convergence of Adaptive Stochastic Gradient Descent
No ratings yet
Linear Convergence of Adaptive Stochastic Gradient Descent
19 pages
1 s2.0 S0888327021005847 Main
No ratings yet
1 s2.0 S0888327021005847 Main
19 pages
Intro 1
No ratings yet
Intro 1
12 pages
The Fundamentals of Co Miclat 1
No ratings yet
The Fundamentals of Co Miclat 1
108 pages
t4 Sol
No ratings yet
t4 Sol
8 pages
An Algorithm For Quadratic Optimization With One Quadratic Constraint and Bounds On The Variables
No ratings yet
An Algorithm For Quadratic Optimization With One Quadratic Constraint and Bounds On The Variables
9 pages
The Risk of Machine Learning
No ratings yet
The Risk of Machine Learning
66 pages
(Textbook) (Solution) The Elements of Statistical Learning
No ratings yet
(Textbook) (Solution) The Elements of Statistical Learning
147 pages
01 Intro Notes Cvxopt f22
No ratings yet
01 Intro Notes Cvxopt f22
25 pages
Convex Functions
No ratings yet
Convex Functions
13 pages
Parameterized Expectations Algorithm: Lecture Notes 8
No ratings yet
Parameterized Expectations Algorithm: Lecture Notes 8
33 pages
Bayesian EBDO Savannah Mourelatos Feb2008 V2
No ratings yet
Bayesian EBDO Savannah Mourelatos Feb2008 V2
36 pages
An Easy-To-Use Real-World Multi-Objective Optimization Problem Suite
No ratings yet
An Easy-To-Use Real-World Multi-Objective Optimization Problem Suite
21 pages
Desymm
No ratings yet
Desymm
13 pages
Sol3 2015
No ratings yet
Sol3 2015
8 pages
Identify Forms of Tangible and Intangible Heritage and The Threats To These
100% (2)
Identify Forms of Tangible and Intangible Heritage and The Threats To These
2 pages
O4MD 01 Introduction
No ratings yet
O4MD 01 Introduction
10 pages
Poor Starting Points in Machine Learning
No ratings yet
Poor Starting Points in Machine Learning
11 pages
Bowen 2009
No ratings yet
Bowen 2009
31 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
EA Paper
No ratings yet
EA Paper
16 pages
Validity and Reliability in Quantitative Research
No ratings yet
Validity and Reliability in Quantitative Research
34 pages
Bootstrapping Time Series Models
No ratings yet
Bootstrapping Time Series Models
43 pages
Nigatu Nigus
No ratings yet
Nigatu Nigus
65 pages
I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017
No ratings yet
I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017
20 pages
Lion 5 Paper
No ratings yet
Lion 5 Paper
15 pages
Research Proposal Writing Format + 2016
100% (1)
Research Proposal Writing Format + 2016
6 pages
13 - Econometric Applications of Dynamic Programming: V X Uxi EV Xi
No ratings yet
13 - Econometric Applications of Dynamic Programming: V X Uxi EV Xi
5 pages
2021 Book Tach Part7 Paperportfolio
No ratings yet
2021 Book Tach Part7 Paperportfolio
8 pages
Handout 1 Introduction
No ratings yet
Handout 1 Introduction
7 pages
Stats300a Fall15 Lecture1
No ratings yet
Stats300a Fall15 Lecture1
7 pages
Alat Ukur AMS
No ratings yet
Alat Ukur AMS
14 pages
Methods in Gut Microbial Ecology For Ruminants
No ratings yet
Methods in Gut Microbial Ecology For Ruminants
235 pages
Unfaithfulness Among Married Couples: John Mapfumo
No ratings yet
Unfaithfulness Among Married Couples: John Mapfumo
13 pages
Modelling Data Uncertainty in Growth Forecasts: Karmeshu T and F. Lara-Rosano
No ratings yet
Modelling Data Uncertainty in Growth Forecasts: Karmeshu T and F. Lara-Rosano
7 pages
Meldung Einer Dissertation Tu Wien
100% (2)
Meldung Einer Dissertation Tu Wien
5 pages
Monte Carlo Tree Search: A Review of Recent Modifications and Applications
No ratings yet
Monte Carlo Tree Search: A Review of Recent Modifications and Applications
66 pages
Desc and Analytic Studies PPT Final 09252013
No ratings yet
Desc and Analytic Studies PPT Final 09252013
76 pages
Modeling and Optimization of Latency in Erasure-Coded Storage Systems
No ratings yet
Modeling and Optimization of Latency in Erasure-Coded Storage Systems
141 pages
Diss Lesson 1
No ratings yet
Diss Lesson 1
4 pages
A Project Report ON: Acc Cement
No ratings yet
A Project Report ON: Acc Cement
9 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
Chapter 4 - Risk and Return
No ratings yet
Chapter 4 - Risk and Return
39 pages
Pygmalion Effect
No ratings yet
Pygmalion Effect
3 pages
Welded Connections of Wind Turbine Towers Under Fa
No ratings yet
Welded Connections of Wind Turbine Towers Under Fa
16 pages
Manage Business Risk
No ratings yet
Manage Business Risk
23 pages
Royal Society of Chemistry
No ratings yet
Royal Society of Chemistry
2 pages
Ual Dissertation Examples
100% (2)
Ual Dissertation Examples
8 pages
0000189230
No ratings yet
0000189230
3 pages
1 s2.0 016794739592844N Main
No ratings yet
1 s2.0 016794739592844N Main
11 pages
A Survey of Dynamic Scheduling in Manufacturing Systems: Djamila Ouelhadj Sanja Petrovic
No ratings yet
A Survey of Dynamic Scheduling in Manufacturing Systems: Djamila Ouelhadj Sanja Petrovic
15 pages
1 s2.0 S0959652617324952 Main
No ratings yet
1 s2.0 S0959652617324952 Main
17 pages
1 s2.0 S027861252300016X Main
No ratings yet
1 s2.0 S027861252300016X Main
19 pages
Tugas Individu I Gede Mahatma Yuda Bakti - Materi Presentasi KMP 611 - Desain Dan Metode Penelitian
No ratings yet
Tugas Individu I Gede Mahatma Yuda Bakti - Materi Presentasi KMP 611 - Desain Dan Metode Penelitian
26 pages
TPNC2019 Preprint
No ratings yet
TPNC2019 Preprint
12 pages
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mahesh BA Resume.
No ratings yet
Mahesh BA Resume.
1 page
Research
No ratings yet
Research
6 pages
Criticises Explainable Defect Pred
No ratings yet
Criticises Explainable Defect Pred
12 pages
My Resume
No ratings yet
My Resume
1 page
Cooper H. 1989. Homework. White Plains N.Y. Longman
100% (1)
Cooper H. 1989. Homework. White Plains N.Y. Longman
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.