On The Impossibility of Statistically Improving Empirical Optimization: A Second-Order Stochastic Dominance Perspective
On The Impossibility of Statistically Improving Empirical Optimization: A Second-Order Stochastic Dominance Perspective
Dominance Perspective
Henry Lam
Department of Industrial Engineering and Operations Research, Columbia University, New York, NY 10027,
henry.lam@columbia.edu
When the underlying probability distribution in a stochastic optimization is observed only through data,
various data-driven formulations have been studied to obtain approximate optimal solutions. We show that
no such formulations can, in a sense, theoretically improve the statistical quality of the solution obtained
from empirical optimization. We argue this by proving that the first-order behavior of the optimality gap
against the oracle best solution, which includes both the bias and variance, for any data-driven solution
is second-order stochastically dominated by empirical optimization, as long as suitable smoothness holds
with respect to the underlying distribution. We demonstrate this impossibility of improvement in a range of
examples including regularized optimization, distributionally robust optimization, parametric optimization
and Bayesian generalizations. We also discuss the connections of our results to semiparametric statistical
inference and other perspectives in the data-driven optimization literature.
Key words : empirical optimization, second-order stochastic dominance, optimality gap, regularization,
distributionally robust optimization
1. Introduction
We consider a stochastic optimization problem in the form:
where x is the decision variable in a known feasible region X ⊂ Rd , and Z : Rd → R is the objective
function that depends on an underlying probability distribution P and we denote Z(x) = ψ(x, P ).
A primary example of ψ(x, P ) is the expected value objective function EP [h(x, ξ)] where EP [·]
denotes the expectation with respect to P that generates a random object ξ ∈ Ξ. This, however,
can be more general, including for instance the (conditional) value-at-risk of ξ.
We focus on data-driven optimization where P is not known but only observed via i.i.d. data. In
this situation, the decision maker obtains a data-driven solution, say x̂, typically by solving some
1
2 Lam: On the Impossibility of Statistically Improving Empirical Optimization
reformulation of (1) that utilizes the data. Let x∗ be an (unknown) optimal solution for (1). We
are interested in the statistical properties of the optimality gap or regret
This captures the suboptimality of x̂ relative to the oracle optimal solution x∗ . (2) is a natural
metric to evaluate the quality of an obtained solution and, in the language of statistical learning,
it measures the generalization performance relative to the oracle best in terms of the objective
value. Obviously, if a solution bears a smaller optimality gap than another solution, then its true
objective value or generalization performance is also better by the same magnitude.
Data-driven optimization as discussed above arises ubiquitously across operations research and
machine learning, where the objective ψ(x, P ) ranges from an expected business revenue to the loss
of a statistical model. To obtain x̂ from data, the arguably most straightforward approach is empir-
ical optimization (EO), namely by replacing the unknown true P with the empirical distribution P̂
in the objective ψ(x, P ). For example, in expected value optimization where ψ(x, P ) = EP [h(x, ξ)],
this corresponds to minimizing EP̂ [h(x, ξ)], also known as the sample average approximation (SAA)
(Shapiro et al. 2014). Other than EO, there are plenty of actively studied reformulations, including
regularization that adds penalty terms to the objective (Friedman et al. 2001), and data-driven dis-
tributionally robust optimization (DRO) (Delage and Ye 2010, Goh and Sim 2010, Ben-Tal et al.
2013, Wiesemann et al. 2014, Lim et al. 2006, Rahimian and Mehrotra 2019) where one turns the
objective ψ(x, P ) into maxQ∈U ψ(x, Q), with U being a so-called uncertainty set or ambiguity set
that is calibrated from data and, at least intuitively speaking, has a high likelihood of containing
the true distribution.
Our main question to address is as follows: Considering the EO solution as a natural base-
line, could we possibly improve its statistical performance in terms of the optimality gap (2) (or
equivalently the attained true objective value), by incorporating some regularization or robustness-
enhancing modification? Our main assertion is that, under standard conditions, it is theoretically
impossible to improve the statistical performance of EO in this regard.
Our assertion is qualified by the following setup and conditions. First, we denote x̂EO
n = x(P̂n ) as
the EO solution obtained from n data points, where the subscript n in x̂EO
n and P̂n highlights the
dependence on sample size for the solution and the empirical distribution, and here x(·) is viewed
as a function on P̂n . By a modification to EO, we mean to consider a wider choice of data-driven
solution x̂λn = x(P̂n , λ), where λ is a tuning parameter in an expanded class of procedures that cover
EO in particular. Without loss of generality, we set x(P̂n , 0) = x(P̂n ), i.e., λ = 0 corresponds to EO.
This λ appears virtually in all common regularization approaches, and also calibration proposals
Lam: On the Impossibility of Statistically Improving Empirical Optimization 3
on the set size in DRO with a “consistent” uncertainty set, i.e., a set with the property that it
reduces to the singleton on the empirical distribution when its size λ is tuned to be zero. Typically,
λ is chosen depending on the sample size n, and as n grows, λ eventually shrinks to 0. For example,
in regularization λ represents a bias-variance tradeoff parameter introduced to avoid overfitting.
When n is large, the variance diminishes and so is the need to trade off a decrease in variance
with an increase in bias, deeming a shrinkage of λ to 0. Similarly, in DRO that uses a consistent
confidence region as the uncertainty set, the set size converges to zero as n gets large. Hereafter,
when the dependence of λ on n is needed, we use the notation λn .
Under the above setup, we have two main conditions. First, we focus on solutions x(·, ·) that are
smooth with respect to the distribution P and the tuning parameter λ. This condition is implied
by the smoothness of the objective function ψ(x, P ) (with respect to both x and P ). Second, we
consider the decision dimension d to be fixed or, in other words, the setting where the sample size
n is large relative to the dimension. To summarize, we look at the most basic setting of smooth
stochastic optimization in a large-sample regime, where our goal is to provide a fundamental
argument to show the superiority of EO over all other possibilities.
We now explain the impossibility of statistical improvement. By this we mean that, in the large-
sample regime, the optimality gap evaluated at the EO solution x̂EO
n is always no worse than
that evaluated at x̂λnn , in terms of the risk profile measured by second-order stochastic dominance,
regardless of how λn depends on n. As an immediate implication, this means for any non-decreasing
convex function f : R → R, we have
E[f (G (x̂EO λn
n ))] ≤ E[f (G (x̂n ))]
up to a negligible error, where E[·] denotes the expectation with respect to the data used to obtain
x̂EO
n or x̂λnn . In particular, setting f (y) = y 2 , we conclude that the mean squared error of Z(x̂EO
n )
against Z(x∗ ) is always no larger than that of Z(x̂λnn ). The same conclusion holds for f (y) = y p for
any other p ≥ 1.
We will show how this impossibility result applies to all common examples of regularization
on EO, and common DROs with a consistent uncertainty set. Moreover, we also show how our
result applies to parametric settings, where the optimization involve unknown finite-dimensional
parameters that need to be estimated. In the latter settings, the most straightforward approach is
to plug in a consistent point estimate of the parameter into the optimization formulation, where this
point estimate can be obtained by any common techniques such as maximum likelihood estimation
(MLE) and the method of moments (MM) (Van der Vaart 2000, Chapters 4, 5). Here, we can again
consider adding regularization on the plugged-in optimization formulation. We may also choose
4 Lam: On the Impossibility of Statistically Improving Empirical Optimization
to use Bayesian approaches such as minimizing the expected posterior cost (Wu et al. 2018), and
consider regularizing the Bayesian formulation. In all the above cases, our result concludes that it is
impossible to improve the solution obtained from EO or a simple consistent parameter estimation,
in terms of the asymptotic risk profile of the optimality gap, by injecting regularization or DRO.
We will connect and contrast our results with other viewpoints in the data-driven optimization
as well as the statistics literature. A reader who is proficient in stochastic optimization may find
our claim on the superiority of EO very natural. Yet, as far as we know, our studied perspective
appears unknown in the literature. Our results are intended to guide optimizers against engaging in
suboptimal strategies in the considered basic large-sample situations, as we show in this situation
that there is no theoretical improvement in using any strategies over EO. On the other hand, they
are not intended to undermine the diverse strategies in data-driven optimization, as there are other
situations where these strategies are used for good reasons (we will discuss these in Section 4).
In the following, Section 2 presents our main result, and Section 3 discusses its applications on
the range of data-driven formulations mentioned above. Then, in Section 4, we compare our results
with established viewpoints in the DRO literature and classical statistics, and discuss scenarios
beyond our considered setting in which alternate approaches to EO offer advantages.
2. Main Results
Recall that x∗ is a minimizer of Z(x). Also recall that n is the sample size in an i.i.d. data set
Pn
{ξi , i = 1, . . . , n}, and P̂n denotes its empirical distribution, i.e., P̂ (·) = (1/n) i=1 δξi (·) where δξi (·)
is the Dirac measure at the i-th data point ξi . x̂EO λ
n = x(P̂n ) is the EO solution, and x̂n = x(P̂n , λ)
We set up some notation. In the following, we denote EQ [·], V arQ (·) and CovQ (·, ·) as the
expectation, variance and covariance under a probability distribution Q. We use “⇒” to denote
p d
weak convergence or convergence in distribution, “→” denote convergence in probability, “=”
denote equality in distribution, and “a.s.” denote almost surely. We denote k · k as the Euclidean
norm. For any deterministic sequences ak ∈ R and bk ∈ R, both indexed by a common index, say k
that goes to ∞, we say that ak = o(bk ) if ak /bk → 0, ak = O(bk ) if there exists a finite M > 0 such
that |ak /bk | < M for all sufficiently large k, ak = ω(bk ) if |ak /bk | → ∞, ak = Ω(bk ) if there exists a
finite M > 0 such that |ak /bk | > M for all sufficiently large k, and ak = Θ(bk ) if there exist finite
M , M > 0 such that M < |ak /bk | < M for all sufficiently large k. We use the notations op (·) and
Op (·) to denote a smaller and an at most equal stochastic order respectively. Namely, for a sequence
of random vector Ak ∈ Rd and deterministic sequence bk ∈ R, both indexed by a common index k
p
that goes to ∞, Ak = op (bk ) means Ak /bk → 0 as k → ∞. Correspondingly, Ak = Op (bk ) means for
Lam: On the Impossibility of Statistically Improving Empirical Optimization 5
any ǫ > 0, there exists a large enough N > 0 and M > 0 such that P (kAk /bk k ≤ M ) ≥ 1 − ǫ for any
p
k > N . Finally, for a sequence of random variable Ak ∈ R, we say that Ak → ∞ if for any ǫ > 0
and M > 0 there exists N > 0 large enough such that P (Ak > M ) ≥ 1 − ǫ for any k > N . Similarly,
p
Ak → −∞ if for any ǫ > 0 and M > 0 there exists N > 0 large enough such that P (Ak < −M ) ≥ 1 − ǫ
for any k > N . Finally, we denote ess supQ f as the essential supremum of a random function f
under distribution Q, and ⊤ as the transpose.
2.1. Conditions
We make three assumptions:
Assumption 1 (Optimality conditions). A true minimizer x∗ for Z(x) satisfies the second-
order optimality conditions, namely the gradient ∇Z(x∗ ) = 0 and the Hessian ∇2 Z(x∗ ) is positive
semidefinite.
as n → ∞ and λ → 0 (at any rate), for some function IF : Ξ → Rd and constant vector K ∈ Rd .
Moreover, CovP (IF (ξ)), the covariance matrix of IF (ξ) under P , is entry-wise finite. Furthermore,
when λ = 0, we have x̂0n = x̂EO
n .
Here in (4), the first equality is the definition of the inner product h·, ·i. The second equality follows
by noting that we can always assume IF (ξ) satisfies EP [IF (ξ)] = 0 (because EP̂n [1] = 1, we can
take IF (ξ) − EP [IF (ξ)] to be our new influence function if not the case). The third equality follows
from the definition of the empirical distribution P̂n . Note that when λ = 0, Assumption 2 implies
that the EO solution x̂EO 0
n = x̂n satisfies
1
x̂EO
n
∗
− x = hIF (ξ), P̂n − P i + op √ (5)
n
Such a linear relation can in fact arise more generally, for instance under Hadamard differentiability
(Van der Vaart 2000 Chapter 20), though this is not needed for our current purpose.
Note that Assumption 2 also stipulates that K is the (partial) derivative of x(P, λ) with respect
p
to λ. Moreover, (3) and (5) imply the consistency of solutions x̂λn and x̂n , in the sense that x̂λn → x∗
p
and x̂EO ∗
n → x as n → ∞ and λ → 0 (at any rate).
E[ǫ|Ã + η] = 0 a.s.
Lam: On the Impossibility of Statistically Improving Empirical Optimization 7
Note that the definition of second-order stochastic dominance is defined solely on the probability
distributions of A and B above. It is common to define this notion on the distributions directly,
though here we use random variable for the convenience of our subsequent developments. Moreover,
let us make clear that our Definition 1 applies to loss (i.e., smaller is desirable) rather than gain (i.e.,
bigger is desirable), the latter more customarily used in the economics literature (Hanoch and Levy
1969, Hadar and Russell 1969, Rothschild and Stiglitz 1970). This distinction, however, is imma-
terial, as we can simply view gain as negative loss. That is, if we call the gain −A to second-order
stochastically dominate the gain −B, then Definition 1 states that E[−u(−(−A))] ≥ E[−u(−(−B))]
for any non-decreasing convex function u, which is equivalent to E[v(−A)] ≥ E[v(−B)] for any
non-decreasing concave function v. This reduces back to notion that a gain is preferable if it has a
higher expected utility.
Proposition 1 is well-established (e.g., Shaked and Shanthikumar 2007 Theorem 4.A.5). Proposi-
tion 1 states that a second-order stochastically dominant variable B, when compared to A, contains
two additional terms η and ǫ. The first term is a non-negative random variable η, so that A + η
is less attractive than A in terms of first-order stochastic dominance, i.e., the distribution func-
tion of A + η is at least that of A. The second term is ǫ that does not change the expectation
but adds more variability to A + η. This means that B is a so-called mean-preserving spread of
A + η (Landsberger and Meilijson 1993). From a risk-averse perspective, additional uncertainty
caused by higher variability is always undesirable. Second-order stochastic dominance thus stipu-
lates that B is less desirable than A in terms of both the notions that “smaller is desirable” and
“less uncertainty is desirable”.
Finally, we introduce the following notion that facilitates the use of stochastic dominance for
weak limits, which occurs in our studied large-sample regime:
Definition 2 (Asymptotic second-order stochastic dominance). For any two real-
valued random sequences An and Bn , we say that An is asymptotically second-order stochastically
dominated by Bn as n → ∞ if An ⇒ W0 and Bn ⇒ W1 such that W0 is second-order stochastically
dominated by W1 or, more generally, for any subsequences nk0 , nk1 → ∞ such that Ank0 and Bnk1
have weak limits, say W0 and W1 respectively, W0 is second-order stochastically dominated by W1 .
“Asymptotic” here means that the second-order stochastic dominance holds in the weak limits
or, in the case that no unique weak limits exist, then we look at all possible subsequences that have
weak limits. We also remark that the weak convergence here is broadly defined as including the
case where the limits W0 and W1 can be ±∞, in which case it means the corresponding sequences
p
→ ±∞.
8 Lam: On the Impossibility of Statistically Improving Empirical Optimization
Theorem 1 states that no matter how we choose λn in the expanded data-driven procedure, as
long as it goes to 0, then it is impossible to improve x̂EO
n asymptotically in terms of the risk profile
of the optimality gap measured by second-order stochastic dominance. Note that the scaling n in
front of the optimality gap in the theorem is natural as this is the scaling that gives a nontrivial
limit under the first-order optimality condition.
The key in showing Theorem 1 is to decompose the weak limit of nG (x̂λnn ) into that of nG (x̂EO
n ),
a positive term, and a conditionally unbiased noise term that leads to the mean-preserving spread
as in (6). Essentially, the two extra terms signify that introducing λn would simultaneously lead
to an extra “bias” and an extra “variability”. This behavior holds regardless of how we choose the
sequence λn . The detailed proof of Theorem 1 is in Appendix EC.2, and the next subsection gives
the roadmap and intuitive explanation.
Lemma 1. Under Assumptions 1 and 2, as n → ∞ and λ → 0, the optimality gap of x̂λn satisfies
quadratic and cross terms arising from the “bias” λK in (3) for x̂λn . In other words, Lemma 1
deduces the following relation between the optimality gaps of x̂λn and x̂EO
n
d 1 2 ⊤ 2 1
G (x̂λn ) = G (x̂EO
n )+ λ K ∇ Z(x∗ )K + λK ⊤ ∇2 Z(x∗ )hIF (ξ), P̂n − P i + op + λ2
2 n
Lam: On the Impossibility of Statistically Improving Empirical Optimization 9
or equivalently the relation between the true objective values attained by x̂λn and x̂EO
n
λ d EO 1 2 ⊤ 2 ∗ ⊤ 2 ∗ 1 2
Z(x̂n ) = Z(x̂n ) + λ K ∇ Z(x )K + λK ∇ Z(x )hIF (ξ), P̂n − P i + op +λ (9)
2 n
The detailed proof of Lemma 1 is left to Appendix EC.2.
Now we highlight the main intuition in obtaining Theorem 1. By the definition of hIF (ξ), P̂n − P i
in (4), the central limit theorem (CLT) implies that
Y
hIF (ξ), P̂n − P i ≈ √
n
in distribution, where Y is a Gaussian vector with mean 0 and covariance matrix CovP (IF (ξ)).
Thus we can write the expression of G (x̂(λ)) in Lemma 1 as
1 ⊤ 2 1 1 1
G (x̂λn ) ≈ Y ∇ Z(x∗ )Y + λ2 K ⊤ ∇2 Z(x∗ )K + √ λK ⊤ ∇2 Z(x∗ )Y + op + λ2 (10)
2n 2 n n
in distribution. Moreover, setting λ = 0, the EO optimality gap becomes
EO 1 ⊤ 2 ∗ 1
G (x̂n ) ≈ Y ∇ Z(x )Y + op (11)
2n n
Now consider all possible choices of the sequence λ in relation to n, and for each case we see
√
how the first three terms in (10) behaves. Suppose λn = o(1/ n). Then the second and third terms
are of smaller order than the first term, so that (10) reduces to (1/(2n))Y ⊤ ∇2 Z(x∗ )Y + op (1/n).
In this case, G (x̂λnn ) and G (x̂EO
n ) behave the same asymptotically. In other words, the expanded
procedure does not offer any first-order benefit relative to EO. Now suppose, on the other hand,
√
that λn = ω(1/ n). Then the second and third terms are both of bigger order than the first term,
with the second term the most dominant, and so (10) becomes 12 λ2n K ⊤ ∇2 Z(x∗ )K + op (λ2n ). In this
case, G (x̂λnn ) is of bigger order than G (x̂EO
n ), so that the expanded procedure gives a solution that
√
is worse than EO. Thus, we are left with choosing λn = Θ(1/ n).
√
Suppose λn ≈ a/ n for some finite a 6= 0 as n → ∞. We have
λn 1 ⊤ 2 ∗ 1 2 ⊤ 2 ∗ 1 ⊤ 2 ∗ 1
G (x̂n ) ≈ Y ∇ Z(x )Y + a K ∇ Z(x )K + aK ∇ Z(x )Y + op
2n 2n n n
so that all three terms have the same order 1/n. The coefficient in this first-order term is
1 ⊤ 2 1
Y ∇ Z(x∗ )Y + a2 K ⊤ ∇2 Z(x∗ )K + aK ⊤ ∇2 Z(x∗ )Y (12)
2 2
Note that the second term in (12) is deterministic and always non-negative. On the other hand,
the third term has conditional mean zero given the first term, namely
" #
1
E aK ⊤ ∇2 Z(x∗ )Y Y ⊤ ∇2 Z(x∗ )Y = 0 (13)
2
10 Lam: On the Impossibility of Statistically Improving Empirical Optimization
To see this, note that since ∇2 Z(x∗ ) is positive semidefinite by Assumption 1, Y ⊤ ∇2 Z(x∗ )Y can be
written as a sum-of-squares of linear transformations of Y , i.e., Y ⊤ ∇2 Z(x∗ )Y = k(∇2 Z(x∗ ))1/2 Y k2
where (∇2 Z(x∗ ))1/2 denotes the square-root matrix of ∇2 Z(x∗ ). Moreover, note that Y , as a mean-
zero Gaussian vector, is symmetric (i.e., the densities at y and −y are the same). Thus, conditional
on the knowledge of 21 Y ⊤ ∇2 Z(x∗ )Y , it is equally likely for Yj to take yj and −yj for any yj , and
we have E[Yj |Y ⊤ ∇2 Z(x∗ )Y ] = 0 for all components Yj , j = 1, . . . , d of Y . This implies (13). In other
words, (12) is a mean-preserving spread of (1/2)Y ⊤ ∇2 Z(x∗ )Y + (1/2)a2K ⊤ ∇2 Z(x∗ )K.
√
Putting the above together, we see that, in the case λn ≈ a/ n for finite a 6= 0, the first-order
term of G (x̂λnn ), (12), is exactly decomposable into the form in (6) in Proposition 1. Compared
with the first-order coefficient of G (x̂EO ⊤ 2 ∗ EO
n ), namely (1/2)Y ∇ Z(x )Y , we thus have G (x̂n ) second-
order stochastically dominated by G (x̂λnn ) in terms of first-order behavior. Therefore, in all possible
choices of λn , the asymptotic statistical behavior of G (x̂λnn ) cannot be better than G (x̂EO
n ).
We summarize the above intuitive explanation with the following two propositions:
Proposition 3 (Comparisons on the trichotomy). Under the same assumptions and nota-
tions in Proposition 2, consider n → ∞. When a = 0, nG (x̂λnn ) and nG (x̂EO
n ) have the same weak
p
limit. When a = ∞ or −∞, nG (x̂λnn ) → ∞ whereas nG (x̂EO
n ) converges weakly to a tight random
that of nG (x̂λnn ).
Lam: On the Impossibility of Statistically Improving Empirical Optimization 11
Proposition 2 summarizes the asymptotic limits of G (x̂λnn ) derived from Lemma 1. Proposition 3
then compares them with that of G (x̂EO
n ), and concludes that asymptotic second-order stochastic
ficient that could be bigger or smaller. Thus in this degenerate case the comparison is inconclusive.
That is, we have |J1 | inequality constraints and |J2 | equality constraints, with constraint functions
denoted gj (x). We could also add non-negativity constraints on x, as long as the solution satisfies
the assumptions we make momentarily. We assume the following first-order optimality conditions:
where α∗j ’s are the Lagrange multipliers, and B indicates the binding set of constraints, i.e., B =
{j ∈ J1 ∪ J2 : gj (x∗ ) = 0}. Moreover, ∇Z 2 (x) + j∈B α∗j ∇2 gj (x) = 0 is positive semidefinite.
P
The first condition in Assumption 4 is the standard KKT condition. The second condition is
the second-order optimality condition imposed on the Lagrangian. Note that we have implicitly
assumed Z and gj ’s are twice differentiable in the assumption.
In addition, we assume the following behavior on x̂λn :
Assumption 5 means that, asymptotically, the data-driven solution retains the same set of binding
constraints as the true solution. This condition typically holds for any solution that is consistent in
converging to x∗ . Next, in parallel to Assumption 3, we have the following non-degeneracy condition
in the constrained case:
With the above assumptions, we now argue that all the impossibility results we have discussed
hold in the constrained setting, as follows:
The proof of Theorem 2 follows a similar development as in Section 2.4, but instead of using the
unconstrained first-order optimality to remove the first-order term in the Taylor series expansion
of G (x̂λn ), we use the Lagrangian (Assumption 4) to express this term in terms of the constraints,
which are in turn analyzed via another Taylor expansion. The detailed proof is in Appendix EC.2.
3.1. Regularization
Consider Z(x) := ψ(x, P ) as the expected value objective function EP [h(x, ξ)] where h : X × Ξ → R.
The EO objective function is then ψ(x, P̂n ) = EP̂n [h(x, ξ)]. We consider x̂Reg,λ
n as a regularized
solution obtained from
x̂Reg,λ
n
n
− x∗
2 ∗ −1 ∗ 2 ∗ −1 ∗ 1 ∗
= −h(EP [∇ h(x , ξ)]) ∇h(x , ξ), P̂n − P i − λn (EP [∇ h(x , ξ)]) ∇R(x ) + op √ + λn k∇R(x (16)
)k
n
Lam: On the Impossibility of Statistically Improving Empirical Optimization 13
as n → ∞ and λn → 0 (at any rate relative to n). Hence, if in addition Assumptions 1 and 3 hold
with K = −(EP [∇2 h(x∗ , ξ)])−1 ∇R(x∗ ), then nG (x̂EO
n ) is asymptotically second-order stochastically
dominated by nG (x̂Reg,λ
n
n
).
The proof of Theorem 3 is in Appendix EC.3, which requires the theory of M -estimation with
nusiance parameter. In addition to smoothness (Assumptions EC.1, EC.3) and first-order optimal-
ity conditions (Assumption EC.4), we also need to control the function complexity of the loss func-
tion (Van Der Vaart and Wellner 1996) via Donsker and Glivenko-Cantelli properties (Assumption
EC.2). Theorem 3 can also be translated to DRO based on the Wasserstein ball thanks to its
connection with regularization; see the next subsection.
At a first glance, Theorem 3 may look contradictory to the regularization literature. For instance,
even in the linear regression, it is known that a ridge regression with a properly chosen λ can
always improve the estimation quality under large sample (Li et al. 1986, 1987). The catch here
is the criterion to measure the quality of solution. Our considered criterion in this paper is the
risk profile of the entire distribution of the optimality gap, whereas the criterion looked at in the
ridge regression can be seen to correspond to the expected optimality gap. This latter criterion
appears reasonable for estimation problems, but it is not sufficient from an optimization viewpoint:
An improvement in the expected gap does not necessarily mean the obtained solution is better
statistically in terms of the attained objective value, as the variability of the optimality gap can
become worse – In fact this fundamental dilemma is precisely what we showed.
To distinguish clearly optimality gap versus expected optimality gap, consider the approximation
of G (x̂λn ) in (7). Under standard regularity conditions, its expected value, taken with respect to all
the data, is
where the last equality follows by comparing with the approximation for E[G (x̂EO
n )], and noting that
1 2
2
λ K ⊤ ∇2 Z(x∗ )K is deterministic and λK ⊤ ∇2 Z(x∗ )hIF (ξ), P̂n − P i has mean 0. The dominant
term in the remainder, namely o(1/n + λ2 ), is actually of order O(λ/n), which comes from the
cross-term of two “P̂n − P ” and one “λ” (the other higher-order terms either have expectation 0 or
are dominated by others). Thus, we can choose λ = k/n, for some constant k, such that the second
14 Lam: On the Impossibility of Statistically Improving Empirical Optimization
term in (17) and this O(λ/n) term possess the same 1/n2 order and are in opposite signs, thus
leading to a smaller expected optimality gap over x̂EO
n . However, if the variability of the optimality
gap is taken into account, then we need to consider the third term of (7) which is of stochastic
√
order λ/ n. Choosing λ = k/n then entails this term to order 1/n3/2 , which is larger than the
improvement of order 1/n2 in the expected gap and as a result washes away this gain.
We should also make clear that our impossibility result in Theorem 1 does apply to the com-
parison in terms of the expected optimality gap, because the second-order stochastic dominance
implies as a particular case that the expected values follow the corresponding ordering. This may
appear again as a contradiction to our result in saying that regularization can improve the expected
gap of EO. However, this latter gain is actually of higher-order than 1/n, whereas in the dominant
term of order 1/n in the optimality gap there is no gain, which is what our result implies. This
means that, at least in low-dimensional cases, the improvement using regularization, even focusing
only on the expected optimality gap, is negligible.
Finally, we justify one of our claims above that the criterion used in justifying the gain in ridge
regression corresponds to the expected optimality gap. Consider the simple least-square problem
minβ∈Rd E(Y − X ⊤ β)2 , where ξ = (X, Y ) ∈ Rd+1 follows a linear model Y = X ⊤ β + ǫ with E[ǫ|X] =
0, and E[XX ⊤ ] = I. Suppose from data we obtain a coefficient estimate β̂. Then the optimality
gap G (β̂), according to the least-square objective function, is
= kβ̂ − β k2
where we have used E[Y X ⊤ ] = β ⊤ . That is, kβ̂ − β k2 is the optimality gap. Thus, in the regression
context, Ekβ̂ − β k2 is the mean squared error (MSE) of the estimator β̂, while in optimization
this is the expected optimality gap. It is shown (Li et al. 1986, 1987) that using a regularizing
penalty kβ k2 can improve the MSE of β̂. However, Theorem 3 shows it would not improve kβ̂ − β k2
asymptotically.
where Uλ is a so-called uncertainty set or ambiguity set on the space of probability distributions
which, at least intuitively, is believed to contain the ground-truth P with high likelihood. The
parameter λ signifies the size of the set. By properly choosing λ and accounting for the worst-
case scenario in the inner maximization, (18) is hoped to output a higher-quality solution. DRO
is gaining surging popularity in recent years. It can be viewed as a generalization of the classical
deterministic RO (Ben-Tal et al. 2009, Bertsimas et al. 2011) where the uncertain parameter in an
optimization problem is now the underlying probability distribution in a stochastic problem.
Before we present our impossibility result regarding DRO over EO, let us first frame the landscape
of DRO and reason the DRO types that have a legitimate possibility of beating EO statistically. In
the DRO literature, the choice of Uλ can be categorized roughly into two groups. The first group
is based on partial distributional information, such as moment and support (Ghaoui et al. 2003,
Delage and Ye 2010, Goh and Sim 2010, Wiesemann et al. 2014, Hanasusanto et al. 2015), shape
(Popescu 2005, Van Parys et al. 2016, Li et al. 2017, Lam and Mottet 2017, Chen et al. 2021) and
marginal distribution (Chen et al. 2018, Doan et al. 2015, Dhara et al. 2021). This approach has
proven useful in robustifying decisions when facing limited distributional information, or when data
is scarce, e.g., in the extremal region. In such cases, by using Uλ that captures the known partial
information, the DRO guarantees a worst-case performance bound on EP [h(x̂DRO,λ
n , ξ)] (by using
the outer objective value, namelyEP [h(x̂DRO,λ
n , ξ)] ≤ maxQ∈Uλ EQ [h(x̂DRO,λ
n , ξ)]). Additionally, if Uλ
is calibrated from data to be a high-confidence region in containing P , then such a worst-case
bound holds with at least the same statistical confidence level (e.g., Delage and Ye 2010). However,
from a large-sample standpoint, the set Uλ constructed in these approaches typically bears intrinsic
looseness due to the use of only partial distributional information, and consequently the obtained
solution x̂DRO,λ
n does not converge to x∗ as n grows (regardless of how we choose λ).
The second group of Uλ comprises neighborhood balls in the probability space, namely Uλ = {Q :
D(Q, P 0) ≤ λ} for some statistical distance D(·, ·) between two probability distributions, baseline
distribution P 0 , and neighborhood size λ > 0. Common choices of D include the φ-divergence class
(Ben-Tal et al. 2013, Bertsimas et al. 2018, Bayraksan and Love 2015, Jiang and Guan 2016, Lam
2016, 2018), and the Wasserstein distance (Gao and Kleywegt 2016, Chen and Paschalidis 2018,
Esfahani and Kuhn 2018, Blanchet and Murthy 2019). When the ball center P 0 and the ball size λ
are chosen properly in relation to the data size, it can be guaranteed that a worst-case bound holds
for EP [h(x̂DRO
n , ξ)] with high confidence and, moreover, x̂DRO
n converges to x∗ . In other words, such
DRO can provide statistically consistent solutions. For this reason, in the following we will study
this approach. In particular, we will present φ-divergence-based DRO in detail, and then connect
Wasserstein-based DRO to the result in Section 3.1.
16 Lam: On the Impossibility of Statistically Improving Empirical Optimization
We consider D represented by a φ-divergence and the ball center P 0 taken as the empirical
distribution P̂n . For any two distributions Q and Q′ on the same domain, let L = dQ/dQ′ be the
Radon-Nikodym derivative or the likelihood ratio between Q and Q′ . Then D is defined by
Let φ∗ (s) = supt≥0 {st − φ(t)} be the convex conjugate of φ. Thanks to the properties of φ above,
standard convex analysis (Rockafellar 2015) gives the following:
Together with the regularity conditions on the optimization problem in Appendix EC.4, we have
the following result:
x̂nD−DRO,λn − x∗
CovP (h(x∗ , ξ), ∇h(x∗ , ξ))
q
= −(EP [∇ h(x , ξ)]) h∇h(x , ξ), P̂n − P i − λn φ∗ ′′ (0)(EP [∇2 h(x∗ , ξ)])−1
2 ∗ −1 ∗
p
V arP (h(x∗ , ξ))
√
1
+ op √ + λn (20)
n
as n → ∞ and λn → 0 (at any rate relative to n). Hence, if in addition Assumptions 1 and 3 hold
with K = − φ∗ ′′ (0)(EP [∇2 h(x∗ , ξ)])−1 CovP (h(x∗ , ξ), ∇h(x∗ , ξ))/ V arP (h(x∗ , ξ)), then nG (x̂EO
p p
n )
as λ → 0 (Lam 2016, Dupuis et al. 2016, Gotoh et al. 2018, Lam 2018, Duchi and Namkoong 2019,
Duchi et al. 2021). The relation (20) can thus be obtained formally by turning the minimization
of (21) into a root-finding (or M -estimation; Van der Vaart 2000 Chapter 5) problem via the
first-order optimality condition, and then applying the delta method. Nonetheless, the precise
development needs more technicality, with the detailed proof of Theorem 4 in Appendix EC.4. We
also point out that (20) can be viewed as a generalization of Duchi and Namkoong (2019) that
considers the special case of χ2 -distance for which there is no higher-order term in (21), and thus
our proof directly uses the Karush–Kuhn–Tucker (KKT) condition instead of starting with the
Taylor expansion as in Duchi and Namkoong (2019). Moreover, (21) also relates to Gotoh et al.
(2021) that considers the expectation and variance of the cost function of the obtained solution
(see Section 4.2).
Next we consider D as the Wasserstein distance. The p-th order Wasserstein or optimal transport
distance is defined by
D(Q, Q′ ) = inf (Eπ ||ξ − ξ ′ kp )1/p (22)
π∈Π(Q,Q′ )
18 Lam: On the Impossibility of Statistically Improving Empirical Optimization
where Π(Q, Q′ ) denotes the set of all distributions with marginals Q and Q′ , and (ξ, ξ ′ ) is distributed
according to π. The definition (22) has a dual representation in terms of integral probability metric
(Sriperumbudur et al. 2012), and the norm k·k there can be replaced with more general transporta-
tion cost functions (e.g., Blanchet and Murthy 2019). The rich structural properties of Wasserstein
DRO has facilitated its tight connection with machine learning and statistics (Kuhn et al. 2019,
Rahimian and Mehrotra 2019, Blanchet et al. 2019, Gao et al. 2017a, Shafieezadeh-Abadeh et al.
2019, Chen and Paschalidis 2018).
Here we focus on the 1-st order Wasserstein distance (p = 1), and set the ball center P 0 as P̂n ,
so that Uλ = {Q : D(Q, P̂n) ≤ λ} where D is defined in (22). In this case, it is known that, when
Ξ is in the real space and h(x, ξ) is convex in ξ, the worst-case expectation in (18) possesses a
Lipschitz-regularized reformulation as
for any x, where Lip(h(x, ·) is the Lipschitz modulus of h(x, ·) given by Lip(h(x, ·) =
supξ6=ξ′ |h(x, ξ) − h(x, ξ ′ )|/||ξ − ξ ′ k. From (23), we immediately obtain the following:
be a minimizer of EP̂n [h(x, ξ)] and x̂nW −DRO,λ a minimizer of (18) with Uλ = {Q : D(Q, P̂n) ≤ λ}
and D defined in (22) with p = 1. Under Assumptions EC.1–EC.4 in Appendix EC.3 where we set
R(x) = Lip(h(x, ·)), we have
x̂nW −DRO,λn − x∗
= −h(EP [∇2 h(x∗ , ξ)])−1 ∇h(x∗ , ξ), P̂n − P i − λn (EP [∇2 h(x∗ , ξ)])−1 ∇Lip(h(x∗ , ·))
1 ∗
+ op √ + λn k∇Lip(h(x , ·))k
n
as n → ∞ and λn → 0 (at any rate relative to n). Hence, if in addition Assumptions 1 and 3
hold with K = −(EP [∇2 h(x∗ , ξ)])−1 ∇Lip(h(x∗ , ·)), then nG (x̂EO
n ) is asymptotically second-order
Moreover, more generally (i.e., p 6= 1 and h(x, ·) not necessarily convex), the worst-case objective
maxQ∈Uλ EQ [h(x, ξ)] admits an expansion similar to (21) where the first-order term is λV (h(x, ·)),
with V (h(x, ·)) being some variability measure of h in which Lip(h(x, ·)) is a special case (Gao et al.
2017a).
where IFθ (ξ) : Ξ → Rd has CovP (IFθ (ξ)) that is finite. Then
1
x̂nP −Reg,λ − x∗ = −h(∇2x ψ(x∗ , θ ∗ ))−1 ∇xθ ψ(x∗ , θ ∗ )IFθ , P̂n − P i− λ(∇2xψ(x∗ , θ ∗ ))−1 ∇x R(x)+op √ +λ
n
as n → ∞ and λ → 0. Hence, if in addition Assumptions 1 and 3 hold with K =
−(∇2x ψ(x∗ , θ ∗ ))−1 ∇x R(x), then nG (x̂nP −EO ) is asymptotically second-order stochastically dominated
by nG (x̂Pn −Reg,λn ) for any λn → 0.
The asymptotic (24) can be ensured by standard conditions. For instance, in the case where
there is no model misspecification and we use maximum likelihood estimator for θ̂n , (24) holds
under the Lipschitzness of the log-likelihood function, with IFθ (·) = Iθ−1
∗ sθ ∗ (·) where sθ ∗ (·) is the
score function and Iθ∗ is the Fisher information matrix (see, e.g., Van der Vaart 2000 Theorem
5.39).
where EΘ|Dn [·] is the posterior distribution of θ given the collection of data Dn = {ξ1 , . . . , ξn },
and we denote Θ as the random variable distributed under this posterior distribution. In other
words, x̂nP −Bay,0 optimizes the posterior expectation of the original objective function ψ(x, θ), while
x̂nP −Bay,λ imposes additionally a regularizing penalty R with the regularization parameter λ. We
have the following result:
for some θ̂n where k · kT V denotes the total variation distance, PΘ|Dn is the posterior distribution of
√
θ, and J is some covariance matrix. Moreover, assume that n(Θ − θ̂n )|Dn is uniformly integrable
a.s. as n → ∞. Then
1
x̂nP −Bay,λ − x∗ = −h(∇2x ψ(x∗ , θ ∗ ))−1 ∇xθ ψ(x∗ , θ ∗ )IFθ , P̂n − P i− λ(∇2xψ(x∗ , θ ∗ ))−1 ∇x R(x)+op √ +λ
n
as n → ∞ and λ → 0. Hence, if in addition Assumptions 1 and 3 hold with K =
−(∇2x ψ(x∗ , θ ∗ ))−1 ∇x R(x), then nG (x̂nP −EO ) is asymptotically second-order stochastically dominated
by nG (x̂Pn −Bay,λn ) for any λn → 0.
Lam: On the Impossibility of Statistically Improving Empirical Optimization 21
Note that the asymptotic expressions of x̂nP −Bay,λ − x∗ in Theorem 6 and x̂nP −Reg,λ − x∗ in Theorem
5 are the same. This in particular implies that the simple Bayesian solution, x̂nP −Bay,0 , performs
asymptotically equivalently to the EO solution and, moreover, x̂nP −Bay,λ and x̂nP −Reg,λ perform
asymptotically equivalently.
Finally, we mention that the convergence (26) is standard, as guaranteed by the Berstein-von
Mises Theorem (e.g., Theorem 10.1 and the discussion on P.144 in Van der Vaart 2000).
Moreover, when data are observed, suppose Uλ is constructed as a high-confidence region for P , i.e.,
P(P ∈ Uλ ) ≥ 1 − α for some confidence level 1 − α. Then this confidence level can be translated into
at least the same confidence on a bound of the performance of x̂DRO via the worst-case objective
value, given by
DRO DRO
P Z(x̂ ) ≤ max EQ [h(x̂ , ξ)] ≥ 1 − α (27)
Q∈Uλ
where P denotes the probability with respect to the data. The above argument can be finite-sample
or asymptotically argued (in the latter case, the asymptotic confidence guarantee of the uncer-
tainty set would translate to the asymptotic confidence guarantee of the performance bound).
Such type of guarantees has been studied in, e.g., Delage and Ye (2010), Goh and Sim (2010),
Hanasusanto et al. (2015), Lam and Mottet (2017), Ben-Tal et al. (2013), Bertsimas et al. (2018),
Jiang and Guan (2016), Esfahani and Kuhn (2018). Van Parys et al. (2020) and Sutter et al.
22 Lam: On the Impossibility of Statistically Improving Empirical Optimization
(2020) in particular call the complement of the probability in the left hand side of (27) the out-of-
sample disappointment.
Our results in Sections 2 and 3.2 differ from the bound (27) in two important aspects. First
is that we are measuring the quality of solution by a ranking of the true objective value. That
is, an obtained solution x̂ that has a smaller value of Z(x̂) is regarded as more desirable. This is
different from (27) that guarantees the validity of the estimated objective value in bounding the
true value of an obtained solution. Note that the latter validity does not necessarily imply the
obtained solution performs better in the true objective value. In fact, we show that DRO cannot
be superior to EO in the latter aspect, at least in the large-sample regime that we consider.
Regarding which criterion, (27) or ours, should an optimizer use, it may depend on the particular
situation of interest. In terms of comparing solution quality, we believe there should be little
argument against our criterion of ranking the attained true objective value, as this appears the most
direct measurement of solution performance. On the other hand, in some high-stake situations, an
optimizer may want to obtain a reliable upper estimate, or to ensure a low enough upper bound, of
the attained objective value, in which case the conventional DRO guarantee (27) would be useful.
Our second distinction from bound (27) is that we study the error relative to the oracle best
solution, i.e., our approximation is on the optimality gap Z(x̂) − Z(x∗ ) for an obtained solution
x̂. A claimed drawback of using bound (27) is that it could be loose, thus unable to detect the
over-conservativeness of DRO (e.g., one can simply take Uλ to be extremely large, so that (27)
trivially holds). This latter criticism is resolved to an extent by a series of work that shows that, by
choosing Uλ properly, the worst-case objective value maxQ∈Uλ E[h(x, ξ)] differs from the true value
EP [h(x, ξ)] only by a small amount. This includes the empirical likelihood (Lam and Zhou 2017,
Lam 2019, Duchi and Namkoong 2019, Duchi et al. 2021), Bayesian (Gupta 2019), and large devi-
ations perspective (Van Parys et al. 2020) in divergence DRO, and the profile likelihood and vari-
ability regularization for Wasserstein DRO (Blanchet et al. 2019, Gao et al. 2017b). Moreover, it
can be shown that certain divergence DRO gives rise to the best possible bound in the form of (27),
which is argued via a match of the statistical performance with the CLT (Lam 2019, Gupta 2019,
Duchi et al. 2021) or the Bernstein bound (Duchi and Namkoong 2019), or a “meta-optimization”
that gives the tightest such bound subject to an allowed large deviations rate (Van Parys et al.
2020). Nonetheless, all these works mainly focus on an upper bound on Z(x̂) instead of the opti-
mality gap Z(x̂) − Z(x∗ ), the latter being more challenging due to the unknown oracle true optimal
value Z(x∗ ).
Lam: On the Impossibility of Statistically Improving Empirical Optimization 23
where Q lies in the relevant space of probability distributions. This formulation is akin to the
divergence DRO we discussed in Section 3.2, but using the Lagrangian formulation directly instead
of starting with the notion of uncertainty set. The parameter λ in (28) plays a similar role as the
ball size of our uncertainty set.
The main insight from Gotoh et al. (2018, 2021) is a desirable improvement on the bias-variance
tradeoff using the solution obtained in (28), which we call a “Lagrangian (L)-DRO” solution
x̂nL−DRO,λ . More precisely, Gotoh et al. (2018) concludes (Theorem 5.1 therein) that
and
as λ → 0, and Gotoh et al. (2021) develops these expansions further to center at EP [h(x̂EO
n , ξ)] and
V arP (h(x̂EO
n , ξ)). From (29) and (30), we see that while L − DRO deteriorates the expected loss, it
reduces the variance of the loss by a large magnitude (λ versus λ2 ), giving an overall improvement
on the mean squared error. In this sense, injecting a small λ > 0 in L − DRO is desirable compared
to EO with λ = 0.
Putting aside the technical differences in using L-DRO versus the common DRO in (18) (the
latter requires an extra layer of analysis in the Lagrangian reformulation), we point out two con-
ceptual distinctions between Gotoh et al. (2018, 2021) and our results in Section 2 and 3.2. First is
the criterion in measuring the quality of an obtained solution x̂, in particular the role of the vari-
ance of the loss function h(x, ξ), V arP (h(x, ξ)). Our criterion is in terms of the achieved optimality
gap, or equivalently Z(x̂), where Z(·) is the true objective function of the original optimization. In
24 Lam: On the Impossibility of Statistically Improving Empirical Optimization
particular, any risk-aware consideration should be already incorporated into the construction of the
loss h. When an obtained solution x̂ is used in many future test cases, an estimate of Z(x̂), using
ntest test data points, has a variance given by V arP (h(x̂, ξ))/ntest (instead of V arP (h(x̂, ξ))), and
thus the variance of h plays a relatively negligible role. This is different from Gotoh et al. (2018,
2021) who takes an alternate view that puts more weight on the variability of the loss function.
Our second main distinction from Gotoh et al. (2018, 2021) is our consideration of the attained
true objective value or generalization performance Z(x̂) as a random variable, and we study its
risk profile by assessing its entire distribution that exhibits the second-order stochastic dominance
relation put forth in Section 2. Here, the randomness of Z(x̂) comes from the statistical noise
from data used in constructing the obtained solution x̂. In contrast, Gotoh et al. (2018) studies
the mean of the attained objective value (and variance in the sense described above). In this latter
setting, as we have seen in (17), any expanded procedure with parameter λ satisfying Assumption
2 (not only divergence DRO) would lead to a deterioration of order λ2 from EO. This does show
an inferiority of the expanded procedure, but it does not give a complete picture as the attained
objective value can be better in other distributional aspects. Our Theorem 1 stipulates that even
considering the entire distribution, an expanded procedure like DRO still cannot outperform EO.
vex function f . Noting from (8) that G (x̂n ) ≈ 21 (x̂n − x∗ )⊤ ∇2 Z(x∗ )(x̂n − x∗ ), our result implies,
√ √
roughly speaking, that E[f (g( n(x̂n − x∗ ))] is at least E[f (g( n(x̂EO ∗
n − x ))] asymptotically, where
g(y) = k(∇2 Z(x∗ ))1/2 y k2 which is convex and thus f ◦ g is subconvex by the non-decreasing convex
property of f and the form of g. Our result thus, at least intuitively, coincides with the implica-
tion of the minimax theorem. However, the minimax theorem considers a perturbation of the true
Lam: On the Impossibility of Statistically Improving Empirical Optimization 25
parameter value within a shrinking neighborhood at a specific rate (which leads to the notion of
regular estimators and also defines the plausible parameter values to be considered in the minimax
regime), and it claims the dominance relation using Anderson’s lemma, a result in convex geometry
(Van der Vaart 2000 Theorem 8.5). On the other hand, our Theorem 1 is obtained by perturbing
the data-driven procedure parametrized by λ, and analyzes the optimality gap or objective value
via second-order stochastic dominance. Our main insight to claim the superiority of EO, which
uses the mean-preserving spread in risk-based ranking, appears novelly beyond the semiparametric
literature.
Moreover, it is perhaps revealing to cast our result in the context of the conventional Cramer-
Rao bound. The latter states that, under suitable smoothness conditions, maximum likelihood is
the best among all possible estimators in estimating (functions of) unknown model parameters in
terms of asymptotic variance. Suppose we use a particular objective function, namely the expected
log-likelihood, in our framework. The Cramer-Rao bound would intuitively conclude the expected
optimality gap is best attained via EO, and thus our result can be viewed as a local generalization in
that it applies to general objective functions and concludes the superiority of the entire distribution
of the optimality gap when using EO.
To explain in detail, first note that (7) stipulates that the EO solution, x̂EO
n , satisfies
1 EO
Z(x̂EO ∗ ∗ ⊤ 2 ∗ EO ∗
n ) − Z(x ) ≈ (x̂n − x ) ∇ Z(x )(x̂n − x )
2
Now consider, under suitable conditions like the ones for Section 3,
E[(x̂EO ∗ ⊤ 2 ∗ EO ∗ ⊤ 2 ∗
n − x ) ∇ Z(x )(x̂n − x )] = E[hIF, P̂n − P i ∇ Z(x )hIF, P̂n − P i]
" n n
#
1X 1 X
≈E ∇h(x∗ , ξi )⊤ ∇2 Z(x∗ )−1 ∇2 Z(x∗ )∇2 Z(x∗ )−1 ∇h(x∗ , ξi )
n i=1 n i=1
tr(∇2 Z(x∗ )−1 Cov(∇h(x∗ , ξ)))
= (31)
n
tr(∇2 Z(x∗ )−1 CovP (∇h(x∗ , ξ))) = tr(∇2 Z(x∗ )−1 EP [∇h(x∗ , ξ))2 ]
=d
where we use the first-order optimality condition EP [∇h(x∗ , ξ)] = 0 in the first equality and the
equivalent expression of the Fisher information matrix in the third equality. This shows that (31)
becomes d/n in this special case.
Now consider x̂EO
n as an estimator of x∗ , or that (∇2 Z(x∗ ))1/2 x̂EO
n as an estimator
of (∇2 Z(x∗ ))1/2 x∗ . The Cramer-Rao bound states that, for any nearly unbiased estimator
(∇2 Z(x∗ ))1/2 x̂n , we have
where ψ(x) = Ex [(∇2 Z(x∗ ))1/2 x̂n ] = (∇2 Z(x∗ ))1/2 Ex x̂n , with Ex [·] being the expectation taken
under fx (·), and I is the Fisher information matrix given by I = −EP [∇2 log fx∗ (ξ)] = ∇2 Z(x∗ ).
Now, consider any estimator x̂n that satisfies Ex [x̂n ] = x + O(1/n) (such as all the data-driven
solutions we discussed in Section 3 using small enough λn , i.e., λn = O(1/n)). We have ∇ψ(x∗ ) ≈
(∇2 Z(x∗ ))1/2 and
⊤ d
E k(∇2 Z(x∗ ))1/2 (x̂n − x∗ )k2 ≈ tr(CovP ((∇2 Z(x∗ ))1/2 x̂n )) ≥ tr((∇2Z(x∗ ))1/2 (nI)−1 (∇2 Z(x∗ ))1/2 ) =
n
Therefore, in the case where ξ ∼ fx∗ (·) and h(x, ξ) = − log fx (ξ), the EO solution x̂EO
n is approxi-
mately optimal in the sense that the optimality gap Z(x) − Z(x∗ ), which is approximately 21 (x −
x∗ )⊤ ∇2 Z(x∗ )(x − x∗) for any estimator x close to x∗ in expectation, has expectation E[Z(x) − Z(x∗ )]
minimized at x̂EO
n up to a negligible error. Our main result is a generalization of the above. It
suggests that x̂EO
n not only minimizes Z(x) − Z(x∗) in expectation, but also minimizes E[g(Z(x) −
Z(x∗ ))] for any convex non-decreasing loss function g(·), or equivalently that Z(x̂n ) − Z(x∗ ) for
any data-driven solution x̂n second-order stochastically dominates Z(x̂EO ∗
n ) − Z(x ).
4.1 and 4.2 when comparing with some existing DRO studies. We close this paper by discussing
some other prominent examples and the types of procedures that are demonstrably powerful:
Finite-sample performance: We have focused and claimed EO is best in the large-sample regime,
but for small sample the situation could be different. While it may be difficult to argue there are
generally superior procedures than EO (or vice versa) in finite sample, there exists documented
situations where some approaches are consistently better than EO. For example, the so-called
operational statistic (Liyanage and Shanthikumar 2005, Chu et al. 2008) strengthens data-driven
solutions to perform better than simple choices such as EO, uniformly across all possible parameter
values (in a parametric problem) and sample size. It does so by expanding the class of data-driven
solutions in which a better solution is systematically searched via a second-stage optimization or
Bayesian analysis. In certain structural problems, for instance those arising in inventory control,
this approach provably and empirically performs better than EO.
High dimension: We have fixed the problem dimension throughout this paper. It is well-known,
however, that certain regularization schemes, e.g., L1 -penalty as in LASSO (Friedman et al. 2001),
are suited for high-dimensional problems that exhibit sparse structures. Through the equivalence
with regularization, this also implies that certain types of Wasserstein DRO enjoy similar statistical
benefits (Blanchet et al. 2019, Shafieezadeh-Abadeh et al. 2019).
Distortion of loss function properties: For problems that do not satisfy our imposed smoothness
√
conditions, the decay rate of G (x̂EO 2
n ) could be 1/ n instead of 1/n. In this situation, χ -divergence
DRO, which exhibits variance regularization, can be used to boost the rate to 1/n in certain
examples (Duchi and Namkoong 2019). Moreover, it is also shown that DRO using a distance
induced from the reproducing kernel Hilbert space (Staib and Jegelka 2019), which relates to the
so-called kernel mean matching (Gretton et al. 2009), could exhibit finite-sample optimality gap
bounds that are free of any complexity of the loss function class, which is drastically distinct from
EO’s behavior (Zeng and Lam 2021).
Data pooling and contextual optimization: When there are simultaneously many stochastic
optimization problems to solve, it is shown that introducing a shrinkage onto EO can enhance
the tradeoff between the optimality gap and instability (Gupta and Rusmevichientong 2021,
Gupta and Kallus 2021). This shrinkage is in a similar spirit as the so-called James-Stein estimator
(Cox and Hinkley 1979) in classical statistics, in which individual estimators, or the solutions of the
individual optimization problems, are adjusted by “weighting” with a pooled estimator. Relatedly,
when the considered optimization problem involves parameters or outcomes that depend on covari-
ates, using no such covariate information, or using the naive predict-then-optimize approach, can
be improved by integrating the prediction and EO steps that leads to a better ultimate objective
value performance (e.g., Ban and Rudin 2019, Elmachtoub and Grigas 2021).
28 Lam: On the Impossibility of Statistically Improving Empirical Optimization
Acknowledgments
I gratefully acknowledge support from the National Science Foundation under grants CAREER CMMI-
1834710 and IIS-1849280.
References
Asmussen S, Glynn PW (2007) Stochastic Simulation: Algorithms and Analysis, volume 57 (Springer Science
& Business Media).
Ban GY, Rudin C (2019) The big data newsvendor: Practical insights from machine learning. Operations
Research 67(1):90–108.
Bayraksan G, Love DK (2015) Data-driven stochastic programming using phi-divergences. Tutorials in Oper-
ations Research, 1–19 (INFORMS).
Ben-Tal A, Den Hertog D, De Waegenaere A, Melenberg B, Rennen G (2013) Robust solutions of optimization
problems affected by uncertain probabilities. Management Science 59(2):341–357.
Bertsimas D, Brown DB, Caramanis C (2011) Theory and applications of robust optimization. SIAM Review
53(3):464–501.
Bertsimas D, Gupta V, Kallus N (2018) Robust sample average approximation. Mathematical Programming
171(1-2):217–282.
Blanchet J, Kang Y, Murthy K (2019) Robust Wsserstein profile inference and applications to machine
learning. Journal of Applied Probability 56(3):830–857.
Blanchet J, Murthy K (2019) Quantifying distributional model risk via optimal transport. Mathematics of
Operations Research 44(2):565–600.
Chen L, Ma W, Natarajan K, Simchi-Levi D, Yan Z (2018) Distributionally robust linear and discrete
optimization with marginals. Available at SSRN 3159473 .
Chen R, Paschalidis IC (2018) A robust learning approach for regression models based on distributionally
robust optimization. Journal of Machine Learning Research 19(13).
Chen X, He S, Jiang B, Ryan CT, Zhang T (2021) The discrete moment problem with nonconvex shape
constraints. Operations Research 69(1):279–296.
Chu LY, Shanthikumar JG, Shen ZJM (2008) Solving operational statistics via a Bayesian analysis. Opera-
tions Research Letters 36(1):110–116.
Delage E, Ye Y (2010) Distributionally robust optimization under moment uncertainty with application to
data-driven problems. Operations Research 58(3):595–612.
Dhara A, Das B, Natarajan K (2021) Worst-case expected shortfall with univariate and bivariate marginals.
INFORMS Journal on Computing 33(1):370–389.
Lam: On the Impossibility of Statistically Improving Empirical Optimization 29
Doan XV, Li X, Natarajan K (2015) Robustness to dependency in portfolio optimization using overlapping
marginals. Operations Research 63(6):1468–1488.
Duchi JC, Glynn PW, Namkoong H (2021) Statistics of robust optimization: A generalized empirical likeli-
hood approach. Mathematics of Operations Research .
Duchi JC, Namkoong H (2019) Variance-based regularization with convex objectives. J. Mach. Learn. Res.
20:68–1.
Dupuis P, Katsoulakis MA, Pantazis Y, Plechác P (2016) Path-space information bounds for uncertainty
quantification and sensitivity analysis of stochastic dynamics. SIAM/ASA Journal on Uncertainty
Quantification 4(1):80–111.
Elmachtoub AN, Grigas P (2021) Smart “predict, then optimize”. Management Science .
Esfahani PM, Kuhn D (2018) Data-driven distributionally robust optimization using the Wasserstein metric:
Performance guarantees and tractable reformulations. Mathematical Programming 171(1-2):115–166.
Friedman J, Hastie T, Tibshirani R (2001) The Elements of Statistical Learning, volume 1 (Springer Series
in Statistics New York).
Gao R, Chen X, Kleywegt AJ (2017a) Wasserstein distributional robustness and regularization in statistical
learning. arXiv e-prints arXiv–1712.
Gao R, Chen X, Kleywegt AJ (2017b) Wasserstein distributionally robust optimization and variation regu-
larization. arXiv preprint arXiv:1712.06050 .
Gao R, Kleywegt AJ (2016) Distributionally robust stochastic optimization with Wasserstein distance. arXiv
preprint arXiv:1604.02199 .
Ghaoui LE, Oks M, Oustry F (2003) Worst-case value-at-risk and robust portfolio optimization: A conic
programming approach. Operations Research 51(4):543–556.
Goh J, Sim M (2010) Distributionally robust optimization and its tractable approximations. Operations
Research 58(4-Part-1):902–917.
Gotoh JY, Kim MJ, Lim AE (2018) Robust empirical optimization is almost the same as mean–variance
optimization. Operations Research Letters 46(4):448–452.
Gotoh JY, Kim MJ, Lim AE (2021) Calibration of distributionally robust empirical optimization models.
Operations Research .
Gretton A, Smola A, Huang J, Schmittfull M, Borgwardt K, Schölkopf B (2009) Covariate shift by kernel
mean matching. Dataset Shift in Machine Learning 3(4):5.
Gupta V (2019) Near-optimal Bayesian ambiguity sets for distributionally robust optimization. Management
Science 65(9):4242–4260.
Gupta V, Rusmevichientong P (2021) Small-data, large-scale linear optimization with uncertain objectives.
Management Science 67(1):220–241.
Hadar J, Russell WR (1969) Rules for ordering uncertain prospects. The American Economic Review
59(1):25–34.
Hampel FR (1974) The influence curve and its role in robust estimation. Journal of the American Statistical
Association 69(346):383–393.
Hanasusanto GA, Roitch V, Kuhn D, Wiesemann W (2015) A distributionally robust perspective on uncer-
tainty quantification and chance constrained programming. Mathematical Programming 151(1):35–62.
Hanoch G, Levy H (1969) The efficiency analysis of choices involving risk. The Review of Economic Studies
36(3):335–346.
Hong LJ, Huang Z, Lam H (2020) Learning-based robust optimization: Procedures and statistical guarantees.
Management Science .
Jiang R, Guan Y (2016) Data-driven chance constrained stochastic program. Mathematical Programming
158(1):291–327.
Kosorok MR (2007) Introduction to Empirical Processes and Semiparametric Inference (Springer Science &
Business Media).
Kuhn D, Esfahani PM, Nguyen VA, Shafieezadeh-Abadeh S (2019) Wasserstein distributionally robust opti-
mization: Theory and applications in machine learning. Operations Research & Management Science
in the Age of Analytics, 130–166 (INFORMS).
Lam H (2016) Robust sensitivity analysis for stochastic systems. Mathematics of Operations Research
41(4):1248–1275.
Lam H (2018) Sensitivity to serial dependency of input processes: A robust approach. Management Science
64(3):1311–1327.
Lam H (2019) Recovering best statistical guarantees via the empirical divergence-based distributionally
robust optimization. Operations Research 67(4):1090–1105.
Lam H, Mottet C (2017) Tail analysis without parametric models: A worst-case perspective. Operations
Research 65(6):1696–1711.
Lam H, Zhou E (2017) The empirical likelihood approach to quantifying uncertainty in sample average
approximation. Operations Research Letters 45(4):301 – 307.
Landsberger M, Meilijson I (1993) Mean-preserving portfolio dominance. The Review of Economic Studies
60(2):479–485.
Li B, Jiang R, Mathieu JL (2017) Ambiguous risk constraints with moment and unimodality information.
Mathematical Programming 1–42.
Lam: On the Impossibility of Statistically Improving Empirical Optimization 31
Li KC, et al. (1986) Asymptotic optimality of CL and generalized cross-validation in ridge regression with
application to spline smoothing. The Annals of Statistics 14(3):1101–1112.
Li KC, et al. (1987) Asymptotic optimality for Cp , CL , cross-validation and generalized cross-validation:
Discrete index set. The Annals of Statistics 15(3):958–975.
Lim AE, Shanthikumar JG, Shen ZM (2006) Model uncertainty, robust optimization, and learning. Models,
Methods, and Applications for Innovative Decision Making, 66–94 (INFORMS).
Liyanage LH, Shanthikumar JG (2005) A practical inventory control policy using operational statistics.
Operations Research Letters 33(4):341–348.
Póczos B, Xiong L, Schneider J (2012) Nonparametric divergence estimation with applications to machine
learning on distributions. arXiv preprint arXiv:1202.3758 .
Popescu I (2005) A semidefinite programming approach to optimal-moment bounds for convex classes of
distributions. Mathematics of Operations Research 30(3):632–657.
Rothschild M, Stiglitz JE (1970) Increasing risk: I. a definition. Journal of Economic theory 2(3):225–243.
Shaked M, Shanthikumar JG (2007) Stochastic Orders (Springer Science & Business Media).
Shapiro A, Dentcheva D, Ruszczyński A (2014) Lectures on Stochastic Programming: Modeling and Theory
(SIAM).
Sriperumbudur BK, Fukumizu K, Gretton A, Schölkopf B, Lanckriet GR, et al. (2012) On the empirical
estimation of integral probability metrics. Electronic Journal of Statistics 6:1550–1599.
Staib M, Jegelka S (2019) Distributionally robust optimization and generalization in kernel methods. Wal-
lach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R, eds., Advances in Neural
Information Processing Systems, 9131–9141.
Sutter T, Van Parys BP, Kuhn D (2020) A general framework for optimal data-driven optimization. arXiv
preprint arXiv:2010.06606 .
Van der Vaart AW (2000) Asymptotic Statistics, volume 3 (Cambridge university press).
Van Der Vaart AW, Wellner JA (1996) Weak Convergence and Empirical Processes (Springer).
Van Parys BP, Esfahani PM, Kuhn D (2020) From data to decisions: Distributionally robust optimization
is optimal. Management Science .
Van Parys BP, Goulart PJ, Kuhn D (2016) Generalized Gauss inequalities via semidefinite programming.
Mathematical Programming 156(1-2):271–302.
32 Lam: On the Impossibility of Statistically Improving Empirical Optimization
Wang Q, Kulkarni SR, Verdú S (2009) Divergence estimation for multidimensional densities via k-nearest-
neighbor distances. IEEE Transactions on Information Theory 55(5):2392–2405.
Wang Z, Glynn PW, Ye Y (2016) Likelihood robust optimization for data-driven problems. Computational
Management Science 13(2):241–261.
Wiesemann W, Kuhn D, Sim M (2014) Distributionally robust convex optimization. Operations Research
62(6):1358–1376.
Wu D, Zhu H, Zhou E (2018) A Bayesian risk approach to data-driven stochastic optimization: Formulations
and asymptotics. SIAM Journal on Optimization 28(2):1588–1612.
Zeng Y, Lam H (2021) Complexity-free generalization via distributionally robust optimization. preprint .
e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization ec1
Proofs of Statements
Theorem EC.1 (a.k.a. Theorem 5.31 in Van der Vaart 2000). Consider, for all x ∈
X ⊂ Rd and λ ∈ Rm such that kλ − λ0 k ≤ δ for a given point λ0 for some δ > 0 , ϕx,λ (·) is a function
mapping Ξ to Rd . Also consider (possibly random) sequences xn ∈ X and λn for n = 1, 2, . . .. Let
P be a distribution generating ξ ∈ Ξ and EP [·] denotes its expectation. Let P̂n denotes the empir-
ical distribution of i.i.d. observations ξ1 , . . . , ξn and EP̂n [·] denotes its expectation. We assume the
following conditions:
1. {ϕx,λ (·) : kx − x∗ k ≤ δ, kλ − λ0 k ≤ δ } is Donsker.
2. EP kϕx,λ (ξ) − ϕx∗,λ0 (ξ)k2 → 0 as (x, λ) → (x∗ , λ0 ).
3. EP [ϕx∗ ,λ0 (ξ)] = 0
4. EP [ϕx,λ (ξ)] is differentiable at x = x∗ , uniformly in λ within a small neighborhood of λ0 with
non-singular derivative matrix Vx∗ ,λ = ∇EP [ϕx∗ ,λ (ξ)] such that Vx∗ ,λ → Vx∗ ,λ0 .
√
5. nEP̂n [ϕxn ,λn (ξ)] = op (1)
p
6. (xn , λn ) → (x∗ , λ0 ) as n → ∞
Then
∗ 1
xn − x = −Vx−1 −1
∗ ,λ EP [ϕx∗ ,λn (ξ)] − Vx∗ ,λ EP̂ [ϕx∗ ,λ0 (ξ)] + op √ + kEP [ϕx∗ ,λn (ξ)]k
0 0 n
n
Theorem EC.2 (a.k.a. Theorem 5.9 in Van der Vaart 2000). Let Ψn (·) be a random
vector-valued functions and Ψ(·) be a deterministic vector-valued function of x ∈ X such that for
every ǫ > 0,
p
sup kΨn (x) − Ψ(x)k → 0
x∈X
with mean 0 and covariance CovP (IF (ξ)) by the CLT. Thus
λ ∗ 1 1 1
x̂n − x = Op √ + λK + op √ + λ = Op √ + λ (EC.3)
n n n
and hence
1
kx̂λn ∗ 2
− x k = Op + λ2
n
which gives
1
r(x̂λn )kx̂λn − x∗ k2 = op + λ2
n
Thus, (EC.2) becomes
1 1
Z(x̂λn ) − Z(x∗ ) = (x̂λn − x∗ )⊤ ∇2 Z(x∗ )(x̂λn − x∗ ) + op + λ2 (EC.4)
2 n
Now putting in (3), we get
Z(x̂λn ) − Z(x∗ )
⊤
1 1 2 ∗ 1
= hIF (ξ), P̂n − P i + λK + op √ + λ ∇ Z(x ) hIF (ξ), P̂n − P i + λK + op √ + λ
2 n n
1
+ op + λ2 (EC.5)
n
which can be further written as
1 1
(hIF (ξ), P̂n − P i + λK)⊤∇2 Z(x∗ )(hIF (ξ), P̂n − P i + λK) + op + λ2
2 n
or
1 1 1
hIF (ξ), P̂n − P i⊤ ∇2 Z(x∗ )hIF (ξ), P̂n − P i + λ2 K ⊤ ∇2 Z(x∗ )K +λK ⊤∇2 Z(x∗ )hIF (ξ), P̂n − P i +op + λ2
2 2 n
e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization ec3
Proof of Proposition 2. By Assumption 2 and the definition of hIF (ξ), P̂n − P i in (4), the CLT
implies that
√
nhIF (ξ), P̂n − P i ⇒ Y
where Y is a Gaussian vector with mean 0 and covariance CovP (IF (ξ)). From (7), and using the
√
continuity of the quadratic function and Slutsky’s theorem, we have, when a = 0 so that nλn → 0,
that
1 √ ⊤ √ 1
nG (x̂λnn ) = nhIF (ξ), P̂n − P i ∇2 Z(x∗ ) nhIF (ξ), P̂n − P i + nλ2n K ⊤ ∇2 Z(x∗ )K
2 2
√ ⊤ 2 ∗
√
2
+ nλn K ∇ Z(x ) nhIF (ξ), P̂n − P i + op 1 + nλn
1 ⊤ 2
⇒ Y ∇ Z(x∗ )Y
2
√
Likewise, when a = ∞ or −∞, which means nλn → ∞ or −∞, we have
1 λn 1 √ ⊤
2 ∗
√ 1
G (x̂n ) = n hIF (ξ), P̂ n − P i ∇ Z(x ) n hIF (ξ), P̂ n − P i + K ⊤ ∇2 Z(x∗ )K
λ2 2nλ2n 2
1 √
1
+√ K ⊤ ∇2 Z(x∗ ) nhIF (ξ), P̂n − P i + op +1
nλn nλ2n
1
⇒ K ⊤ ∇2 (x∗ )K
2
√
Moreover, when nλn → a for some 0 < |a| < ∞, we have
1 √ ⊤ √ 1
nG (x̂λnn ) = 2
nhIF (ξ), P̂n − P i ∇ Z(x ) ∗
nhIF (ξ), P̂n − P i + nλ2n K ⊤ ∇2 Z(x∗ )K
2 2
√ ⊤ 2 ∗
√
2
+ nλn K ∇ Z(x ) nhIF (ξ), P̂n − P i + op 1 + nλn
1 ⊤ 2 1
⇒ Y ∇ Z(x∗ )Y + a2 K ⊤ ∇2 Z(x∗ )K + aK ⊤ ∇2 Z(x∗ )Y
2 2
Finally, when λ = 0, we either use the same line of arguments as above or observe that we are
in the case a = 0, which reduces to
1 ′ 2
nG (x̂EO ∗
n ) ⇒ Y ∇ Z(x )Y
2
This concludes the proposition.
Proof of Proposition 3. When a = 0, from Proposition 2 it is trivial to see that the weak limits of
p
nG (x̂λnn ) and nG (x̂EO λn 2 2 λn
n ) coincide. When a = ∞ or −∞, we have nG (x̂n ) = (nλn )(1/λn )G (x̂n ) → ∞
since (1/2)K ⊤∇2 Z(x∗ )K > 0 by Assumption 3. Finally, when 0 < |a| < ∞, we compare the weak
1 2
limits of nG (x̂λnn ) and nG (x̂EO ⊤ 2 ∗
n ) in Proposition 2. Note that 2 a K ∇ (x )K is deterministic and
d
This is because Y , as a mean-zero Gaussian vector, is symmetric, i.e., Y = −Y , and
thus for any z ∈ R, Y given (1/2)Y ⊤ ∇2 Z(x∗ )Y = z has the same distribution as
−Y given (1/2)(−Y )⊤ ∇2 Z(x∗ )(−Y ) = z or equivalently (1/2)Y ⊤ ∇2 Z(x∗ )Y = z. This
implies E[Y |(1/2)Y ⊤ ∇2 Z(x∗ )Y ] = E[−Y |(1/2)Y ⊤ ∇2 Z(x∗ )Y ], which in turn implies
E[Y |(1/2)Y ⊤ ∇2 Z(x∗ )Y ] = 0 and thus (EC.6). Hence, by Proposition 1, we get that
(1/2)Y ⊤ ∇2 Z(x∗ )Y is second-order stochastically dominated by (1/2)Y ⊤ ∇2 Z(x∗ )Y +
(1/2)a2 K ⊤ ∇2 Z(x∗ )K + aK ⊤ ∇2 (x∗ )Y .
Proof of Theorem 1. This follows immediately from Proposition 3, allowing for the weak limit
of ∞ and using the subsequence argument in Definition 2 when needed.
Proof of Theorem 2. Note that Assumption 4 gives
X
∇Z(x∗ ) = − α∗j ∇gj (x∗ ) (EC.7)
j∈B
p
Moreover, since Assumption 2 implies x̂λn − x∗ → 0 as n → ∞ and λ → 0, we have, from Assumption
5, with probability converging to 1,
1
0 = gj (x̂λn ) − gj (x∗ ) = ∇gj (x∗ )⊤ (x̂λn − x∗ ) + (x̂λn − x∗ )⊤ ∇2 gj (x∗ )(x̂λn − x∗ ) + op (kx̂λn − x∗ k2 )
2
or
1
− ∇gj (x∗ )⊤ (x̂λn − x∗ ) = (x̂λn − x∗ )⊤ ∇2 gj (x∗ )(x̂λn − x∗ ) + op (kx̂λn − x∗ k2 ) (EC.8)
2
Z(x̂λn ) − Z(x∗ )
1
= ∇Z(x∗ )⊤ (x̂λn − x∗ ) + (x̂λn − x∗ )⊤ ∇2 Z(x∗ )(x̂λn − x∗ ) + op (kx̂λn − x∗ k2 )
2
X 1
=− αj ∇gj (x ) (x̂n − x∗ ) + (x̂λn − x∗ )⊤ ∇2 Z(x∗ )(x̂λn − x∗ ) + op (kx̂λn − x∗ k2 )
∗ ∗ ⊤ λ
j∈B
2
1X ∗ λ 1
= αj (x̂n − x∗ )⊤ ∇2 gj (x∗ )(x̂λn − x∗ ) + (x̂λn − x∗ )⊤ ∇2 Z(x∗ )(x̂λn − x∗ ) + op (kx̂λn − x∗ k2 )
2 j∈B 2
!
1 λ X
= (x̂n − x∗ )⊤ ∇2 Z(x∗ ) + α∗j ∇2 gj (x∗ ) (x̂λn − x∗ ) + op (kx̂λn − x∗ k2 )
2 j∈B
It is now clear that the optimality gap Z(x̂λn ) − Z(x∗ ) behaves the same as (8) except that ∇2 Z(x∗ )
is replaced by ∇2 Z(x∗ ) + j∈B α∗j ∇2 gj (x∗ ). Thus the proofs of all theorems go through in the same
P
Assumption EC.1 stipulates that h is sufficiently smooth, with gradients satisfying Lipschitzness
and curvature properties. Assumption EC.2 is a condition on the “functional complexity” of the
class {h(x, ·)} and implies that h(x, ξ) satisfies a suitable uniform law of large numbers and CLT,
properties that are commonly used to ensure the validity of EO. Assumption EC.3 stipulates that
R is sufficiently smooth with a bounded Hessian matrix.
Next we state explicitly that the oracle solution x∗ and data-driven solution x̂Reg,λ
n both satisfy
corresponding first-order optimality conditions.
Assumption EC.4 part 1 signifies that the oracle solution x∗ satisfies the first-order optimality
condition, and moreover is unique in this satisfaction. Part 2 of the assumption signifies that the
regularized data-driven solution x̂Reg,λ
n satisfies the first-order opitmality condition of the regular-
ized EO problem.
We are now ready to prove Theorem 3.
ec6 e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization
Proof of Theorem 3. We verify all the conditions in Theorem EC.1 to conclude the result,
where we set xn in Theorem EC.1 to be x̂Reg,λ
n
n
. First of all, by Assumptions EC.1.1 and EC.1.2,
we can exchange derivative and expectation so that ∇EP [h(x, ξ)] = EP [∇h(x, ξ)] (e.g., Chapter
VII, Proposition 2.3 in Asmussen and Glynn 2007). Now, we take ψx,λ (ξ) in Theorem EC.1 as
∇h(x, ξ) + λ∇R(x), and set λ0 = 0 therein. Then:
Condition 1: Assumption EC.2 together with that R(x) is deterministic and satisfies Assump-
tion EC.3.1 give that {ψx,λ (·) : kx − x∗ k ≤ δ, |λ − λ0 | ≤ δ } is Donsker (Example 2.10.7 in
Van Der Vaart and Wellner 1996). Thus Condition 1 in Theorem EC.1 is satisfied.
Condition 2: Consider
EP k∇h(x, ξ) + λ∇R(x) − ∇h(x∗, ξ)k2 ≤ ((EP k∇h(x, ξ) − ∇h(x∗ , ξ)k2 )1/2 + λEP k∇R(x)k)2
→0
as (x, λ) → (x∗ , 0), where we have used Minkowski’s inequality in the first inequality and Assump-
tion EC.1.3 in the second inequality. Thus Condition 2 in Theorem EC.1 is satisfied.
Condition 3: Condition 3 in Theorem EC.1 is satisfied since we have argued for Condition 1 above
that ∇EP [h(x, ξ)] = EP [∇h(x, ξ)], and that x∗ satisfies the first-order condition in Assumption
EC.4.1.
in a neighborhood of 0, where EP [L2 (ξ)] < ∞ and ej denotes a vector of 0’s except the j-th
component being 1, and δ 6= 0. Thus the dominated convergence theorem gives ∇EP [∇h(x∗ , ξ)] =
EP [∇2 h(x∗ , ξ)]. This shows that EP [∇h(x, ξ)] is differentiable at x = x∗ . Moreover, together with
Assumption EC.3.1 we have
Furthermore, clearly
EP [∇2 h(x∗ , ξ)] + λ∇2 R(x∗ ) → EP [∇2 h(x∗ , ξ)]
e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization ec7
as λ → 0, and with Assumption EC.1.4 we have EP [∇2 h(x∗ , ξ)] + λ∇R(x∗ ) being positive definite
for λ in a neighborhood of 0. Thus Condition 4 in Theorem EC.1 is satisfied.
Condition 5: Condition 5 in Theorem EC.1 is satisfied since ∇(EP̂n [h(x, ξ)] + λR(x)) =
EP̂n [∇h(x, ξ)] + λ∇R(x) by Assumptions EC.1.1 and EC.3.1, and by Assumption EC.4.2 we have
x̂Reg,λ
n being a solution to EP̂n [∇h(x, ξ) + λ∇R(x)] = 0.
p
sup kEP̂n [∇h(x, ξ)]+λn ∇R(x) − EP [∇h(x, ξ)]k ≤ sup kEP̂n [∇h(x, ξ)] − EP [∇h(x, ξ)]k +λn sup k∇R(x)k → 0
x∈X x∈X x∈X
as n → ∞ by Minkowski’s equality and Assumption EC.3.2. This concludes supx∈X kΨn (x) −
p
Ψ(x)k → 0 in Theorem EC.2. Moreover, Assumption EC.4 concludes inf x∈X :kx−x∗k≥ǫ kΨ(x)k > 0 =
p
kΨ(x∗ )k. Therefore, by Theorem EC.2, we have x̂Reg,λ
n
n
→ x∗ .
With the above six conditions, we can invoke Theorem EC.1 to obtain
x̂Reg,λ
n
n
− x∗
= −(EP [∇2 h(x∗ , ξ)])−1 EP [∇h(x∗ , ξ) + λn ∇R(x)] − (EP [∇2 h(x∗ , ξ)])−1 EP̂n [∇h(x∗ , ξ)]
1 ∗ ∗
+ op √ + kE[∇h(x , ξ) + λn ∇R(x )]k
n
2 ∗ −1 ∗ 2 ∗ −1 ∗ 1 ∗
= −h(EP [∇ h(x , ξ)]) ∇h(x , ξ), P̂n − P i − λn (EP [∇ h(x , ξ)]) ∇R(x ) + op √ + λn k∇R(x )k
n
The first condition in Assumption EC.5 guarantees that we can use the KKT con-
ditions to locate the optimal solution of (18) (we will soon see that the problem
n h i o
minx∈X ,α≥0,β∈R αEP̂n φ∗ h(X)−β
α
+ αλn + β is a dualization of (18)). The second condition
ensures the obtained optimal solution is nontrivial. Both conditions need to be satisfied with over-
whelming probability as n → ∞.
Analogous to Assumption EC.4, we also need the following first-order optimality condition for
the oracle solution.
Compared to Assumption EC.1, here we strengthen part 2 of the assumption to the uniform
boundedness of h and ∇h. Note that the uniform boundedness of ∇h implies L1 -Lipschitzness of h,
which is precisely Assumption EC.1.2. The stronger Assumption EC.7.2 can be potentially relaxed
but it helps streamline our mathematical arguments.
Compared to Assumption EC.2, Assumption EC.8 requires the Glivenko-Cantelli property for h
and h2 , and the Donsker property for a larger function class {φ∗ ′ (αh(x·) − β)∇h(x, ·)}. Moreover,
it requires a non-degeneracy on h uniformly across all x ∈ X .
We are now ready to prove Theorem 4.
Proof of Theorem 4. Consider the inner maximization in (18). By a change of measure from Q
to P̂n , we rewrite it as
maxL≥0 EP̂n [h(x, ξ)L]
(EC.9)
subject to EP̂n [φ(L)] ≤ λ
e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization ec9
where the decision variable is now the likelihood ratio L = L(ξ). Next, since L ≡ 1 is in the interior
of the feasible region for any λ > 0, Slater’s condition holds and we rewrite (EC.9) as the Lagrangian
formulation
where in the above we define 0φ∗ (a/0) = ∞ if a > 0 and 0φ∗ (a/0) = 0 for a ≤ 0 (i.e., when α = 0
the resulting objective function is equal to ∞ if ess supP̂ h(x, ξ) > β, and β otherwise). Thus, (18)
is equivalent to
h(x, ξ) − β
min αEP̂n φ∗ + αλ + β (EC.11)
x∈X ,α≥0,β α
Now suppose the two events in Assumption EC.5 hold, which occur with probability reaching 1 as
n → ∞. In the first event, the KKT conditions of (EC.11) uniquely identifies the optimal solution.
Denote (x̂λn , α̂λn , β̂nλ ) as the optimal solution of (EC.11). If α̂λn = 0, then β̂nλ = ess supP̂n h(x̂λn , ξ) and
the objective value of (EC.11) becomes ess supP̂n h(x̂λn , ξ), which violates the second event. Thus,
with probability reaching 1, we must have α̂λn > 0. We will focus on this case for the rest of the
proof and use a standard convergence in probability argument to handle the other case (which
occurs with vanishing probability) when needed.
The KKT conditions in the case α̂λn > 0 are precisely the first-order optimality conditions
∗′ h(x, ξ) − β
EP̂n φ ∇h(x, ξ) = 0
α
∗ h(x, ξ) − β ∗′ h(x, ξ) − β h(x, ξ) − β
EP̂n φ − EP̂n φ +λ=0
α α α
h(x, ξ) − β
−EP̂n φ∗ ′ +1=0
α
or equivalently
h(x, ξ) − β
∗′
EP̂n φ ∇h(x, ξ) = 0
α
h(x, ξ) − β h(x, ξ) − β h(x, ξ) − β
EP̂n φ∗ ′ − EP̂n φ∗ =λ
α α α
∗′ h(x, ξ) − β
EP̂n φ =1
α
ec10 e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization
For convenience, we do a change of variable and define α̃ = 1/α and β̃ = β/α. Then the above
can be written as
h i
∗′
EP̂n φ (α̃h(x, ξ) − β̃)∇h(x, ξ) = 0 (EC.12)
h i h i
EP̂n φ∗ ′ (α̃h(x, ξ) − β̃)(α̃h(x, ξ) − β̃) − EP̂n φ∗ (α̃h(x, ξ) − β̃) = λ (EC.13)
h i
EP̂n φ∗ ′ (α̃h(x, ξ) − β̃) = 1 (EC.14)
Condition 6: We derive the asymptotic behaviors of α̃λn and β̃nλ . First consider (EC.14). We show
that we can find a unique root β̃n = β̃n (α̃) for (EC.14) as a function of α̃, at a neighborhood of
α̃ = 0. Here and from now on, we note that the values of α̃ and β̃ could depend on x but we will
suppress this dependence unless needed. Note that α̃ = β̃ = 0 solves (EC.14). Moreover, by Lemma
2 EP̂n [φ∗ ′ (α̃h(x, ξ) − β̃)] − 1 is continuously differentiable in α̃ and β̃ both at 0 and
∂
EP̂n [φ∗ ′ (α̃h(x, ξ) − β̃)] − 1 = −EP̂n [φ∗ ′′ (α̃h(x, ξ) − β̃)] = −φ∗ ′′ (0) < 0
∂ β̃ α̃=β̃=0 α̃=β̃=0
So by the implicit function theorem, we have β̃n (α̃) continuously differentiable in a neighborhood
around α̃ = 0. Furthermore, since
∂
E [φ∗ ′ (α̃h(x, ξ) − β̃)] − 1 = EP̂n [φ∗ ′′ (α̃h(x, ξ) − β̃)h(x, ξ)] = φ∗ ′′ (0)Ê[h(x, ξ)]
∂ α̃ P̂n
α̃=β̃=0 α̃=β̃=0
we have β̃ ′ (0) = −(−φ∗ ′′ (0)−1 )φ∗ ′′ (0)EP̂n [h(x, ξ)] = Ê[h(x, ξ)]. Moreover, since h is uniformly
bounded by Assumption EC.7.2, we have α̃h(x, ξ) − β̃ → 0 uniformly over x, ξ when α̃, β̃ → 0. Thus,
together with the mean value theorem, we have
for some ζ between 0 and α̃, when α̃ is in a sufficiently small neighborhood of 0, where β̃n′ (ζ)
converges to EP̂n [h(x, ξ)] uniformly over P̂n and x ∈ X as α̃ → 0.
e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization ec11
EP̂n [φ∗ ′ (0)(α̃h(x, ξ) − β̃)] + EP̂n [φ∗ ′′ (ζ1 (x, ξ))(α̃h(x, ξ) − β̃)2 ] − EP̂n [φ∗ ′ (0)(α̃h(x, ξ) − β̃)]
1
− EP̂n [φ∗ ′′ (ζ2 (x, ξ))(α̃h(x, ξ) − β̃)2 ] = λ (EC.16)
2
which gives
1
EP̂n [φ∗ ′′ (ζ1 (x, ξ))(α̃h(x, ξ) − β̃)2 ] − EP̂n [φ∗ ′′ (ζ2 (x, ξ))(α̃h(x, ξ) − β̃)2 ] = λ (EC.17)
2
where ζ1 (x, ξ) and ζ2 (x, ξ) are some values between 0 and α̃h(x, ξ) − β̃. Let us take β̃ = β̃n (α̃) as
the root of (EC.14) described above, which is well-defined for α̃ in a neighborhood of 0. We can
write the left hand side of (EC.17) as
1
EP̂n [φ∗ ′′ (ζ1 (x, ξ))(α̃h(x, ξ) − β̃n (α̃))2 ] − EP̂n [φ∗ ′′ (ζ2 (x, ξ))(α̃h(x, ξ) − β̃n (α̃))2 ]
2
1
un (α̃) = EP̂n [φ∗ ′′ (ζ1 (x, ξ))(h(x, ξ) − β̃n′ (ζ))2] − EP̂n [φ∗ ′′ (ζ2 (x, ξ))(h(x, ξ) − β̃n′ (ζ))2]
2
Note that un (α̃) is continuous at α̃ = 0, and un (0) = φ∗ ′′ (0)V arP̂n (h(x, ξ))/2. We argue that un (α̃)
converges to φ∗ ′′ (0)V arP (h(x, ξ))/2 > 0 uniformly over x ∈ X a.s. as n → ∞ and α̃ → 0 (at any rate
relative to n). More precisely, we consider
Thus (EC.18) goes to 0. Hence, given a sufficiently small λ and large n, we can find a root α̃λn to
(EC.17), or equivalently
un (α̃λn )(α̃λn )2 = λ
ec12 e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization
in a small neighborhood of 0, uniformly over all x ∈ X . Moreover, note that the events represented
in Assumption EC.5 occurs with probability tending to 1, and thus, by the uniform non-degeneracy
of the variance (Assumption EC.8.3), we have α̃λn satisfy
s
λ p
α̃λn = λ
→0 (EC.19)
un (α̃n )
as n → ∞ and λ → 0. Correspondingly,
p
β̃nλ = β̃n (α̃λn ) = β̃n′ (ζnλ )α̃λn → 0
p
where ζnλ lies between 0 and α̃λn . In other words, (α̃λn , β̃nλ ) → 0 as n → ∞ and λ → 0.
p
Next we show that x̂λn → x∗ . To this end, we verify the conditions in Theorem EC.2 where Φ(x) is
set to be ∇EP [h(x, ξ)] = EP [∇h(x, ξ)] (by the L1 -Lipschitzness of h implied by Assumption EC.7.2)
h i
∗′ λn λn
and Φn (x) is set to be EP̂n φ α̃n h(x, ξ) − β̃n ∇h(x, ξ) in (EC.12). We have
h i
p
sup EP̂n φ∗ ′ α̃λnn h(x, ξ) − β̃nλn ∇h(x, ξ) − EP [∇h(x, ξ)] → 0
x∈X
with probability tending to 1. Thus, all the conditions in Theorem EC.2 are satisfied and we have
p p
x̂λn → x∗ . Therefore, we obtain x̂λn , α̃λn , β̃nλ → 0 as n → ∞ and λ → 0.
Moreover, from (EC.15) and (EC.19), and using the uniform boundedness and continuity of h
(Assumptions EC.7.1 and EC.7.2), we invoke the dominated convergence theorem to get further
that s
2λ
α̃λn = (1 + op (1)) (EC.20)
φ∗ ′′ (0)V arP (h(x∗ , ξ))
and s
2λ
β̃nλ ∗
= EP [h(x , ξ)] (1 + op (1)) (EC.21)
φ∗ ′′ (0)V arP (h(x∗ , ξ))
We now continue to verify all other conditions in Theorem EC.1. As mentioned, we set the λ
in the theorem as (α̃, β̃). We then set ϕx,α̃,β̃ (ξ) = φ∗ ′ (α̃h(x, ξ) − β̃)∇h(x, ξ) which is the function
e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization ec13
Condition 2: We have
≤ ((EP k(φ∗ ′ (α̃h(x, ξ) − β̃) − 1)∇h(x, ξ)k2)1/2 + (EP k∇h(x, ξ) − ∇h(x∗, ξ)k2 )1/2 )2 (EC.22)
by Minkowski’s inequality. For the first term in (EC.22), since h and ∇h are uniformly bounded
(Assumption EC.7.2) and φ∗ ′ is continuous at 0 (Lemma 2), as α̃, β̃ → 0, we have EP k(φ∗ ′ (α̃h(x, ξ) −
β̃) − 1)∇h(x, ξ)k2 → 0 uniformly in x. For the second term in (EC.22), we have
as x → x∗ since E[L2 (ξ)2 ] < ∞ (Assumption EC.7.3). Therefore, putting together we have (EC.22)
go to 0. This shows the satisfaction of Condition 2.
Condition 4: Since h is uniformly bounded (Assumption EC.7.2) and also L1 -Lipschitz (implied
by Assumption EC.7.2), and φ∗ ′ is continuously differentiable at 0 (Lemma 2) and thus Lipschitz
continuous at a neighborhood of 0, we have φ∗ ′ (α̃h(x, ξ) − β̃) being L1 -Lipschitz and uniformly
bounded over x, ξ a.s. for a fixed sufficiently small (α̃, β̃). On the other hand, we have ∇h being L1 -
Lipschitz (implied by Assumption EC.7.3) and uniformly bounded (Assumption EC.7.2). Thus we
have φ∗ ′ (α̃h(x, ξ) − β̃)∇h(x, ξ) being L1 -Lipschitz. Moreover, by Assumption EC.7.1 and Lemma
2 again it is differentiable a.s. Thus, using the same argument used to verify Condition 4 in the
proof of Theorem 3, we can exchange derivative and expectation to obtain
= EP [φ∗ ′′ (α̃h(x, ξ) − β̃)α̃∇h(x, ξ)∇h(x, ξ)′] + EP [φ∗ ′ (α̃h(x, ξ) − β̃)∇2 h(x, ξ)]
for (α̃, β̃) in a neighborhood of 0. In particular, ∇EP [∇h(x, ξ)] = EP [∇2 h(x, ξ)].
ec14 e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization
EP [φ∗ ′′ (α̃h(x∗ , ξ) − β̃)α̃∇h(x∗ , ξ)∇h(x∗, ξ)′ ] + EP [φ∗ ′ (α̃h(x∗ , ξ) − β̃)∇2 h(x∗ , ξ)] → EP [∇2 h(x∗ , ξ)]
h i
∗′
Condition 5: Since x̂λn satisfies EP̂n φ α̃λn h(x, ξ) − β̃nλ ∇h(x, ξ) = 0 with probability tending to
1, Condition 5 is readily satisfied.
x̂λnn − x∗
h i
= −(EP [∇2 h(x, ξ)])−1 EP φ∗ ′ α̃λnn h(x∗ , ξ) − β̃nλn ∇h(x∗ , ξ) − (EP [∇2 h(x, ξ)])−1 EP̂n [∇h(x∗ , ξ)]
i
1 h
∗′ λn ∗ λn
∗
+ op √ + EP φ α̃n h(x , ξ) − β̃n ∇h(x , ξ)
n
h i
= −(EP [∇ h(x, ξ)])−1 EP [∇h(x∗ , ξ)] + EP φ∗ ′′ (ζn ) α̃λnn h(x∗ , ξ) − β̃nλn ∇h(x∗ , ξ)
2
i
2 −1 ∗ 1 h
∗ ′′
λ ∗ λn
∗
− (EP [∇ h(x, ξ)]) EP̂n [∇h(x , ξ)] + op √ + EP φ (ζn ) α̃n h(x , ξ) − β̃n ∇h(x , ξ)
n
h i
∗ ′′
= −(EP [∇ h(x, ξ)]) EP φ (ζn ) α̃n h(x , ξ) − β̃nλn ∇h(x∗ , ξ)
2 −1 λn ∗
i
2 −1 ∗ 1 h
∗ ′′
λn ∗ λn
∗
− (EP [∇ h(x, ξ)]) EP̂n [∇h(x , ξ)] + op √ + E φ (ζn ) α̃n h(x , ξ) − β̃n ∇h(x (EC.23) , ξ)
n
where ζn lies between 0 and α̃λnn h(x∗ , ξ) − β̃nλn , and we have used the first-order optimality condi-
tion EP [∇h(x∗ , ξ)] = 0 derived previously. Now, substituting (EC.20) and (EC.21), and using the
continuity of φ∗ ′′ at 0 (Lemma 2), uniform boundedness of h and ∇h (Assumption EC.7.1) and the
characterizations in (EC.15) and (EC.19), we invoke the bounded convergence theorem to obtain
h i
EP φ∗ ′′ (ζn ) α̃λnn h(x∗ , ξ) − β̃nλn ∇h(x∗ , ξ)
"s #
′′ λn
= φ∗ (0)EP (h(x∗ , ξ) − EP [h(x∗ , ξ)])∇h(x∗, ξ) (1 + op (1))
φ∗ ′′ (0)V arP (h(x∗ , ξ))
s
λn φ∗ ′′ (0)
= CovP ((h(x∗, ξ), ∇h(x∗ , ξ))(1 + op (1))
V arP (h(x∗ , ξ))
e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization ec15
Proof of Corollary 1. The result is immediate by using, e.g., Kuhn et al. (2019) Theorem 10 to
obtain (23), on which we apply Theorem 3.
Now let
r(h, λ) = x(θ ∗ + h, λ) − x(θ, 0) − (∇2xψ(x∗ , θ ∗ ))−1 ∇xθ ψ(x∗ , θ ∗ )h − (∇2xψ(x∗ , θ ∗ ))−1 ∇x R(x)λ
ec16 e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization
By the continuous differentiability of x(·, ·) argued above, we have r(h, λ) = o(k(h, λ)k) as (h, λ) → 0.
Thus, by the definition of convergence in probability, we have
x(θ̂n , λ) − x(θ, 0) − (∇2xψ(x∗ , θ ∗ ))−1 ∇xθ ψ(x∗ , θ ∗ )(θ̂n − θ ∗ ) − (∇2xψ(x∗ , θ ∗ ))−1 ∇x R(x∗ )λ
∗ ∗ ∗ 1
= r(θ̂n − θ , λ) = op (k(θ̂n − θ , λ)k) = op (kθ̂n − θ k + |λ|) = op √ +λ
n
Noting that x̂nP −Reg,λ = x(θ̂n , λ), this concludes the theorem.
Assumption EC.12 (Additional regularity conditions for the penalty). ∇2x R(x) is uni-
formly bounded over x ∈ X .
for n → ∞ and λ → 0.
where ∇jx for j = 1, . . . , d denotes the j-th component of the gradient. By the mean value theorem
and Assumption EC.11, we can write (EC.25) as
j j j Wn
∇x ψ(x, θ̂n ) + EΘ|Dn ∇xθ ψ(x, θ̂n + ζn ) √ (EC.26)
n
e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization ec17
√
where ∇jxθ ψ(x, θ) denotes the j-th row of ∇xθ ψ(x, θ), and ζnj is between 0 and Wn / n. By the
uniform boundedness of ∇xθ ψ(x, θ) in Assumption EC.11 and the almost sure uniform integrability
of Wn , we have ∇jxθ ψ(x, θ̂n +ζnj )Wn also a.s. uniformly integrable. Thus, together with the continuity
p p
of ∇xθ ψ(x, θ) in Assumption EC.11, θ̂n → θ ∗ as implied by (24), and kWn − N (0, J )kT V → 0 in (26),
we have
p
EΘ|Dn [∇jxθ ψ(x, θ̂n + ζnj )Wn ] → 0
Thus,
1
EΘ|Dn [∇x ψ(x, Θ)] = ∇x ψ(x, θ̂n ) + op √ (EC.27)
n
uniformly over x ∈ X .
Now, define x̂nP −Reg,λ = x(θ̂, λ) as in the proof of Theorem 5, and x̂P −Bay,λ as the root of
which exists with probability converging to 1 as n → ∞ and coincides with the definition in Assump-
tion EC.13, which is well-defined by using Assumption EC.11.
Now, by the first-order optimality condition,
By the delta method, and together with the uniform boundedness of ∇2x ψ(x, θ) and the continuity
of ∇2x ψ(x, θ) in x uniformly over θ in Assumption EC.11, we can rewrite the above as
∇2x ψ(x̂nP −EO , θ̂n )(x̂nP −Reg,λ − x̂nP −EO ) + op (kx̂nP −Reg,λ − x̂nP −EO k) = −λ∇x R(x̂nP −Reg,λ )
kx̂nP −Reg,λ − x̂nP −EO k ≤ k∇2x ψ(x̂nP −EO , θ̂n )−1 kk∇2x ψ(x̂nP −EO , θ̂n )(x̂nP −Reg,λ − x̂nP −EO )k
by using the uniform boundedness of the eigenvalues of ∇2x ψ(x, θ) away from 0 and ∞ in Assumption
p
EC.11, and the uniform boundedness of ∇x R(x) in Assumption EC.12. Thus x̂nP −Reg,λ − x̂nP −EO → 0
as λ → 0. We then have
x̂nP −Reg,λ − x̂nP −EO = −λ∇2x ψ(x̂nP −EO , θ̂n )−1 ∇x R(x̂nP −Reg,λ )(1 + op (1)) (EC.28)
Similarly, by the first-order optimality condition in Assumption EC.13, Assumption EC.11 and
(EC.27), we have
1
∇x ψ(x̂nP −Bay,λ , θ̂n ) − ∇x ψ(x̂nP −Bay,0 , θ̂) = −λ∇x R(x̂nP −Bay,λ ) + op √
n
ec18 e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization
So, by the delta method, and together with the uniform boundedness of ∇2x ψ(x, θ) in Assumption
EC.11, we have
1
∇2x ψ(x̂nP −Bay,0 , θ̂n )(x̂nP −Bay,λ − x̂nP −Bay,0 )+op (kx̂nP −Bay,λ − x̂nP −Bay,0 k) = −λ∇x R(x̂nP −Bay,λ )+op √
n
Then we have
kx̂nP −Bay,λ − x̂nP −Bay,0 k ≤ k∇2x ψ(x̂nP −Bay,0 , θ̂n )−1 kk∇2x ψ(x̂nP −Bay,0 , θ̂n )(x̂nP −Bay,λ − x̂nP −Bay,0 )k
1
= O(λ) + op √ + op (kx̂nP −Bay,λ − x̂nP −Bay,0 k)
n
by using the uniform boundedness of the eigenvalues of ∇2x ψ(x, θ) away from 0 and ∞ in Assumption
p
EC.11, and the uniform boundedness of ∇x R(x) in Assumption EC.12. Thus x̂nP −Bay,λ − x̂nP −Bay,0 →
0 as λ → 0 and n → ∞. We then have
1
x̂nP −Bay,λ − x̂nP −Bay,0 = −λ∇2x ψ(x̂nP −Bay,0 , θ̂n )−1 ∇x R(x̂nP −Bay,λ) + op √ (EC.29)
n
Now, by the delta method like in the proof of Theorem 5, and together with the continuity and
uniform boundedness of ∇2x ψ(x, θ) in Assumption EC.11, we have
∇x ψ(x̂nP −Bay,0 , θ̂n ) −∇xψ(x̂nP −EO , θ̂n ) = ∇2x ψ(x̂nP −EO , θ̂n )(x̂nP −Bay,0 − x̂nP −EO )+op (kx̂nP −Bay,0 − x̂nP −EO k)
Note also that by (EC.27) and the first-order optimality conditions in Assumption EC.9.1 and
Assumption EC.11,
1
∇x ψ(x̂nP −Bay,0 , θ̂n ) + op √ = 0 = ∇x ψ(x̂nP −EO , θ̂n )
n
so that
1
∇x ψ(x̂nP −Bay,0 , θ̂n ) − ∇x ψ(x̂nP −EO , θ̂n ) = op √
n
Thus
kx̂nP −Bay,0 − x̂nP −EO k ≤ k∇2x ψ(x̂nP −EO , θ̂n )−1 kk∇2x ψ(x̂nP −EO , θ̂n )(x̂nP −Bay,0 − x̂nP −EO )k
1
= op √ + op (kx̂nP −Bay,0 − x̂nP −EO k)
n
by using the uniform boundedness of the eigenvalues of ∇2x ψ(x, θ) away from 0 and ∞ in Assumption
EC.11. This gives
1
x̂nP −Bay,0 − x̂nP −EO = op √ (EC.30)
n
e-companion to Lam: On the Impossibility of Statistically Improving Empirical Optimization ec19
= (x̂nP −Bay,λ − x̂nP −Bay,0 ) + (x̂nP −Bay,0 − x̂nP −EO ) + (x̂nP −EO − x̂nP −Reg,λ )
1
= −λ∇2x ψ(x̂nP −Bay,0 , θ̂n )−1 ∇x R(x̂nP −Bay,λ) + λ∇2x ψ(x̂nP −EO , θ̂n )−1 ∇x R(x̂nP −Reg,λ ) + op √
n
1
= op √ +λ
n
p
by the consistency of x̂nP −Bay,λ , x̂nP −Reg,λ , x̂nP −Bay,0 , x̂nP −EO → x∗ implied by these equations and the
continuity of ∇2x ψ(x, θ) in Assumption EC.11 and of ∇x R(x) in Assumption EC.12. Together with
the conclusions from Theorem 5, we prove the theorem.