Main
Main
1
yintat@uw.edu. This work is supported in part by CCF-1749609, DMS-1839116, DMS-2023166, Microsoft Re-
search Faculty Fellowship, Sloan Research Fellowship and Packard Fellowships.
2
vempala@gatech.edu. This work is supported in part by CCF-1563838, E2CDA-1640081, CCF-1717349 and
DMS-1839323.
Contents
0.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1 Introduction 6
1.1 Why non-convex functions can be dicult to optimize . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Why is convexity useful? Linear Separability! . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Convex problems are everywhere! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Examples of convex sets and functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Checking convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Subgradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.7 Logconcave functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
I Optimization 18
2 Gradient Descent 19
2.1 Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Basic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Analysis for convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Strongly Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Generalizing Gradient Descent* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7 Gradient Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Elimination 29
3.1 Cutting Plane Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Ellipsoid Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 From Volume to Function Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Center of Gravity Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Sphere and Parabola Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4 Reduction 46
4.1 Equivalences between Oracles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Gradient from Evaluation via Finite Dierence . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Separation via Membership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 Composite Problem via Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5 Geometrization 64
5.1 Norms and Local Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Mirror Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 FrankWolfe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4 The Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.5 Interior Point Method for Linear Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.6 Interior Point Method for Convex Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
1
Contents 2
6 Sparsication 88
6.1 Subspace embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 Leverage Score Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.3 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4 Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7 Acceleration 101
7.1 Chebyshev Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2 Conjugate Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.3 Accelerated Gradient Descent via Plane Search . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.4 Accelerated Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.5 Accelerated Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.6 Accelerated Stochastic Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
II Sampling 112
8 Gradient-based Sampling 113
8.1 Gradient-based methods: Langevin Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.2 Langevin Dynamics is Gradient Descent in Density Space*1 . . . . . . . . . . . . . . . . . . . 117
10 Geometrization 123
10.1 Basics of Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
10.2 Conductance of the Ball Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
10.3 Generating a warm start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
10.4 Isotropic Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
10.5 Isoperimetry via localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.6 Hit-and-Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
10.7 Dikin walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
10.8 Mixing with Strong Self-Concordance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
10.9 Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
11 Annealing 154
11.1 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
11.2 Volume Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
B Notation 165
We use B(x, r) to denote the Euclidean (or ℓ2 ) ball centered atP x with radius r: P
{y : ∥y − x∥2 ≤ r}. We use
conv(X) to denote the convex hull of X , namely conv(X) = { αi xi : αi ≥ 0, αi = 1, xi ∈ X}. We use
⟨x, y⟩ to denote the ℓ2 inner product xT y of x and y . For any two points x, y ∈ Rn , we view them as column
vectors, and use [x, y] to denote conv({x, y}), namely, the line segment between x and y . Unless specied
otherwise, ∥x∥ will be the ℓ2 norm ∥x∥2 .
We use ei to denote the coordinate vector with i-th coordinate is 1 and 0 otherwise.
Functions
Denition 0.1. For any L ≥ 0, a function f : V → W is L-Lipschitz if ∥f (x) − f (y)∥W ≤ L ∥x − y∥V
where the norms ∥·∥V and ∥·∥W are ℓ2 norms if unspecied.
Denition 0.2. A function f ∈ C k (Rn ) if f is a k-dierentiable function and its kth derivative is continuous.
Denition 0.3. A function f : X ⊆ Rn → R is called lower semi-continuous at a point x0 ∈ X if for every
real y < f (x0 ) there exists a neighborhood U of x0 such that f (x) > y for all x ∈ U . Equivalently, f is lower
semi-continuous at x0 i
The function is lower semi-continuous if is it so at every point in its domain. The denition of upper
semi-continuous is analogous with the inequalities reversed and lim inf replaced by lim sup.
Theorem 0.4 (Taylor's Remainder Theorem). For any g ∈ C k+1 (R), and any x and y, there is a ζ ∈ [x, y]
such that
k
X (y − x)j (y − x)k+1
g(y) = g (j) (x) + g (k+1) (ζ) .
j=0
j! (k + 1)!
Linear Algebra
Denition 0.6. A real symmetric matrix A is positive semi-denite (PSD) if x⊤ Ax ≥ 0 for all x ∈ Rn .
Equivalently, a real symmetric matrix with all nonnegative eigenvalues is PSD. We write A ⪰ 0 if A is PSD
and A ⪰ B if A − B is PSD.
Denition 0.7. For any matrix A, we dene its trace, trA = Aii , Frobenius norm, ∥A∥2F = tr(A⊤ A) =
P
i,j Aij , and operator norm, ∥A∥op = sup∥x∥2 =1 ∥Ax∥2 . Note that x Ax = tr(Axx ) and in general,
2 ⊤ ⊤
P
tr(AB) = tr(BA).
For symmetric A, we have trA = i λi , ∥A∥2F = i λ2i and ∥A∥op = maxi |λi | where λi are the eigenvalues
P P
of A. √
For a vector x, ∥x∥A = xT Ax; for a matrix B ,
∥Bx∥A
∥B∥A = sup .
x ∥x∥A
3
Contents 4
−1 −1
A + UV ⊤ = A−1 − A−1 U I + V ⊤ A−1 U V ⊤ A−1 .
Probability
Denition 0.9. The Total Variation (TV) or ℓ1 -distance between two distributions with measures ρ, ν with
support Ω is
Z
1
dT V (ρ, ν) = |ρ(x) − ν(x)| dx = sup |ρ(S) − ν(S)| = sup ρ(S) − ν(S).
2 Ω S⊂Ω S⊂Ω
Denition 0.11. The χ-squared distance of a densities ρ with respect to another density ν is dened as
\nu
2 !
2 ρ(x) ρ(x)
χ (ρ, ν) = Eν −1 = Eρ − 1.
ν(x) ν(x)
Denition 0.12. The Wassestein p-distance between two probability measures ρ, ν over a metric space M
is dened as 1/p
p
Wp (ρ, ν) = inf Ex,y∼γ (d(x, y) )
γ∈Γ(ρ,ν)
where Γ(ρ, ν) is the set of all couplings of ρ and ν (joint probability measures with support M × M whose
marginals are ρ and ν ).
Denition 0.13. The marginal of a distribution in Rn with density ν in the span of a k -dimensional
subspace V is dened as Z
g(x) = ν(x, y) dy
y∈V ⊥
for any x ∈ V .
Note that the marginal is a convolution of the density with the indicator function of the subspace.
Denition 0.14. A distribution D with support contained in Rn is said to be isotropic if ED (x) = 0 and
ED (xx⊤ ) = I .
Geometry
Denition 0.15. We denote the unit Euclidean ballnas B n = {x : ∥x∥o2 ≤ 1} . More generally, we dene the
ℓp -norm ball of radius r centered at z as Bpn (z, r) = x : ∥x − z∥p ≤ r .
0.1 Examples
Here is a list we want to cover:
linear system, SDD, M matrices, directed Laplacian, multi-commodity ow, totally unimodular matri-
ces (what other linear systems?)
logistic regression, other regressions, ℓp regression, convex regression
linear program < quadratic program < second order cone program < semidenite program < conic
program < convex program
Example: John ellipsoid, minimum incrible ball, geometric programming and matrix scaling?
Shortest Path, maximum ow, min cost ow
Example: Transportation
Markov Decision Process
matroid intersection, submodular minimization
Chapter 1
Introduction
In this book, we will study two topics involving convexity, namely optimization and sampling. Given a
multivariate, real-valued function f ,
1. How quickly can we nd a point that minimizes f ?
2. How fast can we sample a point according to the distribution with density dened by f , i.e., proportional
to e−f ?
Optimization appears naturally across mathematics, the sciences and engineering for a variety of theoretical
and practical reasons. Its study over centuries has been extremely fruitful. Sampling is motivated by the
question of choosing a representative point or subset (rather than an extremal point)? Rather than a feasible
set, we have a distribution which assigns probabilities to subsets. The goal is to sample a point from a target
distribution, i.e., the output point should lie in a given subset with probability equal to the probability of
the set in the target distribution. These problems are quite closely connected, e.g., sampling from such
distributions can be used to nd near-optimal points. These problems are intractable in full generality, and
have exponential (in dimension) complexity even under smoothness assumptions.
Convexity and its natural extensions are a current frontier of tractable, i.e., polynomial-time, compu-
tation. The assumption of convexity induces structure in instances that makes them amenable to ecient
algorithms. For example, the local minimum of a convex function is a global minimum. Convexity is main-
tained by natural operations such as intersection (for sets) and addition (for functions). Perhaps less obvious,
but also crucial, is that convex sets can be approximated by ellipsoids in various ways.
We will learn several techniques that lead us to some polynomial-time algorithms for both problems and
(nearly) linear-time algorithms for the case when f is close to a quadratic function.
Although convex optimization has been studied since the 19th century1 with many tight results emerging,
there are still many basic open problems. Here is an example:
Open Problem. Given an n × n random 0/1 matrix A with O(n) nonzero entries and a 0/1 vector b, can
we solve Ax = b in o(n2 ) time?
Computing the volume is an ancient problem, the early Egyptians and Greeks developed formulas for
specic shapes of interest. Unlike convex optimization, even computing the volume of a convex body is
intractable, as we will see later. Nevertheless, there are ecient randomized algorithms that can estimate
the volume of convex bodies to arbitrary accuracy in time polynomial in the dimension and the desired
accuracy. This extends to ecient algorithms for integrating logconcave functions, i.e., functions of the
form e−f where f is convex. The core ingredient is sampling in high dimension. Sampling and volume
computation will be the motivating problems for the second part of this book. Again, many basic problems
remain open. To illustrate:
Open Problem. Given a polytope dened by {x : Ax ≤ b}, can we estimate its volume to within a constant
factor in nearly linear time?
1 Augustin-Louis Cauchy introduced gradient descent in 1847.
6
1.1. Why non-convex functions can be dicult to optimize 7
where B(0n , 1) is the unit ball centered at the origin, 0n . This function is 1-Lipschitz, i.e., for any x, y , we
have |f (x) − f (y)| ≤ |x − y|, and unless we query f (x) with x that is ϵ-close to x∗ , it will always return ϵ.
Since the region where f is not ϵ has volume ϵ1n times the volume of the unit ball, one can show that it takes
Ω ϵ1n calls to f to nd x∗ . The following exercise asks you to prove that this bound is tight.
Exercise 1.1. Show that if f is 1-Lipschitz on B(0n , 1), we can nd x∗ such that f (x∗ ) − minx f (x) ≤ ϵ by
evaluating f (x) at O( 1ϵ )n points.
Thus O(1/ϵ)n is the best possible bound for optimizing general 1-Lipschitz functions. Similar construc-
tions can also be made for innitely dierentiable functions. We note that it is easy to nd local minima for
all the functions above. In Section 2.2, we will show that it is easy to nd an approximate local minimum
of a continuously dierentiable function.
Denition 1.2. A set K ⊆ Rn is convex if for every pair of points x, y ∈ K , we have [x, y] ⊆ K , where
[x, y] = {(1 − t)x + ty : t ∈ [0, 1]} is the one-dimensional interval from x to y .
Denition 1.3. A function f : Rn → R ∪ {+∞} is convex if for every t ∈ [0, 1], we have
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y).
1.2. Why is convexity useful? Linear Separability! 8
Exercise 1.4. Suppose we have a function f : Rn → R∪{+∞}that has the property that for every x, y ∈ Rn ,
x+y 1
f ≤ (f (x) + f (y)) .
2 2
Denition 1.5. An optimization problem minx∈K f (x) is convex if K and f are convex.
Any point not in a convex set can be separated by a hyperplane from the set. We will see in Chapter 3
that separating hyperplanes allow us to do binary search to nd a point in a convex set. This is the basis of
all polynomial-time algorithms for optimizing general convex functions. We will explore these binary search
algorithms in Chapter 3. The following notions will be used routinely. A halfspace in Rn is dened as the
set {x : ⟨a, x⟩ ≥ b} for some a ∈ Rn , b ∈ R. A polyhedron is the intersection of halfspaces. A polytope is the
convex hull of a set of points.
Theorem 1.6 (Hyperplane separation theorem). Let K be a nonempty closed convex set in Rn and y∈
/ K.
n
There is a non-zero θ∈R such that
⟨θ, y⟩ > max ⟨θ, x⟩ .
x∈K
2
Proof. Let x∗ be a point in K closest to y , namely x∗ ∈ arg minx∈K ∥x − y∥2 (such a minimizer always
exists for closed convex sets and is unique; this is sometimes called Hilbert's projection theorem but can be
proved directly for this setting). Using convexity of K , for any x ∈ K and any 0 < t ≤ 1, we have that
(1 − t)x∗ + tx ∈ K and hence
2 2 2
∥y − (1 − t)x∗ − tx∥2 ≥ min ∥y − x∥2 = ∥y − x∗ ∥2 .
x∈K
1.2. Why is convexity useful? Linear Separability! 9
⟨θ, y − x⟩ = ⟨θ, y − x∗ ⟩ + ⟨y − x∗ , x∗ − x⟩
= ∥θ∥2 + ⟨y − x∗ , x∗ − x⟩
>0
Corollary 1.7. Any closed convex set K can be written as the intersection of halfspaces as follows
\
K= x : ⟨θ, x⟩ ≤ max ⟨θ, y⟩ .
y∈K
θ∈Rn
In other words, any closed convex set is the limit of a sequence of polyhedra.
def
Let L = {x : ⟨θ, x⟩ ≤ maxy∈K ⟨θ, y⟩}. Since K ⊆ {x : ⟨θ, x⟩ ≤ maxy∈K ⟨θ, y⟩}, we have K ⊆
T
Proof. θ∈Rn
L.
For any x ∈
/ K , Theorem 1.6 shows that there is a θ such that θ⊤ x > maxy∈K θ⊤ y . Hence, we have x ∈
/L
and hence L ⊆ K .
1.2. Why is convexity useful? Linear Separability! 10
This shows that convex optimization is related to linear programs (optimize linear functions over poly-
topes) as follows:
min f (x) = min y
x∈K (x,y)∈{x∈K,y∈R:y≥f (x)}
where the set {x ∈ K, y ∈ R : y ≥ f (x)} then can be approximated by intersection of halfspaces, namely
{A xy ≤ b} for some matrix A ∈ Rm×(n+1) and vector b ∈ Rm with m → +∞.
Exercise 1.8. Let A, B ⊂ Rn be nonempty disjoint closed convex sets. Then there exists a vector v ∈ Rn
such that supx∈A v ⊤ x ≤ inf x∈B v ⊤ x. Show strict inequality if the sets are also bounded.
Similar to convex sets, we have a separation theorem similar to Theorem 1.6 for convex functions. In
Chapter 3, we will see that this allows us to use binary search to minimize convex functions.
Later in this chapter we will see that the above condition is in fact an equivalent denition of convexity.
Proof. Fix any x, y ∈ Rn . Let g(t) = f ((1 − t)x + ty). Since f is convex, it is convex along every line, in
particular over the segment [x, y] and so g is convex over [0, 1]. Then, we have
This theorem shows that ∇f (x) = 0 (local minimum) implies x is a global minimum.
Theorem 1.10 (Optimality condition for unconstrained problems). Let f ∈ C 1 (Rn ) be convex. Then,
x ∈ Rn is a minimizer of f (x) if and only if ∇f (x) = 0.
1.3. Convex problems are everywhere! 11
We note that the proof above is in fact a constructive proof. If x is not a minimum, it suggests a point
that has better function value. This will be the discussion of an upcoming section. For continuous convex
functions, there is a weaker notion of gradient called sub-dierential, which is a set instead of a vector. Both
theorems above holds with gradients replaced by sub-dierentials.
where c ∈ Rn is the cost vector, b ∈ Rm is the intake requirement vector and A ∈ Rm×n is the matrix of
nutrients contents of each food2 .
2 Unfortunately, Nobel Laureate Stigler showed that [69] the optimal meal is evaporated milk, cabbage, dried navy beans,
and beef liver.
1.3. Convex problems are everywhere! 12
Both the diet problem and the minimum cost ow problem can be reformulated into the form
min c⊤ x (1.4)
Ax=b,x≥0
for some vectors c, b and some matrix A. These problems are called linear programs and have many applica-
tions in resource allocation. Special cases of linear programs are also of great interest; for example the diet
problem is a packing/covering LP.
Exercise 1.11. Show how the minimum cost ow problem and the diet problem can be written in the form
(1.4). Also, show that (1.4) is a convex problem.
This function is not convex in θ. More generally, one considers the objective function (to be minimized over
θ)
n
1X
R(θ) = f (yi ⟨xi , θ⟩) + λ∥θ∥1 (1.5)
n i=1
where f is some function such that f (z) is large when z is positive and large and f (z) = 0 when z is
highly negative, and λ∥θ∥1 is a regularization term to make sure θ is bounded. One popular function is
f (z) = log(1 + ez ), and this problem is called logistic regression. In the section 1.5, we prove that the
function (1.5) is indeed convex when f (z) = log(1 + ez ).
f = argminfeasible g SurfaceArea(g)
where we say g is feasible if g(x, y) = f (x, y) for all x, y on the boundary of [0, 1]2 . One natural question
(called Plateau's problem) is to nd a minimal surface with a given boundary. For this particular case we
consider, we can simply use convex optimization. Note that the constraint (g is feasible) is exactly a linear
subspace on the space of functions on [0, 1]2 . Furthermore, the objective is convex by using the fact that
s 2 2
Z 1Z 1
∂f (x, y) ∂f (x, y)
SurfaceArea(f ) = 1+ + dxdy.
0 0 ∂x ∂y
Exercise 1.12. Show that surface area is convex by using the denition.
Calculus of variations is an area of mathematics that studies optimization in function spaces and there
are many common theorems between this and convex optimization.
1.4. Examples of convex sets and functions 13
This characterization shows that minx f (x) is the same as min(x,t)∈epif t. Therefore, convex optimization
is the same as optimizing a linear function over a convex set. Another important feature of a convex set is
the following.
Fact 1.15. Any level set {x ∈ Rn : f (x) ≤ t} of a convex function f is convex.
In particular, this shows that the set of minimizers is connected. Therefore, any local minimum is a
global minimum. We note that the converse of the fact above is not true. A function is quasiconvex if every
level set is convex. Equivalently, a function f : Rn → R is quasiconvex if for every x, y ∈ Rn and t ∈ [0, 1],
f ((1 − t)x + ty) ≤ max f (x), f (y).
For example, Fig. 1.6 shows a function that is quasiconvex but not convex.
Finally, we note that many operations preserve convexity. Here is an example.
Exercise 1.16. For a matrix A, vector b, positive scalars t1 , t2 ≥ 0, and convex functions f1 and f2 , the
function g(x) = t1 f1 (Ax + b) + t2 f2 (x) is convex.
Here are some convex sets and functions. In Section 1.5, we illustrate how to check convexity.
Example. Convex sets: polyhedron {x : Ax ≤ b}, polytope conv ({v1 , . . . , vm }) with v1 , . . . , vm ∈ Rn ,
ellipsoid {x : x⊤ Ax ≤ 1} with A ⪰ 0, positive semidenite cone {X ∈ Rn×n : X ⪰ 0}, norm ball {x :
∥x∥p ≤ 1} for all p ≥ 1.
Example. Convex functions: x, max(x, 0), ex , |x|a for a ≥ 1, − log(x), x log x, ∥x∥p for p ≥ 1, (x, y) → x2
y
Qn 1
(for y > 0), A → − log det A over PSD matrices A, (x, Y ) → x⊤ Y −1 x (for Y ≻ 0), log exi , ( i xi ) n .
P
i
Exercise 1.17. Show that the above sets and functions are all convex.
Exercise 1.18. Show that the intersection of convex sets is convex; show that for convex functions, f, g ,
the function h(x) = max f (x), g(x) is also convex.
1.5. Checking convexity 14
1
f (y) = f (x) + ∇f (x)⊤ (y − x) + (y − x)⊤ ∇2 f (z)(y − x).
2
Proof. Let g(t) = f ((1 − t)x + ty). Taylor expansion (Theorem 0.4) shows that
1
g(1) = g(0) + g ′ (0) + g ′′ (ζ)
2
where ζ ∈ [0, 1]. To see the result, note that g(0) = f (x), g(1) = f (y), g ′ (0) = ∇f (x)⊤ (y − x) and
g ′′ (ζ) = (y − x)⊤ ∇2 f ((1 − ζ)x + ζy)(y − x).
Now, we show that f is convex if and only if ∇2 f (x) ⪰ 0 for all x.
Theorem 1.20. Let f ∈ C 2 (Rn ). Then, the following are equivalent:
1. f is convex.
2. f (y) ≥ f (x) + ∇f (x)⊤ (y − x) for all x, y ∈ Rn .
3. ∇2 f (x) ⪰ 0 for all x ∈ Rn .
Proof.We have proved (1) implies (2) in Theorem 1.9.
Suppose (2) holds. Then, for any x, h ∈ Rn
f (x + th) ≥ f (x) + t∇f (x)⊤ h.
By Taylor expansion (Lemma 1.19), we have that
t2 ⊤ 2
f (x + th) = f (x) + t∇f (x)⊤ h + h ∇ f (z)h
2
where z ∈ [x, x + th]. By comparing two equations, we have that h⊤ ∇2 f (z)h ≥ 0. Taking t → 0, we have
z → x and hence ∇2 f (z) → ∇2 f (x). Therefore, we have that
h⊤ ∇2 f (x)h ≥ 0
for all x and h. Hence, this gives (3).
Suppose (3) holds. Fix x, y ∈ Rn . Consider the function
g(λ) = f (λx + (1 − λ)y) − λf (x) − (1 − λ)f (y).
Consider λ∗ = argmaxλ∈[0,1] g(λ). If λ∗ is either 0 or 1, then we have g(λ∗ ) = 0. Otherwise, by Taylor's
theorem, there is a ζ ∈ [λ∗ , 1] such that
1
g(1) = g(λ∗ ) + g ′ (λ∗ )(1 − λ∗ ) + g ′′ (ζ)(1 − λ∗ )2
2
∗ 1 ′′ ∗ 2
= g(λ ) + g (ζ)(1 − λ )
2
where we used that g ′ (λ∗ ) = 0. Note that
g ′ (ζ) = ∇f (ζx + (1 − ζ)y)⊤ (x − y) − f (x) + f (y),
g ′′ (ζ) = (x − y)⊤ ∇2 f (ζx + (1 − ζ)y)(x − y).
By the assumption (3), we have that g ′′ (ζ) ≥ 0 and hence 0 = g(1) ≥ g(λ∗ ). Hence, in both cases,
maxλ∈[0,1] g(λ) = g(λ∗ ) ≤ 0. This gives (1).
1.6. Subgradients 15
Example 1.21. The function (1.5) is convex for f (z) = log(1 + exp(z)).
Pn
We write R(θ) = R1 (θ) + R2 (θ) where R1 (θ) = n1 i=1 f (yi · xi , θ ) and R2 (θ) = λ∥θ∥1 . It is easy
Proof.
to check R2 is convex. So, it suces to prove R1 is convex. Now, we use Theorem 1.20 to prove that R1 is
convex. Note that
n
1 X ′ i
i i i
∇R1 (θ) = f (y x , θ )y x ,
n i=1
n
1 X ′′ i
i i i ⊤
∇2 R1 (θ) = f (y x , θ )x (x )
n i=1
where we used (y i )2 = 1. Since xi (xi )⊤ ⪰ 0, it suces to prove that f ′′ (y i xi , θ ) ≥ 0. This follows from
exp(z) exp(z)
the calculation: f ′ (z) = 1+exp(z) 1
= 1 − 1+exp(z) and f ′′ (z) = (1+exp(z))2 ≥ 0.
1.6 Subgradients
The standard denition of a convex function in terms of gradients requires dierentiability. However, a
more general denition allows us to avoid this requirement. For a convex function f : Rn → R, we say
that a function g : Rn → Rn is called a subgradient if it satises the following property: for any x, y ∈
Rn ,f (y) − f (x) ≥ ⟨g(x), y − x⟩. For the purpose of optimization algorithms, in almost all cases, a subgradient
will suce in place of a gradient.
Exercise 1.23. Show that for any t ≥ 0, the level set L(t) = {x : f (x) ≥ t} of a logconcave function f is
convex.
1.7. Logconcave functions 16
(
1 if x ∈ K
Example 1.24. The indicator function of a convex set 1K (x) = is logconcave. The Gaussian
0 otherwise
density function is logconcave. The Gaussian density restricted to any convex set is logconcave.
To see that the indicator function of a convex set K is logconcave, simply consider two points x, y which
(1) both lie in K , (2) both lie outside and (3) one is in K , one is outside. Now check the value of the indicator
along any convex combination of x and y .
Lemma 1.25 (Dinghas; Prékopa; Leindler). The product, minimum and convolution of two logconcave
functions is also logconcave; in particular, any linear transformation or any marginal of a logconcave density
is logconcave; the distribution function of any logconcave density is logconcave.
We next describe the basic theorem underlying the above properties. We will see their proofs in a later
chapter.
Theorem 1.26 (Prékopa-Leindler). Fix λ ∈ [0, 1]. Let f, g, h : Rn → R+ be functions satisfying h(λx + (1 −
λ)y) ≥ f (x)λ g(x)1−λ for all x ∈ Rn . Then,
Z Z λ Z 1−λ
h≥ f g .
Rn Rn Rn
An equivalent version of the lemma for sets in Rn is often useful. By a measurable set below, we mean
Lebesgue measurable, which coincides with the denition of volume (for an axis aligned box, it is the product
of the axis lengths; for any other set, it is the limit over increasingly ner partitions into boxes, of the sum
of volumes of boxes that intersect the set).
Theorem 1.27 (Brunn-Minkowski). For any λ ∈ [0, 1] and measurable sets A, B ⊂ Rn , we have
Now, say we know the signal is smooth and we model the prior as P(θ) ∝ exp(−λ − θi+1 )2 ) where λ
P
i (θi
controls how smooth the signal is. Hence,
X X
− log P(θ|y) = c + (yi − θi )2 + λ (θi − θi+1 )2 .
i i
Since each term in the function above is convex, so is the whole formula. Hence, the recovery question
becomes a convex optimization problem
X X
min (yi − θi )2 + λ (θi − θi+1 )2 .
θ
i i
1.7. Logconcave functions 17
When we recover a signal, we want to know how condent we are because there are many choices of
θ that could explain the same measurement y . One way to do this is to sample multiple θ ∝ P(θ|y) and
compute the empirical variance or other statistics. Note that
P 2 P 2
P(θ|y) ∝ e− i (yi −θi ) −λ i (θi −θi+1 )
which is a logconcave distribution. Therefore, one can study the signal and quality of signal recovery via
logconcave sampling.
Part I
Optimization
18
Chapter 2
Gradient Descent
2.1 Philosophy
Optimization methods often follow the following framework:
Algorithm 1: OptimizationFramework
for k = 0, 1, · · · do
Approximate f by a simpler function fk according to the current point x(k)
Do something using fk (such as set x(k+1) = arg minx fk (x))
end
The runtime depends on the number of iterations and the cost per iteration. Philosophically, the dicul-
ties of a problem can never be created nor destroyed, only converted from one form of diculty to another.
When we decrease the number of iterations, the cost per iteration often increases. The gain of new methods
often come from avoiding some wasted computation, utilizing some forgotten information or giving a faster
but tailored algorithm for a sub-problem. This is of course just an empirical observation.
One key question to answer in designing an optimization algorithm is what the problem looks like (or how
can we approximate f by a simpler function). Here are some approximations we will use in this textbook:
First-order Approximation: f (y) ≈ f (x) + ⟨∇f (x), y − x⟩ (Section 2.2).
Second-order Approximation: f (y) ≈ f (x)P+ ⟨∇f (x), y − x⟩ + (y − x)⊤ ∇2 f (x)(y − x) (Section 5.4).
Stochastic/Monte-Carlo Approximation: i fi (x) ≈ fj (x) for a random j (Section 6.3).
Matrix Approximation: Approximate A by a simpler B with 21 A ⪯ B ⪯ 2A (Section 6.1 and Section
6.2).
Matrix Approximation: Approximate A by a low-rank matrix.
Set Approximation: Approximate a convex set by an ellipsoid or a polytope (Section 3.1).
Barrier Approximation: Approximate a convex set by a smooth function that blows up on the boundary
(Section 5.5).
Polynomial Approximation: Approximate a function by a polynomial (Section 7.1).
Partial Approximation: Split the
PK problem into two parts and approximate only one part.
Taylor Approximation: f (y) ≈ k=0 Dk f (x)[y − x]k . Pn
Mixed ℓ2 -ℓp Approximation: f (y) ≈ f (x) + ⟨∇f (x), y − x⟩ + i=1 αi (yi − xi )2 + βi (yi − xi )p
Here are other approximation not covered:
Stochastic Matrix Approximation: Approximate A by a simpler random B with B ⪯ 2A and EB ⪰ 21 A
Homotopy Method: Approximate a function by a family of functions
...(Please give me more examples here)...
The second question to answer is how to maintain dierent approximations created in dierent steps. One
simple way would be forget the approximation we got in previous steps, but this is often not optimal. Another
way is to keep all previous approximations/information (such as Section 3.1). Often the best way will be
combining previous and current approximation carefully to a better approximation (such as Section 7.4).
19
2.2. Basic Algorithm 20
calculus, either the minimum (or point achieving the minimum) is unbounded or the gradient is zero at a
minimum. So we try to nd a point with gradient close to zero (which, of course, does not guarantee global
optimality). The basic algorithm is the following.
Algorithm 2: GradientDescent (GD)
Input: Initial point x(0) ∈ Rn , step size h > 0.
for k = 0, 1, · · · do
if ∥∇f (x(k) )∥2 ≤ ϵ then return x(k) ;
// Alternatively, one can use x(k+1) ← argminx=x(k) +t∇f (x(k) ),t∈R f (x).
x(k+1) ← x(k) − h · ∇f (x(k) ).
end
One can view gradient descent as a greedy method for solving minx∈Rn f (x). At a point x, gradient
descent goes to the minimizer of
min f (x) + ∇f (x)⊤ δ.
∥δ∥2 ≤h/∥∇f (x)∥2
The term f (x) + ∇f (x)⊤ δ is simply the rst-order approximation of f (x + δ). Note that in this problem,
the current point x is xed and we are optimizing the step δ . Certainly, there is no inherent reason for using
rst-order approximation and the Euclidean norm ∥x∥2 . For example, if you use second-order approximation,
then you would get a method involving Hessian of f .
The step size of the algorithm usually either uses a xed constant, or follows a predetermined schedule,
or determined using a line search.
If the iteration stops, we get a point with ∥∇f (x)∥2 ≤ ϵ. Why is this good? The hope is that x is
a near-minimum in the neighborhood of x. However, this might not be true if the gradient can uctuate
wildly:
Denition 2.1. We say f has L-Lipschitz gradient if ∇f is L-Lipschitz, namely, ∥∇f (x) − ∇f (y)∥2 ≤
L∥x − y∥2 for all x, y ∈ Rn .
Similar to Theorem 1.20, we have the following equivalent:
Theorem 2.2. Let f ∈ C 2 (Rn ). For any L ≥ 0, the following are equivalent:
1. ∥∇f (x) − ∇f (y)∥2 ≤ L∥x − y∥2 for all x, y ∈ Rn .
2 n
2. −LI ⪯ ∇ f (x) ⪯ LI for all x ∈
R .
3. f (y) − f (x) − ∇f (x)⊤ (y − x) ≤ L ∥y − x∥22 for all x, y ∈ Rn .
2
∇f (x + hv) − ∇f (x)
∇2 f (x)v = lim .
h→0 h
Since ∥ ∇f (x+hv)−∇f
h
(x)
h ∥hv∥2 = L∥v∥2 , we have ∥∇ f (x)v∥2 ≤ L∥v∥2 , which means all eigenvalues of
∥2 ≤ L 2
we have that Z 1
∥∇f (x) − ∇f (y)∥2 ≤ ∥∇2 f (y + t(x − y))∥op ∥x − y∥2 dt ≤ L∥x − y∥2 .
0
One practical advantage of line search is that the algorithm does not need to know a bound on the
Lipschitz constant of the gradient. The next lemma shows that the function value must decrease along the
GD path for a suciently small step size, and the magnitude of the decrease depends on the norm of the
current gradient.
Lemma 2.7. For any f ∈ C 2 (Rn ) with L-Lipschitz gradient, we have
1 1
f (x − ∇f (x)) ≤ f (x) − ∥∇f (x)∥22 .
L 2L
Proof. Lemma 1.19 shows that
1 1 1
f (x − ∇f (x)) = f (x) − ∥∇f (x)∥22 + ∇f (x)⊤ ∇2 f (z)∇f (x)
L L 2L2
for some z ∈ [x, x − L ∇f (x)].
1
Since ∥∇2 f (x)∥op ≤ L, we have that
This in turn give a better bound on the number of iterations because the bound in Theorem 2.6 is aected
by f (x) − f ∗ rather than ∥x − x∗ ∥2
Theorem 2.9. 2 n
Let f ∈ C (R ) be convex with L-Lipschitz gradient and x∗ be any minimizer of f. With
1 (k)
step size h= L , the sequence x in GradientDescent satises
2LR2
f (x(k) ) − f (x∗ ) ≤ where R= max ∥x − x∗ ∥2 .
k+4 f (x)≤f (x(0) )
This style of proof is typical in optimization. It shows that when the gradient is large, then we make
large progress and when the gradient is small, we are close to optimal.
This proof above does not make essential use any property of ℓ2 or inner product space. It can be
extended to work for general norms if the gradient descent step is dened using that norm. For the case of
ℓ2 , one can prove that ∥x(k) − x∗ ∥2 is in fact decreasing.
Lemma 2.10. For
2
h ≤
L , we have that ∥x(k+1) − x∗ ∥2 ≤ ∥x(k) − x∗ ∥2 . Therefore, for an L-gradient
Lipschitz convex function f , for GD with h = L1 , we have
2L∥x(0) − x∗ ∥22
f (x(k) ) − f (x∗ ) ≤ .
k+4
Proof. We compute the distance to an optimal point as follows, noting that ∇f (x∗ ) = 0:
The error estimate follows from ∥x(k) − x∗ ∥22 ≤ ∥x(0) − x∗ ∥22 for all k and the proof in Theorem 2.9.
2L∥x(0) −x∗ ∥2
Rewriting the bound, Theorem 2.9 shows it takes ϵ
2
iterations. Compare to the bound
2L
) − f in Theorem 2.6, it seems the new result has a strictly better dependence on ϵ. How-
∗
(0)
ϵ2 f (x
ever, this is not true because one measures the error in terms of ∥∇f (x)∥2 while the other is in terms of
f (x) − f ∗ . For f (x) = x2 /2, we have f (x) − f ∗ = ∥∇f (x)∥22 and hence both have the same dependence on ϵ
for this particular function. So, the real benet of Theorem 2.9 is its global convergence.
2.4. Strongly Convex Functions 24
2
∥∇f (x)∥ ≥ 2µ (f (x) − f (y)) .
Proof. By the denition of µ strong convexity, we have
µ 2
f (y) ≥ f (x) + ⟨∇f (x), y − x⟩ + ∥x − y∥ .
2
Rearranging, we have
µ 2 µ 2 1 2
f (x) − f (y) ≤ ⟨∇f (x), x − y⟩ − ∥x − y∥ ≤ max ∇f (x)⊤ ∆ − ∥∆∥ = ∥∇f (x)∥ .
2 ∆ 2 2µ
This will lead to the following guarantee. Note that the error now decreases geometrically rather than
additively.
Theorem 2.14. Letf ∈ C 2 (Rn ) be µ-strongly convex with L-Lipschitz gradient and x∗ be any minimizer of
f. With step size h = L1 , the sequence x(k) in GradientDescent satises
µk
f (x(k) ) − f (x∗ ) ≤ 1 − f (x(0) ) − f (x∗ ) .
L
In a later chapter, we will
p µ see that an accelerated variant of gradient descent improves this further by
replacing the L
µ
term with L .
Proof. Lemma 2.7 shows that
1
f (x(k+1) ) − f ∗ ≤ f (x(k) ) − f ∗ − ∥∇f (x(k) )∥22 (2.1)
2L
µ
≤ f (x(k) ) − f ∗ − f (x(k) ) − f ∗ (2.2)
L
µ
= 1− f (x ) − f ∗
(k)
(2.3)
L
where we used Lemma 2.13. The conclusion follows.
2.5. Line Search 25
It is natural to ask to what extent the assumption of convexity is essential for the bounds we obtained.
This is the motivation for the next exercises.
Exercise 2.16. Suppose f satises ⟨∇f (x), x − x∗ ⟩ ≥ α (f (x) − f (x∗ )). Derive a bound similar to Theorem
2.9 for gradient descent.
Exercise 2.17. Suppose f satises ∥∇f (x)∥22 ≥ µ (f (x) − f (x∗ )) . Derive a bound similar to Theorem 2.14
for gradient descent.
Exercise 2.18. Give examples of nonconvex functions satisfying the above conditions. (Note: convex
functions satisfy the rst with α = 1 and µ-strongly convex functions satisfy the second.)
2(1 − c1 ) 1 − c2
≥h≥ .
µ L
As a corollary, we have that the function value progress given by such step is Ω(∥∇f (x)∥2 /L). Therefore,
this gives the same guarantee Theorem 2.9 and Theorem 2.14. A common way to implement this is via a
backtracking line search. The algorithm starts with a large step size and decreases it by a constant factor if
the Wolfe conditions are violated. For gradient descent, the next step involves exactly computing ∇f (x + hp)
and hence if our line search accepts the step size immediately, the line search almost costs nothing extra.
Therefore, if we maintain a step size throughout the algorithm and decreases it only when it violates the
condition, the total cost of the line search will be only an additive logarithmic number of gradient calls
throughout the algorithm and is negligible.
Finally, we note that for problems of the form f (x) = i fi (a⊤ i x), the bottleneck is often in computing
P
Ax. In this case, exact line search is almost free because we can store the vectors Ax and Ah.
splitting it into two terms, one term is the rst-order approximation and the second term is just an ℓ2 norm.
More generally, we can split a function into two terms, one that is easy to optimize and another that we
need to approximate with some error. More precisely, we consider the following
Denition 2.21. We say g + h is an α-approximation to f at the point x if
g is convex, with the same value at x, i.e., g(x) = f (x) ,
h(x) = 0 and h((1 − α)x + αb x) for all x
x) ≤ α2 h(b b,
g(y) + αh(y) ≤ f (y) ≤ g(y) + h(y) for all y.
To understand this assumption, we note that if f is µ-strongly convex with L-Lipschitz gradient, then
for any x, we can use α = L
µ
and
Theorem 2.22. Suppose we are given a convex function f such that we can nd an α-approximation at any
x. Let x∗ f . Then the sequence x(k) in GeneralizedGradientDescent satises
be any minimizer of
Proof. Using the fact that g (k) + h(k) is an upper bound on f , we have that our progress on f is larger than
the best possible progress on g (k) + h(k) :
To bound the best possible progress, i.e., the RHS above, we consider x
b = arg miny g (k) (y) + αh(k) (y) and
(k)
z = (1 − α)x + αb x. We have that
where we used g (k) is convex and the assumption on h in the second inequality, we used x b minimizes
g (k) + αh(k) .
Combining both and using the fact that g (k) + αh(k) is a lower bound on f , we have
min f (x).
x∈K
def
To apply Theorem 2.22 to F (x) = f (x) + δK (x), for any x, we consider the functions
L
min f (x) + ⟨∇f (x), y − x⟩ + ∥y − x∥22 .
y∈K 2
More generally, this works for problems of the form
for some convex function ϕ(x). Theorem 2.22 requires us to solve a sub-problem of the form
L
min f (x) + ⟨∇f (x), y − x⟩ + ∥y − x∥22 + ϕ(y).
y 2
ℓp Regression
One can apply this framework for optimizing ℓp regression minx ∥Ax−b∥pp . For example, one can approximate
the function f (x) = xp by g(y) = xp + pxp−1 (y − x) and h(y) = p2p−1 (xp−2 (y − x)2 + (y − x)p ). Using this,
one can show that one can solve the problem
for some vector v and some diagonal matrix D. One can show that Theorem 2.22 only need to solve the
sub-problem approximately. Therefore, this shows that one can solve ℓp regression with log(1/ϵ) convergence
by solving mixed ℓ2 + ℓp regression approximately.
Other assumptions
For some special cases, it is possible to analyze this algorithm without convexity. One prominent application
is compressive sensing:
min ∥Ax − b∥22 .
∥x∥0 ≤k
For matrices A satisfying restricted isometry property, one can apply GeneralizedGradientDescent to solve
the problem with the splitting g(x) = 2A⊤ (Ax − b) + δ∥x∥0 ≤k and h(x) = ∥x∥2 . In this case, the algorithm
is called iterative hard-thresholding [10] and the sub-problem has a closed form expression.
Exercise 2.24. Give the closed form solution for the sub-problem given by the splitting above.
2.7. Gradient Flow 28
dxt
= −∇f (xt ).
dt
This can be viewed as the canonical continuous algorithm. Finding the right discretization has lead to many
fruitful research directions. One benet of the continuous view is to simplify some calculations. For example,
for strongly convex f , Theorem 2.14 now becomes
d dxt
(f (xt ) − f (x∗ )) = ∇f (xt )⊤ = −∥∇f (xt )∥22 ≤ −2µ (f (xt ) − f (x∗ ))
dt dt
where we used Lemma 2.13 at the end. Solving this dierential inequality, we have
Without the strong convexity assumption, gradient ow can behave wildly. For example, the length of the
gradient ow can be exponential in d on an unit ball [56].
We emphasize that this continuous view is mainly useful for understanding, indicative of but not neces-
sarily implying an algorithmic result. In some cases, eective algorithmic results can be obtained simply by
discretizing time in the gradient ow. The study of such numerical methods and their convergence properties
is its own eld, and well-known basic methods include the forward Euler method (which results in the basic
version of GD), the backward Euler method, the Implicit Midpoint method and Runge-Kutta methods. We
will see that gradient ow and its discretization also play an important role in the development of sampling
algorithms.
2.8 Discussion
Convex optimization by variants of gradient descent is a very active eld with an increasing number of
applications. Often, methods that are provable for the convex setting are applied as heuristics to nonconvex
problems as well, most notably in deep learning. This is one of the principal features of GD, its wide
applicability as an algorithmic paradigm.
Researchers are also using GD to get provably faster algorithms for classical problems. For example, [2]
and [40] applied the decomposition of Eqn. (2.4) to obtain fast algorithms for ℓp regression and the ℓp ow
problem. [36] showed that the ℓp ow problem can be used as a subroutine to solve uncapacitied maximum
ow problem in m4/3+o(1) time. Instead of assuming h(x) converges to 0 quadratically, [55] proved Theorem
2.22 assuming h is given by some divergence function and showed its applications in D-optimal design.
Chapter 3
Elimination
Therefore, we know that ⟨∇f (x), x∗ − x⟩ ≤ 0. Namely, x∗ lies in a halfspace H with normal vector −∇f (x).
Roughly speaking, this shows that each gradient computation cuts the set of possible solutions in half. In
one dimension, this allows us to do a binary search to minimize convex functions.
It turns out that in Rn , binary search still works. In this chapter, we will cover several ways to do this
binary search. All of them follow the same framework, called the cutting plane method. In this method, the
convex set or function of interest is given by an oracle, typically a separation oracle: for any x ∈
/ K ⊆ Rn ,
the oracle nds a vector g(x) ∈ R such that
n
Problem 3.1 (Finding a point in a convex set). Given ϵ > 0, R > 0, and a convex set K ⊆ R · B n specied
by a separation oracle, nd a point y ∈ K or conclude that volK ≤ ϵn . The complexity of an algorithm is
measured by the number of calls to the oracle and the number of arithmetic operations.
Remark. To minimize a convex function, we set g(x) = ∇f (x) and K to be the set of (approximate)
minimizers of f . In Section 3.3, we relate the problem of proving that vol(K) is small to the problem of
nding an approximate minimizer of f .
In this framework, we maintain a convex set E (k) that contains the set K . Each iteration, we choose
some x(k) based on E (k) and query the oracle for g(x(k) ) . The guarantee for g(.) implies that K lies in the
halfspace
H (k) = {y : g(x(k) )⊤ (y − x(k) ) ≤ 0}
and hence K ⊂ H (k) ∩ E (k) . The algorithm continues by choosing E (k+1) to be a convex set that contains
H (k) ∩ E (k) .
29
3.2. Ellipsoid Method 30
Algorithm 4: CuttingPlaneFramework
Input: Initial set E (0) ⊆ Rn containing K .
for k = 0, · · · do
Choose a point x(k) ∈ E (k) .
if E (k) is small enough then return x(k) ;
Find E (k+1) ⊃ E (k) ∩ H (k) where
def
H (k) = {x ∈ Rn such that g(x(k) )⊤ (x − x(k) ) ≤ 0}. (3.2)
end
To analyze the algorithm, the main questions we need to answer are:
1. How do we choose x(k) and E (k+1) ?
2. How do we measure progress?
3. How quickly does the method converge?
4. How expensive is each step?
Progress on the cutting plane method is shown in the next table.
Table 3.1: Dierent Cutting Plane Methods. Omitted polylogarithmic terms. The number of iterations follows from the
rate.
that contains K and becomes smaller in volume in each step. Note that for this to be an ellipsoid the matrix
A(k) must be symmetric PSD. After we compute g(x(k) ) and H (k) via (3.2), we dene E (k+1) to be the
smallest volume ellipsoid containing E (k) ∩ H (k) . The key observation is that the volume of the ellipsoid E (k)
decreases by a factor of 1 − Θ( n1 ) every iteration. This volume property holds for any halfspace through the
3.2. Ellipsoid Method 31
center of the current ellipsoid (not only for the one whose normal is the gradient), a property we will exploit
in the next chapter.
Algorithm 5: Ellipsoid
−1
Input: Initial ellipsoid E (0) = {y ∈ Rn : (y − x(0) )⊤ A(0) (y − x(0) ) ≤ 1}.
for k = 0, · · · do
if E (k) is small enough or Oracle says YES then return x(k) ;
(k) (k)
A g(x )
1 √
x(k+1) = x(k) − n+1 .
g(x(k) )⊤ A(k) g(x(k) )
(k) ⊤ (k)
2
(k) (k)
2 A g(x )g(x ) A
A(k+1) = n2n−1 A(k) − n+1 (k) ⊤ (k)
g(x ) A g(x ) (k) .
end
Remark 3.2. We usually obtain g(x) from a separation oracle.
where we used that ∥x∥2 ≤ 1 and x1 (1 + x1 ) ≤ 0 (since −1 ≤ x1 ≤ 0) at the end. This shows that
x ∈ E (k+1) .
Exercise 3.5. Show that the ellipsoid E (k+1) computed above is the minimum volume ellipsoid containing
E (k) ∩ H (k) .
Exercise 3.6. Suppose that we used a box instead of an ellipsoid. Could we ensure progress in each
iteration? What about a simplex?
V(E (k) )
min f (x(i) ) − min f (y) ≤ · max f (z) − min f (x) .
i=1,2,···k y∈Ω V(Ω) z∈Ω x∈Ω
Remark 3.8. We can think V(E) as some way to measure the size of E . It can be radius, mean-width or any
1
other way to measure size. For the ellipsoid method, we use V(E) = vol(E) n for which we have proved
volume decrease in Lemma 3.3. We raise the volume to power 1/n to satisfy linearity. Also note that we
only guarantee that one of the previous query points has a small function value; of course we can simply use
the point with minimum function value.
(k)
Proof. Let x∗ be any minimizer of f over Ω. For any α > V(E )
V(Ω) and S = (1 − α)x + αΩ, by the linearity
∗
of V , we have that
V(S) = αV(Ω) > V(E (k) )
Therefore, S is not a subset of E (k) and hence there is a point y ∈ S\E (k) . y is not in E (k) . This means it
is eliminated by the subgradient halfspace at some step i ≤ k , namely for some i ≤ k , we have (denoting the
subgradient by ∇f ),
∇f (x(i) )⊤ (y − x(i) ) > 0.
By the convexity of f , it follows that f (x(i) ) ≤ f (y). Since y ∈ S , we have y = (1 − α)x∗ + αz for some
z ∈ Ω. Thus, the convexity of f implies that
Therefore, we have
(i)
min f (x ) − min f (x) ≤ α max f (z) − min f (x) .
i=1,2,···k x∈Ω z∈Ω x∈Ω
(k)
V(E )
Since this holds for any α > V(Ω) , we have the result.
3.3. From Volume to Function Value 33
Combining Lemma 3.3 and Theorem 3.7, we have the following rate of convergence.
Theorem 3.9. f n
be a convex function on R , E
Let
(0)
be any initial ellipsoid and Ω ⊂ E
(0)
be any convex
(0)
set. Suppose that for any x ∈ E , we can nd, in time T , a nonzero vector g(x) such that
Then, we have
n1
vol(E (0) )
(i) k
min f (x ) − min f (y) ≤ exp − max f (z) − min f (x) .
i=1,2,···k y∈Ω vol(Ω) 2n(n + 1) z∈Ω x∈Ω
Remark 3.10. We note that for this rate of convergence (and hence the entire algorithm) to be polynomial,
we need some bound on the range of the function value. This often follows from a bound on the diameter of
the support.
Proof. Lemma 3.3 shows that the volume of the ellipsoid maintained decreases by a factor of exp(− 2n+2
1
)
1
in every iteration. Hence, vol n decreases by exp(− 2n(n+1)
1
) every iteration. The bound follows by applying
def 1
Theorem 3.7 with V(E) = vol(E) n . Using Theorem 3.7, we have
n1
vol(E (k) )
(i)
min f (x ) − min f (y) ≤ max f (z) − min f (x)
i=1,2,···k y∈Ω vol(Ω) z∈Ω x∈Ω
n1
vol(E (0) )
k
≤ exp − max f (z) − min f (x) .
vol(Ω) 2n(n + 1) z∈Ω x∈Ω
Next, we note the proof of Theorem 3.7 only used the fact that one side of halfspace dened by the
gradient has higher value. Therefore, we can replace the gradient with the vector g(x).
To bound the time per iteration, note that we make one query to the separation oracle per iteration, and
then compute the next Ellipsoid using the formulas in Algorithm 5. The most time-consuming operation is
multiplying an n × n matrix by an n-vector, which has complexity O(n2 ).
This theorem can be used to solve many problems in polynomial time. As an illustration, we show how
to solve linear programs in polynomial time here.
Theorem 3.11. Given a linear program minx∈Ω c⊤ x where P = {x: Ax ≥ b}. Let the diameter of P be
def 1
R = maxx∈Ω ∥x∥2 and its volume radius be r = vol(Ω) n . Then, we can nd x∈P for which
R
in O(n2 (n2 + nnz(A)) log( rε )) time where nnz(A) is the number of non-zero elements in A.
Remark 3.12. If the dimension n is constant, this algorithm is nearly linear time (linear in the number of
constraints)!
Proof. For the linear program minAx≥b c⊤ x, the function we want to minimize is
(
0 if a⊤
i x ≥ bi for all i
L(x) = c x + δAx≥b (x) where δAx≥b (x) =
⊤
. (3.3)
+∞ otherwise.
For this function L, we can use the separation oracle v(x) = c if Ax ≥ b and v(x) = −ai if a⊤
i x < bi . If
there are many violated constraints, any one of them will do.
3.4. Center of Gravity Method 34
In this case, we can set Ω = P . We can simply pick E (0) be the unit ball centered at 0 with radius R.
We apply Theorem 3.9 to nd x such that
To get the solution exactly, i.e., ε = 0, we need to assume the linear program has integral (or rational)
coecients and then the running time will depend on the sizes of the numbers in the matrix A and in the
vectors b and c. It is still open how to solve linear programs in time bounded by a polynomial in only the
number of variables and constraints (and not the bit sizes of the coecients). Such a running time is called
strongly polynomial.
The measure of progress will once again be the volume (or more precisely, volume radius) of the current set.
It is clear that the volume can only decrease in each iteration. But at what rate? The following classical
theorem shows that the volume of the convex body decreases by a constant factor (no more than 1 − 1e )
when using the exact center of gravity.
Theorem 3.14. [29]Let K be a convex body in Rn with center of gravity z. Let H be any halfspace containing
z. Then,
n
n
vol(K ∩ H) ≥ vol(K).
n+1
Note that the constant on the RHS is at least 1/e. We prove the theorem later in this chapter. Unfor-
tunately, computing the center of gravity even of a polytope is #P-hard [62]. For the purpose of ecient
approximations, it is important to establish a stable version of the theorem that does not require an exact
center of gravity.
Recall that a nonnegative function is logconcave if its logarithm is concave, i.e., for any x, y ∈ Rn and
any λ ∈ [0, 1],
f (λx + (1 − λ)y) ≥ f (x)λ f (y)1−λ .
We refer to Section 1.7 for some background on logconcave functions. A distribution p is isotropic if the
mean of a random variable drawn from the distribution is zero and the covariance is the identity matrix.
The randomized center of gravity method dened as follows:
3.4. Center of Gravity Method 35
Algorithm 6: RandomizedCenterOfGravity
Input: Initial convex set E (0) .
for k = 0, · · · do
if E (k) is small enough then return x(k) ;
Let y (1) , . . . , y (N ) be uniform random points from E (k) and set
N
1 X (i)
x(k) = y
N i=1
E (k+1) = E (k) ∩ H (k)
def
where H (k) = {x ∈ Rn : g(x(k) )⊤ (x − x(k) ) ≤ 0} obtained by querying g(x(k) ).
end
Remark 3.15. The question of how to samply (nearly) uniform random points is an important one that we
will address in detail in this book. Using sampling to approximate the center of gravity leads to the reduction
in the time per iteration from nn to n4 (Table 3.1).
To prove the convergence of the method, we will use a robust version of Theorem 3.14, which will give a
similar result despite using only an approximate center of gravity.
Theorem 3.16 (Robust Grünbaum). Let p be an isotropic logconcave distribution, namely Ex∼p x = 0 and
Ex∼p x2 = 1. For any unit vector θ ∈ Rn , t ∈ R we have
1
Px∼p (x⊤ θ ≥ t) ≥ − |t| .
e
Proof. By taking the marginal with respect to the direction θ, we can assume the distribution is one-
dimensional. Let P (t) = Px∼p (x⊤ θ ≤ t). Note that P (t) is the convolution of p and 1(−∞,0] . Hence, it is
logconcave (Lemma 1.25). By some limit arguments, we can assume P (−M ) = 0 and P (M ) = 1 for some
large enough M (to be rigorous, we do the proof below for nite M and a RHS ϵ(M ) instead of zero, then
take the limit M → ∞). Since Ex∼p x = 0, we have that
Z M
dP (t)
t dt = 0
−M dt
RM
Integration by parts gives that −M P (t)dt = M . Note that P (t) is increasing logconcave, if P (0) is too
RM
small, it would make −M P (t)dt too small. To be precise, since P is logconcave, i.e., − log P (t) is convex,
and so we have,
d
P (0)
− log P (t) ≥ − log P (0) − dt t.
P (0)
Or we simply write P (t) ≤ P (0)eαt for some α. Hence,
1
Z M Z Z M
α eP (0) 1
M= P (t)dt ≤ P (0)eαt dt + 1dt = +M − .
−M −∞ 1/α α α
Lemma 3.17. Let p be a one-dimensional isotropic logconcave density. Then max p(x) ≤ 1.
For a proof of this (and for other properties of logconcave functions), we refer the reader to [54].
3.4. Center of Gravity Method 36
Exercise 3.18. Give a short proof that maxx p(x) = O(1) for any one-dimensional isotropic logconcave
density.
Using the robust Grünbaum theorem 3.16, we get the following algorithm, which uses uniform random
points from the current set. Obtaining such a random sample algorithmically is an interesting problem that
we will study in the second part of this book.
Lemma 3.19. Suppose y (1) , . . . , y (N ) are i.i.d. uniform random points from a convex body K and y =
1
P N (i)
N i=1 y . Then for any halfspace H not containing y ,
r
1 n
E(vol(K ∩ H)) ≤ 1− + vol(K).
e N
Proof. Without loss of generality, we assume that K is in isotropic position, i.e., EK (y (i) ) = 0 and EK (y (i) (y (i) )⊤ ) =
I . Then we have E(y) = 0 and
N
(i)
2
1 X
1
2 n
2
E ∥y∥ = 2 E
y
= E
y (1)
= .
N i=1
N N
Therefore, r rn
2
E ∥y∥ ≤ E ∥y∥ = .
N
Thus, we can apply Theorem 3.16 with t = N to bound E(vol(K ∩ H))/vol(K).
pn
Theorem 3.7 readily gives the following guarantee for convex optimization, again using volume radius as
the measure of progress.
Theorem 3.20. Let f be a convex function on Rn , E (0) be any initial set and Ω ⊂ E (0) be any convex set.
Suppose that for any x ∈ E (0) , we can nd a nonzero vector g(x) such that
1
vol(E (0) ) n
k
(i)
E min f (x ) − min f (y) ≤ (0.95) n
max f (z) − min f (x) .
i=1,2,···k y∈Ω vol(Ω) z∈Ω x∈Ω
Now we give a geometric proof of Theorem 3.14. Note that one can modify the proof of Theorem 3.16 to
get another proof.
Proof. Since ane transformations do not aect ratios of volumes, without loss of generality, assume that
the center of gravity of K is the origin and H is the halfspace {x : x1 ≤ 0}. For each point t ∈ R, let
A(t) = K ∩ {x : x1 = t} be the (n − 1)-dimensional slice of K with x1 = t. Dene r(t) as the radius of the
(n − 1)-dimensional ball with the same (n − 1)-dimensional volume as A(t).
The goal of the proof is to show that the smallest possible halfspace volume is achieved for a cone by a cut
perpendicular to its axis. In the rst step, we will symmetrize K as follows: replace each cross-section A(t)
by a ball of the same volume, centered at (t, 0, . . . , 0)T . We claim that the resulting rotationally symmetric
body is convex. To see this, note that all we have to show is that the radius function r(t) is concave. For
any s, t ∈ R, and any λ ∈ [0, 1], we have by convexity of K that
and by the Brunn-Minkowski theorem (Theorem 1.27) applied to A(s), A(t), the function voln−1 (A(s))1/(n−1)
is a concave function and so we have
1 1 1
voln−1 (A(λs + (1 − λ)t)) n−1 ≥ λvoln−1 (A(s)) n−1 + (1 − λ)voln−1 (A(t)) n−1 .
3.4. Center of Gravity Method 37
as desired.
Next consider the subset K1 = K ∩ {x : x1 ≤ 0}. We replace this subset with a cone C having the same
base A(0) and apex at some point along the e1 axis so that the volume of the cone is the same as vol(K1 ).
Using the concavity of the radial function, this transformation can only decrease the center of gravity along
e1 . Therefore, proving a lower bound on the transformed body K1 will give a lower bound for K . So assume
we do this and the center of gravity is the origin. Next, extend the cone to the right, so that it remains a
rotational cone, and the volume in the positive halfspace along e1 is the same as vol(K \K1 ). Once again, the
center of gravity can only move to the left, and so the volume of K1 can only decrease by this transformation.
At the end we have shown that the lower bound for any convex body follows by proving for a rotational cone
with axis along the normal to the halfspace. The intersection of this cone with the halfspace is the original
cone scaled down by the ratio of the distance from the apex to the center of gravity, to the height of the
cone. So all that remains to be done is to compute the relative distance of the center of gravity from the
apex, which is exactly n/(n + 1). Now we compute the volume ratio:
n
vol(K1 ) n
= .
vol(K) n+1
Exercise 3.21. Show that for a cone of height h in Rn , i.e., the convex hull of a convex body K ⊂ Rn−1
with a single point a ∈ Rn at distance h from the hyperplane H containing K , the distance of the center of
gravity z of the cone is at distance h/(n + 1) from H .
To conclude this section, we note that the number of separation oracle queries made by the center-of-
gravity cutting plane method is asymptotically the best possible.
3.4. Center of Gravity Method 38
Theorem 3.22. Any algorithm that solves Problem 3.1 using a separation oracle needs to make Ω(n log(R/ϵ))
queries to the oracle.
Proof. Suppose K is a cube of side length ϵ contained in the cube [0, R]n . Imagine a tiling of the big cube
by cubes of side length ϵ. Consider the oracle that always returns an axis parallel halfspace that does not
cut any little cube and contains at least half of the volume of the remaining region, i.e., the set given by
the original cube intersected with all halfspaces given by the oracle so far. This is always possible since for
any halfspace either the halfspace or its complement will contain at least half the volume of any set. Thus
each query at best halves the remaining volume. To solve the problem, the algorithm needs to cut down to
a set of volume ϵn starting from a set of volume Rn . Thus it needs at least n log2 (R/ϵ) queries.
Discussion
In later chapters we will see how to implement each iteration of the center of gravity method in polynomial
time. Computing the exact center of gravity is #P-hard even for a polytope [62], but we can nd an
arbitrarily close approximation in randomized polynomial time via sampling. The method generalizes to
certain noisy computation models, e.g., when the oracle reports a function value that is within a bounded
additive error, i.e., a noisy function oracle.
3.5. Sphere and Parabola Methods 39
the function satises µ · I ⪯ ∇ f (x) ⪯ L · I for all x. Hence, if µ is not too large, this rate of decrease of the
2 L
k k
function value can be much better than the 1 − n12 rate of the ellipsoid method or the 1 − n1 rate of
the center of gravity method (recall that the convergence rate in function value is n times slower than the
convergence rate in volume). In this section, we show how to modify the ellipsoid method to get a faster
convergence rate when L µ is small. One can view this whole section as just an interpretation of accelerated
gradient descent (which we haven't seen yet) in the cutting plane framework. In a later section, we will give
another interpretation.
def def
Using x+ = x − 1
L ∇f (x) and x++ = x − µ1 ∇f (x), we can write
2
x − x++
2 ≤ ∥∇f (x)∥2 − 2 (f (x) − f (x∗ )). (3.4)
∗
2 µ2 µ
To use this formula in the cutting plane framework, we need a crude upper bound on f (x∗ ). One can simply
use f (x∗ ) ≤ f (x). Or, we can use Lemma 2.7 and get
1 1
f (x∗ ) ≤ f (x+ ) ≤ f (x − ∇f (x)) ≤ f (x) − ∥∇f (x)∥2 .
L 2L
Putting it into (3.4), we see that
x − x++
2 ≤ 1
∗ 2
∥∇f (x)∥2 − 2µ · (f (x) − f (x∗ ))
2 µ2
1
2 2
∥∇f (x)∥2 − 2µ · f (x) − f (x+ ) − f (x+ ) − f (x∗ )
≤ 2
µ µ
1 2 1 2
∥∇f (x)∥22 − f (x+ ) − f (x∗ )
≤ 2 ∥∇f (x)∥2 − 2µ ·
µ 2L µ
µ
1− L 2 2
f (x+ ) − f (x∗ ) . (3.5)
≤ ∥∇f (x)∥2 −
µ2 µ
Therefore, using the trivial bound of zero for the second term on the RHS, x∗ lies in a ball centered at x++
with radius at most r
µ ∥∇f (x)∥2
1− · . (3.6)
L µ
This suggests using balls instead of ellipsoids in a cutting plane algorithm; it would certainly be more
ecient to maintain!
We arrive at the following algorithm.
3.5. Sphere and Parabola Methods 40
Algorithm 7: SphereMethod
Input: Initial point x(0) ∈ Rn , strong convexity parameter µ, Lipschitz gradient parameter L.
Q(0) ← Rn .
for k = 0, · ·
· do
2 µ
2
1− L
Set Q = ·
∇f (x(k) )
2 .
x ∈ Rn :
x − (x(k) − µ1 ∇f (x(k) ))
≤
µ2
2
Q(k+1) ← minSphere(Q ∩ Q(k) ) where minSphere(K) is the smallest sphere covering K .
x(k+1) ← center of Q(k+1) .
end
To analyze SphereMethod, we need the following lemma, which is illustrated in Figure 3.4.
√ √
Lemma 3.23. For any g ∈ Rn and ϵ ∈ (0, 1), we have B(0, 1) ∩ B(g, ∥g∥2 1 − ϵ) ⊂ B(x, 1 − ϵ) for some
x.
Proof. By symmetry, it suces to consider the two-dimensional case, and to assume that g = ae1 . If a ≤ 1,
we can simply pick x = g . Otherwise, let (x, 0) be the center of the smallest ball containing the required
intersection, and y be its radius. (See Figure 3.4). We have x2 + y 2 = 1 and (x − a)2 + y 2 = (1 − ϵ)a2 . This
implies that
1 + ϵa2
x=
2a
and so
ϵ 1 ϵ2 a2
y2 = 1 − − 2 − ≤1−ϵ
2 4a 4
as claimed.
Lemma 3.24. Let the measure of progress V(Q) = radius(Q). Then, we have that x∗ ∈ Q(k) and V(Q(k+1) ) ≤
µ
· V(Q(k) ) for all k .
p
1− L
Remark 3.25. The function value decrease follows from Theorem 3.7.
Proof Sketch. The fact x∗ ∈ Q(k) follows directly from the denition of Q. For the decrease of radius,
suppose that n o
Q(k) = x ∈ Rn : ∥x − x(k) ∥ ≤ R(k) .
radius(Q(k+1) )2 ∇f (0)
To compute radius(Q(k) )2
, we can assume x(k) = 0 and R(k) = 1 and let g = µ . Hence, we have
n
2 µ 2
o
Q(k+1) = minSphere ∥x − g∥2 ≤ (1 − ) · ∥g∥2 ∩ {∥x∥ ≤ 1} .
L
Now Lemma 3.23 (with ϵ = L)
µ
shows that the radius(Q(k+1) )2 ≤ 1 − L.
µ
Note that this gives the same convergence rate as gradient descent for strongly convex functions. Each
iteration is much faster than the O(n2 ) time of the Ellipsoid method.
3.5. Sphere and Parabola Methods 41
radius =
p
1 − ϵ|g|2
|g| |g|
√ √ √
√
1−ϵ 1 − ϵ |g| 1 − ϵ |g|
p
1 1− ϵ
√ √ p √ p √
B(0, 1) ∩ B(g, |g| 1 − ϵ) ⊂ B(x, 1 − ϵ) B(0, 1 − ϵ|g|2 ) ∩ B(g, |g| 1 − ϵ) ⊂ B(x, 1 − ϵ)
Figure 3.4: The left diagram shows the intersection shrinks at the same rate if only one of the ball shrinks; the
right diagram shows the intersection shrinks much faster if two balls shrink at the same absolute amount.
Notice that the left-hand side is a strict inequality (unless we already solved the problem). We infer a
halfspace containing the optimum using the subgradient at x, namely
⟨∇f (x), x∗ − x⟩ ≤ 0.
As the algorithm proceeds, we nd a new point x(new) such that f (x(new) ) < f (x). Therefore, the original
inequality can be strengthened to
or
⟨∇f (x), x∗ − x⟩ < −(f (x) − f (x(new) )).
This suggests we should move earlier halfspaces and thereby reduce the measure of p the next set. We expect
merely updating the halfspaces can improve the convergence rate from 1 − L µ
to 1 − L µ
because of the right
diagram in Figure 3.4. An ecient way to manage all this information is to directly maintain a region that
contains (x∗ , f (x∗ )). Now, we can view the inequality f (y) ≥ f (x) + ⟨∇f (x), y − x⟩ as a cutting plane of the
epigraph of f and we do not need to update previous cutting planes anymore.
The next algorithm is an epigraph cutting plane method.
3.5. Sphere and Parabola Methods 42
Algorithm 8: ParabolaMethod
Input: Initial point x(0) ∈ Rn and the strong convexity parameter µ.
q (0) (y) ← −∞.
for k = 0, · · · do
1
2
Set q (k+ 2 ) (y) = f (x(k) ) + ∇f (x(k) )⊤ (y − x(k) ) + µ2
y − x(k)
.
1
Let q (k+1) = maxParabola(max(q (k+ 2 ) , q (k) )) where maxParabola(q) outputs the parabolic
function p such that p(x) ≤ q(x) for all x, and p maximizes minx p(x).
// Alternatively, one can use x(k+ 2 ) ← x(k) − L1 ∇f (x(k) ) below.
1
1
Let x(k+ 2 ) = lineSearch(x(k) , −∇f (x(k) )) where
Exercise 3.26. Show that the formula for maxParabola computes the optimal parabola.
The key fact we will be using from the formula is that when 0 < λ < 1, we have
2 2 2
µ (∥cA − cB ∥ + µ (vB − vA ))
vλ = vA + .
8 ∥cA − cB ∥2
This says that the quadratic lower bound improves a lot whenever µ2 ∥cA −cB ∥2 ≫ vB −vA or µ2 ∥cA −cB ∥2 ≪
vB − vA . Using this, we can analyze the ParabolaMethod.
Theorem 3.27. Assume that f is µ-strongly convex with L-Lipschitz gradient. Let rk = f (x(k) )−miny q (k) (y).
Then, we have that
r
2 µ
rk+1 ≤ (1 − ) · rk2 .
L
In particular, we have that
r
(k+1) 1
∗ µ k
f (x )−f ≤ (1 − ) ∥∇f (x(0) )∥2 .
2µ L
Remark 3.28. Note that the squared radius of {y : q (k) (y) = f (x(k) )} is µ2 (f (x(k) ) − miny q (k) (y)) because
q (k) (y) = miny q (k) (y) + µ2 ∥y − arg minx q (k) (x)∥2 . Hence, rk is measuring the squared radius of our quadratic
lower bound. To relate to the cutting plane framework, we can view the set as {y : q (k) (y) ≤ f (x(k) )} and
the measure V = f (x(k) ) − miny q (k) (y).
2 1 2
Proof. Fix some k . We write q (k) (y) = vA + µ
2 ∥y − cA ∥ and q (k+ 2 ) (y) = vB + µ
2 ∥y − cB ∥ with
2
Using the notation in maxParabola, we write q (k+1) (y) = vλ + µ2 ∥y − cλ ∥ . Note that rk+1
2
= f (x(k+1) ) − vλ
and rk = f (x ) − vA . Therefore, we have
2 (k)
rk2 − rk+1
2
f (x(k) ) − f (x(k+1) ) + vλ − vA
= . (3.8)
rk2 rk2
To bound the right hand side, it suces to bound vA and vλ . From the description of the algorithm
maxParabola, we see that there are three cases λ = 0, λ = 1 and 0 < λ < 1. We only focus on proving the
nontrivial case λ ∈ (0, 1). In this case, we have that
2 2 2
µ (∥cA − cB ∥ + µ (vB − vA ))
vλ = vA +
8 ∥cA − cB ∥2
2 2 (k) ∥∇f (x(k) )∥2 2
µ (∥cA − cB ∥ + µ (f (x ) − vA ) − µ2 )
= vA + .
8 ∥cA − cB ∥2
Since f ≥ q (k) , we have f (x(k) ) ≥ q (k) (x(k) ) ≥ minx q (k) (x) = vA . Next, we claim that ∥cA − cB ∥2 ≥
∥∇f (x(k) )∥2
µ2 . Using these two facts, we can prove that
2 (k) 2
µ ( µ (f (x ) − vA )) µ · rk4
vλ ≥ vA + (k) 2
= v A + .
8 ∥∇f (x )∥
2
2∥∇f (x(k) )∥2
µ
Remark 3.29. We can view (3.9) as the key equation of the proof above. It shows that the progress is roughly
∥∇f ∥2
L
µ
+ ∥∇f ∥2where the rst term comes from the progress on the function value and the second term comes
from the curvature of the cutting sphere.
Exercise 3.30. Provepthe following extension of Lemma 3.23: There exists x s.t. B(0, 1 − ϵ|g|2 ) ∩
p
√ √
B(g, |g| 1 − ϵ) ⊂ B(x, 1 − ϵ).
3.6. Lower Bounds 44
3.5.3 Discussion
This section was about the idea of managing cutting planes, and as a byproduct we get an accelerated rate
of convergence. As wep will see later, standard accelerated gradient descent does not use line search and
achieves the rate 1 − L µ
. However, it seems that the use of line search helps in practice and that with a
careful implementation, line search can be as cheap as gradient computation. For more dicult problems,
one may want to store multiple quadratic lower bounds (see [21]).
n−1 n √
L−µ L−µ X L−µ 2 µX 2 Lµ − µ 2
f (x) = − x1 + 2
(xi − xi+1 ) + x1 + xi + xn (3.10)
4 8 i=1 8 2 i=1 4
which satises µ·I ⪯ ∇2 f (x) ⪯ L·I . Assume that our algorithm satises x(k) ∈ span(x(0) , ∇f (x(0) ), · · · , ∇f (x(k−1) ))
(0)
with the initial point x = 0. Then, for k < n,
p
(k) µ 3/2 L/µ − 1 2k
f (x ) − min f (x) ≥ ( ) ( p ) (f (x(0) ) − min f (x)).
x L L/µ + 1 x
Proof. First, we check the strong convexity and smoothness. Note that f (x) = −x1 + 21 x⊤ Ax for some
matrix A. Hence, ∇2 f (x) = A. Hence, we have
n−1 n √
⊤ 2 L−µ X 2 L−µ 2 X
2 Lµ − µ 2
θ ∇ f (x)θ = (θi − θi+1 ) + θ1 + µ θi + θn
4 i=1 4 i=1
2
To lower bound the error, we note that the gradient at x(0) is of the form (?, 0, 0, 0, · · · ) and hence by the
assumption x(1) = (?, 0, 0, 0, · · · ) and ∇f (x(1) ) = (?, ?, 0, 0, · · · ). By induction, only the rst k coordinates
of x(k) are non-zero.
3.6. Lower Bounds 45
Now, we compute the minimizer of f (x). Let x∗ be the minimizer of f (x). By the optimality condition,
we have that
L−µ L−µ ∗ L−µ ∗
− + (x1 − x∗2 ) + x1 + µx∗1 = 0,
4 4 4
L−µ ∗ L−µ ∗
(xi − x∗i−1 ) + (xi − x∗i+1 ) + µx∗i = 0, for i ∈ {2, 3, · · · , n − 1}
4 4 √
L−µ ∗ ∗ Lµ + µ ∗
(xn − xn−1 ) + xn = 0.
4 2
√
L/µ−1 i
By a direct substitution, we have that x∗i = ( √ ) is a solution of the above equation. Now, we note
L/µ+1
that
n p p
X L/µ − 1 L/µ − 1
∥x(k) − x∗ ∥22 ≥ (p )2i ≥ ( p )2(k+1)
i=k+1
L/µ + 1 L/µ + 1
and that
∞ p p p
X L/µ − 1 2i L/µ − 1 2 L/µ + 1
∥x (0)
− x∗ ∥22 ≤ (p ) = (p ) .
i=1
L/µ + 1 L/µ + 1 2
Now, by smoothness and by the strong convexity of f , we have
p
f (x(k) ) − f (x∗ ) µ ∥x(k) − x∗ ∥22 µ 3/2 L/µ − 1 2k
≥ · (0) ≥ ( ) (p ) .
f (x(0) ) − f (x∗ ) L ∥x − x∗ ∥22 L L/µ + 1
Note that this worst function naturally appears in many problems. So, it is a problem we need to
address. In some sense, the proof points out a common issue of any algorithm which only uses gradient
information. Given any convex function, we construct the dependence graph G on the set of variables xi by
connecting xi to xj if ∇f (x)i depends on xj or ∇f (x)j depends on xi (given all other variables). Note that
the dependence graph G of the worst function is simply a n vertex path, whose diameter is n − 1. Also, note
that gradient descent can only transmit information from one vertex to another in each iteration. Therefore,
it takes at least Ω(diameter) time to solve the problem unless we know the solution is sparse (when L/µ is
small). However, we note that this is not a lower bound for all algorithms.
The problem (3.10) belongs to a general class of functions called Laplacian systems and it can be solved
in nearly linear time using spectral graph theory.
Chapter 4
Reduction
The extreme points of this polytope can be shown to exactly correspond to the indicator vectors of spanning
trees of G. Thus, the optimization oracle in this case is to simply nd a maximum cost spanning tree.
Exercise 4.6. Design a membership oracle for the spanning tree polytope.
1 We also omit the fth oracle VIOL dened by [28], which checks whether the convex set satises a given inequality or gives
a violating point in the convex set, since this is equivalent to the OPT oracle below up to a logarithmic factor.
2 We use a slightly dierent denition than [28] for clarity.
46
4.1. Equivalences between Oracles 47
−−−→ is Õ(1)
V AL(K) = δK ∗ M EM (K) = δK
−−→ is Õ(n)
Figure 4.1: The relationships among the four oracles for convex sets. The arrows are implications.
Note that f ∗ is convex because it is the supremum of linear functions. Also, we have f ∗ (0) = − inf x∈Rn f (x).
Note that3 δK
∗
(c) = supx∈K c⊤ x. Therefore, the validity oracle for δK is simply the evaluation oracle for δK ∗
.
The following lemma shows that the optimization oracle is simply the (sub)gradient oracle for δK . We use
∗
∇f to represent subgradient.
Lemma 4.10. For any continuous function f with dierentiable f ∗, we have that ∇f ∗ (θ) = arg maxx θ⊤ x −
f (x).
Proof. First we observe that the supremum is achieved. Fix θ. Let gθ (x) = θ⊤ x − f (x). We assume
supx gθ (x) = f ∗ (θ) is nite. Let ϵ > 0. Then, gθ (x) is a continuous function, so S = gθ−1 ([f ∗ (θ) − ϵ, ∞)) is a
closed set. The set S is not empty because there is some x for which gθ (x) ≥ supz gθ (z) − ϵ. Now suppose
for a contradiction that S is not bounded. Then, there exists a sequence xi ∈ Rn such that ∥xi ∥ ≥ i and
gθ (xi ) ≥ f ∗ (θ) − ϵ for all i. By the compactness of the unit sphere, we may assume by taking a subsequence
that xi /∥xi ∥ → u for some unit vector u ∈ Rn . Then, u⊤ xi /∥xi ∥ → u⊤ u = 1, and since ∥xi ∥ → ∞, we have
u⊤ xi → ∞. Since gθ+u (xi ) = gθ (xi )+u⊤ xi ≥ f ∗ (θ)−ϵ+u⊤ xi , this means that gθ+u (xi ) → ∞, contradicting
f ∗ (θ + u) being nite. Hence, S is bounded and thus compact, so gθ (x) attains its maximum in S , and thus
in Rn .
Let xθ ∈ arg supx θ⊤ x − f (x). By denition, we have that f ∗ (θ) = θ⊤ xθ − f (xθ ) and that f ∗ (η) ≥
η xθ − f (xθ ) for all η . Therefore,
⊤
θ (η − θ) for all η.
f ∗ (η) ≥ f ∗ (θ) + x⊤
Therefore, xθ ∈ ∇f ∗ (θ).
3 Recall that δC (x) = 0 if x ∈ C and +∞ otherwise.
4.1. Equivalences between Oracles 48
GRAD(f ∗ ) GRAD(f )
−−−→ is Õ(1)
EV AL(f ∗ ) EV AL(f )
−−→ is Õ(n)
Figure 4.2: This illustrates the relationships of oracles for a convex function f and its convex conjugate f ∗.
where H is the set of supporting planes of epif and contains all ane lower bounds on f , namely, f (x) ≥
θ⊤ x − b for all x. Alternatively, we can write
\
(x, t) : t ≥ θ⊤ x − b .
epi(f ) =
H={(θ,b): ∀x, f (x)≥θ ⊤ x−b}
For a xed θ, any feasible b satises b ≥ θ⊤ x − f (x) for all x. So, the smallest feasible value satises
Hence,
f (x) = sup θ⊤ x − b∗ = sup θ⊤ x − f ∗ (θ) = f ∗∗ (x).
(θ,b)∈H θ
∥x∥2
Exercise 4.15. Let f be a convex function with closed epigraph. Show that f = f ∗ i f (x) = 2 .
Since we can use the gradient, ∇f , to compute the gradient of the dual, ∇f ∗ (via cutting plane method),
the involution property shows that we can do the reverse use ∇f ∗ to compute ∇f . Going back to the
example about conv({ai }), since we know how to compute maxx∈conv({ai }) θ⊤ x = δconv({a
∗
i })
(θ), this reduction
gives us a way to separate conv({ai }), or equivalently, to compute the (sub)gradient of δconv({a
∗
i })
. This is
formalized in the next exercise.
Exercise 4.16. Show how to implement the separation oracle SEP for a convex set K given access to an
optimization oracle OPT for K .
Recall that for any linear space X , X ∗ denotes the dual space, i.e., the set of all linear functions on X
and that under mild assumptions4 , we have X ∗∗ = X . Therefore, there are two natural coordinate systems
to record a convex function, the primal space X and the dual space X ∗ . Under these coordinate systems,
we have the dual functions f and f ∗ .
Theorem 4.18 ([6]). Given a transformation T that maps the set of lower semi-continuous5 convex functions
onto itself such that TTϕ = ϕ and ϕ ≤ ψ =⇒ T ϕ ≥ T ψ for all lower-semi-continuous convex functions
ϕ and ψ. Then, T is essentially the convex conjugate, namely, there is an invertible symmetric linear
transformation B, a vector v0 and a constant C0 such that
In combinatorial optimization, many convex sets are given by the convex hull for some discrete objects.
In many cases, the only known way to do the separation is via such reductions. In this chapter, we will study
the following general theorem showing that optimization can be reduced to membership/evaluation with a
quadratic overhead in dimension for the number of oracle queries.
Theorem 4.19. Let K be a convex set specied by a membership oracle, a point x 0 ∈ Rn , and numbers
0 < r < R such that B(x0 , r) ⊆ K ⊆ B(x0 , R). For any f given by an evaluation
convex function oracle and
any ϵ > 0, there is a randomized algorithm that computes a point z ∈ B(K, ϵ) such that.
f (z) ≤ min f (x) + ϵ max f (x) − min f (x)
x∈K x∈K x∈K
2 nR
2
with constant probability using O n log ϵr calls to the membership oracle and evaluation oracle and
O(n3 logO(1) nR
ϵr ) total arithmetic operations.
which only takes n + 1 calls to the evaluation oracle (for computing f (x), f (x + he1 ), · · · , f (x + hen )). The
only issue is that the convex function may not be dierentiable. However, any convex Lipschitz function
is twice dierentiable almost everywhere (see the proof below). Therefore, we can simply perturb x with
random noise, then apply a nite dierence. To see the idea more precisely, we rst observe that the norm
of the Hessian can be bounded in expectation for a Lipschitz function. Note that this is Lipschitzness of the
function, not its gradient. The proof below uses the basic fact that the gradient is dened almost everywhere
for Lipschitz functions.
Lemma 4.20. For any L-Lipschitz convex function f dened in a unit ball, we have ∇2 f (x) exists almost
everywhere and that Ex∈B(0,1) ∥∇2 f (x)∥F ≤ nL.
Proof. The existence almost everywhere is classical (Alexandrov Theorem, see e.g., [59]; see also Rademacher's
theorem about the existence of the derivative almost everywhere for Lipschitz functions). We will only prove
the part Ex∈B(0,1) ∥∇2 f (x)∥F ≤ nL. Since ∇2 f ⪰ 0 (where dened), we have ∥∇2 f (x)∥F ≤ tr∇2 f (x).
Therefore, Z Z Z
∥∇2 f (x)∥F dx ≤ tr∇2 f (x)dx = ∆f (x)dx.
B(0,1) B(0,1) B(0,1)
Using Stokes' Theorem, and letting ν(x) be the normal vector at x, we have
Z Z
∆f (x)dx = ⟨∇f (x), ν(x)⟩ dx ≤ |∂B(0, 1)| · L.
B(0,1) ∂B(0,1)
Hence, we have
|∂B(0, 1)|
Ex∈B(0,1) ∥∇2 f (x)∥F ≤ L = nL.
|B(0, 1)|
[59]
Lemma 4.21 ([42]). Let B∞ (x, r) = {y : ∥x − y∥∞ ≤ r}. For any 0 < r2 ≤ r1 and any convex function f
dened on B∞ (x, r1 + r2 ) with ∥∇f (z)∥∞ ≤ L for any z ∈ B∞ (x, r1 + r2 ) we have
r2
Ey∈B∞ (x,r1 ) Ez∈B∞ (y,r2 ) ∥∇f (z) − g(y)∥1 ≤ n3/2 L
r1
Since ωi (z)dz = 0, the Poincaré inequality for a box (Theorem 4.25 below) shows that
R
B∞ (y,r2 )
Z Z
|ωi (z)| dz ≤ r2 ∥∇ωi (z)∥2 dz
B∞ (y,r2 ) B∞ (y,r2 )
Z
2
= r2
∇ f (z)ei
dz
2
B∞ (y,r2 )
XZ √
Z
2
|ωi (z)| dz ≤ nr2
∇ f (z)
dz
F
i B∞ (y,r2 ) B∞ (y,r2 )
Therefore, we have
√
Ez∈B∞ (y,r2 ) ∥∇f (z) − g(y)∥1 ≤ nr2 Ez∈B∞ (y,r2 ) ∆f (z)
√
= nr2 ∆h(y)
where h = (2r12 )n f ∗ χB∞ (0,r2 ) where χB∞ (0,r2 ) is 1 on the set B∞ (0, r2 ) and 0 on outside.
Integrating by parts, we have that
Z Z
∆h(y)dy = ⟨∇h(y), n(y)⟩ dy
B∞ (x,r1 ) ∂B∞ (x,r1 )
2
where ∆h(y) = i ddxh2 (y) and n(y) is the normal vector on ∂B∞ (x, r1 ) the boundary of the box B∞ (x, r1 ),
P
i
i.e. standard basis vectors. Since f is L-Lipschitz with respect to ∥·∥∞ so is h, i.e. ∥∇h(z)∥∞ ≤ L. Hence,
we have that
Z
1 1 nL
Ey∈B∞ (x,r1 ) ∆h(y) ≤ n
∥∇h(y)∥∞ ∥n(y)∥1 dy ≤ n
· 2n(2r1 )n−1 · L = .
(2r1 ) ∂B∞ (x,r1 ) (2r1 ) r1
Exercise 4.22. For a Lipschitz function f : Rn → R,and h = f ∗ χB∞ (0,1/2) , prove that h is Lipschitz and
∆h(y) = Ez∼B∞(y,1/2) (∆f (z)).
4.2. Gradient from Evaluation via Finite Dierence 52
This lemma shows that we can implement an approximate gradient oracle (GRAD) using an evaluation
oracle (EVAL) even for non-dierentiable functions. By the involution property again, this completes all the
reductions in Figure 4.2. With the above fact asserting that, on average, the gradient is approximated by its
average in a small ball, we now proceed to construct an approximate subgradient, using only an approximate
evaluation oracle. The parameter r2 in the algorithm is chosen to optimize the nal error of the output. By
making the ratio r2 /r1 suciently small, we can get a desired error for the subgradient.
Lemma 4.23. Let r1 > 0 and f be a convex function. Suppose that ∥∇f (z)∥∞ ≤ √L for any z ∈
B∞ (x, 2r1 ) and suppose that we can evaluate f to within ε additive error forε ≤qr1 nL. Let g̃ =
Lε 5/4
SubgradConvexFunc(f, x, r1 , ε). Then, there is random variable ζ ≥ 0 with Eζ ≤ 2 r1 n such that
for any y
f (y) ≥ f (x) + ⟨g̃, y − x⟩ − ζ ∥y − x∥∞ − 4nr1 L.
Proof. We assume that f is twice dierentiable. For general f , we can reduce to this case by viewing it as
a limit of twice-dierentiable functions.
First, we assume that we can compute f exactly, namely ε = 0. Fix i ∈ [n]. Let g(y) be the average of
4.2. Gradient from Evaluation via Finite Dierence 53
∇f over B∞ (y, r2 ). Then, for the function g̃ computed by the algorithm, we have that
f (βi ) − f (αi )
Ez |g̃i − g(y)i | = Ez
− g(y)i
2r2
Z
1 df
≤ Ez (z + sei ) − g(y)i ds
2r2 dxi
df
= Ez
(z) − g(y)i
dxi
where we used that both z + sei and z are uniform distribution on B∞ (y, r2 ) in the last line. Hence, we have
Ez ∥g̃ − ∇f (z)∥1 ≤ Ez ∥∇f (z) − g(y)∥1 + Ez ∥g̃ − g(y)∥1 ≤ 2Ez ∥∇f (z) − g(y)∥1 .
Now, using f is L-Lipschitz between x and z , we have that f (z) ≥ f (x) − L · ∥x − z∥1 . Hence, we have
Note that ∥x − z∥1 ≤ n · ∥x − z∥∞q≤ n(r1 + r2 ) by assumption. Moreover, we can apply Lemma 4.21 to
bound ∥∇f (z) − g̃∥1 and use r2 = √εrnL
1
≤ r1 to get
Note that if we happened to have an exact oracle for f , then we can make r2 arbitrarily small.
Theorem 4.25 (L1 -Poincaré inequality). Let Ω be connected, bounded and open. Then the following (best-
possible) inequality holds for any smooth function f : Ω → R:
2|S||Ω \ S|
Z
f − 1
f (x) dx
≤ sup ∥∇f ∥L1 (Ω)
|Ω| Ω
1
L (Ω) S⊂Ω |∂S||Ω|
where the supremum is over all subsets S s.t. S and Ω\S are both connected.
Exercise 4.26. Prove the inequality in Theorem 4.25 using the classical coarea formula.
4.3. Separation via Membership 54
Note that d + αx (d)x is the last point in K on the line through d ∈ K in the direction of x, and −hx (d) is
the ℓ2 distance from this boundary point to d (see Fig.4.3).
The output of the algorithm for separation is a halfspace that approximately contains K , and the input
point x is close to its bounding hyperplane. It uses a call to the subgradient function above.
We now proceed to analyze the height function.
Lemma 4.28. hx (d) is convex on K.
Proof. Let d1 , d2 ∈ K and λ ∈ [0, 1]. Now d1 + αx (d1 )x ∈ K and d2 + αx (d2 )x ∈ K . Consequently,
[λd1 + (1 − λ)d2 ] + [λ · αx (d1 ) + (1 − λ) · αx (d2 )] x ∈ K .
def
Therefore, if we let d = λd1 +(1−λ)d2 we see that αx (d) ≥ λ·αx (d1 )+(1−λ)·αx (d2 ) and hx (λd1 +(1−λd2 ) ≤
λhx (d1 ) + λhx (d2 ) as claimed.
4.3. Separation via Membership 55
Lemma 4.29. hx is
R+δ
r−δ -Lipschitz over points in B2 (0, δ) for any δ < r.
Proof. Let d1 , d2 be arbitrary points in B(0, δ). We wish to upper bound |hx (d1 ) − hx (d2 )| in terms of
∥d1 − d2 ∥2 . We assume without loss of generality that αx (d1 ) ≥ αx (d2 ) and therefore
|hx (d1 ) − hx (d2 )| = |αx (d1 ) ∥x∥2 − αx (d2 ) ∥x∥2 | = (αx (d1 ) − αx (d2 )) ∥x∥2 .
Consequently, it suces to lower bound αx (d2 ). We split the analysis into two cases.
Case 1: ∥d2 − d1 ∥2 ≥ r − δ . Since 0 ≥ hx (d1 ), hx (d2 ) ≥ −R − δ , we have that
R+δ
|hx (d1 ) − hx (d2 )| ≤ R + δ ≤ ∥d2 − d1 ∥2 .
r−δ
Case 2: ∥d2 − d1 ∥2 ≤ r − δ . We consider the point d3 = d1 + d2 −d λ
1
with λ = ∥d2 − d1 ∥2 /(r − δ). Note
that
1 1
∥d3 ∥2 ≤ ∥d1 ∥2 + ∥d2 − d1 ∥2 ≤ δ + ∥d2 − d1 ∥2 ≤ r.
λ λ
Hence, d3 ∈ K . Since λ ∈ [0, 1] and K is convex, we have that λ · d3 + (1 − λ) · [d1 + αx (d1 )x] ∈ K . Now,
we note that
λ · d3 + (1 − λ) · [d1 + αx (d1 )x] = d2 + (1 − λ) · αx (d1 )x
4.3. Separation via Membership 56
∥d2 − d1 ∥2 R+δ
|hx (d1 ) − hx (d2 )| = (αx (d1 ) − αx (d2 )) · ∥x∥2 ≤ αx (d1 ) · ∥x∥2 ≤ ∥d2 − d1 ∥2 .
r−δ r−δ
In either case, as claimed we have
R+δ
|hx (d1 ) − hx (d2 )| ≤ ∥d2 − d1 ∥2 .
r−δ
The next lemma shows that hx gives us a way to implement an approximation separation oracle and
only needs access to an approximation evaluation oracle for hx which in turn only needs an approximate
membership oracle for K . To be precise, we dene approximate oracles.
Denition 4.30 (Separation Oracle (SEP)). Queried with a vector y ∈ Rn and real numbers δ, δ ′ > 0, with
probability at least 1 − δ ′ , the oracle either
assert that y ∈ B(K, δ), or
nd a unit vector c ∈ Rn such that cT x ≤ cT y + δ for all x ∈ B(K, −δ).
We let SEPδ,δ′ (K) be the time complexity of this oracle.
Denition 4.31 (Membership Oracle (MEM)). Queried with a vector y ∈ Rn and real numbers δ, δ ′ > 0,
with probability at least 1 − δ ′ , either
assert that y ∈ B(K, δ), or
assert that y ∈
/ B(K, −δ).
We let MEMδ,δ′ (K) be the time complexity of this oracle.
We can now state the main lemma for the separation oracle. Since the algorithm is randomized, we have
a parameter ρ ∈ (0, 1) to denote the probability of failure.
Lemma 4.32. Let K be a convex set satisfying B2 (0, r) ⊂ K ⊂ B2 (0, R). Given any 0 < ρ < 1 and
0 ≤ ε ≤ r. With probability 1 − ρ, Separateε,ρ (K, x) outputs a halfspace that contains K .
Proof. When x ∈/ B2 (0, R), the algorithm outputs a valid separation for B2 (0, R). For the rest of the proof,
we assume x ∈/ B(K, −ε) (due to the membership oracle) and x ∈ B2 (0, R).
By Lemma 4.28 and Lemma 4.29, hx is convex with Lipschitz constant 3κ on B2 (0, 2r ). By our assumption
on ε and our choice of r1 , we have that B∞ (0, 2r1 ) ⊂ B2 (0, 2r ). Hence, we can apply Lemma 4.23 to get that
Therefore, we have
⟨g̃, x⟩ ≥ ∥x∥2 − ζ ∥x∥∞ − 12nr1 κ2 . (4.3)
Now, we note that x ∈
/ B(K, −ε). Using that B(0, r) ⊂ K , we have (1 − rε )K ⊂ B(K, −ε). Hence,
ε
hx (0) ≥ − 1 − ∥x∥2 ≥ − ∥x∥2 .
r
Therefore, we have
hx (0) + ⟨g̃, x⟩ ≥ −ζ ∥x∥∞ − 12nr1 κ2
4.3. Separation via Membership 57
for any qy ∈ K . Recall from Lemma 4.23 that ζ is a positive random scalar independent of y satisfying
Eζ ≤ 2 3κε r1 n
5/4
. For any y ∈ K , we have that hx (y) ≤ 0 and hence ζ̃ ≥ ⟨g̃, y − x⟩ where ζ̃ is a random
scalar independent of y satisfying
r
3κε 5/4
Eζ̃ ≤ 4 n R + 24nr1 κ2
r1
≤ 31n7/6 R2/3 ε1/3 κ.
where we used r1 = n1/6 ε1/3 R2/3 /κ and 0 ≤ ε ≤ r. The result then follows using Markov's inequality.
Exercise 4.33. Suppose we can evaluate the subgradient of hx exactly for a convex set K containing the
origin. Give a short proof that for any x ̸∈ K , we have ⟨∇hx (0), y − x⟩ ≤ 0 for all y ∈ K .
Theorem 4.34. Let K B2 (0, 1/κ) ⊂ K ⊂ B2 (0, 1).
be a convex set satisfying For any 0≤η< 1
2 , we have
that
nκ
SEPη (K) ≤ O n log MEM(η/nκ)O(1) (K).
η
Proof. First, we bound the running time. Note that the bottleneck is to compute hx with δ additive error.
Since −O(1) ≤ hx (y) ≤ 0 for all y ∈ B2 (0, O(1)), one can compute hx (y) by binary search with O(log(1/δ))
calls to the membership oracle.
Next, we check that Separateδ,ρ (K, x) is indeed a separation oracle. Note that g̃ may not be an unit
vector and we need to re-normalize the g̃ by 1/ ∥g̃∥2 . So, we need to a lower bound ∥g̃∥2 .
3
From (4.3) and our choice of r1 , if δ ≤ 106ρn6 κ6 , then we have that
r
⟨g̃, x⟩ ≥ ∥x∥2 − ζ ∥x∥∞ − 12nr1 κ2 ≥ .
4
Hence, we have that ∥g̃∥2 ≥ 4κ 1
. Therefore, this algorithm is a separation oracle with error 400 7/6 2 1/3
ρ n κ δ
and failure probability O(ρ + log(1/δ)δ).
nκ
SEPη (K) ≤ O(log( ))MEMη6 /(n7/2 κ6 ) (K).
η
Remark. In practice, the runtime is roughly 2T assuming we have enough memory. Check out google/jax
for a modern implementation.
Before proving it formally, we rst go through an example. Consider the function f (x1 , x2 ) = sin(x1 /x2 )+
x1 x2 . We use xi to denote both the input and all intermediate variables. Then, we can write the program
in T = 6 steps:
4.3. Separation via Membership 58
with simple functions fi whose derivatives we know how to compute. The key idea is compute ∂x ∂f
1
not just
for the inputs x1 and x2 , but also for all intermediate variables. Here, we use ∂xi to denote the derivative of
∂f
f with respect to xi while xing x1 , x2 , · · · , xi−1 (and other inputs if xi is an input). For the example above,
suppose we want to compute ∇f (π, 2), we can simply compute rst compute all xi from i = 1, 2, · · · , 6, then
∂xi in the reverse order from i = 6, 5, · · · , 1:
∂f
x1 = π , x2 = 2, x3 = π/2, x4 = sin(x3 ) = 1, x5 = x1 x2 = 2π , x6 = x4 + x5 = 2π + 1.
∂x
∂f
6
= 1, ∂x
∂f
5
= ∂(x∂x
4 +x5 )
5
= 1, ∂x∂f
4
= ∂(x∂x
4 +x5 )
4
= 1,
∂x3 = ∂x4 ∂x3 = 1 · cos(x3 ) = 0,
∂f ∂f ∂x4
∂x
∂f
2
∂f ∂x3
= ∂x 3 ∂x2
∂f ∂x5
+ ∂x 5 ∂x2
= 0 · (− xx21 ) + 1 · x1 = π ,
2
∂x
∂f
1
∂f ∂x3
= ∂x 3 ∂x1
∂f ∂x5
+ ∂x 5 ∂x1
= 0 · ( x12 ) + 1 · x2 = 2.
The general case is similar. See AutoDifferentiation for the algorithm.
Algorithm 12: AutoDifferentiation
Input: a function f (x1 , x2 , · · · , xn ) given by f (x1 , x2 , · · · , xn ) = xm and
xi = fi (x1 , · · · , xi−1 ) for i = n + 1, n + 2, · · · , m
for i = n + 1, n + 2, · · · , m do
Compute xi = fi (x1 , · · · , xi−1 ).
end
Let ∂f
∂xm= 1.
for i = m − 1, · · · , 1 do
Let Li be the set of j such that fj depends on xi (i.e. xj directly depends on xi ).
∂f ∂xj
Compute ∂x ∂f
.
P
i
= j∈Li ∂x j ∂xi
end
∂f ∂xj
We prove by induction that the formula ∂f
is correct. For the
P
Proof of Theorem 4.35. ∂xi = j∈Li ∂xj ∂xi
base case i = m, we have f = xm and hence ∂f
∂xm= 1. For the induction, we let Li = {xj1 , xj2 , · · · , xjk }. If
we x variables x1 , x2 , · · · , xi−1 , then f is a function of xi (and of other inputs if xi is an input). Since only
xj1 , xj2 , · · · , xjk depend on xi , we can also view f as a function of xj1 , xj2 , · · · , xjk . More precisely, we have
f (xi ) = f (xj1 (xi , xj−1 ), xj2 (xi , xj−2 ), · · · , xjk (xi , xj−k ))
where we use xj−1 to denote the variables xj2 , xj3 , · · · , xjk . By chain rule, we have
∂f X ∂f ∂xj
= .
∂xi ∂xj ∂xi
j∈Li
To bound the runtime, we dene the computation graph G be a graph on x1 , x2 , · · · , xm such that i → j
if fj depends on xi . Note that each edge is examined O(1) times whether evaluating f or its gradient. Hence,
the cost of computing f and the cost of our algorithm are both Θ(m) where m is the number of edges in G.
This completes the proof.
We note that an ecient implementation of the chain rule is the heart of the backpropagation algorithm
fo neural networks. To conclude this section, we see that Theorem 4.35 can be surprisingly useful even for
some simple explicit functions.
Corollary 4.36. If we can compute f (A) = det A exactly in time T, then we can compute A−1 exactly in
O(T ).
4.4. Composite Problem via Duality 59
Proof. Note that ∂A∂ij det A = adj(A)ji = det A · (A−1 )ji . Hence, ∇ log det A = A−⊤ . Theorem 4.35 shows
that computing A−⊤ can be done as fast as det A.
where we used h = h∗∗ in the rst line, the following minimax theorem on the second line, and the denition
of g ∗ on the third line.
Theorem 4.37 (Sion's minimax theorem). X ⊂ Rn be a compact convex set and Y ⊂ Rm be a
Let convex
set. If f : X × Y → R ∪ {+∞} such that f (x, ·) is upper semi-continuous and quasi-concave on Y for all
x ∈ X and f (·, y) is lower semi-continuous and quasi-convex on X for all y ∈ Y . Then, we have
Remark. Compactness is necessary. Consider f (x, y) = x + y . This theorem generalizes Von Neumann's
minimax theorem.
We call g(x) + h(Ax) the primal problem and g ∗ (−A⊤ θ) + h∗ (θ) the dual problem. Often, the dual
problem gives us some insight on the primal problem. However, we note that there are many ways to split
a problem into two and hence many candidate dual problems.
Example 4.38. Consider the unit capacity ow problem on a graph G = (V, E):
max c⊤ f
Af =d,−1≤f ≤1
where f ∈ RE is the ow vector, A ∈ RV ×E encodes the vertex-edge adjacency matrix with two nonzeros
per column, d is the demand vector so that Af = d is ow conservation, and c is the cost vector. We can
write the dual as follows:
When c = 0 and d = F · 1st this is the maximum ow problem with ow value F , and the dual problem
is the minimum s − t cut problem with the cut given by {v ∈ V such that ϕ(v) ≥ t}. We can view ϕ as
assigning a potential to every vertex of the graph. Note that there are |E| variables in primal and |V |
variables in dual. So, in this sense, the dual problem is easier for dense graphs. Although we do not have a
way to turn a minimum s − t cut to a maximum s − t ow in general, we will see various tricks to reconstruct
the primal solution from the dual solution by modifying the problem.
naively on the primal problem, we would get Õ(n2 (Z + n4 )) time algorithm for the primal (because there
are n2 variables) and Õ(m(Z + nω + mP 2
)) for the dual where Z is the total number of non-zeros in Ai . (The
term Z + n is the cost of computing
ω
yi Ai and nding its minimum eigenvalue.) Generally, n2 ≫ m and
hence it takes much less time to solve the dual.
We note that
Pm min b⊤ y = Pm min b⊤ y.
i=1 yi Ai ⪰C v⊤ ( i=1 yi Ai −C)v≥0 ∀∥v∥2 =1
In each step of the cutting plane method, the (sub)gradient oracle either outputs b or outputs one of the
cutting planes
Xm
v⊤ ( yi Ai − C)v ≥ 0.
i=1
Let S be the set of all cutting planes used in the algorithm. Then, the proof of the cutting plane method
shows that
Pm min b⊤ y = Pm min b⊤ y ± ε. (4.6)
i=1 yi Ai ⪰C v⊤ ( i=1 yi Ai −C)v≥0 ∀v∈S
The key idea to obtaining the primal solution is to take the dual of the right hand side (which is an
approximate dual of the original problem). Now, we have
X Xm
min b⊤ y = min max b⊤ y − λv v ⊤ ( yi Ai − C)v
v⊤ ( m y λv ≥0
P
i=1 yi Ai −C)v≥0 ∀v∈S i=1
v∈S
X m
X X
= max min C • λv vv ⊤ + b⊤ y − yi λv vv ⊤ • Ai
λv ≥0 y
v∈S i=1 v∈S
m
X
= P max min C •X + yi (bi − X • Ai )
X= v∈S λv vv ⊤ ,λv ≥0 y
i=1
= max C • X.
X= v∈S λv vv ⊤ ,λv ≥0,X•Ai =bi
P
Note
P that this is exactly the primal SDP problem, except that we restrict the set of solutions to the form
v∈S λv vv with λv ≥ 0. Also, we can write this problem as a linear program:
⊤
X
max λv v ⊤ Cv. (4.7)
λv v ⊤ Ai v=bi for all i,λv ≥0
P
v v
Therefore, we can simply solve this linear program and recover an approximate solution for the SDP. By
(4.6), we know that this is an approximate solution with the same guarantee as the dual SDP.
Now, we analyze the runtime of this algorithm. This algorithm contains two phases: solve the dual SDP
via cutting plane method, and solve the primal linearPmprogram. Note that each step of the cutting plane
method involves nding a separating hyperplane of i=1 yi Ai ⪰ C .
Exercise 4.39. Let Ω = {y ∈ Rm : m i=1 yi Ai ⪰ C}. Show that one can implement the separation oracle
P
in time O∗ (Z + nω ) via eigenvalue computation.
Therefore, the rst phase takes O∗ (m(Z + nω + m2 )) time in total. Since the cutting plane method
takes O∗ (m) steps, we have |S| = O∗ (m). In the second phase, we need to solve a linear program (4.7) with
O∗ (m) variables with O(m) constraints. It is known how to solve such linear programs in time O∗ (m2.38 )
[18]. Hence, the total cost is dominated by the rst phase
O∗ mZ + mnω + m3 .
Problem 4.40. In the rst phase, each step involves computing an eigenvector of similar matrices. So, can
we use matrix update formulas to decrease
the cost per step in the cutting plane to O (Z + n )? Namely,
∗ 2
This is the weighted b-matching problem. Typically, the number of students is much more than the number
of schools. Therefore, an algorithm with running time linear in the number of students is preferable. To
apply our framework, we let
X
K1 = {x ∈ RE , xe ≥ 0, x(a,b) ≤ 1 ∀a ∈ V1 },
(a,b)∈E
V2
K2 = {y ∈ R , yb ≤ cb },
= minn max
m
c⊤ x + δK1 (x) + θ⊤ M x − δK
∗
2
(θ)
x∈R θ∈R
= max
m
minn c⊤ x + δK1 (x) + θ⊤ M x − δK
∗
2
(θ)
θ∈R x∈R
∗
= max
m
−δK 1
(−c − M ⊤ θ) − δK
∗
2
(θ).
θ∈R
Taking the dual has two benets. First, the number of variables is smaller. Second, the gradient oracle is
something we can compute eciently. Hence, cutting plane methods can be used to solve it in O∗ (mT + m3 )
where T is the time to evaluate ∇δK∗
1
and ∇δK
∗
2
. The only problem left is to recover the primal x.
The key observation is the following lemma:
Lemma 4.41. Let xi ∈ K1 be the set of points output by the oracle
∗
∇δK 1
during the cutting plane method.
Dene yi ∈ K2 similarly. Suppose that the cutting plane method ends with the guarantee that the additive
4.4. Composite Problem via Duality 63
where K
f1 = conv(xi ) and K
f2 = conv(yi ).
Proof. Let θi be the set of directions queried by the oracle for ∇δK ∗
1
and φi be the directions queried by
the oracle for ∇δK2 . We claim that xi ∈ ∇δK
∗ ∗
f1 (θ i ) and y i ∈ ∇δ ∗
K
f2 (φ i ) . Having this, the algorithm cannot
distinguish between K1 and K1 , and between K2 and K2 . Hence, the algorithm runs exactly the same, i.e.,
f f
uses the same sequence of points. Therefore, we get the same value c⊤ x. However, by the guarantee of
cutting plane method, we have that
min c⊤ x ≤ c⊤ x ≤ min c⊤ x + ε.
x∈K
f1 ,T x∈K
f2 x∈K1 ,T x∈K2
This reduces the problem into the form minx∈K f2 c x. For the second phase, we let zi = M xi ∈ R .
f1 ,T x∈K
⊤ m
Then, we have
X
min c⊤ x = min
P P c⊤ ( t i xi )
x∈K
f1 ,M x∈K
f2 ti ≥0,si ≥0,M i ti x i = i si yi
i
X
= min P
P ti · c⊤ xi .
ti ≥0,si ≥0, i ti zi = i si yi
Note that it takes O∗ (mZ) time to write down this linear program where Z is the number of non-zeros in
M . Next, we note that this linear program has O∗ (m) variables and m constraints. Therefore, we can solve
it in O∗ (m2.38 ) time.
Therefore, the total running time is
O∗ (m(T + m2 ) + (mZ + m2.38 )).
To conclude, we have the following theorem.
Theorem 4.42. Given convex sets K1 ⊂ Rn and K2 ⊂ Rm with m ≤ n and a matrix M : Rn → Rm with
∗ ∗
Z non-zeros, let T be the cost to compute ∇δK and ∇δK . Then, we can solve the problem
1 2
min c⊤ x
x∈K1 ,M x∈K2
in time O∗ (mT + mZ + m3 ).
Remark. We hid all sorts of terms in the log term hidden in O∗ such as the diameter of the set. Also this is
the number of arithmetic operations, not the bit complexity.
Going back to the school/student problem, this algorithm gives a running time of
O∗ (|V2 ||E| + |V2 |3 )
which is linear in the number of students!
In general, this statement says that if we can split a convex problem into two parts, with both being easy
to solve and one part having fewer variables, then we can solve the entire problem in time depending on the
smaller dimension.
Exercise 4.43. How fast can we solve minx∈∩ki=1 Ki c⊤ x given the oracles ∇δK
∗
i
?
Chapter 5
Geometrization
In this chapter, we study techniques that further exploit the geometry of convex functions and associated
norms. Many of these techniques are eective in practice for large scale problems. We recall the following
relevant denitions.
For the usual Euclidean norm, the convex body is the Euclidean ball, B2n . Similarly, for an ℓp norm, the
convex body is the unit ℓp ball.
It is often useful to consider ane transformations, and the norms they induce, e.g., for a PSD matrix
A, we can dene the associated norm as follows:
√
∥x∥A = x⊤ Ax.
The convex body (unit ball) of this norm is an ellipsoid centered at zero and dened by the matrix A.
As we have encountered previously in this book, convex sets and functions have duals. For a convex body
K , the dual (or polar) is dened as follows:
K ∗ = {y : ∀x ∈ K, ⟨x, y⟩ ≤ 1} .
The dual norm for ∥.∥K is ∥.∥K ∗ . We can state a generalized Cauchy-Schwarz inequality.
Fact 5.1. For x, y ∈ Rn and any centrally symmetric convex body K, we have ⟨x, y⟩ ≤ ∥x∥K ∥y∥K ∗ .
In this chapter, an important idea will be local norms, i.e., at each point x in the domain, there could
be a dierent norm. Indeed, in p a Riemannian metric M, for every x ∈ M,there is a matrix A(x) s.t. the
norm at x is dened as ∥v∥x = v ⊤ A(x)v. A special class of Riemannian metrics of particular interest for
us will be Hessian metrics (corresponding to Hessian manifolds). Here the matrix dening the local norm is
the Hessian of a convex, twice-dierentiable function, i.e.,
A(x) = ∇2 ϕ(x)
64
5.2. Mirror Descent 65
Now, we are ready to analyze the subgradient method. It basically involves tracking the squared distance
to the optimum, ∥x(k+1) − x∗ ∥22 .
Theorem 5.4. Let f be a convex function that is G-Lipschitz in ℓ2 norm. After T steps, the subgradient
method outputs a point x such that
∥x(0) − x∗ ∥22 h
f (x) ≤ f (x∗ ) + + G2
2hT 2
where x∗ is any point that minimizes f over D.
∥x(0) −x∗ ∥2
Remark. If the distance ∥x(0) − x∗ ∥ and the Lipschitz constant G are known, we can pick h = √
G T
and get
G · ∥x(0) − x∗ ∥2
f (x) ≤ f (x∗ ) + √ .
T
5.2. Mirror Descent 66
where we used Lemma 5.3 in the inequality. Since x∗ lies on the −g (k) direction, we expect x(k+1) is closer
to x∗ than x(k) if the step size is small enough (or if we ignore the second order term ∥g (k) ∥22 ). To bound
the distance improvement, we apply the denition of subgradient and get
D E
f (x∗ ) ≥ f (x(k) ) + g (k) , x∗ − x(k) .
Note that this equation shows that if the error f (x(k) ) − f (x∗ ) is larger, then we move faster towards the
optimum. Rearranging the terms, we have
1 (k) h
f (x(k) ) − f (x∗ ) ≤ ∥x − x∗ ∥22 − ∥x(k+1) − x∗ ∥22 + G2 .
2h 2
We sum over all iterations, to get
T −1
1 X 1 1 h
(f (x(k) ) − f (x∗ )) ≤ · (∥x(0) − x∗ ∥22 − ∥x(T ) − x∗ ∥22 ) + G2
T T 2h 2
k=0
∥x(0) − x∗ ∥22 h
≤ + G2 .
2hT 2
The result follows from observing that for a convex function,
T −1 T −1
!
1 X (k) 1 X
f x − f (x∗ ) ≤ (f (x(k) ) − f (x∗ )).
T T
k=0 k=0
On the other hand, its gradient g lives in ℓ∞ space; the dual space of ℓ1 is ℓ∞ (in general, ℓp is dual to ℓq
where (1/p) + (1/q) = 1). Since x and g are not in the same space, the term x − ηg does not make sense.
More precisely, we have the following tautology (directly follows from the denition of dual space, namely
the set of all linear maps in the original space).
Denition 5.5. A Banach space over the reals is a vector space over the reals together with a norm
that denes a complete metric space, i.e., for any Cauchy sequence, X = (xi )∞
i=1 , there is a vector x s.t.
limn→∞ ∥xn − x∥ = 0.
Fact 5.6. Given any Banach space D over the reals and a continuously dierentiable function f from D to
R, its gradient ∇f (x) ∈ D∗ for any x.
In general, if the function f is on the primal space D, then the gradient g lives in the dual space D∗ .
Therefore, we need to map x from the primal space D to the dual space D∗ , update its position, then map
the point back to the original space D.
In fact, Lemma 5.6 gives us one such map, ∇f . Consider the following algorithm: Starting at x. We use
∇f (x) to map x to the dual space y = ∇f (x). Then, we apply the gradient step on the dual y (new) = y−∇f (x)
and map it back to the primal space, namely nding x(new) such that ∇f (x(new) ) = y (new) . Note that
y (new) = 0 and hence x(new) is exactly a minimizer of f . So, if we can map it back, this algorithm solves the
problem in one step. Unfortunately, the task of mapping it back is exactly our original problem.
Instead of using the same f , mirror descent uses some other convex function Φ, called the mirror map.
For constrained problems, the mirror map may not bring the point back to a point in D. Naively, one may
consider the algorithm
Note that the rst step of nding y (t+1) involves solving an optimization problem. We will show how to do
this optimization later (see 5.1) but with a proper formulation of the algorithm which takes into account the
distance as measured by the mirror map Φ.
Denition 5.7. For any strictly convex function Φ, we dene the Bregman divergence as
DΦ (y, x) = Φ(y) − Φ(x) − ⟨∇Φ(x), y − x⟩ .
Note that DΦ (y, x) is the error of the rst order Taylor expansion of Φ at x. Due to the convexity of Φ,
we have that DΦ (y, x) ≥ 0. Also, we note that DΦ (y, x) is convex in y , but not necessarily in x.
Example 5.8. DΦ (y, x) = ∥y − x∥2 for Φ(x) = ∥x∥2 . DΦ (y, x) = i yi log xyii − yi + xi for Φ(x) =
P P P
xi log xi .
P
Note that this is a natural generalization of x(k+1) = arg minx∈D hg (k)⊤ x + ∥x − x(k) ∥2 .
Theorem 5.10. Let f G-Lipschitz convex function on D with respect to some norm ∥ · ∥. Let Φ be
be a
a ρ-strongly convex function on D with respect to ∥ · ∥ with squared diameter R2 = supx∈D Φ(x) − Φ(x(0) ).
Then, mirror descent outputs x such that
R2 h
f (x) − min f (x) ≤ + G2 .
x hT 2ρ
Remark 5.11. We say a function f is ρ strongly convex with respect to the norm ∥ · ∥ if for any x, y , we have
ρ
f (y) ≥ f (x) + ⟨∇f (x), y − x⟩ + ∥y − x∥2 .
2
The usual strong convexity is with respect to the Euclidean norm.
q q
Remark 5.12. Picking h = G T , we get the rate f (x) ≤ minx f (x) + GR
R 2ρ 2
ρT .
where we used Lemma 5.9 in the inequality. Using the denition of subgradient, we have that
D E
f (x∗ ) ≥ f (x(k) ) + g (k) , x∗ − x(k) .
5.2. Mirror Descent 69
DΦ (x∗ , x(k+1) ) ≤ DΦ (x∗ , x(k) ) − h · (f (x(k) ) − f (x∗ )) + DΦ (x(k) , y (k+1) ) − DΦ (x(k+1) , y (k+1) ).
(hG)2
DΦ (x∗ , x(k+1) ) ≤ DΦ (x∗ , x(k) ) − h · (f (x(k) ) − f (x∗ )) + .
2ρ
Rearranging the terms, we have
1 hG2
f (x(k) ) − f (x∗ ) ≤ DΦ (x∗ , x(k) ) − DΦ (x∗ , x(k+1) ) + .
h 2ρ
Summing over all iterations, we have
T −1
1 X 1 1 h
(f (x(k) ) − f (x∗ )) ≤ · (DΦ (x∗ , x(0) ) − DΦ (x∗ , x(T ) )) + G2
T T h 2ρ
k=0
1 h
≤ DΦ (x∗ , x(0) ) + G2
hT 2ρ
R2 h 2
≤ + G .
hT 2ρ
Note that
(k) (k) (k) (k)
X X X
DΦ (x, x(k) ) = xi log xi − xi log xi − (1 + log xi )(xi − xi )
i
X xi
= xi log (k)
xi
(k)
where we used that xi . Hence, the step is simply
P P
i xi = i
X xi
x(k+1) = arg P min hg (k)⊤ x + xi log (k)
.
xi =1,xi ≥0 xi
for some normalization constant Z . Note that this algorithm multiplies the current x with a multiplicative
factor and hence it is also called multiplicative weight update.
1
Φ(y) − Φ(x) − ⟨∇Φ(x), y − x⟩ ≥ ∥y − x∥21 .
2
Therefore, Φ is 1-strongly convex in ∥ · ∥1 . Hence, ρ = 1.
Diameter Direct calculation shows that − log n ≤ Φ(x) ≤ 0. We start at x(0) = n1 (1, . . . , 1)T . Hence,
R2 = log n.
Result
Theorem
P 5.13. Let f be a 1-Lipschitz function on ∥ · ∥1 . Then, mirror descent with the mirror map
Φ(x) = i xi log xi . r
2 log n
T
f (x ) − min f (x) ≤
.
T x
5.3 FrankWolfe
Mirror descent is not suitable for all spaces. The guarantee of mirror descent crucially depends on the fact
that there is a 1-strongly convex mirror map Φ such that maxx Φ(x) − minx Φ(x) is small on the domain.
For some domains such as {x : ∥x∥∞ ≤ 1}, the range maxx Φ(x) − minx Φ(x) can be large.
Lemma 5.14. Let Φ be a 1-strongly convex function on Rn over ∥ · ∥ ∞ ≤ 1. Then,
n
max Φ(x) ≥ min Φ(x) + .
∥x∥∞ ≤1 ∥x∥∞ ≤1 2
Remark. This inequality is tight because 12 ∥x∥2 is 1-strongly convex in ∥ · ∥∞ and its value is between 0 and
2.
n
5.3.1 Algorithm
Now we give another geometry dependent algorithm that relies on a dierent set of assumptions. The
problem we study in this section is of the form
min f (x)
x∈D
for f such that ∇f is Lipschitz in a certain sense. The algorithm reduces the problem to a sequence of
linear optimization problems.
Algorithm 15: FrankWolfe
Input: Initial point x(0) ∈ Rn , step size h > 0.
for k = 0, 1, · · · , T − 1 do
Compute y (k) = arg miny∈D y, ∇f (x(k) ) .
Analysis
Theorem 5.15. Let f be a convex function on a convex set D with a constant Cf such that
1
f ((1 − h)x + hy)) ≤ f (x) + h ⟨∇f (x), y − x⟩ + Cf h2 .
2
Then, for any x, y ∈ D and h ∈ [0, 1], we have
2Cf
f (x(k) ) − f (x∗ ) ≤ .
k+2
Remark. If ∇f is L-Lipschitz with respect to the norm ∥ · ∥ over the domain D, then Cf ≤ L · diam∥·∥ (D)2 .
5.4. The Newton Method 72
where we used the fact that y (k) = arg miny∈D y, ∇f (x(k) ) and that x∗ ∈ D. Hence, we have that
1
f (x(k+1) ) ≤ f (x(k) ) − hk (f (x(k) ) − f (x∗ )) + Cf h2k .
2
Let ek = f (x(k) ) − f (x∗ ). Then,
1
ek+1 ≤ (1 − hk )ek + Cf h2k .
2
2Cf
Note that e0 = f (x(0) ) − f ∗ ≤ 12 Cf . By induction, we have that ek ≤ k+2 .
Remark. Note that this proof is in fact the same as Theorem 2.9.
Finding the zeros of the right hand side, we have the Newton step
g(x(k) )
x(k+1) = x(k) − .
g ′ (x(k) )
In high dimension, we can approximate the function by its Jacobian g(x) ∼ g(x(k) ) + Dg(x(k) )(x − x(k) ) and
this gives the step
−1
x(k+1) = x(k) − Dg(x(k) ) g(x(k) ).
When the function g(x) = ∇f (x), then the Newton step becomes
−1
x(k+1) = x(k) − ∇2 f (x(k) ) ∇f (x(k) ).
To see why ane-invariance is important for optimization, we consider the following function
100 2 1 2
f (x1 , x2 ) = x + x .
2 1 2 2
The gradient descent for this function is
(x1 , x2 ) ← (x1 , x2 ) − h∇f (x1 , x2 ) = ((1 − 100h)x1 , (1 − h)x2 ).
We need h < 100 1
in order the rst coordinate to converge, but this will make the second coordinate converges
too slowly. In general, we may want to take dierent step sizes for dierent directions and Newton method
gives the best step if the function is quadratic.
For many classes of functions, gradient methods converge to the solution linearly (namely, it takes
c · log 1ϵ iterations for some c depending on the problem) while the Newton method converges to the solution
quadratically (namely, it takes c′ · log log 1ϵ for some c′ ) if the starting point is suciently close to a root.
However, each step of Newton method involves solving a linear system, which can be much more expensive.
Furthermore, Newton method may not converges if the starting point is far away.
Algorithm 16: NewtonMethod
Input: Initial point x(0) ∈ Rn
for k = 0, 1, · · · , T − 1 do
−1
x(k+1) = x(k) − Dg(x(k) ) g(x(k) ).
end
return x(T ) .
Theorem 5.16 (Quadratic convergence). Assume that g : Rn → Rn is twice continuously dierentiable.
(k) (k) ∗
Let x be the sequence given by the Newton method. Suppose that x converges to some x such that
g(x∗ ) = 0 and Dg(x∗ ) is invertible. Then, for k large enough, we have
Hence, we have
Z 1
(k) −1 ∗
0 = Dg(x ) g(x (k)
)+x −x (k)
+ (1 − s)Dg(x(k) )−1 D2 g(x(k) + se(k) )[e(k) , e(k) ]ds.
0
Since x (k+1)
=x (k)
− Dg(x g(x ), we have
(k) −1
) (k)
Z 1
∗
x (k+1)
−x = (1 − s)Dg(x(k) )−1 D2 g(x(k) + se(k) )[e(k) , e(k) ]ds.
0
So, we have
Z 1
∗
∥x (k+1)
−x ∥≤ (1 − s)∥Dg(x(k) )−1 D2 g(x(k) + se(k) )[e(k) , e(k) ]∥ds
0
Z 1
≤ (1 − s)∥Dg(x(k) )−1 D2 g(x(k) + se(k) )∥op ds · ∥x(k) − x∗ ∥2 .
0
This argument above uses only that g ∈ C 2 and does not require convexity. Without further global
assumptions on g , the Newton method does not always converge to a root. The argument above only shows
that if the algorithm converges, then it converges quadratically eventually. We call this local convergence
since it only gives a bound when x(k) is close enough to x∗ . In comparison, all earlier analyses in this
book are about global convergence, bounding the total number of iterations. In practice, both analyses are
important; global convergence makes sure the algorithm is robust and local quadratic convergence makes sure
the algorithm converges to machine accuracy quickly√ enough. The local quadratic convergence is particularly
important for simple problems such as computing x.
λ1 ≤ x(k) ≤ λ1 + ε · (x(0) − λ1 ).
Proof. Since the roots of g are real, we can write
n
Y
g(x) = an · (x − λi )
i=1
g(x) 1
′
=P 1 .
g (x) i x−λi
applied to a degree n real-rooted polynomial, starting with x(0) > λ1 , in each iteration the distance to the
largest root decreases by a factor of 1 − n1/k
1
.
Such a dependence of log 1ϵ is called linear convergence. The convergence of the Newton method can be
quadratic, when close enough to a root.
Theorem 5.19. Assume that |f ′ (x∗ )| ≥ α at f (x∗ ) = 0 and f ' is L-Lipschitz. Then, if |x0 − x∗ | ≤ α
2L ,
2(k)
α L (k)
|x(k) − x∗ | ≤ |x − x∗ | .
L α
5.4. The Newton Method 75
Z x∗ Z x∗
f (x∗ ) = f (x(k) ) + f ′ (z)dz = f (x(k) ) + f ′ (x(k) )(x∗ − x(k) ) + (f ′ (z) − f ′ (x(k) ))dz.
xt xt
Therefore,
Z x∗
∗ 1
x (k+1)
− x = ′ (k) (f ′ (z) − f ′ (x(k) ))dz.
f (x ) xt
Z x∗
L
≤ |z − x(k) |dz
|f ′ (x(k) )| xt
L|x∗ − x(k) |2
= .
2|f ′ (x(k) )|
So, if |x(k) − x∗ | ≤ 2L ,
α
L (k) 1
|x(k+1) − x∗ | ≤ |x − x∗ |2 ≤ |x(k) − x∗ |
α 2
and 2
L (k+1) L (k)
|x − x∗ | ≤ |x − x∗ | .
α α
After t steps,
2(k)
L (k) L
|x − x∗ | ≤ ϵ0
α α
polynomial.
Moving back to optimization, we can view the goal as nding a root of ∇f (x) = 0. Newton's iteration is
the update
x(k+1) = x(k) − (∇2 f (x(k) ))−1 ∇f (x(k) ).
By the above proof, Newton's iteration has quadratic convergence from points close enough to the optimal.
Quasi-Newton Method
When the Jacobian of g is not available, we can approximate it using the function itself. For one dimension,
we can approximate the Newton method and get the following secant method
x(k) − x(k−1)
x(k+1) = x(k) − g(x(k) )
g(x(k) ) − g(x(k−1) )
g(x(k) )−g(x(k−1) )
where we approximated the g ′ (x(k) ) by x(k) −x(k−1)
. For nice enough function, the convergence rate
√
1+ 5
satises εk+1 ≤ C · εk , which is super linearly but not quadratic.
2
For higher dimension, we need to approximate the Jacobian of g . Let J (k) be the approximate Jacobian
we maintained in the k th iteration. Similar to the secant method, we want to enforce
In dimension higher than one, this does not uniquely dene the J (k+1) . One natural choice is to nd J (k+1)
that is closest to J (k) while satisfying the equation above. Solving the problem
Exercise 5.20. Prove that the equation (5.3) is indeed the minimizer of (5.2).
When g is given by the gradient of a convex function f , we know that the Jacobian of g satises
Dg = ∇2 f (x) ⪰ 0. Therefore, we should impose some conditions such that J (k) ⪰ 0 for all k .
BFGS algorithm
BroydenFletcherGoldfarbShanno (BFGS) algorithm is one of the most popular quasi-Newton methods.
The algorithm maintains an approximate Hessian J (k) such that
J (k+1) (x(k) − x(k−1) ) = ∇f (x(k) ) − ∇f (x(k−1) )
J (k+1) is close to J (k) .
J (k) ≻ 0.
To achieve all of these conditions, the natural optimization is
2
−1 1
def
1
W 2 J −1 − J (k)
J (k+1) = arg min W 2
Js=y,J=J ⊤
F
R1
where s(k) = x(k) − x(k−1) , y (k) = ∇f (x(k) ) − ∇f (x(k−1) ) and W = 0 ∇2 f (x(k−1) + s(x(k) − x(k−1) ))ds
1
(or any W such that W y = s). In some sense, the W − 2 is just a correct change of variables so that the
algorithm is ane invariant. Solving the equation above [25], one obtain the update
−1 −1
J (k+1) = (I − ρk · s(k) y (k)⊤ ) J (k) (I − ρk · y (k) s(k)⊤ ) + ρk · s(k) s(k)⊤ (5.4)
where ρk = 1
y (k)⊤ s(k)
. Alternatively, one can also show that [26]
−1
To implement the BFGS algorithm, it suces to compute J (k) ∇f (x(k) ). Therefore, we can di-
−1 −1
rectly use the recursive formula (5.4) to compute J (k) ∇f (x(k) ) instead of maintaining J (k) or J (k)
explicitly.
In practice, the recursive formula J (k) becomes too expensive and hence one can stop the recursive
formula after constant steps, which gives the limited-memory BFGS algorithm.
where A ∈ Rm×n . The diculty of linear programs is the constraint x ≥ 0. Without this constraint, we can
simply solve it as a linear system. One natural idea to solve linear programs is to replace the hard constraint
x ≥ 0 by some smooth function. So, let us consider the following regularized version of the linear program
for some t ≥ 0:
Xn
(Pt ) : ⊤
min c x − t ln xi subject to Ax = b.
x
i=1
We will explain the reason of choosing ln x in more detail later. For now, we can think it as a nice function
that blows up to ∞ as x approaches zero.
One can think that − ln x gives a force from every constraint x ≥ 0 to make sure x ≥ 0 is true. Since
the gradient of − ln x blows up when x = 0, when x is close enough, the force is large enough to counter the
cost c. When t → 0, then the problem (Pt ) is closer to the original problem (P) and hence the minimizer of
(Pt ) is closer to a minimizer of (P).
First, we give a formula for the minimizer of (Pt ).
Lemma 5.21 (Existence and Uniqueness of central path). If the polytope {Ax = b, x ≥ 0} has an interior,
then the optimum of (Pt ) is unique and is given by the solution of the following system:
xs = t,
Ax = b,
A⊤ y + s = c,
(x, s) ≥ 0
where the variables si are additional slack variables, and xs = t is shorthand of xi si = t for all i.
Proof. The optimality condition, using dual variables y for the Lagrangian of Ax = b is given by
t
c− = A⊤ y.
x
Write si = xi ,
t
to get the formula. The solution is unique because the function − ln x is strictly convex.
Denition 5.22. We dene the central path Ct = (x(t) , y (t) , s(t) ) as the sequence of points satisfying
x(t) s(t) = t,
Ax(t) = b,
A⊤ y (t) + s(t) = c,
(x(t) , s(t) ) ≥ 0.
To give another interpretation of the central path, note that the dual problem is
0 ≤ c⊤ x − b⊤ y = c⊤ x − x⊤ A⊤ y = x⊤ s.
Hence, (x, y, s) solves the linear program if it satises the central path equation with t = 0. Therefore,
following the central path is a balanced way to decrease xi si uniformly to 0. We can formalize the intuition
that for small t, x(t) is a good approximation of the primal solution. In fact t itself is a bound on the error
of the current solution.
5.5. Interior Point Method for Linear Programs 78
(x + δx )(s + δs ) ≈ t,
A(x + δx ) = b,
⊤
A (y + δy ) + (s + δs ) = c.
(Omitted the non-negative conditions.) Using our assumption on (x, y, s) and noting that δx · δs is small,
the equation can simplied as follows. We use the notation X = Diag(x), S = Diag(s).
0 A⊤ I
δx 0
A 0 0 δy = 0 .
S 0 X δs t − xs
This is a linear system and hence we can solve it exactly.
Exercise 5.24. Let r = t−xs. Prove that Sδx = (I−P )r and Xδs = P r where P = XA⊤ (AS −1 XA⊤ )−1 AS −1 .
First, we show that x(new) = x + δx and s(new) = s + δs are feasible.
Lemma 5.25. Suppose i (xi si − t)2 ≤ ε2 t2 with ε < 21 . Then,x(new) (new)
P
i >0 and si >0 for all i.
Proof. Note that P 2 = P . However, in general P ̸= P ⊤ ,i.e. P might not be an orthogonal projection matrix.
1 1 1 1
It will be convienient to consider the orthogonal projection matrix P = S − 2 X 2 A⊤ (AS −1 XA⊤ )−1 AS − 2 X 2 .
Note that
1 1 1 1
X −1 δx = S − 2 X − 2 (I − P )S − 2 X − 2 r.
By the assumption for each i, xi si ≥ (1 − ε)t. Therefore, we have
1 1 1
∥X −1 δx ∥2 ≤ p ∥(I − P )S − 2 X − 2 r∥2
(1 − ϵ)t
1 1 1
≤p ∥S − 2 X − 2 r∥2
(1 − ϵ)t
1 ϵ
≤ ∥r∥2 ≤ .
(1 − ϵ)t 1−ϵ
1 1 1 1
Similarly, we have S −1 δs = S − 2 X − 2 P S − 2 X − 2 r. Hence, we have ∥S −1 δs ∥2 ≤ 1−ϵ
ϵ
. Therefore, when ϵ < 21 ,
we have both ∥X δx ∥∞ and ∥S δs ∥∞ less than 1, which shows that both x
−1 −1 (new)
and s(new) are positive.
Next, we show that xs is closer to t after one Newton step.
Lemma 5.26. If i (xi si − t)2 ≤ ε2 t2 with ϵ < 41 , we have that
P
X 2
(new) (new)
− t ≤ ϵ4 + 16ϵ5 t2 .
xi si
i
5.5. Interior Point Method for Linear Programs 79
X (new) (new)
2 X X 2
X δx,i 2 δs,i 2
LHS = xi si −t = 2
(xi si +xi δs,i +si δx,i +δx,i δs,i −t) = 2 2
δx,i δs,i ≤ ((1 + ϵ)t) ·
i i i i
xi si
where in the last step we used x2i s2i ≤ (1 + ε)2 t2 . Using the previous lemma, we have that
2
LHS ≤ ((1 + ϵ)t) · ∥X −1 δx ∥24 ∥S −1 δs ∥24
2
≤ ((1 + ϵ)t) · ∥X −1 δx ∥22 ∥S −1 δs ∥22
4
2 ϵ
≤ ((1 + ϵ)t)
1−ϵ
4 5 2
≤ ϵ + 16ϵ t
2 t2
Let Φ = i (xi si − t) be the error of the current iteration. We always maintain Φ ≤ 16 for the
P
Proof.
t2
current (x, y, s) and t. At each step, we use Lemma 5.26 which makes Φ ≤ 50 . Then, we decrease t to
t(1 − 101√n ). Note that
Lemma 5.28. Consider a linear programminAx=b,x≥0 c⊤ x with n variables and d constraints. Assume that
1. Diameter: For any x≥0 Ax = b, we have that ∥x∥∞ ≤ R.
with
2. Lipschitz constant of the objective: ∥c∥∞ ≤ L.
⊤
For any 0 < δ ≤ 1, the modied linear program minAx=b,x≥0 c x with
1
1
δ/L · c
A 0 Rb − A1n Rb
A= ,b = , and c= 0
1⊤
n 1 0 n+1
1
satises the
following:
1n + Lδ · c
1n
0d
1. x= 1 ,y=
and s = 1 are feasible primal dual vectors.
−1
1 1
2
2. For any feasible primal dual vectors (x, y, s) with duality gap at most δ , the vector x̂ = R·x1:n (x1:n are
the rst n coordinates of x) is an approximate solution to the original linear program in the following
5.5. Interior Point Method for Linear Programs 80
sense
c⊤ x̂ ≤ min c⊤ x + LR · δ,
Ax=b,x≥0
X
∥Ax̂ − b∥1 ≤ 4nδ · R |Ai,j | + ∥b∥1 ,
i,j
x̂ ≥ 0.
Part 1. For the rst result, straightforward calculations show that (x, y, s) ∈ R(n+2)×(d+1)×(n+2) are
feasible, i.e.,
1
1n 1
A 0 Rb − A1n b
Ax = ·1= R =b
1⊤
n 1 0 n+1
1
and
A⊤ 1n + Lδ · c
1n
⊤ 0
A y+s= 0 1· d + 1
1 ⊤ ⊤ ⊤ −1
b − 1n A 0 1
R
1n + Lδ · c
−1n
= −1 + 1
0 1
δ
L ·c
= 0 =c
1
For any optimal x ∈ Rn in the original LP, we consider the following x ∈ Rn+2
1
R xP
n
x = n + 1 − R1 i=1 xi (5.5)
0
and c ∈ Rn+2
· c⊤
δ
L
c= 0 (5.6)
1
We want to argue that x ∈ Rn+2 is feasible in the modied LP. It is obvious that x ≥ 0, it remains to show
Ax = b ∈ Rd+1 . We have
1
A 0 R1 b − A1n R xP
1 1
1 n Ax Rb
Ax = ⊤ · n + 1 − R x R
i=1 i = n + 1 = n + 1 = b,
1n 1 0
0
where the third step follows from Ax = b, and the last step follows from denition of b.
Therefore, using the denition of x in (5.5) we have that
1
R xP δ δ
n
OPT ≤ c⊤ x = Lδ · c⊤ 0 1 · n + 1 − R1 i=1 xi = · c⊤ x = · OPT. (5.7)
LR LR
0
5.5. Interior Point Method for Linear Programs 81
where the rst step follows from modied program is solving a minimization problem, the second step follows
from denition of x ∈ Rn+2 (5.5) and c ∈ Rn+2 (5.6), the last step follows from x ∈ Rn is an optimal solution
in the original linear program.
x1:n
Given a feasible (x, y, s) ∈ R(n+2)×(d+1)×(n+2) with duality gap δ 2 , we can write x = τ ∈ Rn+2 for
θ
some τ ≥ 0, θ ≥ 0. We can compute c⊤ x which is Lδ · c⊤ x1:n + θ. Then, we have
δ ⊤ δ
· c x1:n + θ ≤ OPT + δ 2 ≤ · OPT + δ 2 , (5.8)
L LR
where the rst step follows from denition of duality gap, the last step follows from (5.7).
Hence, we can upper bound the OPT of the transformed program as follows:
LR δ ⊤ RL δ
c⊤ x̂ = R · c⊤ x1:n = · c x1:n ≤ ( · OPT + δ 2 ) = OPT + LR · δ,
δ L δ LR
where the rst step follows by x̂ = R · x1:n , the third step follows by (5.8).
Note that
δ ⊤ δ δ 1 δ n
c x1:n ≥ − ∥c∥∞ ∥x1:n ∥1 = − ∥c∥∞ ∥ x∥1 ≥ − ∥c∥∞ ∥x∥∞ ≥ −δn, (5.9)
L L L R L R
where the second step follows from denition x ∈ Rn+2 , and the last step follows from ∥c∥∞ ≤ L and
∥x∥∞ ≤ R.
We can upper bound the θ in the following sense,
δ
θ≤ · OPT + δ 2 + δn ≤ 2nδ + δ 2 ≤ 4nδ (5.10)
LR
where the rst step follows from (5.8) and (5.9), the second step follows by OPT = minAx=b,x≥0 c⊤ x ≤ nLR
(because ∥c∥∞ ≤ L and ∥x∥∞ ≤ R), and the last step follows from δ ≤ 1 ≤ n.
The constraint in the new polytope shows that
1 1
Ax1:n + ( b − A1n )θ = b.
R R
Using x̂ = Rx1:n ∈ Rn , we have
1 1 1
A x̂ + ( b − A1n )θ = b.
R R R
Rewriting it, we have Ax̂ − b = (RA1n − b)θ ∈ Rd and hence
∥Ax̂ − b∥1 = ∥(RA1n − b)θ∥1 ≤ θ(∥RA1n ∥1 + ∥b∥1 ) ≤ θ · (R∥A∥1 + ∥b∥1 ) ≤ 4nδ · (R∥A∥1 + ∥b∥1 ),
where the second step follows from triangle inequality, the third step follows from ∥A1n ∥1 ≤ ∥A∥1 (because
the denition of entry-wise ℓ1 norm), and the last step follows from (5.10).
√
5.5.4 Why n?
The central path is the solution to the following ODE
d d
St xt + Xt st = 1,
dt dt
d
A xt = 0,
dt
d d
A⊤ yt + st = 0.
dt dt
5.6. Interior Point Method for Convex Programs 82
−1
Solving this linear system, we have that St dx
dt = (I−Pt )1 and Xt dt = Pt 1 where Pt = Xt A (ASt Xt A )
t dst ⊤ ⊤ −1
ASt−1 .
Using that xt st = t, we have that
d ln xt d ln st
= (I − Pt )1 and = Pt 1.
d ln t d ln t
Note that √
∥Pt 1∥∞ ≤ ∥Pt 1∥2 = n.
Hence, xt and st can change by at most a constant factor when we change t by a 1 ± √1
n
factor.
Exercise 5.29. If we are given x such that ∥ln x − ln xt ∥∞ = O(1), then we can nd xt by solving Õ(1)
linear systems.
min f (x)
x
Hence, it suces to study the problem minx∈K c⊤ x. Similar to the case of linear programs, we replace the
hard constraint x ∈ K by a soft constraint as follows:
where ϕ(x) is a convex function such that ϕ(x) → +∞ as x → ∂K . Note that we put the parameter t in
front of the cost c⊤ x instead of ϕ as in the last lecture, it is slightly more
Pn convenient here. We say ϕ is a
barrier for K . To be concrete, we can always keep in mind ϕ(x) = − i=1 ln xi . As before, we dene the
central path.
Denition 5.30. The central path xt = arg minx ϕt (x).
The interior point method follows the following framework:
1. Find x close to x1 .
2. While t is not tiny,
(a) Move x closer to xt
(b) t → (1 + h) · t.
5.6.1 Self-concordance
In this section, we give a general analysis for the Newton method. In the next section, we will use this to
show that interior point method can be generalized to convex optimization. A key property of the Newton
method is that it is invariant under linear transformation. In general, whenever a method uses k th order
information, we need to assume the k th derivative is continuous. Otherwise, the k th derivative is not useful
for algorithmic purposes. For the Newton method, it is convenient to assume that the Hessian is Lipschitz.
Since the method is invariant under linear transformation, it only makes sense to impose an assumption that
is invariant under linear transformation.
5.6. Interior Point Method for Convex Programs 83
Denition 5.31. Given a convex function f : Rn → R, and any point x ∈ Rn , dene the norm ∥.∥x as
2
∥v∥x = v ⊤ ∇2 f (x)v.
Remark. The constant 2 is chosen so that − ln(x) exactly satises the assumption and it is not very important,
in that by scaling f , we can change any constant to any other constant.
Exercise 5.32. Show that the following property is equivalent fo self-concordance as dened above: re-
stricted on any straight line g(t) = f (x + th), we have g ′′′ (t) ≤ 2g ′′ (t)3/2 .
Exercise 5.33. Show that the functions x⊤ Ax, − ln x, − ln(1 − x2i ), − ln det X are self-concordant under
P
suitable nonnegativity conditions.
The self-concordance condition says that locally, the Hessian does not change too fast, i.e., the change in
the Hessian is bounded by its magnitude (to the power 1.5). We will skip the proof of the lemma below.
From the self-concordance condition, we have the following more directly usable property.
Lemma 5.35. For a self-concordant function f and any x ∈ domf and any ∥y − x∥x < 1, we have that
1
(1 − ∥y − x∥x )2 ∇2 f (x) ⪯ ∇2 f (y) ⪯ ∇2 f (x).
(1 − ∥y − x∥x )2
By self-concordance, we have
2
|α′ (t)| ≤ 2 ∥y − x∥x+t(y−x) ∥u∥x+t(y−x) . (5.11)
3
For u = y − x, we have |α′ (t)| ≤ 2α(t) 2 . Hence, we have d
dt
√1 ≥ −1. Integrating both sides wrt t, we
α(t)
have
1 1 1
p ≥p −t= − t.
α(t) α(0) ∥x − y∥x
Rearranging it gives
2
2 1 ∥x − y∥x
∥y − x∥x+t(y−x) = α(t) ≤ 1 = .
( ∥x−y∥ − t)2 (1 − t ∥x − y∥x )2
x
Not all convex functions are self-concordant. However, for our purpose, it suces to show that we can
construct a self-concordant barrier for any convex set.
Unfortunately, this is an existence result and the barrier function is expensive to compute.In practice,
we construct self-concordant barriers out of simpler ones:
Lemma 5.38. The following functions are self-concordant barriers. We use ν -sc as a short form for ν -self-
concordant barrier.
− ln x is 1-sc for {x ≥ 0}.
− ln cos(x) is 1-sc for {|x| ≤ π2 }.
2
− ln(t2 − ∥x∥ ) is 2-sc for {t ≥ ∥x∥2 }.
− ln det X is n-sc for {X ∈ Rn×n , X ⪰ 0}.
− ln x − ln(ln x + t) is 2-sc for {x ≥ 0, t ≥ − ln x}.
− ln t − ln(ln t − x) is 2-sc for {t ≥ ex }.
− ln x − ln(t − x ln x) is 2-sc for {x ≥ 0, t ≥ x ln x}.
−2 ln t − ln(t2/p − x2 ) is 4-sc for {t ≥ |x|p } for p ≥ 1.
− ln x − ln(tp − x) is 2-sc for {tp ≥ x ≥ 0} for 0 < p ≤ 1.
− ln t − ln(x − t−1/p ) is 2-sc for {x > 0, t ≥ x−p } for p ≥ 1.
− ln x − ln(t − x−p ) is 2-sc for {x > 0, t ≥ x−p } for p ≥ 1.
The following lemma shows how we can combine barriers.
Lemma 5.39. If ϕ1 and ϕ2 are ν1 and ν2 -self concordant barriers for K1 and K2 respectively, then ϕ1 + ϕ2
is a ν1 + ν2 self concordant barrier for K1 ∩ K2 .
Lemma 5.40. If ϕ is a ν -self concordant barrier for K, then ϕ(Ax + b) is ν -self concordant for {y : Ay + b ∈
K}.
Exercise 5.41. Using the lemmas above, prove that −
Pm
i=1 i x − bi ) is an m-self concordant barrier
ln(a⊤
for the convex set {Ax ≥ b}.
r2
∥∇f (x′ )∥∇2 f (x′ )−1 ≤
.
(1 − r)2
Remark 5.43. Note that ∥∇f (x)∥∇2 f (x)−1 =
∇ f (x) ∇f (x)
is the step size of the Newton method.
2 −1
x
This is a measurement of the error, since the goal is to nd x with ∇f (x) = 0.
5.6. Interior Point Method for Convex Programs 85
Finally, we bound the error of the current iterate in terms of ∥∇f (x)∥∇2 f (x)−1 .
Lemma 5.44. x such that ∥∇f (x)∥∇2 f (x)−1 ≤
Given
1
6 , we have that
∥x − x ∥x∗ ≤ 2∥∇f (x)∥∇2 f (x)−1 ,
∗
Z 1
2 ∗ ∗ ∗
∥∇f (x)∥∇2 f (x)−1 =
∇ f (x + t(x − x ))(x − x ) dt
0 ∇2 f (x)−1
Z 1
≥ (1 − (1 − t)r)2 ∥x − x∗ ∥x dt
0
r2
3r
= 1−r+ r≥ .
3 4
5.6. Interior Point Method for Convex Programs 86
where we used r ≤ 1
4 at the end.
⟨∇ϕ(x), y − x⟩ ≤ ν.
Proof. Let α(t) = ⟨∇ϕ(zt ), y − x⟩ where zt = x + t(y − x). Then, we have
α′ (t) = ∇2 ϕ(zt )(y − x), y − x .
Note that √
α(t) ≤ ∥∇ϕ(zt )∥∇2 ϕ(zt )−1 ∥y − x∥∇2 ϕ(zt ) ≤ v ∥y − x∥∇2 ϕ(zt ) .
Hence, we have α′ (t) ≥ v1 α(t)2 . If α(0) ≤ 0, then we are done. Otherwise, α is increasing and hence α(1) > 0.
Since α(1)
1 1
≤ α(0) − v1 . So, α(0) ≤ v .
Lemma 5.46 (Duality Gap). Suppose that ϕ is a ν -self concordant barrier, we have that
ν
⟨c, xt ⟩ ≤ ⟨c, x∗ ⟩ + .
t
1
More generally, for any x such that ∥tc + ∇ϕ(x)∥(∇2 ϕ(x))−1 ≤ 6 , we have that
√
ν+ ν
⟨c, x⟩ ≤ ⟨c, x∗ ⟩ + .
t
5.6. Interior Point Method for Convex Programs 87
1
⟨c, x − xt ⟩ ≤ ∥tc + ∇ϕ(x)∥∇2 ϕ(x)−1 + ∥∇ϕ(x)∥∇2 ϕ(x)−1
3t √
1 1 √ ν
≤ + ν ≤ .
3t 6 t
Theorem 5.47. Given a ν -self concordant barrier ϕ and its minimizer. We can nd x ∈ K such that
c⊤ x ≤ c⊤ x∗ + ϵ in
√ ν
O( ν log( ∥c∥∇2 ϕ(x)−1 ))
ϵ
iterations.
Proof. We prove by induction that ∥∇ft (x)∥∇2 ft (x))−1 ≤ 16 at the beginning of each iteration. This is true
at the beginning by the denition of initial t. By Lemma 5.42, after the Newton step, we have
1/6 2 1
∥∇ft (x)∥∇2 ft (x)−1 ≤ ( ) = .
1 − 1/6 25
Sparsication
In this chapter, we study some randomization techniques for faster convex optimization.
where A ∈ Rn×d with n ≥ d (we assume this throughout the chapter). The gradient of the function is
2A⊤ Ax − 2A⊤ b. Setting it to zero, and assuming A⊤ A is invertible, the solution is given by
x = (A⊤ A)−1 A⊤ b.
If AT A is not invertible, we use its pseudo-inverse. If the matrix A⊤ A ∈ Rd×d is given, then we can solve the
equation above in time dω , the current complexity of matrix multiplication. If n > dω , then the bottleneck
is simply to compute A⊤ A. The following lemma shows that it suces to approximate A⊤ A.
The simplest iteration is the Richardson iteration:
To ensure this converges, we scale down AT A by its largest eigenvalue so that AT A ≺ I . This gives a bound
of O(κ(A⊤ A) log(1/ϵ)) on the number of iterations to get ϵ error where κ(A⊤ A) = λmax (A⊤ A)/λmin (A⊤ A).
More generally, one can use pre-conditioning. Recall that for a vector v , the norm ∥v∥M is dened as v T M v .
Lemma 6.1. ⊤
Given a matrix M such that A A ⪯ M ⪯ κ · A⊤ A for some κ ≥ 1. Consider the algorithm
(k+1) (k) −1 ⊤ (k) ⊤
x =x − M (A Ax − A b) . Then, we have that
k
1
∥x(k) − x∗ ∥M ≤ 1− ∥x(0) − x∗ ∥M .
κ
Remark 6.2. The proof also shows why the choice of norm above is the natural one. In this norm, the
residual drops geometrically.
Proof. Using x∗ = (A⊤ A)−1 A⊤ b, i.e., A⊤ b = (A⊤ A)x∗ , and the formula of x(k+1) , we have
88
6.1. Subspace embedding 89
1 1
where H = M − 2 A⊤ AM − 2 . Note that the eigenvalues of H lie between 1/κ and 1 and hence
2 1 2
λmax (I − H) ≤ 1 − .
κ
Note that x⊤ A⊤ Ax = ∥Ax∥2 . Alternatively, we can think that our goal is to approximate A by a smaller
matrix B s.t. ∥Ax∥2 is close to ∥Bx∥2 for all x. In this section, we show that we can simply take B = ΠA
for a random matrix Π with relatively few rows. With this choice of M , we can run the Richardson iteration.
We need to see if this will make the entire procedure more ecient.
for all y ∈ S .
In this section, we focus on the case that S is a d-dimensional subspace in Rn , namely S = {Ax : x ∈ Rd }.
Consider the SVD A = U ΣV ⊤ . For any y ∈ S , we have that
Exercise 6.4. For any d-dimensional subspace S , any embedding with distortion ϵ < 1 must have at least
d rows.
This embedding is not useful for solving the least squares problem because the solution of the least square
problem is simply a closed form of the SVD decomposition x = V Σ−1 U ⊤ b and nding the SVD is usually
more expensive.
Denition 6.5. A random matrix Π ∈ Rm×n is a (d, ϵ, δ)-oblivious subspace embedding (OSE) if for any
xed d-dimensional subspace S ⊂ Rn , Π is an embedding for S with distortion ϵ with probability at least
1 − δ.
Lemma 6.6. We call Π a (d, ϵ, δ) OSE if for any matrix U ∈ Rn×d with orthonormal columns, we have that
P(∥U ⊤ Π⊤ ΠU − Id ∥op ≤ ϵ) ≥ 1 − δ.
Proof. Let S be the subspace with an orthonormal basis U ∈ Rn×d , namely S = {y : y = U z} Then, the
condition
(1 − ϵ)∥y∥2 ≤ ∥Πy∥2 ≤ (1 + ϵ)∥y∥2
can be rewritten as
(1 − ϵ)U ⊤ U ⪯ U ⊤ Π⊤ ΠU ⪯ (1 + ϵ)U ⊤ U.
Using U ⊤ U = Id , we have
(1 − ϵ)Id ⪯ U ⊤ Π⊤ ΠU ⪯ (1 + ϵ)Id .
6.1. Subspace embedding 90
for all unit vectors a. An OSE for d = 1 is given by the Johnson-Lindenstrauss Lemma. The original version
was for a uniform random subspace of dimension d, but later versions extended this to Gaussian, Bernoulli
and more general random matrices [73].
Lemma 6.7 (Johnson-Lindenstrauss Lemma). Π ∈ Rm×n be a random matrix with i.i.d entries
Let from
N (0, √1m ) or uniformly sampled from ± √1m with m = Θ( ϵ12 log( 1δ )). Then, Π is a (1, ϵ, δ) OSE.
We will skip the proof for this as we will prove a more general result later. Next, we show that any OSE
for d = 1 is a OSE for general d. Therefore, it suces to focus on the case d = 1. First, we need a lemma
about ϵ-net on S n−1 .
Lemma 6.8. For any ϵ > 0 and any n ∈ N, there are at most (1 + 2ϵ )n unit vectors x i ∈ Rn such that for
n
any unit vector x ∈ R , there is an i such that ∥x − xi ∥2 ≤ ϵ.
Proof. Let N = {xi }i=1 be a 2 -net for S . Then, for any x, we have x1 ∈ N such that ∥x − x1 ∥2 ≤ 21 .
1 d−1
Using the 12 -net guarantee on the vector x − x1 , we can nd x2 ∈ N and 0 ≤ t2 ≤ 21 such that
1
∥x − x1 − t2 x2 ∥ ≤ .
4
P∞
Continuing similarly, we have x = with 0 ≤ ti ≤ 2i−1
i=1 ti xi
1
. Hence, we have that
X
x⊤ (U ⊤ Π⊤ ΠU − Id )x = ti tj x⊤ ⊤ ⊤
i (U Π ΠU − Id )xj
i,j
X
ti tj · max x⊤ U ⊤ Π⊤ ΠU − Id x
≤
x∈N
i,j
≤ 4 max x⊤ U ⊤ Π⊤ ΠU − Id x
x∈N
= 4 max ∥Πx∥2 − 1
x∈U N
This reduction and the Johnson-Lindenstrauss Lemma shows that a random ± √1m is a (d, ϵ, δ) OSE with
1 1
m = Θ( 2
(d + log( ))).
ϵ δ
As we discussed before any (d, ϵ, δ) OSE should have at least d rows. Therefore, the number of rows of this
OSE is tight for the regime ϵ = Θ(1). We only need an OSE for ϵ = Θ(1) because of Lemma 6.1; by iterating
we can get any ϵ with an overhead of log(1/ϵ). Unfortunately, computing ΠA is in fact more expensive than
A⊤ A. The rst involves multiplying Θ(d) × n and n × d matrix, the second one involves multiplying d × n
and n × d matrix.
n = d = 1, ∥Πx∥2 is the number of non-zeros in the only column. Therefore, we indeed need that s scales
like 1/ϵ2 . The advantage of a sparse embedding is that applying it can be much more ecient.
Remark 6.10. It turns out that one can select exactly s non-zeros for each column. This allows us to use
s = Θ( log(d/δ)
ϵ ). The proof of this is slightly more complicated due to the lack of independence [16].
To analyze U Π ΠU , we note that
⊤ ⊤
m
X
U ⊤ Π⊤ ΠU = (ΠU )⊤
r (ΠU )r .
r=1
Since each row of Π is independent, we can use matrix concentration bounds to analyze the sum above. See
[70] for a survey on matrix concentration bounds.
Theorem 6.11 (Matrix Cherno). Suppose we have a sequence of independent, random, self-adjoint ma-
R n
trices Mj ∈ Rn×n such that EMj = I and 0 ⪯ Mj ⪯ R · I . Then, for T = ε2 log δ ,
T
1X
(1 − O(ε))I ⪯ Mj ⪯ (1 + O(ε))I
T j=1
Mr ⪯ m · πr⊤ U U ⊤ πr · I. (6.1)
With small probability, πr⊤ U U ⊤ πr can be huge. However, as long as πr⊤ U U ⊤ πr is bounded by R with
probability 1 − δ , then we can still use the matrix Cherno bound above. To bound πr⊤ U U ⊤ πr , we will use
the following large deviation inequality.
Lemma 6.13. Assume that s≫ m
n log( 1δ ),log2 ( 1δ )/ϵ2 and m ≫ d log( 1δ )/ϵ2 . Then,
ϵ2
πr⊤ U U ⊤ πr ≤
log(d/δ)
with probability 1 − δ. (Here ≫ means greater by a suciently large constant factor).
6.1. Subspace embedding 92
πr⊤ U U ⊤ πr = σ ⊤ P σ.
Note that U U ⊤ is a projection matrix and hence ∥P ∥op ≤ 1. Since P ⪰ 0, we have that
√ q √
∥P ∥F = trP 2 ≤ ∥P ∥op · trP ≤ trP .
with probability 1 − δ . Note that trU U ⊤ = trU ⊤ U = d and that P is a random diagonal block of U U ⊤ of
size at most 2sn/m. By the Cherno bound, one can show that
4sd
trP ≤ .
m
Hence, we have r !
4d 1 sd 1 1
σ⊤ P σ ≤ + O · log + log
m s m δ δ
ϵ2
with probability 1−δ . Using s ≫ log2 ( dδ )/ϵ2 and m ≫ d log( dδ )/ϵ2 , we have σ ⊤ P σ ≤ log(d/δ) with probability
1 − δ.
Theorem 6.14. Consider a random sparse matrix Π ∈ Rm×n where each entry is ± √1s with probability
s d log( d
δ) log2 ( d
δ)
m and 0 otherwise. There exist constants c1 , c2 such that for m = c1 ϵ2 and s = c2 ϵ2 Π is an
O(d, ϵ, δ)-OSE.
ϵ2
πr⊤ U U ⊤ πr ≤
log(d/δ)
with high probability. Under this event, using Theorem 6.11 and (6.1), we have that
m
X
(1 − O(ε))I ⪯ πr⊤ U U ⊤ πr ⪯ (1 + O(ε))I.
r=1
6.2. Leverage Score Sampling 93
Proof. The guarantee of x follows from the denition of OSE and Lemma 6.1. We note that ΠA simply
involves duplicating each row of A into roughly s many rows in ΠA. Hence, the cost of computing ΠA is
O(s · nnz(A)) = O(nnz(A))
e . The cost of computing M takes O(de ω ) time. Computing M −1 takes O(d e ω)
time. The loop takes O(nnz(A)) time. This explain the total time.
e
Linear regression can also be solved in time O nnz(A) + dO(1) via a very sparse embedding: each column
of Π picks exactly one nonzero entry in a random location. This was analyzed by Clarkson and Woodru
[76]. See also [34] for a survey.
Remark. Here we use O∗ to emphasize there is some dependence on log( 1ϵ ) suppressed for notational sim-
plicity. Lemma 6.1 shows that once we can solve the system with constant approximation, we can repeat
log( 1ϵ ) times to get an ϵ-accurate solution.
Note that σ(A) is the diagonal of the projection matrix A(A⊤ A)+ A⊤ . Since 0 ⪯ A(A⊤ A)+ A⊤ ⪯ I , we
have that 0 ≤ σi (A) ≤ 1. Moreover, since A(A⊤ A)+ A⊤ is a projection matrix, the sum of A's leverage scores
(its trace) is equal to the rank of A:
n
X
σi (A) = tr(A(A⊤ A)+ A⊤ ) = rank(A(A⊤ A)+ A⊤ ) = rank(A) ≤ d. (6.2)
i=1
6.2. Leverage Score Sampling 94
The leverage score measures the importance of a row in forming the row space of A. If a row has a
component orthogonal to all other rows, its leverage score is 1. Removing it would decrease the rank of A,
completely changing its row space. The coherence of A is ∥σ(A)∥∞ . If A has low coherence, no particular
row is especially important. If A has high coherence, it contains at least one row whose removal would
signicantly aect the composition of A's row space. The following two characterizations help with this
intuition.
Lemma 6.18. For all A ∈ Rn×d and i ∈ [n] we have that
2
σi (A) = min ∥x∥2 .
A⊤ x=ai
ai a⊤ ⊤
i ⪯ t · A A. (6.3)
Sampling rows from A according to their exact leverage scores gives a spectral approximation for A with
high probability. Sampling by leverage score overestimates also suces.
Lemma 6.20. Given a vector u of leverage score overestimates, i.e., σi (A) ≤ ui for all i, dene
1 ui
X= ai a⊤
i with probability pi = .
pi ∥u∥1
∥u∥1 log n 1
For T = Ω( ε2 ), with probability 1− nO(1)
, we have that
T
1X
(1 − ε)A⊤ A ⪯ Xi ⪯ (1 + ε)A⊤ A
T i=1
where Xi are independent copies of X.
Proof. Note that EX = A A and that ⊤
1 ∥u∥1
0⪯X= ai a⊤
i ⪯ ai a⊤ ⊤
i ⪯ ∥u∥1 · A A.
pi σi
1 1
Now, the statement simply follows from the matrix Cherno bound with Mk = (A⊤ A)− 2 Xk (A⊤ A)− 2 and
R = ∥u∥1 .
Combining Lemma 6.20 and Lemma 6.1, we have that
T (n) = cost of computing σi + O∗ (nnz(A) + T (d log d)) (6.4)
where we used that ∥σ∥1 = O(d). However, computing σ exactly is too expensive for many purposes. In
[67], they showed that we can compute leverage scores approximately by solving only polylogarithmically
many regression problems. This result uses the fact that
2
σi (A) =
A(A⊤ A)+ A⊤ ei
2
and that by the Johnson-Lindenstrauss Lemma these lengths are preserved up to a multiplicative error if we
project these vectors to a random low-dimensional subspace.
2
In particular, this lemma shows that we can approximate σi (A) via
ΠA(A⊤ A)+ A⊤ ei
2 . The benet of
this is that we can compute ΠA(A⊤ A)+ by solving logε2 many linear systems. In other words, we have that
n
σi,S = a⊤ ⊤
i (AS AS )
−1
ai
where AS is A restricted to rows in S . The set S will be a random sample of k rows of A. Note that A⊤
S AS
is an overestimate of σi . Hence, it suces to bound ∥σi,S ∥1 . The key lemma is the following:
Lemma 6.21. We have that
n
X nd
E|S|=k σi,S∪{i} ≤ .
i=1
k
Note that i∈S σi,S∪{i} = i∈S σi,S ≤ d. Hence, the second term is bounded by n.
P P
For the rst term, we note that sample a set S of size k , then sample i ∈
/ S is same as sample a set T
of size k + 1, then sample i ∈ T . Hence, we have
E|S|=k Ei∈S
/ σi,S∪{i} = E|T |=k+1 Ei∈T σi,T
d d
≤ E|T |=k+1 =
k+1 k+1
Hence, we have that
n
X d n+1
E|S|=k σi,S∪{i} ≤ (n − k) + d = d · .
i=1
k+1 k+1
where D is a distribution of convex functions in Rd . The goal is to nd the minimizer x∗ = argminx F (x).
Suppose we observed samples f1 , f2 , · · · , fT from D. Ideally, we wish to approximate x∗ by the empirical
risk minimizer
T
(T ) def 1
X
xERM = arg min FT (x) = fi (x).
x T i=1
(T )
It is known that xERM is optimal in a certain sense in spite of its computational cost. Therefore, to discuss
the eciency of an optimization algorithm for F (x), it is helpful to consider the ratio
EF (x(T ) ) − F (x∗ )
(T )
.
EF (xERM ) − F (x∗ )
(T )
We will rst discuss the term EF (xERM ) − F (x∗ ). As an example, consider the simplest one-dimensional
problem F (x) = Eb∼N (0,I) (x − b)2 . Note that x(T ) is simply the average of T standard normal variables and
hence
(T ) (T )
EF (xERM ) − F (x∗ ) = Eb1 ,b2 ,··· ,bT Eb (xERM − b)2 − b2
(T )
= Eb1 ,b2 ,··· ,bT (xERM )2
T
1X 2 1
= Eb1 ,b2 ,··· ,bT ( bi ) = .
T i=1 T
(T ) σ2 1
EF (xERM ) − F (x∗ ) → where σ 2 = E∥∇f (x∗ )∥2(∇2 F (x∗ ))−1 .
T 2
σ2
where T is called the Cramer-Rao bound.
Lemma 6.22. Suppose that f is µ-strongly convex with Lipschitz Hessian for all f ∼D for some µ > 0.
Suppose that Ef ∼D ∥∇f (x∗ )∥2 < +∞. Then, we have that
(N )
EF (xERM ) − F (x∗ )
lim = 1.
N →+∞ σ 2 /N
Remark. The statement holds with weaker assumptions and the rate of convergence can be made quantitative.
(N ) (N )
Proof.We rst prove xERM → x∗ as N → +∞. By the optimality condition of xERM , and using Taylor's
theorem, for some x
e, we have that
(N ) (N )
0 = ∇FN (xERM ) = ∇FN (x∗ ) + ∇2 FN (e
x)(xERM − x∗ ). (6.5)
By the µ-strongly convexity, we have
(N ) 1
∥xERM − x∗ ∥22 ≤ ∥∇FN (x∗ )∥2 .
µ
Since Ef ∼D ∇f (x∗ ) = ∇F (x∗ ) = 0, we have
(N ) 1 1 X
E∥xERM − x∗ ∥22 ≤ E ∥∇fi (x∗ )∥2
µ N2 i
1
= Ef ∼D ∥∇f (x∗ )∥2
µN
6.3. Stochastic Gradient Descent 97
(N )
Therefore, xERM → x∗ as N → +∞.
Now, to compute the error, Taylor expansion of F at x∗ shows that
(N ) 1 (N ) (N )
F (xERM ) − F (x∗ ) = (x − x∗ )⊤ ∇2 F (x)(xERM − x∗ )
2 ERM
(N )
for some x between x∗ and xERM . Using this and (6.5) gives
(N ) 1
F (xERM ) − F (x∗ ) = ∇FN (x∗ )⊤ (∇2 FN (e x))−1 ∇FN (x∗ ).
x))−1 ∇2 F (x)(∇2 FN (e
2
(N )
Since xERM → x∗ and ∇2 FN ⪰ µI , we have (∇2 FN (e x))−1 → (∇2 F (x∗ ))−1 . Hence,
x))−1 ∇2 F (x)(∇2 FN (e
we have
(N ) N
lim EN · (F (xERM ) − F (x∗ )) = lim E∇FN (x∗ )⊤ (∇2 F (x∗ ))−1 ∇FN (x∗ )
N →∞ N →∞ 2
1
= E∥∇f (x∗ )∥2(∇2 F (x∗ ))−1 .
2
Now, we discuss how to achieve a bound similar to σ 2 /T using stochastic gradient descent. Since gradient
descent is a rst order method, we can only achieve a bound related to the ℓ2 norm, E∥∇f (x∗ )∥22 , instead
of the inverse Hessian norm.
Lemma 6.23. Suppose f has L-Lipschitz gradient for all f ∈ D. Let x∗ be the minimizer of F. Then, we
have
Ef ∼D ∥∇f (x) − ∇f (x∗ )∥22 ≤ 2L · (F (x) − F (x∗ )).
Proof. Let g(x) = f (x) − ∇f (x∗ )⊤ (x − x∗ ). By construction, x∗ is the minimizer of g and f (x∗ ) = g(x∗ ).
Hence, by the progress of gradient descent, we know
1
g(x∗ ) ≤ g(x) − ∥∇g(x)∥2 .
2L
Rearranging the term, we have
∥∇g(x)∥2 ≤ 2L(g(x) − g(x∗ ))
and hence
∥∇f (x) − ∇f (x∗ )∥2 ≤ 2L · (f (x) − f (x∗ ) − ∇f (x∗ )⊤ (x − x∗ )).
Taking expectation, we have the result.
Theorem 6.24. ∗
Suppose f has L-Lipschitz gradient for all f ∈ D and F is µ strongly convex. Let x be the
2 1 ∗ 2 1 (k)
minimizer of F and σ = 2 E∥∇f (x )∥ . For step size h ≤ 4L , the sequence x in StochasticGradientDescent
satises
8hσ 2 hµ k
E∥x(k) − x∗ ∥2 ≤ + (1 − ) · ∥x(0) − x∗ ∥2
µ 2
and
T −1
1 X ∥x(0) − x∗ ∥2
EF (x(k) ) − F (x∗ ) ≤ 4hσ 2 + .
T hT
k=0
6.3. Stochastic Gradient Descent 98
Using h ≤ 1
4L and F (x(k) ) − F (x∗ ) ≤ 1
2µ ∥∇F (x )∥2 ,
(k) 2
we have
hµ
E∥x(k+1) − x∗ ∥2 ≤ 4h2 σ 2 + (1 − ) · ∥x(k) − x∗ ∥2 .
2
The rst conclusion follows.
For the second conclusion, (6.6) shows that
E∥x(k) − x∗ ∥2 − E∥x(k+1) − x∗ ∥2
EF (x(k) ) − F (x∗ ) ≤ + 4hσ 2 .
h
Hence, we have
T −1
1 X ∥x(0) − x∗ ∥2
EF (x(k) ) − F (x∗ ) ≤ 4hσ 2 +
T hT
k=0
Exercise 6.25. Applying the theorem above twice, show that one can achieve error Õ( µT
2
σ
), which is roughly
same as the Cramer-Rao bound.
We note that for many algorithms in this book, such as mirror descent, the stochastic version is obtained
by replacing the gradient ∇F by the gradient of a sample, ∇f .
Note that if x(0) is close to x∗ , then ∇fei (x∗ ) is small because both ∇fi (x∗ ) − ∇fi (x(0) ) and ∇F (x(0) ) are
small. Formally, the variance for fe is bounded as follows:
Lemma 6.26. Suppose f has L-Lipschitz gradient for all f ∈ D. For any f ∈ D, let ∇fe(x) = ∇f (x) −
(0)
∇f (x )+ ∇F (x(0) ) for some xed x(0) . Then, we have
Theorem 6.27. f has L-Lipschitz gradient for all fi and F is µ strongly convex. Let x∗ be the mini-
Suppose
1 256L (km)
mizer of F . For step size h = 64L and m = µ , the sequence x in StochasticVarianceReducedGradient
satises
1
EF (x(km) ) − F (x∗ ) ≤ · (F (x(0) ) − F (x∗ )).
2k
In particular, it takes O((n + L 1
µ ) log( ϵ )) gradient computations to nd x such that EF (x) − F (x∗ ) ≤ ϵ ·
(0) ∗
(F (x ) − F (x )).
∇f .
Proof. The algorithm consists of T
m phases. For the rst phase, we compute ∇F (x(0) ). Lemma 6.26 shows
that
E∥∇f (x(k) ) − ∇f (x(0) ) + ∇F (x(0) )∥2 ≤ 8L · (F (x(0) ) − F (x∗ )).
Hence, Theorem 6.24 shows that
m−1
1 X F (x(0) ) − F (x∗ )
EF (x(k) ) − F (x∗ ) ≤ 16hL · (F (x(0) ) − F (x∗ )) +
m µhm
k=0
1
≤ (F (x(0) ) − F (x∗ )).
2
Hence, the error decreases by half each phase. This shows the result.
1 ∂ 1 2
Ei f (x − f (x)ei ) ≤ f (x) − ∥∇f (x)∥2 .
Li ∂xi 2L
Proof. Note that the function ζ(t) = f (x + tei ) is Li smooth. Hence, we have that
1 ∂ 1 ∂
f (x − f (x)ei ) ≤ f (x) − f (x)2 .
Li ∂xi 2Li ∂xi
6.4. Coordinate Descent 100
1 ∂ X Li 1 ∂
Ef (x − f (x)ei ) ≤ f (x) − f (x)2
Li ∂xi i
L 2Li ∂x i
1 X ∂
= f (x) − f (x)2
2L i ∂xi
1 2
= f (x) − ∥∇f (x)∥2 .
2L
Theorem 6.29 (Coordinate Descent Convergence). Given a convex function f, suppose that
∂2
∂x2i
f (x) ≤ Li
(k+1) (k) 1 ∂ (k)
P
for all x and let L= Li . Consider the algorithm x ←x − Li ∂xi f (x )ei . Then, we have that
2LR2
Ef (x(k) ) − f (x∗ ) ≤ with R= max ∥x − x∗ ∥2
k+4 f (x)≤f (x(0) )
µ Ω(k)
Ef (x(k) ) − f (x∗ ) ≤ (1 − ) (f (x(0) ) − f (x∗ )).
L
Remark. Note that Ld ≤ Lip(∇f ) ≤ L. Therefore gradient descent takes at least 1
d times as many steps as
coordinate descent while each step takes d times longer (usually).
Chapter 7
Acceleration
µ k
inf max |q(x)| ≤ 1 − .
q(0)=1 µ≤x≤L L
This corresponds to the Richardson iteration, and the above bound shows that it takes O( Lµ log( ϵ )) degree
1
Denition 7.1. For any integer d, the d'th Chebyshev polynomial is dened as the unique polynomial that
satises
Td (cos θ) = cos(dθ).
Exercise 7.2. Show that Td (x) is a degree |d| polynomial in x and that
Td+1 (x) + Td−1 (x)
xTd (x) = . (7.1)
2
Theorem 7.3 (Cherno Bound). For independent random variables Y1 , · · · , Ys such that P(Yi = 1) =
1
P(Yi = −1) = 2 , for any a ≥ 0, we have
s
X a2
P( Yi ≥ a) ≤ 2 exp(− ).
i=1
2s
√
Now we are ready to show that there is a degree Õ( s) polynomial that estimates xs .
Theorem 7.4. For any positive integers s and d, there is a polynomial p of degree d such that
d2
max |p(x) − xs | ≤ 2 exp(− ).
x∈[−1,1] 2s
101
7.2. Conjugate Gradient 102
Ps
Proof. Let Yi are i.i.d. random variable uniform on {−1, 1}. Let Zs = i=1 Yi . Note that
1
Ez∼Zs Tz = Ez∼Zs−1 (Tz+1 + Tz−1 )
2
Now, (7.1) shows that Ez∼Zs Tz (x) = Ez∼Zs−1 xTz (x). By induction, we have
Ez∼Zs Tz (x) = xs T0 (x) = xs . (7.2)
Now, we dene the polynomial p(x) = Ez∼Zs Tz (x)1|z|≤d . The error of the polynomial can be bounded as
follows
max |p(x) − xs | = max |Ez∼Zs Tz (x)1|z|>d |
x∈[−1,1] x∈[−1,1]
that
inf max |q(x)| ≤ ϵ.
q(0)=1 µ≤x≤L
Lemma 7.7. Let x(k) = argminx∈Kk f (x) be the Krylov sequence. . Then, the steps v (k) = x(k) − x(k−1) are
conjugate, namely,
v (i)⊤ Av (j) = 0 for all i ̸= j.
Proof. Assume that i < j . The optimality of x(j) shows that ∇f (x(j) ) ∈ Kj⊥ ⊂ Kj−1
⊥
and that ∇f (x(j−1) ) ∈
⊥
Kj−1 . Hence, we have
⊥
Av (j) = ∇f (x(j) ) − ∇f (x(j−1) ) ∈ Kj−1 .
Next, we note that v (i) = x(i) − x(i−1) ∈ Ki ⊂ Kj−1 . Hence, we have v (i)⊤ Av (j) = 0.
Since the steps are conjugate, v (i) forms a conjugate basis for the Krylov subspaces:
Note that x(k) = x(k−1) + v (k) . Hence, it suces to nd a formula for v (k) .
Lemma 7.8. We have
for some c0 .
For c0 , we use that r(k−1) ∈ Kk−1⊥
. This gives v (k)⊤ r(k−1) = c0 ∥r(k−1) ∥2 .
For ck−1 , since v (i)⊤
Av (k−1)
= 0 for any i ̸= k − 1, we have
∥r (k−1) ∥22
To make the formula simpler, we dene p(k) = v (k)⊤ r (k−1)
v (k) .
∥r (k−1) ∥22 (k) ∥r (k−1) ∥2 (k−1)
Lemma 7.9. We have that x(k) = x(k−1) + p(k)⊤ Ap(k)
p and p(k) = r(k−1) − ∥r (k−2) ∥22
p .
For the quantity v (k)⊤ r(k−1) , we note that f (x(k−1) + tv (k) ) is minimized at t = 1 and hence
Note that
r(k−1) = b − Ax(k−1) = r(k−2) − Av (k−1) .
Taking inner product with r(k−1) and using r(k−1) ⊥ r(k−2) gives ∥r(k−1) ∥2 = r(k−1)⊤ Av (k−1) . Put this into
(7.3) gives the result.
∥r (k) ∥2
p(k+1) = r(k) − ∥r (k−1) ∥2
p(k) .
end
1
f (x(k) ) − f (x∗ ) ≤ inf max q(λi )2 · ∥b∥2A−1
2 q(0)=1 i
Let q(x) = 1 − xp(x). Note that q(0) = 1 and deg q ≤ k and any such q is of the form 1 − xp. Hence, we
have
1
2(f (x(k) ) − f (x∗ )) = inf ∥q(A)A− 2 b∥22
deg q≤k,q(0)=1
The result follows from the fact that ∥q(A)∥2op = maxi q(λi )2 .
7.3. Accelerated Gradient Descent via Plane Search 105
It is known that for any 0 < µ ≤ L, there is a degree k polynomial q with q(0) = 1 such that
k
2
max q(λi )2 ≤ 2 1 − q
i L
µ +1
q
Therefore, it takes O( L µ log( ϵ )) iterations to nd x such that f (x) − f (x ) ≤ ϵ · ∥b∥A−1 .
1 ∗ 2
Also, we note that if there are only s distinct eigenvalues in A, then conjugate gradient nds the exact
solution in s iterations.
Theorem 7.11. Assume that f is convex with ∇2 f (x) ⪯ L · I for all x. Then, we have that
L∥x∗ − x(1) ∥2
f (x(k) ) − f (x∗ ) ≲ .
k2
Proof. Let δk = f (x(k) ) − f (x∗ ). By the convexity of f , we have
δk ≤ ∇f (x(k) )⊤ (x(k) − x∗ )
= ∇f (x(k) )⊤ (x(k) − x(1) ) + ∇f (x(k) )⊤ (x(1) − x∗ ).
Since x(k) is the minimizer on P (k−1) which contains x(1) , we have ∇f (x(k) )⊤ (x(k) − x(1) ) = 0 and hence
δk ≤ ∇f (x(k) )⊤ (x(1) − x∗ ).
Let λt = 1
∥∇f (x(t) )∥
. Note that
T
* T
+
T
X X ∇f (x(k) )
X ∇f (x(k) )
λk δk ≤ , x(1) − x∗ ∗
≤ ∥x − x (1)
∥2 ·
.
∥∇f (x (k) )∥
∥∇f (x(k) )∥
k=1 k=1 k=1 2
(s)
Pk ∇f (x )
Finally, we note that ∇f (x(k+1) ) ⊥ P (k) and hence ∇f (x(k+1) ) ⊥ s=1 ∥∇f (x(s) )∥ . Therefore, we have
2
∇f (x(k) )
2
T T
∇f (x(k) )
X
X
=
∥∇f (x(k) )∥
= T.
∥∇f (x(k) )∥
2
k=1 2 k=1
7.4. Accelerated Gradient Descent 106
Hence, we have
T
X √
λ k δk ≤ T · ∥x∗ − x(1) ∥2 .
k=1
Exercise 7.12. Solve the recursion omitted at the end of the proof.
Pk ∇f (x(s) )
Note that the proof above used only the fact that {x(1) , x(k)+ , s=1 ∥∇f (x(s) )∥
} ⊂ P . Therefore, one can
put extra vectors in P to obtain extra features. For example, if we use the subspace
k
X ∇f (x(s) )
P = x(k) + span(x(k) − x(1) , ∇f (x(k) ), (s) )∥
, x(k) − x(k−1) ),
s=1
∥∇f (x
then one can prove that this algorithm is equivalently to conjugate gradient when f is a quadratic function.
where f (x) is strongly convex and smooth and h(x) is convex. We assume that we access the function f and
h dierently via:
1. Let Tf be the cost of computing ∇f (x).
2
2. Let Th,λ be the cost of minimizing h(x) + λ2 ∥x − c∥2 exactly.
The idea is to move whatever we can optimize in ϕ to h and hopefully this makes the remaining part of ϕ,
f , as smooth and strongly convex as possible. To make the statement general, we only assume h is convex
and hence h may not be dierentiable. To handle this issue, we need to dene an approximate derivative of
h that we can compute.
Denition 7.13. We dene the gradient step
L 2
px = argminy f (x) + ∇f (x)⊤ (y − x) + ∥y − x∥ + h(y)
2
and the gradient mapping
gx = L(x − px ).
Note that if h = 0, then px = x − 1
L ∇f (x) and gx = ∇f (x). In general, if ϕ ∈ C 2 , then we have that
1 1
px = x − ∇ϕ(x) + O( 2 ).
L L
7.4. Accelerated Gradient Descent 107
Therefore, we have that gx = ∇ϕ(x) + O( L1 ). Hence, the gradient mapping is an approximation of the
gradient of ϕ that is computable in time Tg + Th,L .
The key lemma we use here is that ϕ satises a lower bound dening using gx . Ideally, we would love to
get a lower bound as follows:
µ 2
ϕ(z) ≥ ϕ(x) + gx⊤ (z − x) + ∥z − x∥2 .
2
But it is WRONG. If that was true for all z , then we would have gx = ∇ϕ(x). However, if ϕ ∈ C 2 is µ
strongly convex, then we have
µ 2
ϕ(z) ≥ ϕ(x) + ∇ϕ(x)⊤ (z − x) + ∥z − x∥2
2
1 1 2 µ 2
≥ ϕ(x − ∇ϕ(x)) + ∥∇ϕ(x)∥ + ∇ϕ(x)⊤ (z − x) + ∥z − x∥2 . (7.4)
L 2L 2
It turns out that this is true and is exactly what we need for proving gradient descent, mirror descent and
accelerated gradient descent.
Theorem 7.14. Given ϕ = f + h. Suppose that f is µ strongly convex with L-Lipschitz gradient. Then, for
any z, we have that
1 2 µ 2
ϕ(z) ≥ ϕ(px ) + gx⊤ (z − x) + ∥gx ∥2 + ∥z − x∥2 .
2L 2
2
Proof. Let f (y) = f (x) + ∇f (x)⊤ (y − x) + L2 ∥y − x∥2 and pt = px + t(z − px ). Using that p0 is the minimizer
of f + h, we have that
L 2
f (p0 ) + h(p0 ) ≤ f (pt ) + h(pt ) ≤ f (p0 ) + ∇f (p0 )⊤ (pt − p0 ) + ∥pt − p0 ∥ + h(pt ).
2
Hence, we have that
L 2
0 ≤ ∇f (p0 )⊤ (pt − p0 ) + ∥pt − p0 ∥ + h(pt ) − h(p0 )
2
Lt2 2
≤ t · ∇f (p0 )⊤ (z − p0 ) + h(z) − h(p0 ) + ∥z − p0 ∥ .
2
Taking t → 0+ , we have
∇f (px )⊤ (z − px ) + h(z) − h(px ) ≥ 0
Expanding the term ∇f (px ), we have
∇f (x)⊤ (z − px ) + L(px − x)⊤ (z − px ) + h(z) − h(px ) ≥ 0.
Equivalently,
h(z) ≥ h(px ) + ∇f (x)⊤ (px − z) + L(px − x)⊤ (px − z)
1
= h(px ) + ∇f (x)⊤ (px − z) + ∥gx ∥2 + L(px − x)⊤ (x − z)
L
1
= h(px ) + ∇f (x)⊤ (px − z) + ∥gx ∥2 + gx⊤ (z − x).
L
Using that
µ
f (z) ≥ f (x) + ∇f (x)⊤ (z − x) + ∥z − x∥2
2
and that
1
f (px ) ≤ f (x) + ∇f (x)⊤ (px − x) + ∥gx ∥2 ,
2L
we have the result:
1 µ
ϕ(z) ≥ h(px ) + f (x) + ∇f (x)⊤ (px − x) + ∥gx ∥2 + gx⊤ (z − x) + ∥z − x∥2
L 2
1 µ
≥ ϕ(px ) + ∥gx ∥2 + gx⊤ (z − x) + ∥z − x∥2 .
2L 2
7.4. Accelerated Gradient Descent 108
L 2
f (x) + ∇f (x)⊤ (px − x) + ∥px − x∥ + h(px ) ≤ f (x) + h(x).
2
Using h(px ) ≥ h(x) + ∇h(x)⊤ (px − x), we have that
L 2 L 2
0 ≥ ∇ϕ(x)⊤ (px − x) + ∥px − x∥ ≥ −G ∥px − x∥2 + ∥px − x∥ .
2 2
Hence, we have that ∥px − x∥2 ≤ 2
LG and hence ∥gx ∥2 ≤ 2G.
1 2
ϕ(px ) ≤ ϕ(x) − ∥gx ∥2 .
2L
2
This shows that each step of the gradient step decreases the function value by 2L1
∥gx ∥2 . Therefore, if
the gradient is large, then we decrease the function value by a lot. On the other hand, Putting z = x∗ for
Theorem 7.14 shows that
ϕ(x∗ ) ≥ ϕ(px ) + gx⊤ (x∗ − x).
If the gradient is small and domain is bounded, this shows that we are close to the optimal. Combining
these two facts, we can get the gradient descent.
Therefore, it suces to upper bound gx⊤(i) (x(i) − x∗ ). The following lemma shows that if gx⊤(i) (x(i) − x∗ ) is
large, then either the gradient is large or the distance to optimum moves a lot. It turns out this holds for
any vector g , not necessarily an approximate gradient.
η 2 1 2 2
g ⊤ (x − u) = ∥g∥2 + ∥x − u∥2 − ∥p − u∥2
2 2η
for any u.
7.4. Accelerated Gradient Descent 109
Note that if τ = 1, the algorithm is simply mirror descent and if τ = 0, the algorithm is gradient descent.
r
µ Ω(T )
ϕ(x) − ϕ(x∗ ) ≤ 2(1 − ) ϕ(x(1) ) − ϕ(x∗ )
L
in T steps. Furthermore, each step takes Tf + Th,L
Proof. Lemma 7.17 showed that
2
2
η 2 1
gx⊤(k+1) (z (k) ∗ ∗
∗
(7.7)
(k)
(k+1)
− x ) ≤ ∥gx(k+1) ∥2 +
z − x
−
z −x
2 2η 2 2
This shows that if the mirror descent has large error gx⊤(k+1) (z (k) − x∗ ), then the gradient descent makes
2
a large progress ( η2 ∥gx(k+1) ∥2 ).
To make the left-hand side usable, note that x(k+1) = z (k) + 1−τ τ · (y
(k)
− x(k+1) ) and hence
1−τ ⊤
gx⊤(k+1) (x(k+1) − x∗ ) = gx⊤(k+1) (z (k) − x∗ ) + · gx(k+1) (y (k) − x(k+1) )
τ
1−τ 1 2
≤ gx⊤(k+1) (z (k) − x∗ ) + (ϕ(y (k) ) − ϕ(y (k+1) ) − ∥gx(k+1) ∥2 )
τ 2L
η 1
2
2 1 − τ 1
2 2
z − x∗
−
z (k+1) − x∗
+
(k)
≤ ∥gx(k+1) ∥2 + (ϕ(y (k) ) − ϕ(y (k+1) ) − ∥g (k+1) ∥2 ).
2 2η 2 2 τ 2L x
where we used Theorem 7.14 in the middle and (7.7) at the end.
Now, we set 1−τ τ = ηL and get
2
2
1
gx⊤(k+1) (x(k+1) − x∗ ) ≤ ηL(ϕ(y (k) ) − ϕ(y (k+1) )) + ∗
∗
(k)
(k+1)
− x − − x
.
2η
z
z
2 2
7.5. Accelerated Coordinate Descent 110
PT
Taking a sum on both side and let x = 1
T k=1 px(k+1) , we have that
T
1 X
ϕ(x) − ϕ(x∗ ) ≤ (ϕ(px(k+1) ) − ϕ(x∗ ))
T
k=1
T
1 X ⊤
≤ gx(k+1) (x(k+1) − x∗ )
T
k=1
ηL 1
2
ϕ(y (1) ) − ϕ(y (T +1) ) +
z − x∗
(1)
≤
T 2ηT 2
ηL 1
≤ + ϕ(x(1) ) − ϕ(x∗ )
T 2ηT µ
µ 2 µ 2
ϕ(x) = f (x) + h(x) with f (x) = ∥x∥2 and h(x) = ℓ(x) − ∥x∥2 .
2 2
Since f is µ + smooth (YES!, I know this is also µ smooth) and µ strongly convex and since h is convex,
L
n q
we apply Theorem 7.18 and get an algorithm that takes O∗ ( nµ L
) steps. Note that each step involves
Tf + Th,µ+ L . Obviously, Tf = 0. Next, note that Th,µ+ L involves solving a problem of the form
n n
µ L 2 µ 2
yx = argminy ( + ) ∥y − x∥ + (ℓ(y) − ∥x∥ )
2 2n 2
L 2
= argminy ℓ(y) − µy ⊤ x + ∥y − x∥ .
2n
Now, we can apply Theorem 6.29 to solve this problem. It takes
L + (L/n) · n
O∗ ( ) = O∗ (n) coordinate steps.
L/n
P q
Remark 7.20. It is known how to do it in O∗ ( i Lµi ) steps [4].
7.6. Accelerated Stochastic Descent 111
Similar to the coordinate descent, we can accelerate it using the accelerated gradient descent (Theorem
7.18). To apply Theorem 7.18, we consider the function
µ 2 µ 2
ϕ(x) = f (x) + h(x) with f (x) = ∥x∥2 and h(x) = ℓ(x) − ∥x∥2 .
2 2
Since f is µ + smooth (YES!, I know this is also µ smooth) and µ strongly convex and since h is convex,
L
n q
we apply Theorem 7.18 and get an algorithm that takes O∗ (1 + nµ L
) steps. Note that each step involves
Tf + Th,µ+ L . Obviously, Tf = 0. Next, note that Th,µ+ L involves solving a problem of the form
n n
µ L 2 µ 2
yx = argminy ( + ) ∥y − x∥ + (ℓ(y) − ∥x∥ )
2 2n 2
L 2
= argminy ℓ(y) − µy ⊤ x + ∥y − x∥
2n
1X L 2
= argminy (ℓi (y) + ∥y − x∥ − µy ⊤ x)
n i 2n
Theorem 7.21. Given a convex function ℓ = n1 ℓi . Suppose that ∇2 ℓi (x) ≤ L for all i and x and that ℓ
P
is µ strongly convex. Suppose we can compute ∇ℓi in O(1) time. We have an algorithm that outputs an x
such that
Eℓ(x) − ℓ(x∗ ) ≤ ε(ℓ(x(0) ) − ℓ(x∗ ))
q
in O∗ (n + nL
µ) stochastic steps.
Part II
Sampling
112
Chapter 8
Gradient-based Sampling
for some closed set S . Then sampling according to e−f for a suciently large M would allow us to nd an
element of S , which could, e.g., be the minimizer of a hard-to-optimize function.
Consider a second example, which might appear more tractable:
1 ⊤
g(x) = e− 2 x Ax
1x≥0 .
Without the restriction to the nonnegative orthant, the target density is the Gaussian N (0, A−1 ), and can
be sampled by rst sampling the standard Gaussian N (0, I) and applying the linear transformation A−1/2 .
To sample from the standard Gaussian in Rn , we can sample each coordinate independently from N (0, 1),
a problem which has many (ecient) numerical recipes. But how can we handle the restriction? In the
course of forthcoming chapters, we will see that this problem and its generalization to sampling logconcave
densities, i.e., when f is convex, can be solved in polynomial time. It is remarkable that the polynomial-time
frontier for both optimization and sampling is essentially determined by convexity.
We begin with gradient-based sampling methods. These rely on access to ∇f . These methods will in
fact be natural algorithmic versions of continuous processes on random variables, a particularly pleasing
connection. Later we will see methods that only use access to f , and others that utilize higher derivatives,
notably the Hessian. The parallels to optimization will be pervasive and striking.
Output: xt .
113
8.1. Gradient-based methods: Langevin Dynamics 114
Here f : Rn → R is a function, xt is the random variable at time t and dWt is innitesimal Brownian
motion also known as a Wiener process. We can view it as the continuous version of the following discrete
process √
xt+1 = xt − h∇f (xt ) + 2hζt
with ζt sampled independently from N (0, I). When we take the step size h → 0, this discrete process
converges to the continuous one. We discuss the continuous version rst.
A more general form of an SDE is
where xt ∈ Rn , µ(xt , t) ∈ Rn is a time-varying vector eld and σ(xt , t) ∈ Rn×m is a time-varying linear
transformation. The simplest such process is the Wiener process: dxt = dWt which nds many applications
in applied mathematics, nance, biology and physics. Another useful process is the Ornstein-Ulhenbeck
Process:
Lemma 8.1 (Itô's lemma). For any process x t ∈ Rn satisfying dxt = µ(xt )dt + σ(xt )dWt where µ(xt ) ∈ Rn
n×m
and σ(xt ) ∈ R , we have that
1
df (xt ) = ∇f (xt )⊤ dxt + (dxt )⊤ ∇2 f (xt )(dxt )
2
1
= ∇f (xt )⊤ µ(xt )dt + ∇f (xt )⊤ σ(xt )dWt + tr(σ(xt )⊤ ∇2 f (xt )σ(xt ))dt.
2
The usual chain rule comes from using Taylor expansion and taking a limit, i.e.,
Theorem 8.2 (FokkerPlanck equation). For any process xt ∈ Rn satisfying dxt = µ(xt )dt + σ(xt )dWt
n n×m
where µ(xt ) ∈ R and σ(xt ) ∈ R with the initial point x0 drawn from p0 . Then the density pt of xt
satises the equation
dpt X ∂ 1 X ∂2
=− (µ(x)i pt (x)) + [(D(x))ij pt (x)]
dt i
∂xi 2 i,j ∂xi ∂xj
Taking derivatives on the both sides with respect to t, using Itô's lemma (Lemma 8.1), and noting that
EdWt = 0, we have that
Z
⊤ ⊤ 1 ⊤ 2
ϕ(x)dpt (x)dx = E ∇ϕ(xt ) µ(xt )dt + ∇ϕ(xt ) σ(xt )dWt + tr(σ(xt ) ∇ ϕ(xt )σ(xt ))dt
2
⊤ 1 2
= E ∇ϕ(xt ) µ(xt )dt + tr(∇ ϕ(xt )D(xt ))dt .
2
Z X ∂
= − ⟨∇ϕ(x), (pt (x)D(x)i )⟩dx
i
∂xi
XZ ∂2
= ϕ(x) [(D(x))ij pt (x)] dx.
i,j
∂xi ∂xj
Hence,
Z 2
dp t
X ∂ 1 X ∂
ϕ(x) + (µ(x)i pt (x)) − [(D(x))ij pt (x)] dx = 0
dt i
∂x i 2 i,j
∂xi ∂xj
Theorem 8.3. For any smooth function f , the density proportional to F = e−f is stationary for the Langevin
dynamics.
Proof. The FokkerPlanck equation (Theorem 8.2) shows that the distribution pt of xt satises
dpt X ∂ ∂f (x) X ∂2
= ( pt (x)) + [pt (x)] . (8.1)
dt i
∂xi ∂xi i
∂x2i
Now since pt is stationary the LHS is zero and we can rewrite the above as
dpt X ∂ ∂f (x) ∂
=0= pt (x) + pt (x)
dt i
∂xi ∂xi ∂xi
X ∂
∂f (x) ∂
= pt (x) + log pt (x)
i
∂xi ∂xi ∂xi
X ∂
∂
pt (x)
= pt (x) log −f (x) .
i
∂xi ∂xi e
Use the Fokker-Planck equation to derive a corresponding stationary density. Use Itô's lemma to derive
4
E(∥Xt ∥4 ). (Hint: Take expectation on both sides of Itô's lemma for appropriate f (Xt ), and use Edf (Xt ) =
dEf (Xt ) for continuous f and df ).
Convergence via Coupling. Next we turn to the rate of convergence, which will also prove uniqueness
of the stationary distribution for the stochastic process. For this, we assume that f is strongly convex. The
proof is via the classical coupling technique [3].
Our goal is to bound the rate at which the distribution of the current point approaches the stationary
distribution, in some chosen measure of distance between distributions (for example, the TV distance). To
do this, in the coupling technique, we consider two points which are both following the random process. One
of them is already in the stationary distribution, and therefore will stay there. The other is our point. We
will show that there is a coupling of the two distributions, i.e., a joint distribution over the two points, whose
marginals are identical to the single point processes, such that the expected distance between the two points
decreases at a certain rate. More formally, we couple two copies xt , yt of the random process with dierent
starting points (the coupling is a joint distribution D(xt , yt ) with the property that its marginal for each of
xt , yt is exactly the process) and show that their distributions get closer over time.
While the challenge usually is to nd a good coupling, in the present case, the simple identity coupling
(i.e., the same Wiener process is used for both xt and yt ) works well. The distance measure we will use here
is the Wasserstein distance (in Euclidean norm, see Denition 0.12).
Exercise 8.6. Show that for two distributions with the same nite support, computing their Wasserstein
distance reduces to a bipartite matching problem.
Lemma 8.7. Let xt , yt evolve according to the Langevin diusion for a µ-strongly convex function f : Rn →
R. Then, there is a coupling γ between xt and yt s.t.
2 2
Ext ,yt ∼γ ∥xt − yt ∥ ≤ e−2µt ∥x0 − y0 ∥ .
Proof. From the denition of LD, and by using the identity coupling, i.e., the same Gaussian dWt for both
processes xt and yt , we have that
d
(xt − yt ) = ∇f (yt ) − ∇f (xt ).
dt
Hence,
1 d
∥xt − yt ∥2 = ⟨∇f (yt ) − ∇f (xt ), xt − yt ⟩ .
2 dt
Next, from the strong convexity of f , we have
µ 2
f (yt ) − f (xt ) ≥ ∇f (xt )⊤ (yt − xt ) + ∥xt − yt ∥ ,
2
µ 2
f (xt ) − f (yt ) ≥ ∇f (yt )⊤ (xt − yt ) + ∥xt − yt ∥ .
2
8.2. Langevin Dynamics is Gradient Descent in Density Space*2 117
Therefore,
1 d
∥xt − yt ∥2 ≤ −µ∥xt − yt ∥2 .
2 dt
Hence,
d
d ∥xt − yt ∥2
log ∥xt − yt ∥2 = dt
dt ∥xt − yt ∥2
≤ −2µ.
Exercise 8.8. Give an example of a function f for which the density proportional to e−f is not stationary
for the following discretized Langevin algorithm
√
x(k+1) = x(k) − ϵ∇f (x(k) ) + 2ϵZ (k)
where Z (k) ∼ N (0, 1) are independent and the distribution of x(0) is Gaussian.
Denition 8.9. The Wasserstein space P2 (Rn ) on Rn is the manifold on the set of probability measures
on Rn such that the shortest path distance of two measures x, y in this manifold is exactly equal to the
Wasserstein distance between x and y .
We let Tp (M) refer to the tangent space at a point p in a manifold M.
Lemma 8.10. n n
For any p ∈ P2 (R ) and v ∈ Tp P2 (R ), we can write v(x) = ∇ · (p(x)∇λ(x)) for some
n
function λ on R . Furthermore, the local norm of v in this metric is given by
Proof. Let p ∈ P2 (Rn ) and v ∈ Tp P2 (Rn ). We will show that any change of density v can be represented by
a vector eld c on Rn as follows: Consider the process x0 ∼ p and dt d
xt = c(xt ). Let pt be the density of
the distribution of xt . To compute dt pt , we follow the same idea as in the proof as Theorem 8.2. For any
d
smooth function ϕ, we have that Ex∼pt ϕ(x) = Eϕ(xt ). Taking derivatives on the both sides with respect to
t, we have that
Z Z Z
d
ϕ(x) pt (x)dx = ∇ϕ(x)⊤ c(x)pt (x)dx = − ∇ · (c(x)pt (x))ϕ(x)dx
dt
where we used integration by parts at the end. Since this holds for all ϕ, we have that
dpt (x)
= −∇ · (pt (x)c(x)).
dt
1 Sections marked with * are more mathematical and can be skipped.
8.2. Langevin Dynamics is Gradient Descent in Density Space*3 118
Since we are interested only in vector elds that generate the minimum movement in Wasserstein distance,
we consider the optimization problem
Z
1
min p(x)∥c(x)∥2 dx
−∇·(pc)=v 2
where we can think v is the change of pt . Let λ(x) be the Lagrangian multiplier of the constraint −∇·(pc) = v .
Then, the problem becomes
Z Z
1
min p(x)∥c(x)∥2 dx − λ(x)∇ · (p(x)c(x))dx.
c 2
Z Z
1
= min p(x)∥c(x)∥ dx + ∇λ(x)⊤ c(x) · p(x)dx.
2
c 2
Now, we note that the problem is a pointwise optimization problem whose minimizer is given by
c(x) = −∇λ(x).
This proves that any vector eld that generates minimum movement in Wasserstein distance is a gradient
eld. Also, we have that v(x) = ∇R· (p(x)∇λ(x)). Note that the right hand side is an elliptical dierential
equation and hence for any v with v(x)dx = 0, there is an unique solution λ(x). Therefore, we can write
v(x) = ∇ · (p(x)∇λ(x)) for some λ(x).
Next, we note that the movement is given by
Z
∥v∥p = p(x)∥c(x)∥2 dx = Ex∼p ∥∇λ(x)∥2 .
2
As we discussed in the gradient descent section, one can use norms other than ℓ2 norm. For the Wasser-
stein space, we should use the local norm as given in Lemma 8.10.
Theorem 8.11. Let ρt be the density of the distribution produced by Langevin Dynamics for the target
ν = e−f (x) / e−f (y) dy . Then, we have that
R
distribution
dρ 1
= argminv∈Tp P2 (Rn ) ⟨∇F (ρ), v⟩p + ∥v∥2p .
dt 2
Namely, ρt follows continuous gradient descent in the density space for the function F (ρ) = DKL (ρ∥ν) under
the Wasserstein metric.
Solving the right hand side, we have ∇c = ∇λ and hence δ = ∇·(ρ∇c). Now, we note that ∇F (ρ) = log νρ −1.
Therefore,
dρ ρ
= ∇ · (ρ∇(log − 1))
dt ν
ρ
= ∇ · (ρ∇ log )
ν
= ∇ · (ρ∇f ) + ∆ρ
To analyze this continuous descent in Wasserstein space, we rst prove that continuous gradient descent
converges exponentially whenever F is strongly convex.
8.2. Langevin Dynamics is Gradient Descent in Density Space*4 119
2
∥∇F (x)∥x ≥ α · (F (x) − min F (y)) for all x (8.2)
y
on the manifold with the metric ∥ · ∥x where ∇ is the gradient on the manifold. Then, the process dxt =
−αt
−∇F (xt )dt converges exponentially, i.e., F (xt ) − miny F (y) ≤ e (F (x0 ) − miny F (y)).
Proof. We write
d dxt
(F (x) − min F (y)) = ⟨∇F (xt ), ⟩x = −∥∇F (xt )∥2xt ≤ −α(F (x) − min F (y)).
dt y dt t y
Finally, we note that the log-Sobolev inequality for the density ν can be re-stated as the condition (8.2).
Lemma 8.13. Fix a density ν. Then the log-Sobolev inequality, namely, for every smooth function g,
Z Z
1 2
∥∇g∥ dν ≥ α g(x)2 log g(x)2 dν
2
Z
2 Z
1
ρ(x)
dx ≥ α · ρ(x) log ρ(x) dx for all ρ.
ρ(x)
∇ log
2 ν(x)
ν(x)
Combining Lemma 8.13 and Lemma 8.12, we have the following result:
Theorem 8.14. Let f be a smooth function with log-Sobolev constant α. Then the Langevin dynamics
√
dxt = −∇f (x)dt + 2dWt
converges exponentially in KL-divergence to the density ν(x) ∝ e−f (x) with mixing rate O( α1 ), i.e., KL(xt , ν) ≤
e−2αt KL(x0 , ν).
See [44] for a tight estimate of log-Sobolev constant for logconcave measures. In particular for a logconcave
measure with support of diameter D, the log-Sobolev constant is Ω(1/D).
8.2.1 Discussion
Langevin dynamics converges quickly in continuous time for isoperimetric distributions. Turning this into
an ecient algorithm typically needs more assumptions and there is much room for choosing discretizations.
This is similar to the situation with gradient descent for optimization. As we saw in Section 8.2, it turns
out that Langevin dynamics is in fact gradient descent in the space of probability measures under the
Wasserstein metric, where the function being minimized is the KL-divergence of the current density from
the target stationary density. For more on this view of sampling as optimization over measures, see [75].
Chapter 9
vol(S ∩ H) vol(S ∩ H)
centroid(S) = centroid(S ∩ H) + centroid(S ∩ H).
vol(S) vol(S)
The following lemma has a proof similar to that of the Grunbaum theorem.
Lemma 9.2. Let K ⊆ Rn be a convex body with centroid at the origin. Suppose that for some unit vector
θ, the support of K along θ is [a, b]. Then,
b
. |a| ≥
n
Exercise 9.3. Prove Lemma 9.2. [Hint: Use Theorem 1.27.]
Using the above property, we can show that the algorithm reaches a cuboid in a small number of iterations.
Theorem 9.4. Let K be a convex body in Rn containing a cube of side length r around its centroid and
contained in a cube of side length R. Algorithm CuttingPlaneVolume correctly computes the volume of K
nR
using O(n log r ) centroid computations.
120
9.2. Optimization from Membership via Sampling 121
Proof. By Lemma 3.14, at each iteration, the volume of the remaining set K (k) decreases by a factor of
at most (1 − 1e ). When the directional width along any axis is less than r/2 (namely maxx∈K e⊤ ≤ r ),
i x 2
the algorithm stops cutting along that axis. So, just before the last cut along any axis, the width in that
direction is at least r/2. Then, we use the center of gravity to cut. By Lemma 9.2, the directional width
along every axis of the surviving set is at least r/2(n + 1). Since the set always contains the origin, and the
original set contains a cube of side length r, when the algorithm stops, it must be an axis-parallel cuboid
with each side of length in the range [r/2(n + 1), r/2]. So the nal volume is at least (r/2(n + 1))n . The
initial volume is at most Rn . Therefore the number of iterations is at most
Rn
log(1− 1e ) = O (n log(nR/r)) .
(r/2(n + 1))n
In each iteration, by Lemma 9.1, the algorithm maintains the ratio of the volume of the original K to the
current K (k) .
The above algorithm shows that computing the volume is polytime reducible to computing the centroid.
Since volume is known to be #P-hard for explicit polytopes, this means that centroid computation is also
#P-hard for polytopes [62]. In later chapters we will see randomized polytime algorithms for sampling and
hence for approximating centroid and volume.
Exercise 9.5. [12] Given a partial order P on an n-element set, it is of interest to count the number of linear
extensions, i.e., total orders on the set that are consistent with P . We can dene an associated polyhedron,
Q = {x ∈ [0, 1]n : xi < xj if (i, j) ∈ P } .
Show that the number of linear extensions of P is exactly vol(Q)n!.
T
The theorem says that if we sample according to e−αc x for α = n/ε, we will get an ε-approximation to
the optimum. However, sampling from such a density is not trivial. Instead, we will have to go through a
sequence of overlapping distributions, starting with one that is easy to sample and ending with a distribution
that is focused close to the minimum. This method is known as simulated annearling and is the subject
of Chapter 11. The complexity of sampling is polynomial in the dimension and logarithmic in a suitable
notion of probabilistic distance between the starting distribution and the target distribution. The sampling
algorithm only uses a membership (EVAL) oracle.
Exercise 9.7. Extend Theorem 9.6 by replacing cT x with any convex function f (x).
Open Problem. Given an approximately convex function F on unit ball such that max∥x∥2 ≤1 |f (x) −
F (x)| ≤ ε/n for some convex function f , how eciently can we nd x in the unit ball such that F (x) ≤
min∥x∥2 ≤1 F (x) + O(ε)? The current fastest algorithm takes O(n4 logO(1) (n/ε)) calls to the noisy EVAL
oracle for F .
Chapter 10
Geometrization
123
10.1. Basics of Markov chains 124
Example. For the ball walk in a convex body, the state space K is the convex body, and A is the set of all
measurable subsets of K . The next step distribution is
vol (K ∩ (u + δBn ))
Pu ({u}) = 1 −
vol(δBn )
vol (A ∩ (u + δBn ))
Pu (A) = + 1u∈A (u)Pu ({u})
vol(δBn )
vol(A)
The uniform distribution is stationary, i.e., Q(A) = vol(K) .
A distribution Q is stationary if and only if Φ(A) = Φ(K \ A). The existence and uniqueness of the
stationary distribution Q for general Markov chains is a subject on its own. One way to ensure uniqueness
of a stationary distribution is to use lazy Markov chains. In a lazy version of a given Markov chain, at each
step, with probability 1/2, we do nothing; with the rest we take a step according to the Markov chain. The
next theorem is folklore.
Exercise 10.2. If Q is stationary w.r.t. a lazy ergodic Markov chain, then it is the unique stationary
distribution for that Markov chain.
Informally, the mixing rate of a random walk is the number of steps required to reduce some measure of
the distance of the current distribution to the stationary distribution by a constant factor. The following
notions will be useful for comparing two distributions P, Q.
1. Total variation distance is dtv (P, Q) = supA∈A |P (A) − Q(A)|.
2. L2 or χ2 -distance of P with respect to Q is
Z 2 Z 2 Z
2 dP (u) dP (u) dP (u)
χ (P, Q) = − 1 dQ(u) = dQ(u) − 1 = dP (u) − 1.
K dQ(u) K dQ(u) K dQ(u)
P (A)
3. Warmth: P is said to be M -warm w.r.t. Q if M = supA∈A Q(A) .
Φ(A)
ϕ(A) =
min{Q(A), Q(K \ A)}
ϕ(A)
ϕs = min .
A:s<Q(A)≤ 12 Q(A) − s
10.1. Basics of Markov chains 125
Ideally we would like to show that d(Qt , Q), the distance between the distribution after t steps and the
target Q is monotonically (and rapidly) decreasing. We consider
for each x ∈ [0, 1]. To prove inductively that this quantity decreases, Let Gx be the set of functions dened
as Z
Gx = g : K → [0, 1] : g(u) dQ(u) = x .
u∈K
1 1
ht (x) ≤ ht−1 (x − 2ϕy) + ht−1 (x + 2ϕy).
2 2
Proof. Assume that 0 ≤ x ≤ 21 . We construct two functions, g1 and g2 , and use these to bound ht (x). Let
A be a subset that achieves ht (x). Dene
( (
2Pu (A) − 1 if u ∈ A, 1 if u ∈ A,
g1 (u) = and g2 (u) =
0 if u ∈
/ A, 2Pu (A) if u ∈ / A.
Note that 21 (g1 + g2 )(u) = Pu (A) for all u ∈ K , which means that
Z Z Z
1 1
g1 (u) dQt−1 (u) + g2 (u) dQt−1 (u) = Pu (A) dQt−1 (u) = Qt (A).
2 u∈K 2 u∈K u∈K
Since the walk is lazy, Pu (A) ≥ 21 i u ∈ A, the range of the functions g1 , g2 is [0, 1]. We let
Z Z
x1 = g1 (u) dQ(u) and x2 = g2 (u) dQ(u),
u∈K u∈K
since Q is stationary.
Next,
Z
x1 = g1 (u) dQ(u)
u∈K
Z Z
= 2 Pu (A) dQ(u) − dQ(u)
u∈A u∈A
Z
= (1 − Pu (K \ A)) dQ(u) − x
2
u∈A
Z
= x−2 Pu (K \ A) dQ(u)
u∈A
= x − 2Φ(A)
≤ x − 2ϕx
= x(1 − 2ϕ).
Thus we have, x1 ≤ x(1 − 2ϕ) ≤ x ≤ x(1 + 2ϕ) ≤ x2 . Since ht−1 is concave, the chord from x1 to x2 on ht−1
lies below the chord from [x(1 − 2ϕ), x(1 + 2ϕ)]. Therefore,
1 1
ht (x) ≤ ht−1 (x(1 − 2ϕ)) + ht−1 (x(1 + 2ϕ)).
2 2
Then
t
√ √ ϕ2s
ht (x) ≤ C0 + C1 min{ x − s, 1 − x − s} 1 − .
2
The proof is by induction on t.
Corollary 10.7. We have
1. Let M = supA Q0 (A)/Q(A). Then,
t
√ ϕ2
dT V (Qt , Q) ≤ M 1− .
2
1
2. Let 0<s≤ 2 and Hs = sup{|Q0 (A) − Q(A)| : Q(A) ≤ s}. Then,
t
ϕ2
Hs
dT V (Qt , Q) ≤ Hs + 1− s .
s 2
This parameter allows us to show convergence of the current distribution to the target in relative entropy.
Recall that the relative entropy of a distribution P with respect to a distribution Q is
Z
P (x)
HQ (P ) = P (x) log dQ(x).
K Q(x)
Theorem 10.8. For a Markov chain with distribution Qt at time t, and log-Sobolev parameter ρ, we have
To convey the main ideas of the analysis, we focus on the rst approach here. The goal is to show that the
conductance of any subset is large, i.e., the probability of crossing over in one step is at least proportional to
the measure of the set or its complement, whichever is smaller. First, we argue that the one-step distributions
of two points will have a signicant overlap if the points are sucient close.
Setting t = ℓ/2, this says that if the total variation distance between the one-step distributions from u, v
is greater than 1 − ℓ/2, then the distance between them is at least 2√ ℓδ
n
. What this eectively says is that
points close to the internal boundary of a subset are likely to cross over to the other side. To complete a proof
we would need to show that the internal boundary of any subset is large if the subset (or its complement) is
large, a purely geometric property.
Theorem 10.11 (Isoperimetry). Let S1 , S2 , S3 be a partition of a convex body K of diameter D. Then,
2
vol(S3 ) ≥ d(S1 , S2 ) min {vol(S1 ), vol(S2 )} .
D
This can be generalized to any logconcave measure. We will discuss this and other extensions in detail
later. But rst we bound the conductance.
Theorem 10.12. Let K be a convex body in Rn D containing the unit ball and with every u ∈ K
of diameter
having ℓ(u) ≥ ℓ. Then the conductance of the ball walk on K with step size δ is
2
ℓ δ
Ω √ .
nD
ℓ2 δ
Z
Px (S2 ) dx ≥ √ min{vol(S1 ), vol(S2 )} (10.1)
S1 16 nD
Consider the points that are deep inside these sets, i.e., unlikely to jump out of the set:
ℓ ℓ
′
S1 = x ∈ S1 : Px (S2 ) < and S2 = x ∈ S2 : Px (S1 ) <
′
.
4 4
ℓδ
|u − v| ≥ √ .
2 n
√
Thus d(S1 , S2 ) ≥ ℓδ/2 n. Applying Theorem 10.11 to the partition S1′ , S2′ , S3′ , we have
ℓδ
vol(S3′ ) ≥ √ min{vol(S1′ ), vol(S2′ )}
nD
ℓδ
≥ √ min{vol(S1 ), vol(S2 )}.
2 nD
10.2. Conductance of the Ball Walk 131
Corollary 10.13. The ball walk in a convex body with local conductance at least ℓ everywhere has mixing
rate O(nD2 δ 2 /ℓ4 ).
Using the construction above of adding a small ball to every point of K gives a lower bound of δ = 1/n3/2
and ℓ = Ω(1) and thus a polynomial bound of O(n4 D2 ) on the mixing time. As we will see presently, this can
be improved to n2 D2 by avoiding the blow-up, and analyzing the average local conductance. The example
of starting near a corner (say of a hypercube) shows that this cannot work in general; however, from a warm
start, it will suce to bound the average local conductance rather than the minimum.
Theorem 10.14. From a warm start, the ball walk in a convex body of diameter D containing a unit ball
has a mixing rate of O(n2 D2 ) steps.
This is based on two ideas: (1)√most points of a convex body containing a unit ball have large local
conductance and we can use δ = 1/ n instead of 1/n3/2 , (2) the s-conductance is large and hence the walk
mixes from a suitably warm start.
Lemma
10.15. Let K be a convex body containing a unit ball. For the√ ball walk with δ step size, let
Kδ = u ∈ K : ℓ(u) ≥ 34 . Then Kδ is a convex set and vol(Kδ ) ≥ (1 − 2δ n)vol(K).
Lemma 10.17. Let L be any measurable subset of the boundary of a convex body K and SL = {(x, y) : x ∈ K, y ̸∈ K, ∥x − y∥ ≤
Then we have
δ vol(B n−1 )
vol2n (SL ) ≤ voln−1 (L)vol(δB n ).
(n + 1) vol(B n )
Proof. It suces to consider the case when L is innitesimally small; and then we can assume that the
surface of K is locally a hyperplane and compute the measure of SL explicitly.
Proof. As before, we consider the following partition of K . Let K = S1 ∪ S2 be a partition into measurable
sets. We will prove that
Z
δ s s
Px (S2 ) dx ≥ √ min{vol(S1 ) − , vol(S2 ) − } (10.2)
S1 C nD 2 2
Since the uniform distribution is stationary,
Z Z
Px (S2 ) dx = Px (S1 ) dx.
S1 S2
vol(Si′′ ) ≥ vol(Si′ ) − s.
Speedy walk
In the above analysis of the ball walk, the dependence on the error parameter ε, the distance to the target
distribution, is polynomial in 1/ε rather than its logarithm. The speedy walk is a way to improve the analysis.
In the speedy walk, at a point x, we sample the next step uniformly from the intersection of (x + δB n ) ∩ K .
The resulting Markov chain is the subsequence of proper steps of the ball walk.
Exercise 10.20. Show that the stationary density of the speedy walk in a convex body is proportional to
the local conductance.
Any distribution with bounded second moments has an isotropic transformation. It is clear that satisfying
the rst condition is merely a translation, so assume the mean is zero. For the second, suppose the covariance
matrix is EQ (xxT ) = A. Then consider y = A−1/2 x. It is easy to see that
Theorem 10.22. For a convex body in isotropic position (i.e., the uniform distribution over the body is
isotropic), we have
r
n+2 n p
B ⊆ K ⊆ n(n + 2)B n .
n
Thus the eective diameter is O(n). If we could place a convex body in isotropic position before sampling,
we √
would have a poly(n) algorithm. In fact, it is even better than this as most points are within distance
O( n) of the center of gravity. We quote a theorem due to Paouris.
How to compute an isotropic transformation? This is easy, from the denition, all we need is to estimate
its covariance matrix, which can be done from random samples. Thus, if we could sample K , we can compute
an isotropic transformation for it. This appears cyclic we need isotropy for ecient sampling and ecient
sampling for isotropy. The solution is simply to bootstrap them.
Algorithm 29: IsotropicTransform
Input: membership oracle for K s.t. B n ⊆ K ⊆ DB n .
Let x be a random point in B n , A = I and Ki = 2i/n B n ∩ K .
for i = 1, · · · , n log D do
1. Use the ball walkP from x to generate N random points x1 . . . xN in AKi .
N
2. Compute C = N1 i=1 xi xTi and set A = C −1/2 A.
3. Set x = xN .
end
return x.
We will choose N large enough so that after the transformation Ki is 2-isotropic and therefore Ki+1 is
6-isotropic. We can bound N as follows.
Exercise 10.24. Show that if K is isotropic, then with N = O(n2 ), the matrix A = N1 N i=1 xi xi for N
T
P
random samples from K satises ∥A − I∥op ≤ 0.5.
A tight bound on the sample complexity was established by [1] (see also [11, 63, 68]).
Theorem 10.25. For an isotropic logconcave distribution Q in Rn , the covariance N = O(n) random
samples satises ∥A − I∥op ≤ 0.5.
Thus the overall algorithm needs O(n log D) phases, with O(n) samples in each phase from a near-isotropic
distribution, and thus poly(n) steps per sample.
√
Thus, for a (near-)isotropic distribution, the diameter can be replaced by O( n) and this gives a bound
of O(n3 ) from a warm start. One way to summarize the analysis so far is that the complexity of sampling a
convex body (and in fact a logconcave density) from a warm start is O∗ (n2 /ψ 2 ) where ψ is the isoperimetric
ratio of the convex body. In other words, the expansion of the Markov chain reduces to the expansion of
the target logconcave density. It then becomes a natural question to nd the best possible estimate for the
isoperimetric ratio. KLS also provided a conjecture for this.
Conjecture 10.27. The isoperimetric ratio of any isotropic logconcave density in Rn is Ω(1).
The bound of the conjecture holds for all halfspace induced subsets. So the conjecture says that the
worst isoperimetry is achieved up to a constant factor by a halfspace (this version does not need isotropic
position). Here we discuss a powerful technique for proving such inequalities.
Classical proofs of isoperimetry for special distributions are based on dierent types of symmetrization
that eectively identify the extremal subsets. Bounding the Cheeger constant for general convex bodies
and logconcave densities is more complicated since the extremal sets can be nonlinear and hard to describe
precisely, due to the trade-o between minimizing the boundary measure of a subset and utilizing as much
of the external boundary as possible. The main technique to prove bounds in the general setting has been
localization, a method to reduce inequalities in high dimension to inequalities in one dimension. We now
describe this technique with a few applications.
10.5.1 Localization
We will sketch a proof of the following theorem to illustrate the use of localization. This theorem was also
proved by Karzanov and Khachiyan [35] using a dierent, more direct approach.
Theorem 10.28 ([22, 49, 35]). Let f be a logconcave function whose support has diameter D and let πf be
the induced measure. Then for any partition of Rn into measurable sets S1 , S2 , S3 ,
2d(S1 , S2 )
πf (S3 ) ≥ min{πf (S1 ), πf (S2 )}.
D
Before discussing the proof, we note that there is a variant of this result in the Riemannian setting.
Theorem 10.29 ([46]). If K ⊂ (M, g) is a locally convex bounded domain with smooth boundary, diameter
π2
R
D and Ricg ≥ 0, then the Poincaré constant is at least 4D2 , i.e., for any g with g = 0, we have that
π2
Z Z
2
|∇g(x)| dx ≥ g(x)2 dx.
4D2
For the case of convex bodies in Rn , this result is equivalent to Theorem 10.28 up to a constant. One
benet of localization is that it does not require a carefully crafted potential. Localization has recently been
generalized to Riemannian setting [39]. The origins of this method were in a paper by Payne and Weinberger
[61].
We begin the proof of Theorem 10.28. For a proof by contradiction, let us assume the converse of its
conclusion, i.e., for some partition S1 , S2 , S3 of Rn and logconcave density f , assume that
Z Z Z Z
f (x) dx < C f (x) dx and f (x) dx < C f (x) dx
S3 S1 S3 S2
These inequalities are for functions in Rn . The next lemma will help us analyze them.
10.5. Isoperimetry via localization 137
Lemma 10.30 (Localization Lemma [32]). Letg, h : Rn → R be lower semi-continuous integrable functions
such that Z Z
g(x) dx > 0 and h(x) dx > 0.
Rn Rn
Then there exist two points a, b ∈ Rn and an ane function ℓ : [0, 1] → R+ such that
Z 1 Z 1
n−1
ℓ(t) g((1 − t)a + tb) dt > 0 and ℓ(t)n−1 h((1 − t)a + tb) dt > 0.
0 0
The points a, b represent an interval and one may think of ℓ(t)n−1 as proportional to the cross-sectional
area of an innitesimal cone. The lemma says that over this cone truncated at a and b, the integrals of g
and h are positive. Also, without loss of generality, we can assume that a, b are in the union of the supports
of g and h.
Proof outline. The main idea is the following. Let H be any halfspace such that
Z Z
1
g(x) dx = g(x) dx.
H 2 Rn
Thus, either H or its complementary halfspace will have positive integrals for both g and h, reducing the
domain of the integrals from Rn to a halfspace. If we could repeat this, we might hope to reduce the
dimensionality of the domain. For any (n − 2)-dimensional ane subspace L, there is a bisecting halfspace
containing L in its bounding hyperplane. To see this, let H be a halfspace containing L in its boundary.
Rotating H about L we get a family of halfspaces with
R the same property. This family ′includes H , the
′
continuous family, there must be a halfspace for which the function is zero.
If we take all (n − 2)-dimensional ane subspaces dened by {x ∈ Rn : xi = r1 , xj = r2 } where r1 , r2 are
rational, then the intersection of all the corresponding bisecting halfspaces is a line or a point (by choosing
only rational values for xi , we are considering a countable intersection). To see why it is a line or a point,
assume we are left with a two or higher dimensional set. Since the intersection is convex, there is a point
in its interior with at least two coordinates that are rational, say x1 = r1 and x2 = r2 . But then there is a
bisecting halfspace H that contains the ane subspace given by x1 = r1 , x2 = r2 in its boundary, and so it
properly partitions the current set.
Thus the limit of this bisection process is a function supported on an interval (which could be a single
point), and since the function itself is a limit of convex sets (intersections of halfspaces) containing this inter-
val, it is a limit of a sequence of concave functions and is itself concave, with positive integrals. Simplifying
further from concave to linear takes quite a bit of work. For the full proof, we refer the reader to [50].
Going back to the proof sketch of Theorem 10.28, we can apply the localization lemma to get an interval
[a, b] and an ane function ℓ such that
Z 1 Z 1
ℓ(t)n−1 g((1 − t)a + tb) dt > 0 and ℓ(t)n−1 h((1 − t)a + tb) dt > 0. (10.4)
0 0
The functions g, h as we have dened them are not lower semi-continuous. However, this can be addressed
by expanding S1 and S2 slightly so as to make them open sets, and making the support of f an open set.
Since we are proving strict inequalities, these modications do not aect the conclusion.
Let us partition [0, 1] into Z1 , Z2 , Z3 as follows:
Zi = {t ∈ [0, 1] : (1 − t)a + tb ∈ Si }.
10.5. Isoperimetry via localization 138
Note that for any pair of points u ∈ Z1 , v ∈ Z2 , |u − v| ≥ d(S1 , S2 )/D. We can rewrite (10.4) as
Z Z
n−1
ℓ(t) f ((1 − t)a + tb) dt < C ℓ(t)n−1 f ((1 − t)a + tb) dt
Z3 Z1
and Z Z
ℓ(t)n−1 f ((1 − t)a + tb) dt < C ℓ(t)n−1 f ((1 − t)a + tb) dt.
Z3 Z2
Now consider what Theorem 10.28 asserts for the function F (t) over the interval [0, 1] and the partition
Z1 , Z2 , Z3 : Z Z Z
F (t) dt ≥ 2d(Z1 , Z2 ) min F (t) dt, F (t) dt . (10.6)
Z3 Z1 Z2
We have substituted 1 for the diameter of the interval [0, 1]. Also, 2d(Z1 , Z2 ) ≥ 2d(S1 , S2 )/D = C . Thus,
Theorem 10.28 applied to the function F (t) contradicts (10.5) and to prove the theorem in general, and it
suces to prove it in the one-dimensional case. A combinatorial argument reduces this to the case when
each Zi is a single interval. Proving the resulting inequality up to a factor of 2 is a simple exercise and uses
only the unimodality of F . The improvement to the tight bound requires one-dimensional logconcavity. This
completes the proof of Theorem 10.28.
The localization lemma has been used to prove a variety of isoperimetric inequalities. The next theorem is
a renement of Theorem 10.28, replacing the diameter by the square-root of the expected squared distance √
of a random point from the mean. For an isotropic distribution this is an improvement from n to n.
This theorem was proved by Kannan-Lovász-Simonovits in the same paper in which they proposed the KLS
conjecture.
Theorem 10.31 ([32]). For any logconcave density p in Rn with covariance matrix A, the KLS constant
satises
1
ψp ≳ p .
tr(A)
The next theorem shows that the KLS conjecture is true for an important family of distributions. The
proof is again by localization [19], and the one-dimensional inequality obtained is a Brascamp-Lieb Theorem.
We note that the same theorem can be obtained by other means [41, ?].
n
cave function and B is positive denite. Then h is logconcave and for any measurable subset S of R ,
h(∂S) 1
≳ 1 .
min {h(S), h (Rn \ S)} −1
∥B ∥op
2
− 1
In other words, the expansion of h is Ω(
B −1
op2 ).
The analysis of the Gaussian Cooling algorithm for volume computation [20] uses localization.
Next we mention an application to the anti-concentration of polynomials. This is a corollary of a more
general result by Carbery and Wright.
Theorem 10.33 ([14]). Let q be a degree d polynomial in Rn . Then for a convex body K ⊂ Rn of volume
1, any ϵ > 0, and x drawn uniform from K ,
1
Pr |q(x)| ≤ ϵ max |q(x)| ≲ ϵ d d
x∼K K
10.6. Hit-and-Run 139
We conclude this section with a nice interpretation of the localization lemma by Fradelizi and Guedon.
They also give a version that extends localization to multiple inequalities.
Theorem 10.34 (Reformulated Localization Lemma [27]). Let K be a compact convex set in Rn and f be
an upper semi-continuous function. Let
R Pf be the set of logconcave distributions µ supported by K satisfying
f dµ ≥ 0. The set of extreme points of convPf is exactly:
1. the Dirac measure at points x such that f (x) ≥ 0, or
2. the distributions v satises
(a) density function is of the form eℓ with linear ℓ,
(b) support equals to a segment
R [a, b] ⊂ K ,
(c) f dv = 0,
Rx Rb
(d) a
f dv > 0 for x ∈ (a, b) or x
f dv > 0 for x ∈ (a, b).
Since we know the maximizer of any convex function is at extreme points, this shows that one can
optimize maxµ∈Pf Φ(µ) for any convex Φ by checking Dirac measures and log-ane functions.
10.6 Hit-and-Run
The ball walk does not mix rapidly from all starting points. While this hurdle can be overcome by starting
with a deep point and carefully maintaining a warm start, it is natural to ask if there is a simple process
that does truly mix rapidly from any starting point. Hit-and-Run satises this requirement.
Algorithm 30: Hit − and − Run
Input: starting point x0 in a convex body K .
Repeat T times: at current point x,
1. Pick a uniform random direction ℓ through x.
2. Go to uniform random point y on the chord of K induced by ℓ.
return x.
Since hit-and-run is a symmetric Markov chain, the uniform distribution on K is stationary for it.
To sample from a general density proportional to f (x), in Step 2, we sample y according to the density
proportional to f restricted to the random line ℓ.
Next we give a formula for the next step distribution from a point u.
Lemma 10.35. The next step distribution of Hit-and-Run from a point u is given by
Z
2 dx
Pu (A) = n−1
vol(S n−1 ) A ∥x − u∥ ℓ(u, x)
where A is any measurable subset of K and ℓ(u, x) is the length of the chord in K through u and x.
Exercise 10.36. Prove Lemma 10.35.
The main theorem of this section is the following [51].
Theorem 10.37. [51]The conductance of Hit-and-Run in a convex body K containing the unit ball and of
diameter D is Ω(1/nD).
This implies a mixing time of O n2 D2 log(M/ε) to get to within distance ε of the target density starting
from an M -warm initial density. By taking one step from the initial point, we can bound M by (D/d)n
where d is the minimum distance of the starting point from the boundary. Hence this gives a bound of
Õ(n3 D2 ) from any interior starting point.
The proof of the theorem follows the same high-level outline as that of the ball walk, needing two major
ingredients, namely, one-step coupling and isoperimetry. Notably, the isoperimetry is for a non-Euclidean
notion of distance. We begin with some suitable denitions.
Dene the median step-size function F as the F (x) such that
1
Pr(∥x − y∥ ≤ F (x)) =
8
10.6. Hit-and-Run 140
∥u − v∥ ∥p − q∥
dK (u, v) = .
∥p − u∥ ∥v − q∥
The rst ingredient shows that if two points are close geometrically, then their next-step distributions
have signicant overlap.
1 2
dK (u, v) < and ∥u − v∥ ≤ √ max {F (u), F (v)}
8 n
1
we have dT V (Pu , Pv ) < 1 − 500 .
The second ingredient is an isoperimetric inequality (independent of any algorithm). The cross-ratio
distance has a nice isoperimetry inequality.
vol(S1 )vol(S2 )
vol(S3 ) ≥ dK (S1 , S2 ) .
vol(K)
However, this will not suce to prove a bound on the conductance of all subsets. The reason is that we
cannot guarantee a good lower bound on the minimum distance between subsets S1 , S2 . Instead, we will
need a weighted isoperimetric inequality, which uses an average distance.
Theorem 10.40. LetS1 , S2 , S3 be a partition of a convex body K . Let h : K → R+ be a function s.t. for
any u ∈ S1 ,v ∈ S2 , and any x on the chord through u and v , we have
1
h(x) ≤ min {1, dK (u, v)} .
3
Then,
vol(S3 ) ≥ EK (h(x)) min {vol(S1 ), vol(S2 )} .
For bounding the conductance, we will use a specic function h. To introduce it, we rst dene a step-size
function s(x):
vol(x + tB n ∩ K
s(x) = sup t : ≥γ
vol(tB n )
for some xed γ ∈ (0, 1].
Exercise 10.41. Show that the step-size function is concave over any convex body.
We will need the following relationship between the step-size function and the median step function.
Proof of Thm. 10.37. Let K = S1 ∪ S2 be a partition of K into measurable sets. We will prove that
Z
c
Px (S2 ) dx ≥ min{vol(S1 ), vol(S2 )} (10.7)
S1 nD
Consider the points that are deep inside these sets, i.e., unlikely to jump out of the set:
1 1
′
S1 = x ∈ S1 : Px (S2 ) < and S2 = x ∈ S2 : Px (S1 ) <
′
.
1000 1000
assume the second condition above holds. Next, noting that x is some point on the chord through u, v , let
the endpoints of the chord be p, q. Suppose WLOG that x ∈ [u, q]. Then, by the concavity of s(x), and using
the second part of Lemma 10.42, we have,
|x − p|
s(x) ≤ s(u)
|u − p|
|x − p|
≤ 32 F (u)
|u − p|
√ |x − p|
≤ 16 n |u − v|
|u − p|
√
≤ 16dK (u, v) nD
The following lemma shows that self-concordant matrix functions also enjoy a similar regularity as the
usual self-concordant functions.
Lemma 10.44. Given any self-concordant matrix function H on K ⊂ Rn , we dene ∥v∥2x = v ⊤ H(x)v .
Then, for any x, y ∈ K with ∥x − y∥x < 1, we have
2 1
(1 − ∥x − y∥x ) H(x) ⪯ H(y) ⪯ 2 H(x).
(1 − ∥x − y∥x )
10.7. Dikin walk 143
ϕ(0)
ϕ(t) ≤ p . (10.8)
(1 − t ϕ(0))2
Now we x any v and dene ψ(t) = v ⊤ H(xt )v . Then,
′
⊤d
H(xt )v ≤ 2∥h∥xt ∥v∥2xt = 2ϕ(t)ψ(t).
|ψ (t)| = v
dt
Using (10.8) at the end, we have p
d
ln ψ(t) ≤ 2 ϕ(0)
dt (1 − tpϕ(0)) .
Integrating both sides from 0 to 1,
Z 1 p
ψ(1)
ln ≤ 2 ϕ(0) 1
ψ(0) p dt = 2 ln( p ).
0 (1 − t ϕ(0)) 1 − ϕ(0)
The result follows from this, ψ(1) = v ⊤ H(y)v , ψ(0) = v ⊤ H(x)v , and ϕ(0) = ∥x − y∥2x .
Many natural barriers, including the logarithmic barrier and the LS-barrier, satisfy a much stronger
condition than self-concordant. However, this is not always true, as one can construct counterexamples even
in one-dimension.
Denition 10.45. For any convex set K ⊂ Rn , we say a matrix function H : K → Rn×n is strongly
self-concordant if for any u ∈ K , we have
H(x)−1/2 DH(x)[h]H(x)−1/2
≤ 2 ∥h∥x
F
1 1 ∥x − y∥x
∥H(x)− 2 (H(y) − H(x))H(x)− 2 ∥F ≤ .
(1 − ∥x − y∥x )2
Proof. Let xt = (1 − t)x + ty . Then, we have
Z 1
1 1 1 d 1
∥H(x)− 2 (H(y) − H(x))H(x)− 2 ∥F = ∥H(x)− 2 H(xt )H(x)− 2 ∥F dt.
0 dt
We note that H is self-concordant. Hence, Lemma 10.44 shows that
− 12 d − 12 2 −1 d −1 d
∥H(x) H(xt )H(x) ∥F = trH(x) H(xt ) H(x) H(xt )
dt dt dt
1 −1 d −1 d
≤ 4 trH(xt ) H(xt ) H(xt ) H(xt )
(1 − ∥x − xt ∥x ) dt dt
4 2
≤ 4 ∥x − xt ∥xt
(1 − ∥x − xt ∥x )
4 2
≤ 6 ∥x − xt ∥x
(1 − ∥x − xt ∥x )
10.8. Mixing with Strong Self-Concordance 144
where we used the assumption in the second inequality and Lemma 10.44 again for the last inequality.
Hence,
Z 1
− 12 − 21 2∥x − xt ∥x
∥H(x) (H(y) − H(x))H(x) ∥F ≤ 3 dt
0 (1 − ∥x − xt ∥x )
Z 1
2t∥x − y∥x
= dt
0 (1 − t∥x − y∥x )3
∥x − y∥x
= .
(1 − ∥x − y∥x )2
We note that strong self-concordance is stronger than self-concordance since the Frobenius norm is always
larger or equal to the spectral norm. As an example, we will verify that the conditions hold for the standard
log barrier (Lemma ??).
The Dikin walk has the following guarantee.
Theorem 10.47. The mixing rate of the Dikin walk for a symmetric, strongly self-concordant matrix function
with convex log determinant is O(nν̄).
Each step of the standard Dikin walk is fast, and does not need matrix multiplication.
Theorem 10.48. The Dikin walk with the logarithmic barrier for a polytope {Ax ≥ b} can be implemented
in time O(nnz(A) + n2 ) per step while maintaining the mixing rate of O(mn).
The next lemma results from studying strong self-concordance
√ for classical barriers. The KLS constant
below is conjectured to be O(1) and known to be O( log n).
Lemma 10.49. Let ψn be the KLS constant of isotropic logconcave densities in Rn , namely, for any isotropic
n
logconcave density p and any set S ⊂ R , we have
Z (Z Z )
1
p(x)dx ≥ min p(x)dx, p(x)dx .
∂S ψn S Rn \S
Let H(x) be the Hessian of the universal or entropic barriers. Then, we have
H(x)−1/2 DH(x)[h]H(x)−1/2
= O(ψn ) ∥h∥x .
F
n
In short, the universal and entropic barriers in R are strongly self-concordant up to a scaling factor depending
on ψn .
In fact, the proof shows that up to a logarithmic factor the strong self-concordance of these barriers is
equivalent to the KLS conjecture.
Proof. We have to prove two things: rst, the rejection probability is small, second the ellipsoids used by
the Dikin walk at x, y have large overlap. More precisely, we have
1 1 1 vol(Px \Py ) 1 vol(Py \Px )
dTV (Px , Py ) ≤ rej + rejy + +
2 x 2 2 vol(Px ) 2 vol(Py )
1 1 1 vol(Px ∩ Py ) 1 vol(Px ∩ Py )
= rejx + rej + 1 − − (10.9)
2 2 y 2 vol(Px ) 2 vol(Py )
10.8. Mixing with Strong Self-Concordance 145
Now, we bound the fraction of volume in the intersection of the ellipsoids at x, y . Again, we can assume
that H(x) = I. Then, the strongly self-concordance and Lemma 10.46 shows that
1
∥H(y) − I∥F ≤ 2∥x − y∥x ≤ √ . (10.16)
4 n
In particular, we have that
1 3
I ⪯ H(y) ⪯ I. (10.17)
2 2
We partition the eigenvalues λi of H(y) into those of value at least 1 and the rest. Then consider the
ellipsoid E whose eigenvalues are min {1, λi }. This is contained in both Ex (1) and Ey (1). We will see that
vol(E) is a constant fraction of the volume of both Ex (1) and Ey (1). First, we compare E and Ex (1)
!
vol(E) Y Y X
= λi = (1 − (1 − λi )) ≥ exp −2 (1 − λi ) (10.18)
vol(Ex (1))
i:λi <1 i:λi <1 i:λi <1
10.8. Mixing with Strong Self-Concordance 146
where we used that 1 − x ≥ exp(−2x) for 0 ≤ x ≤ 12 and λi ≥ 12 (10.17). From the inequality (10.16), it
follows that sX
1
(λi − 1)2 ≤ √ .
i
4 n
vol(Px ∩ Py ) vol(E) 1
= ≥ e− 2 . (10.19)
vol(Px ) vol(Ex (1))
Similarly, we have
Q
vol(Px ∩ Py ) i:λi <1 λi 1 1 1
= Q =Q ≥ P ≥ e− 4 . (10.20)
vol(Py ) i:λi λi i:λi >1 λi exp( i:λi >1 (λi − 1))
The next lemma establishes isoperimetry. This only needs the symmetric containment assumption. The
isoperimetry is for the cross-ratio distance. For a convex body K , and any two points u, v ∈ K,suppose that
p, q are the endpoints of the chord through u, v in K , so that these points occur in the order p, u, v, q. Then,
the cross-ratio distance between u and v is dened as
∥u − v∥2 ∥p − q∥2
dK (u, v) = .
∥p − u∥2 ∥v − q∥2
Tbis distance enjoyes the following isoperimetric inequality.
Theorem 10.51 ([48]). For any convex body K, and disjoint subsets S1 , S2 of it, and S3 = K \ S1 \ S2 ,we
have
vol(S1 )vol(S2 )
vol(S3 ) ≥ dK (S1 , S2 ) .
vol(K)
We now relate the cross-ratio distance to the ellipsoidal norm.
Proof. Consider the ellipsoid at u. For the chord [p, q] induced by u, v with these points in the order p, u, v,
√ q,
suppose that ∥p − u∥2 ≤ ∥v − q∥2 . Then by Lemma 10.47, p ∈ K ∩ (2u − K). And hence ∥p − u∥u ≤ ν̄.
Therefore,
∥u − v∥2 ∥p − q∥2 ∥u − v∥2 ∥u − v∥u ∥u − v∥u
dK (u, v) = ≥ = ≥ √ .
∥p − u∥2 ∥v − q∥2 ∥p − u∥2 ∥p − u∥u ν̄
We can now prove the main conductance bound.
*
We follow the standard high-level outline [74]. Consider any measurable subset S1 ⊆ K and let S2 = K\S1
be its complement. Dene the points with low escape probability for these subsets as
1
Si′ = x ∈ Si : Px (K \ Si ) <
8
and S3′ = K \ S1′ \ S2′ . Then, for any u ∈ S1′ , v ∈ S2′ , we have dT V (Pu , Pv ) > 1 − 14 . Hence, by Lemma 10.50,
we have ∥u − v∥u ≥ 8√1 n . Therefore, by Lemma 10.52,
1
dK (u, v) ≥ √ √ .
8 n · ν̄
10.9. Hamiltonian Monte Carlo 147
We can now bound the conductance of S1 . We may assume that vol(Si′ ) ≥ vol(Si )/2; otherwise, it immedi-
ately follows that the conductance of S1 is Ω(1). Assuming this, we have
Z Z
1 1
Px (S2 ) dx ≥ dx ≥ vol(S3′ )
S1 ′
S3 8 8
1 vol(S1′ )vol(S2′ )
using isoperimetry (Theorem 10.51) ≥ dK (S1′ , S2′ )
8 vol(P )
1
≥ √ min {vol(S1 ), vol(S2 )} .
512 nν̄
dx ∂H(x, v)
= ,
dt ∂v
dv ∂H(x, v)
=− .
dt ∂x
These equations preserve the Hamiltonian function H . In the simplest Euclidean setting, it can be dened
as follows.
1 2
H(x, v) = f (x) + ∥v∥
2
so that
dx dv
= v, = −∇f (x)
dt dt
or
d2 x
= −∇f (x).
dt2
More generally, the Hamiltonian can depend on a function that denes a local metric:
1 1
H(x, v) = f (x) + log((2π)n det g(x)) + v T g(x)−1 v
2 2
where g(x) is a matrix, and when it is PSD, it denes a local norm at x. In this sense, we can view the
dynamics as evolving on a manifold with local metric g(x). Here in this chapter, we will focus on the case
when g(x) = I , the standard Euclidean metric.
(Riemannian) Hamiltonian Monte Carlo (or RHMC) [?, ?][?, ?] is a Markov Chain Monte Carlo method
for sampling from a desired distribution. Each step of the method consists of the following: At a current
point x,
1. Pick a random velocity y according to a local distribution dened by x (in the simplest setting, this is
the standard Gaussian distribution for every x).
2. Move along the Hamiltonian curve dened by Hamiltonian dynamics at (x, y) for time (distance) δ .
For the choice of H above, the marginal distribution of the current point x approaches the target distribution
with density proportional to e−f . Note that HMC does not require a Metropolis lter! Thus, unlike the walks
we have seen so far, its step sizes are not limited by this consideration even in high dimension. Hamiltonian
Monte Carlo can be used for sampling from a general distribution e−H(x,y) .
dx ∂H(x, y)
= ,
dt ∂y
dy ∂H(x, y)
=− . (10.21)
dt ∂x
def
We dene the map Tδ (x, y) = (x(δ), y(δ)) where the (x(t), y(t)) follows the Hamiltonian curve with the
initial condition (x(0), y(0)) = (x, y).
Hamiltonian Monte Carlo is a Markov chain generated by a sequence of randomly Hamiltonian curves.
Time-reversibility
Lemma 10.54 (Energy Conservation). For any Hamiltonian curve (x(t), y(t)), we have that
d
H(x(t), y(t)) = 0.
dt
Proof. Note that
d ∂H dx ∂H dy ∂H ∂H ∂H ∂H
H(x(t), y(t)) = + = − = 0.
dt ∂x dt ∂y dt ∂x ∂y ∂y ∂x
Hence,
det Φ(t) = det Φ(0) = 1.
Using the previous two lemmas, we can see that Hamiltonian Monte Carlo indeed converges to the desired
distribution.
Lemma 10.56 (Time reversibility). *Let px (x′ ) denote the probability density of one step of the Hamiltonian
Monte Carlo starting at x. We have that
Proof. Fix x and x . Let Fδx (y) be the x component of Tδ (x, y). Let V+ = {y : Fδx (y) = x′ } and V− = {y :
′
e−H(x,y) e−H(x,y)
Z Z
1
′ 1
π(x)px (x ) = + det DF x (y) .
|det (DFδx (y))| 2
2 y∈V+ y∈V− −δ
We note that this formula assumed that DFδx is invertible. Sard's theorem showed that Fδx (N ) is measure
def
0 where N = {y : DFsx (y) is not invertible}. Therefore, the formula is correct except for a measure zero
subset.
By reversing time for the Hamiltonian curve, we have that for the same V± ,
′ ′ ′ ′
e−H(x ,y ) e−H(x ,y )
Z Z
1 1
′
π(x )px′ (x) = det DF x′ (y ′ ) + 2
det DF x′ (y ′ )
(10.22)
2 y∈V+ −δ y∈V− δ
where y ′ denotes the y component of Tδ (x, y) and T−δ (x, y) in the rst
and second sum respectively.
A B
We compare the rst terms in both equations. Let DTδ (x, y) = . Since Tδ ◦ T−δ = I and
C D
Tδ (x, y) = (x′ , y ′ ), the inverse function theorem shows that DT−δ (x′ , y ′ ) is the inverse map of DTδ (x, y).
Hence, we have that
−1
· · · −A−1 B(D − CA−1 B)−1
A B
DT−δ (x′ , y ′ ) = = .
C D ··· ···
′
Therefore, we have that Fδx (y) = B and F−δ
x
(y ′ ) = −A−1 B(D − CA−1 B)−1 . Hence, we have that
x′
−1 |det B|
det DF−δ (y ′ ) = det A−1 det B det D − CA−1 B = .
det A B
C D
A B
Using that det (DTt (x, y)) = det = 1 (Lemma 10.55), we have that
C D
x′
det DF−δ (y ′ ) = |det (DFδx (y))| .
10.9. Hamiltonian Monte Carlo 150
e−H(x,y) e−H(x,y)
Z Z
1 1
x = det DF x′ (y ′ )
2 y∈V+ |det (DFδ (y))| 2 y∈V+ −δ
′ ′
e−H(x ,y )
Z
1
= det DF x′ (y ′ )
2 y∈V+ −δ
′ ′
where we used that e−H(x,y) = e−H(x ,y ) (Lemma 10.54) at the end.
For the second term in (10.22), by the same calculation, we have that
′ ′
e−H(x,y) e−H(x ,y )
Z Z
1 1
det DF x (y) = 2 det DF x′ (y ′ )
2 y∈V− −δ y∈V+ δ
Convergence
2
First we consider the convergence in the case when H(x, v) = f (x) + 21 ∥v∥ for a strongly convex function f .
So the marginal of the stationary distribution along x is proportional to e−f . The idea here is coupling (as
we did for Langevin dynamics). We consider two separate processes x and y , with their next step directions
chosen to be identical. The key lemma is that with this setting the squared distance decreases up to a certain
time that depends on the condition number.
10.9. Hamiltonian Monte Carlo 151
√
Figure 10.6: (a) Eu (1) ⊆ K ∩ (2u − K) ⊆ Eu ( ν̄). (b) Strong self-concordance measures the rate of change of
Hessian of a barrier in the Frobenius norm
Chapter 11
Annealing
154
11.2. Volume Computation 155
Lemma 11.2. n
Z(a) = an f (x)a
R
Let f
be a logconcave function in R . Then Rn
is logconcave for a ≥ 0. If
n
R
f has support K , then Z(a) = a K
f (ax) is logconcave for a > 0.
Table 11.1: The complexity of volume estimation, each step uses O(n)
e bit of randomness. The last algorithm needs
ω−1 2
O mn
e steps per iteration while the rest need O(n ) per oracle query.
In [53] this was improved by sampling from a sequence of nonuniform distributions. Then we consider
the following estimator:
fi+1 (X)
Y = .
fi (X)
We see that R
fi+1
Efi (Y ) = R .
fi
In the algorithm of DFK and KLS, this ratio is bounded by a constant in each phase, giving a total of O∗ (n)
phases since the ratio of nal to initial integrals is exponential. Instead of uniform densities, we consider
2
fi (x) ∝ exp(−ai ∥x∥)χK (x) or fi (x) ∝ exp(−ai ∥x∥ )χK (x).
The coecient ai (inverse temperature) will be changed by a factor of (1+ √1n ) in each phase, which implies
that m = O( e √n) phases suce to reach the target distribution. This is perhaps surprising since the ratio
e √n) phases, and hence
of the initial integral to the nal is typically nΩ(n) . Yet the algorithm uses only O(
√
estimates a ratio of nΩ̃( n) in one or more phases. The key insight is that even though the expected ratio
might be large, its variance is not.
11.2. Volume Computation 156
Lemma 11.3. For X ∼ fi with fi (x) = e−ai ∥x∥ χK (x) for a convex body K, or fi (x) = f (x)ai for a
logconcave function f , we have that the estimator Y = fi+1 (X)
fi (X) satises
n
E Y2 a2i+1
2 ≤
E (Y ) (2ai+1 − ai )ai
which is bounded by a constant for ai = ai+1 1 + √1 .
n
The LV algorithm has two parts. In the rst it nds a transformation that puts the body in near-
isotropic position. The complexity of this part is O(n
e 4 ). In the second part, it runs the annealing schedule,
while maintaining that the distribution being sampled is well-rounded, a weaker condition than isotropy.
Well-roundedness requires that a level set of measure 18 contains a constant-radius ball and the trace of the
covariance (expected√squared distance of a random point from the mean) to be bounded by O(n), so that
R/r is eectively O( n). To achieve the complexity guarantee for the second phase, it suces to use the
1
KLS bound of ψp ≳ n− 2 . Connecting improvements in the Cheeger constant directly to the complexity of
volume computation was an open question for a couple of decades. To apply improvements in the Cheeger
constant, one would need to replace well-roundedness with (near-)isotropy and maintain that. However,
maintaining isotropy appears to be much harder possibly requiring a sequence of Ω(n) distributions and
Ω(n) samples from each, providing no gain over the current complexity of O∗ (n4 ) even if the KLS conjecture
turns out to be true.
A faster
√ algorithm is known for well-rounded convex bodies (any isotropic logconcave density satises
R
r = O( n) and is well-rounded). This variant of simulated annealing, called Gaussian cooling utilizes the
fact that the KLS conjecture holds for a Gaussian density restricted by any convex body, and completely
avoids computing an isotropic transformation.
√
Theorem 11.5 ([20]). The volume of a well-rounded convex body, i.e., with R/r = O∗ ( n), can be computed
∗ 3
using O (n ) oracle calls.
In 2021, it was shown that the complexity of rounding a convex body can be bounded as O∗ (n3 ψn2 ) where
ψn is the KLS constant bound for any isotropic logconcave density in Rn . Together with the next theorem,
it follows that the volume of a convex body can be computed in the same complexity. The current bound
on the KLS constant implies that this is in fact O∗ (n3 ).
Theorem 11.6. n
A near-isotropic transformation for any convex body in R can be computed using Õ(n3 )
n ∗ 3
oracle calls and the volume of any convex body in R can be computed using O (n ) oracle calls.
Bibliography
[1] Radosªaw Adamczak, Alexander Litvak, Alain Pajor, and Nicole Tomczak-Jaegermann. Quantitative
estimates of the convergence of the empirical covariance matrix in log-concave ensembles. Journal of
the American Mathematical Society, 23(2):535561, 2010.
[2] Deeksha Adil, Rasmus Kyng, Richard Peng, and Sushant Sachdeva. Iterative renement for âp-norm
regression. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms,
pages 14051424. SIAM, 2019.
[3] David Aldous and James Fill. Reversible Markov chains and random walks on graphs. Berkeley, 1995.
[4] Zeyuan Allen-Zhu, Zheng Qu, Peter Richtárik, and Yang Yuan. Even faster accelerated coordinate
descent using non-uniform sampling. In International Conference on Machine Learning, pages 1110
1119, 2016.
[5] David Applegate and Ravi Kannan. Sampling and integration of near log-concave functions. In Pro-
ceedings of the 23rd Annual ACM Symposium on Theory of Computing, May 5-8, 1991, New Orleans,
Louisiana, USA, pages 156163, 1991.
[6] Shiri Artstein-Avidan and Vitali Milman. The concept of duality in convex analysis, and the character-
ization of the legendre transform. Annals of mathematics, pages 661674, 2009.
[7] David S Atkinson and Pravin M Vaidya. A cutting plane algorithm for convex programming that uses
analytic centers. Mathematical Programming, 69(1-3):143, 1995.
[8] Alexandre Belloni, Tengyuan Liang, Hariharan Narayanan, and Alexander Rakhlin. Escaping the local
minima via simulated annealing: Optimization of approximately convex functions. In Conference on
Learning Theory, pages 240265, 2015.
[9] Dimitris Bertsimas and Santosh Vempala. Solving convex programs by random walks. Journal of the
ACM (JACM), 51(4):540556, 2004.
[10] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied
and computational harmonic analysis, 27(3):265274, 2009.
[11] J. Bourgain. Random points in isotropic convex sets. Convex geometric analysis, 34:5358, 1996.
[12] Graham Brightwell and Peter Winkler. Counting linear extensions. Order, 8:225242, 1991.
[13] Sébastien Bubeck, Ronen Eldan, and Yin Tat Lee. Kernel-based methods for bandit convex optimization.
arXiv preprint arXiv:1607.03084, 2016.
[14] Anthony Carbery and James Wright. Distributional and l q norm inequalities for polynomials over
convex bodies in r n. Mathematical research letters, 8(3):233248, 2001.
[15] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for nding stationary
points i. arXiv preprint arXiv:1710.11606, 2017.
[16] Michael B Cohen. Nearly tight oblivious subspace embeddings by trace inequalities. In Proceedings of
the twenty-seventh annual ACM-SIAM symposium on Discrete algorithms, pages 278287. SIAM, 2016.
157
Bibliography 158
[17] Michael B Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, and Aaron Sidford.
Uniform sampling for matrix approximation. In Proceedings of the 2015 Conference on Innovations in
Theoretical Computer Science, pages 181190. ACM, 2015.
[18] Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multi-
plication time. arXiv preprint arXiv:1810.07896, 2018.
[19] B. Cousins and S. Vempala. A cubic algorithm for computing Gaussian volume. In SODA, pages
12151228, 2014.
[20] B. Cousins and S. Vempala. Bypassing KLS: Gaussian cooling and an O∗ (n3 ) volume algorithm. In
STOC, pages 539548, 2015.
[21] Dmitriy Drusvyatskiy, Maryam Fazel, and Scott Roy. An optimal rst order method based on optimal
quadratic averaging. arXiv preprint arXiv:1604.06543, 2016.
[22] M. E. Dyer and A. M. Frieze. Computing the volume of a convex body: a case where randomness
provably helps. In Proc. of AMS Symposium on Probabilistic Combinatorics and Its Applications, pages
123170, 1991.
[23] M. E. Dyer, A. M. Frieze, and R. Kannan. A random polynomial time algorithm for approximating the
volume of convex bodies. In STOC, pages 375381, 1989.
[24] Vitaly Feldman, Cristobal Guzman, and Santosh Vempala. Statistical query algorithms for stochastic
convex optimization. CoRR, abs/1512.09170, 2015.
[26] Roger Fletcher. A new variational result for quasi-newton formulae. SIAM Journal on Optimization,
1(1):1821, 1991.
[27] Matthieu Fradelizi and Olivier Guédon. The extreme points of subsets of s-concave probabilities and a
geometric localization theorem. Discrete & Computational Geometry, 31(2):327335, 2004.
[28] Martin Grötschel, László Lovász, and Alexander Schrijver. Geometric algorithms and combinatorial
optimization, volume 2. Algorithms and Combinatorics, 1988.
[29] B. Grunbaum. Partitions of mass-distributions and convex bodies by hyperplanes. Pacic J. Math.,
10:12571261, 1960.
[30] Haotian Jiang, Yin Tat Lee, Zhao Song, and Sam Chiu-wai Wong. An improved cutting plane method
for convex optimization, convex-concave games, and its applications. In Proceedings of the 52nd Annual
ACM SIGACT Symposium on Theory of Computing, pages 944953, 2020.
[31] Adam Tauman Kalai and Santosh Vempala. Simulated annealing for convex optimization. Math. Oper.
Res., 31(2):253266, February 2006.
[32] Ravi Kannan, László Lovász, and Miklós Simonovits. Isoperimetric problems for convex bodies and a
localization lemma. Discrete & Computational Geometry, 13(1):541559, 1995.
[33] Ravi Kannan, László Lovász, and Miklós Simonovits. Random walks and an o*(n5) volume algorithm
for convex bodies. Random structures and algorithms, 11(1):150, 1997.
[34] Ravindran Kannan and Santosh Vempala. Randomized algorithms in numerical linear algebra. Acta
Numerica, 26:95135, 2017.
[35] Alexander Karzanov and Leonid Khachiyan. On the conductance of order Markov chains. Order,
8(1):715, 1991.
Bibliography 159
[36] Tarun Kathuria, Yang P Liu, and Aaron Sidford. Unit capacity maxow in almost o(mΘ{4/3}) time.
In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 119130.
IEEE, 2020.
[37] L Khachiyan, S Tarasov, and E Erlich. The inscribed ellipsoid method. In Soviet Math. Dokl, volume
298, 1988.
[38] Leonid G Khachiyan. Polynomial algorithms in linear programming. USSR Computational Mathematics
and Mathematical Physics, 20(1):5372, 1980.
[39] Boâaz Klartag. Needle decompositions in Riemannian geometry, volume 249. American Mathematical
Society, 2017.
[40] Rasmus Kyng, Richard Peng, Sushant Sachdeva, and Di Wang. Flows in almost linear time via adaptive
preconditioning. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing,
pages 902913, 2019.
[41] Michel Ledoux. Concentration of measure and logarithmic sobolev inequalities. Seminaire de probabilites
de Strasbourg, 33:120216, 1999.
[42] Yin Tat Lee, Aaron Sidford, and Santosh S Vempala. Ecient convex optimization with membership
oracles. arXiv preprint arXiv:1706.07357, 2017.
[43] Yin Tat Lee, Aaron Sidford, and Sam Chiu-wai Wong. A faster cutting plane method and its implications
for combinatorial and convex optimization. In Foundations of Computer Science (FOCS), 2015 IEEE
56th Annual Symposium on, pages 10491065. IEEE, 2015.
[44] Yin Tat Lee and Santosh S Vempala. Stochastic localization+ stieltjes barrier= tight bound for log-
sobolev. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages
11221129. ACM, 2018.
[45] A Yu Levin. On an algorithm for the minimization of convex functions. In Soviet Mathematics Doklady,
volume 160, pages 12441247, 1965.
[46] Peter Li and Shing Tung Yau. Estimates of eigenvalues of a compact riemannian manifold. Geometry
of the Laplace operator, 36:205239, 1980.
[47] L. Lovász. How to compute the volume? Jber. d. Dt. Math.-Verein, Jubiläumstagung 1990, pages
138151, 1990.
[48] L. Lovász. Hit-and-run mixes fast. Math. Prog., 86:443461, 1998.
[49] L. Lovász and M. Simonovits. Mixing rate of Markov chains, an isoperimetric inequality, and computing
the volume. In ROCS, pages 482491, 1990.
[50] L. Lovász and M. Simonovits. Random walks in a convex body and an improved volume algorithm. In
Random Structures and Alg., volume 4, pages 359412, 1993.
[51] L. Lovász and S. Vempala. Hit-and-run from a corner. SIAM J. Computing, 35:9851005, 2006.
[52] László Lovász and Miklós Simonovits. Random walks in a convex body and an improved volume
algorithm. Random structures & algorithms, 4(4):359412, 1993.
[53] László Lovász and Santosh Vempala. Simulated annealing in convex bodies and an O∗ (n4 ) volume
algorithm. In FOCS, pages 650659, 2003.
[54] László Lovász and Santosh Vempala. The geometry of logconcave functions and sampling algorithms.
Random Struct. Algorithms, 30(3):307358, 2007.
[55] Haihao Lu, Robert M Freund, and Yurii Nesterov. Relatively-smooth convex optimization by rst-order
methods, and applications. arXiv preprint arXiv:1610.05708, 2016.
Bibliography 160
[56] Paolo Manselli and Carlo Pucci. Maximum length of steepest descent curves for quasi-convex functions.
Geometriae Dedicata, 38(2):211227, 1991.
[57] Yu Nesterov. Introductory lectures on convex programming volume i: Basic course. Lecture notes, 1998.
[58] Donald J Newman. Location of the maximum on unimodal surfaces. Journal of the ACM (JACM),
12(3):395398, 1965.
[59] Constantin P.. Niculescu and Lars-Erik Persson. Convex Functions and Their Applications: A Contem-
porary Approach. Springer., 2018.
[60] Bernt Oksendal. Stochastic dierential equations: an introduction with applications. Springer Science
& Business Media, 2013.
[61] Lawrence E Payne and Hans F Weinberger. An optimal poincaré inequality for convex domains. Archive
for Rational Mechanics and Analysis, 5(1):286292, 1960.
[62] Luis Rademacher. Approximating the centroid is hard. In Proceedings of the 23rd ACM Symposium on
Computational Geometry, Gyeongju, South Korea, June 6-8, 2007, pages 302305, 2007.
[63] M. Rudelson. Random vectors in the isotropic position. Journal of Functional Analysis, 164:6072,
1999.
[64] Sushant Sachdeva, Nisheeth K Vishnoi, et al. Faster algorithms via approximation theory.
®
Foundations
and Trends in Theoretical Computer Science, 9(2):125210, 2014.
[65] Naum Z Shor. Cut-o method with space extension in convex programming problems. Cybernetics and
systems analysis, 13(1):9496, 1977.
[66] Miklós Simonovits. How to compute the volume in high dimension? Math. Program., 97(1-2):337374,
2003.
[67] Daniel A Spielman and Nikhil Srivastava. Graph sparsication by eective resistances. SIAM Journal
on Computing, 40(6):19131926, 2011.
[68] Nikhil Srivastava, Roman Vershynin, et al. Covariance estimation for distributions with 2+eps moments.
The Annals of Probability, 41(5):30813111, 2013.
[69] George J Stigler. The cost of subsistence. Journal of farm economics, 27(2):303314, 1945.
[70] Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends ®
in Machine Learning, 8(1-2):1230, 2015.
[71] Pravin M. Vaidya. A new algorithm for minimizing convex functions over convex sets. In 30th Annual
Symposium on Foundations of Computer Science, Research Triangle Park, North Carolina, USA, 30
October - 1 November 1989, pages 338343, 1989.
[72] S. Vempala. Geometric random walks: A survey. MSRI Combinatorial and Computational Geometry,
52:573612, 2005.
[73] S. S. Vempala. The Random Projection Method. AMS, 2004.
[74] Santosh Vempala. Geometric random walks: a survey. Combinatorial and computational geometry,
52(573-612):2, 2005.
[75] Andre Wibisono. Sampling as optimization in the space of measures: The langevin dynamics as a
composite optimization problem. arXiv preprint arXiv:1802.08089, 2018.
[76] David P Woodru et al. Sketching as a tool for numerical linear algebra. Foundations and Trends ® in
Theoretical Computer Science, 10(12):1157, 2014.
[77] David B Yudin and Arkadii S Nemirovski. Evaluation of the information complexity of mathematical
programming problems. Ekonomika i Matematicheskie Metody, 12:128142, 1976.
Appendix A
Calculus - Review
Similarly, we use Dk f (x)[h1 , h2 , · · · , hk ] to denote the directional k -th derivative of f at x along directions
h1 , · · · , hk .
Lemma A.1. Given A ∈ Rn×d . Let Φ(x) = ni=1 f (a⊤
P
i x) where ai is the i-th row of A. Then, we have
∇Φ(x) = A⊤ f ′ (Ax) and ∇2 Φ(x) = A⊤ diag(f ′′ (Ax))A where f ′ (Ax) is the vector dened by (f ′ (Ax))i =
f ′ (a⊤
i x).
Since both side are the same for all h, we have ∇Φ(x) = A⊤ f ′ (Ax).
Similarly, we have
h⊤ ∇2 Φ(x)h = D2 Φ(x)[h, h]
Xn
= (f ′′ (Ax))i (Ah)2i
i=1
= h⊤ A⊤ diag(f ′′ (Ax))Ah.
Since ∇2 Φ(x) − A⊤ diag(f ′′ (Ax))A is symmetric and both side are the same for all h, we have ∇2 Φ(x) =
A⊤ diag(f ′′ (Ax))A.
161
A.1. Tips for Computing Gradients 162
Exercise A.2. Use the above method to compute the gradient and Hessian of f (X) = log det AT XA.
Here is a more complicated example.
Lemma A.3 (Brachistochrone Problem). Let (x, u(x)) be the curve from (0, 0) to (1, −1) where the rst
coordinate is the x axis and the second coordinate is the y axis. Suppose that this is the curve that takes the
shortest time for a bead to slide along the curve frictionlessly from (0, 0) to (1, −1) under uniform gravity.
Then, we have that
2uu′′ + (u′ )2 + 1 = 0.
Remark. Take a look at Wikipedia for the Brachistochrone curve. It is counterintuitive!
where ds is the arc length element and v(x) is the velocity at x. By conservation of energy, i.e., the gained
kinetic energy must equal the lost potential energy for every point along the curve, we know that
1
mv(x)2 = −mgu(x).
2
s
1
1 + (u′ (x))2
Z
T (u) = dx.
0 −2gu(x)
Since u is a shortest curve, any local change in u cannot reduce the time, i.e.,
DT (u)[h] = 0
for any change h of the curve u. We next compute the directional derivative of T (u), i.e., d
dt |t=0 T (u + th):
1 Z 1
p
1 1 + u′ (x)2 d 2u′ (x)
Z
1 d ′
DT (u)[h] = − √ 3/2
u(x)dx + p p u (x)dx.
0 2 −2gu(x) dt 0 2 −2gu(x) 1 + u′ (x)2 dt
Z 1 p Z 1
1 1 + u′ (x)2 u′ (x)h′ (x)
= − p h(x)dx + p p dx.
0 2 −2gu(x)u(x) 0 −2gu(x) 1 + u′ (x)2
A.1. Tips for Computing Gradients 163
Note that the second term involves h′ (x). To change the term h′ (x) to h(x), we use the integration by parts
(with respect to x, not t!):
Z 1 " #1 Z !
1
u′ (x)h′ (x) u′ (x)h(x) d u′ (x)
p p dx = p p − p p h(x)dx.
0 −2gu(x) 1 + (u′ (x))2 −2gu(x) 1 + (u′ (x))2 0 0 dx −2gu(x) 1 + (u′ (x))2
Since the endpoints of the curve are xed, we have h(1) = h(0) = 0. Hence, the rst term on the right hand
side is 0. Continuing,
Z 1 p Z 1 !
1 1 + u′ (x)2 d u′ (x)
DT (u)[h] = − p h(x)dx − p p h(x)dx
0 2 −2gu(x)u(x) 0 dx −2gu(x) 1 + u′ (x)2
Z 1 p Z 1
1 1 + (u′ (x))2 u′′ (x)
= − p h(x)dx − p p h(x)dx
0 2 −2gu(x)u(x) 0 −2gu(x) 1 + u′ (x)2
Z 1
1 u′ (x)2
+ p p h(x)dx
0 2 −2gu(x)u(x) 1 + u′ (x)2
Z 1
u′ (x)u′ (x)u′′ (x)
+ p h(x)dx.
0 −2gu(x)(1 + u′ (x)2 )3/2
R1
Hence, we have DT (u)[h] = 0 a(x)h(x)dx where
p
−1 1 + (u′ (x))2 u′′ (x) 1 u′ (x)2
a(x) = p −p p + p p
2 −2gu(x)u(x) −2gu(x) 1 + u′ (x)2 2 −2gu(x)u(x) 1 + u′ (x)2
u′ (x)u′ (x)u′′ (x)
+p . (A.1)
−2gu(x)(1 + u′ (x)2 )3/2
Note that a(x) is the gradient of Tp. Since DT (u)[h] = 0 for all h(x), we have that a(x) = 0 for all x.
Multiplying both sides of (A.1) by 2 −2gu(x)(1 + (u′ (x))2 )3/2 u(x), we have
0 = − (1 + (u′ (x))2 )2 − 2u(x)u′′ (x)(1 + (u′ (x))2 ) + u′ (x)2 (1 + (u′ (x))2 ) + 2u(x)u′ (x)2 u′′ (x)
= − 1 − u′ (x)2 − 2u(x)u′′ (x).
dxt dft
= (∇2 ft (xt ))−1 ∇ (xt ).
dt dt
Proof. By the optimality condition, we have ∇ft (xt ) = 0. Taking derivatives on both sides, we have
dxt dft
∇2 ft (xt ) + ∇ (xt ) = 0.
dt dt
Since ft are strictly convex, ∇2 ft (xt ) is positive denite and is invertible. Hence, we have that the result.
In section 5.5, we used this to compute the derivative of central path.
A.2. Solving optimization problems by hand 164
for some open set Ω and continuously dierentiable functions f , hi and ℓj . If x is a local minimum, x
satises the KKT conditions:
P P
Stationary: ∇f (x) +
i ui ∇hi (x) + j vj ∇ℓj (x) = 0
Complementary Slackness: ui hi (x) = 0 for all i
Primal Feasibility: hi (x) ≤ 0 and ℓj (x) = 0 for all i, j
Dual Feasibility: ui ≥ 0 for all i
We prove Holder's inequality as an example:
Fact A.6. For any vector x, y ∈ Rn , we have ∥xy∥1 ≤ ∥x∥p ∥y∥q for any 1≤p≤∞ and 1≤q≤∞ with
1 1
p + q = 1.
Proof. By symmetries, it suces to compute
X
max xi yi
∥x∥p ≤1
i
for non-zero y . Now, we use the KKT theorem with f (x) = − xi yi , h(x) = ∥x∥p − 1 and Ω = Rn . By
P
i
the KKT conditions, for any maximizer x, we have that
y = u · xp−1 .
p
Hence, we have 1 = i xpiP
= i (yi /u) p−1 = i (yi /u)q . Hence, we have u = ∥y∥q
P P P
Now, we can compute i xi yi as follows
X X yi 1 1 X p/(p−1)
xi yi = ( ) p−1 yi = 1/(p−1) yi = ∥y∥q .
i i
u u i
Appendix B
Notation
Symbol Description
oϵ (f ) o(f ) for any xed ϵ
⟨a, b⟩ Inner product aT b
nnz Number of nonzeros
Õ(.) Asymptotic complexity ignoring logarithmic terms
O∗ (.) Asymptotic complexity ignoring logarithmic terms and error terms
Bn Unit Euclidean ball in Rn
Bp (x, r) p-norm ball of radius r centered at x
p 1/p
p-norm: ( i |xi | )
P
∥.∥p
165