0% found this document useful (0 votes)
38 views166 pages

Main

This document is a book in progress on techniques in optimization and sampling. It discusses topics such as gradient descent, elimination methods, reduction methods, geometrization techniques, sparsification, and acceleration methods for optimization problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views166 pages

Main

This document is a book in progress on techniques in optimization and sampling. It discusses topics such as gradient descent, elimination methods, reduction methods, geometrization techniques, sparsification, and acceleration methods for optimization problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 166

Techniques in Optimization and Sampling

(book in progress with improving citations)

Yin Tat Lee1 Santosh Vempala2


University of Washington Georgia Tech
& Microsoft Research
April 18, 2023

1
yintat@uw.edu. This work is supported in part by CCF-1749609, DMS-1839116, DMS-2023166, Microsoft Re-
search Faculty Fellowship, Sloan Research Fellowship and Packard Fellowships.
2
vempala@gatech.edu. This work is supported in part by CCF-1563838, E2CDA-1640081, CCF-1717349 and
DMS-1839323.
Contents

0.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1 Introduction 6
1.1 Why non-convex functions can be dicult to optimize . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Why is convexity useful? Linear Separability! . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Convex problems are everywhere! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Examples of convex sets and functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Checking convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Subgradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.7 Logconcave functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

I Optimization 18
2 Gradient Descent 19
2.1 Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Basic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Analysis for convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Strongly Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Generalizing Gradient Descent* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7 Gradient Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Elimination 29
3.1 Cutting Plane Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Ellipsoid Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 From Volume to Function Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Center of Gravity Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Sphere and Parabola Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4 Reduction 46
4.1 Equivalences between Oracles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Gradient from Evaluation via Finite Dierence . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Separation via Membership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 Composite Problem via Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5 Geometrization 64
5.1 Norms and Local Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Mirror Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 FrankWolfe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4 The Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.5 Interior Point Method for Linear Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.6 Interior Point Method for Convex Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

1
Contents 2

6 Sparsication 88
6.1 Subspace embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 Leverage Score Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.3 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4 Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7 Acceleration 101
7.1 Chebyshev Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2 Conjugate Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.3 Accelerated Gradient Descent via Plane Search . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.4 Accelerated Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.5 Accelerated Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.6 Accelerated Stochastic Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

II Sampling 112
8 Gradient-based Sampling 113
8.1 Gradient-based methods: Langevin Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.2 Langevin Dynamics is Gradient Descent in Density Space*1 . . . . . . . . . . . . . . . . . . . 117

9 Elimination and Reduction 120


9.1 Cutting Plane method for Volume Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.2 Optimization from Membership via Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

10 Geometrization 123
10.1 Basics of Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
10.2 Conductance of the Ball Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
10.3 Generating a warm start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
10.4 Isotropic Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
10.5 Isoperimetry via localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.6 Hit-and-Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
10.7 Dikin walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
10.8 Mixing with Strong Self-Concordance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
10.9 Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

11 Annealing 154
11.1 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
11.2 Volume Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

A Calculus - Review 161


A.1 Tips for Computing Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
A.2 Solving optimization problems by hand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

B Notation 165

1 Sections marked with * are more mathematical and can be skipped.


Preliminaries

We use B(x, r) to denote the Euclidean (or ℓ2 ) ball centered atP x with radius r: P
{y : ∥y − x∥2 ≤ r}. We use
conv(X) to denote the convex hull of X , namely conv(X) = { αi xi : αi ≥ 0, αi = 1, xi ∈ X}. We use
⟨x, y⟩ to denote the ℓ2 inner product xT y of x and y . For any two points x, y ∈ Rn , we view them as column
vectors, and use [x, y] to denote conv({x, y}), namely, the line segment between x and y . Unless specied
otherwise, ∥x∥ will be the ℓ2 norm ∥x∥2 .
We use ei to denote the coordinate vector with i-th coordinate is 1 and 0 otherwise.

Functions
Denition 0.1. For any L ≥ 0, a function f : V → W is L-Lipschitz if ∥f (x) − f (y)∥W ≤ L ∥x − y∥V
where the norms ∥·∥V and ∥·∥W are ℓ2 norms if unspecied.

Denition 0.2. A function f ∈ C k (Rn ) if f is a k-dierentiable function and its kth derivative is continuous.
Denition 0.3. A function f : X ⊆ Rn → R is called lower semi-continuous at a point x0 ∈ X if for every
real y < f (x0 ) there exists a neighborhood U of x0 such that f (x) > y for all x ∈ U . Equivalently, f is lower
semi-continuous at x0 i

lim inf f (x) ≥ f (x0 ).


x→x0

The function is lower semi-continuous if is it so at every point in its domain. The denition of upper
semi-continuous is analogous with the inequalities reversed and lim inf replaced by lim sup.

Theorem 0.4 (Taylor's Remainder Theorem). For any g ∈ C k+1 (R), and any x and y, there is a ζ ∈ [x, y]
such that
k
X (y − x)j (y − x)k+1
g(y) = g (j) (x) + g (k+1) (ζ) .
j=0
j! (k + 1)!

Denition 0.5. The convolution of two functions f, g : Rn → R is dened as h(x) =


R
Rn
f (z)g(x − z) dz.

Linear Algebra
Denition 0.6. A real symmetric matrix A is positive semi-denite (PSD) if x⊤ Ax ≥ 0 for all x ∈ Rn .
Equivalently, a real symmetric matrix with all nonnegative eigenvalues is PSD. We write A ⪰ 0 if A is PSD
and A ⪰ B if A − B is PSD.

Denition 0.7. For any matrix A, we dene its trace, trA = Aii , Frobenius norm, ∥A∥2F = tr(A⊤ A) =
P

i,j Aij , and operator norm, ∥A∥op = sup∥x∥2 =1 ∥Ax∥2 . Note that x Ax = tr(Axx ) and in general,
2 ⊤ ⊤
P
tr(AB) = tr(BA).

For symmetric A, we have trA = i λi , ∥A∥2F = i λ2i and ∥A∥op = maxi |λi | where λi are the eigenvalues
P P
of A. √
For a vector x, ∥x∥A = xT Ax; for a matrix B ,

∥Bx∥A
∥B∥A = sup .
x ∥x∥A

3
Contents 4

Lemma 0.8 (Sherman-Morrison-Woodbury). For matrix A ∈ Rn×n and vectors u, v ∈ Rn , we have

−1 A−1 uv ⊤ A−1


A + uv ⊤ = A−1 − .
1 + u⊤ Av

For matrices U, V ∈ Rn×k ,we have

−1 −1
A + UV ⊤ = A−1 − A−1 U I + V ⊤ A−1 U V ⊤ A−1 .

Probability
Denition 0.9. The Total Variation (TV) or ℓ1 -distance between two distributions with measures ρ, ν with
support Ω is
Z
1
dT V (ρ, ν) = |ρ(x) − ν(x)| dx = sup |ρ(S) − ν(S)| = sup ρ(S) − ν(S).
2 Ω S⊂Ω S⊂Ω

The following distances are not symmetric.

Denition 0.10. The KL-divergence of a density ρ with respect to another density ν is


Z
ρ(x)
DKL (ρ∥v) = ρ(x) log dx.
ν(x)

Denition 0.11. The χ-squared distance of a densities ρ with respect to another density ν is dened as
\nu
 2 !  
2 ρ(x) ρ(x)
χ (ρ, ν) = Eν −1 = Eρ − 1.
ν(x) ν(x)

Denition 0.12. The Wassestein p-distance between two probability measures ρ, ν over a metric space M
is dened as  1/p
p
Wp (ρ, ν) = inf Ex,y∼γ (d(x, y) )
γ∈Γ(ρ,ν)

where Γ(ρ, ν) is the set of all couplings of ρ and ν (joint probability measures with support M × M whose
marginals are ρ and ν ).

Denition 0.13. The marginal of a distribution in Rn with density ν in the span of a k -dimensional
subspace V is dened as Z
g(x) = ν(x, y) dy
y∈V ⊥

for any x ∈ V .
Note that the marginal is a convolution of the density with the indicator function of the subspace.

Denition 0.14. A distribution D with support contained in Rn is said to be isotropic if ED (x) = 0 and
ED (xx⊤ ) = I .

Geometry
Denition 0.15. We denote the unit Euclidean ballnas B n = {x : ∥x∥o2 ≤ 1} . More generally, we dene the
ℓp -norm ball of radius r centered at z as Bpn (z, r) = x : ∥x − z∥p ≤ r .

Denition 0.16. The Minkowski sum of two sets A, B ⊂ Rn is dened as A + B = {x + y : x ∈ A, y ∈ B}.


0.1. Examples 5

0.1 Examples
Here is a list we want to cover:
ˆ linear system, SDD, M matrices, directed Laplacian, multi-commodity ow, totally unimodular matri-
ces (what other linear systems?)
ˆ logistic regression, other regressions, ℓp regression, convex regression
ˆ linear program < quadratic program < second order cone program < semidenite program < conic
program < convex program
 Example: John ellipsoid, minimum incrible ball, geometric programming and matrix scaling?
ˆ Shortest Path, maximum ow, min cost ow
 Example: Transportation
ˆ Markov Decision Process
ˆ matroid intersection, submodular minimization
Chapter 1

Introduction

In this book, we will study two topics involving convexity, namely optimization and sampling. Given a
multivariate, real-valued function f ,
1. How quickly can we nd a point that minimizes f ?
2. How fast can we sample a point according to the distribution with density dened by f , i.e., proportional
to e−f ?
Optimization appears naturally across mathematics, the sciences and engineering for a variety of theoretical
and practical reasons. Its study over centuries has been extremely fruitful. Sampling is motivated by the
question of choosing a representative point or subset (rather than an extremal point)? Rather than a feasible
set, we have a distribution which assigns probabilities to subsets. The goal is to sample a point from a target
distribution, i.e., the output point should lie in a given subset with probability equal to the probability of
the set in the target distribution. These problems are quite closely connected, e.g., sampling from such
distributions can be used to nd near-optimal points. These problems are intractable in full generality, and
have exponential (in dimension) complexity even under smoothness assumptions.
Convexity and its natural extensions are a current frontier of tractable, i.e., polynomial-time, compu-
tation. The assumption of convexity induces structure in instances that makes them amenable to ecient
algorithms. For example, the local minimum of a convex function is a global minimum. Convexity is main-
tained by natural operations such as intersection (for sets) and addition (for functions). Perhaps less obvious,
but also crucial, is that convex sets can be approximated by ellipsoids in various ways.
We will learn several techniques that lead us to some polynomial-time algorithms for both problems and
(nearly) linear-time algorithms for the case when f is close to a quadratic function.
Although convex optimization has been studied since the 19th century1 with many tight results emerging,
there are still many basic open problems. Here is an example:

Open Problem. Given an n × n random 0/1 matrix A with O(n) nonzero entries and a 0/1 vector b, can
we solve Ax = b in o(n2 ) time?

Computing the volume is an ancient problem, the early Egyptians and Greeks developed formulas for
specic shapes of interest. Unlike convex optimization, even computing the volume of a convex body is
intractable, as we will see later. Nevertheless, there are ecient randomized algorithms that can estimate
the volume of convex bodies to arbitrary accuracy in time polynomial in the dimension and the desired
accuracy. This extends to ecient algorithms for integrating logconcave functions, i.e., functions of the
form e−f where f is convex. The core ingredient is sampling in high dimension. Sampling and volume
computation will be the motivating problems for the second part of this book. Again, many basic problems
remain open. To illustrate:

Open Problem. Given a polytope dened by {x : Ax ≤ b}, can we estimate its volume to within a constant
factor in nearly linear time?
1 Augustin-Louis Cauchy introduced gradient descent in 1847.

6
1.1. Why non-convex functions can be dicult to optimize 7

1.1 Why non-convex functions can be dicult to optimize


Before discussing convex functions, we note that optimizing general functions can be dicult. Consider the
function (
1 if x ̸= x∗
f (x) =
0 if x = x∗
and suppose that we can only access the function by computing its function value. This function f always
returns 1 unless we know x∗ . Hence, one can prove that it takes innitely many calls to f to nd x∗ .
This function is dicult to optimize, and not merely because of its discontinuity. Similar functions can
be constructed that are continuous. Consider the function f : B(0n , 1) → R dened by

f (x) = min(∥x − x∗ ∥2 , ϵ) (1.1)

where B(0n , 1) is the unit ball centered at the origin, 0n . This function is 1-Lipschitz, i.e., for any x, y , we
have |f (x) − f (y)| ≤ |x − y|, and unless we query f (x) with x that is ϵ-close to x∗ , it will always return ϵ.
Since the region where f is not ϵ has volume ϵ1n times the volume of the unit ball, one can show that it takes
Ω ϵ1n calls to f to nd x∗ . The following exercise asks you to prove that this bound is tight.

Figure 1.1: Optimizing general 1-Lipschitz functions

Exercise 1.1. Show that if f is 1-Lipschitz on B(0n , 1), we can nd x∗ such that f (x∗ ) − minx f (x) ≤ ϵ by
evaluating f (x) at O( 1ϵ )n points.

Thus O(1/ϵ)n is the best possible bound for optimizing general 1-Lipschitz functions. Similar construc-
tions can also be made for innitely dierentiable functions. We note that it is easy to nd local minima for
all the functions above. In Section 2.2, we will show that it is easy to nd an approximate local minimum
of a continuously dierentiable function.

1.2 Why is convexity useful? Linear Separability!


In the last section, we saw that general functions are dicult to optimize because we can only nd the
minimum via exhaustive search. Now we dene convex sets, convex functions and convex problems (see Fig.
1.3). One benet of convexity is that it enables binary search. We will see that for convex functions, local
minima are global minima, and (later in this book), in fact we can solve convex optimization problems in
polynomial time.

Denition 1.2. A set K ⊆ Rn is convex if for every pair of points x, y ∈ K , we have [x, y] ⊆ K , where
[x, y] = {(1 − t)x + ty : t ∈ [0, 1]} is the one-dimensional interval from x to y .

Denition 1.3. A function f : Rn → R ∪ {+∞} is convex if for every t ∈ [0, 1], we have
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y).
1.2. Why is convexity useful? Linear Separability! 8

Figure 1.2: Non-Convex Function; Convex Function

Exercise 1.4. Suppose we have a function f : Rn → R∪{+∞}that has the property that for every x, y ∈ Rn ,
 
x+y 1
f ≤ (f (x) + f (y)) .
2 2

Show that this implies f is convex.

Figure 1.3: Convex function; Convex set

Denition 1.5. An optimization problem minx∈K f (x) is convex if K and f are convex.
Any point not in a convex set can be separated by a hyperplane from the set. We will see in Chapter 3
that separating hyperplanes allow us to do binary search to nd a point in a convex set. This is the basis of
all polynomial-time algorithms for optimizing general convex functions. We will explore these binary search
algorithms in Chapter 3. The following notions will be used routinely. A halfspace in Rn is dened as the
set {x : ⟨a, x⟩ ≥ b} for some a ∈ Rn , b ∈ R. A polyhedron is the intersection of halfspaces. A polytope is the
convex hull of a set of points.

Theorem 1.6 (Hyperplane separation theorem). Let K be a nonempty closed convex set in Rn and y∈
/ K.
n
There is a non-zero θ∈R such that
⟨θ, y⟩ > max ⟨θ, x⟩ .
x∈K

2
Proof. Let x∗ be a point in K closest to y , namely x∗ ∈ arg minx∈K ∥x − y∥2 (such a minimizer always
exists for closed convex sets and is unique; this is sometimes called Hilbert's projection theorem but can be
proved directly for this setting). Using convexity of K , for any x ∈ K and any 0 < t ≤ 1, we have that
(1 − t)x∗ + tx ∈ K and hence
2 2 2
∥y − (1 − t)x∗ − tx∥2 ≥ min ∥y − x∥2 = ∥y − x∗ ∥2 .
x∈K
1.2. Why is convexity useful? Linear Separability! 9

Expanding the LHS, we have


2 2
∥y − (1 − t)x∗ − tx∥2 = ∥y − x∗ + t(x∗ − x)∥2
2 2
= ∥y − x∗ ∥2 + 2t ⟨y − x∗ , x∗ − x⟩ + t2 ∥x∗ − x∥2 .
2
Canceling the term ∥y − x∗ ∥2 and dividing both sides by t,
2
2 ⟨y − x∗ , x∗ − x⟩ + t ∥x∗ − x∥2 ≥ 0.

Taking t → 0+ , we have that


⟨y − x∗ , x∗ − x⟩ ≥ 0 for all x ∈ K. (1.2)
Taking θ = y − x and using (1.2), for all x ∈ K , we have that

⟨θ, y − x⟩ = ⟨θ, y − x∗ ⟩ + ⟨y − x∗ , x∗ − x⟩
= ∥θ∥2 + ⟨y − x∗ , x∗ − x⟩
>0

where we used that y ∈


/ K and hence ∥θ∥2 > 0.
Theorem 1.6 shows that a polyhedron (a nite intersection of halfspaces) is essentially as general as a
convex set.

Figure 1.4: Convex set with separating hyperplane

Corollary 1.7. Any closed convex set K can be written as the intersection of halfspaces as follows

\  
K= x : ⟨θ, x⟩ ≤ max ⟨θ, y⟩ .
y∈K
θ∈Rn

In other words, any closed convex set is the limit of a sequence of polyhedra.

def
Let L = {x : ⟨θ, x⟩ ≤ maxy∈K ⟨θ, y⟩}. Since K ⊆ {x : ⟨θ, x⟩ ≤ maxy∈K ⟨θ, y⟩}, we have K ⊆
T
Proof. θ∈Rn
L.
For any x ∈
/ K , Theorem 1.6 shows that there is a θ such that θ⊤ x > maxy∈K θ⊤ y . Hence, we have x ∈
/L
and hence L ⊆ K .
1.2. Why is convexity useful? Linear Separability! 10

This shows that convex optimization is related to linear programs (optimize linear functions over poly-
topes) as follows:
min f (x) = min y
x∈K (x,y)∈{x∈K,y∈R:y≥f (x)}

where the set {x ∈ K, y ∈ R : y ≥ f (x)} then can be approximated by intersection of halfspaces, namely
{A xy ≤ b} for some matrix A ∈ Rm×(n+1) and vector b ∈ Rm with m → +∞.
Exercise 1.8. Let A, B ⊂ Rn be nonempty disjoint closed convex sets. Then there exists a vector v ∈ Rn
such that supx∈A v ⊤ x ≤ inf x∈B v ⊤ x. Show strict inequality if the sets are also bounded.

Similar to convex sets, we have a separation theorem similar to Theorem 1.6 for convex functions. In
Chapter 3, we will see that this allows us to use binary search to minimize convex functions.

Figure 1.5: First-order condition for convexity of a function

Theorem 1.9. Let f ∈ C 1 (Rn ) be convex. Then,

f (y) ≥ f (x) + ∇f (x)⊤ (y − x) for all x, y ∈ Rn (1.3)

Later in this chapter we will see that the above condition is in fact an equivalent denition of convexity.

Proof. Fix any x, y ∈ Rn . Let g(t) = f ((1 − t)x + ty). Since f is convex, it is convex along every line, in
particular over the segment [x, y] and so g is convex over [0, 1]. Then, we have

g(t) ≤ (1 − t)g(0) + tg(1)

which implies that


g(t) − g(0)
g(1) ≥ g(0) + .
t
Taking t → 0+ , we have that g(1) ≥ g(0) + g ′ (0). In other words, using the chain rule for derivatives, we
have g ′ (0) = ⟨∇f (x), y − x⟩ and hence

f (y) ≥ f (x) + ⟨∇f (x), y − x⟩ .

This theorem shows that ∇f (x) = 0 (local minimum) implies x is a global minimum.

Theorem 1.10 (Optimality condition for unconstrained problems). Let f ∈ C 1 (Rn ) be convex. Then,
x ∈ Rn is a minimizer of f (x) if and only if ∇f (x) = 0.
1.3. Convex problems are everywhere! 11

Proof. If ∇f (x) ̸= 0, then


2 2
f (x − ε∇f (x)) = f (x) − ε ∥∇f (x)∥2 + O(ε2 ) ∥∇f (x)∥2 < f (x)
for small enough ε. Hence, such a point cannot be the minimizer.
On the other hand, if ∇f (x) = 0, Theorem 1.9 shows that
f (y) ≥ f (x) + ∇f (x)⊤ (y − x) = f (x) for all y.

We note that the proof above is in fact a constructive proof. If x is not a minimum, it suggests a point
that has better function value. This will be the discussion of an upcoming section. For continuous convex
functions, there is a weaker notion of gradient called sub-dierential, which is a set instead of a vector. Both
theorems above holds with gradients replaced by sub-dierentials.

1.3 Convex problems are everywhere!


In this section, we give a few examples of convex problems to illustrate the wide applicability of convex
optimization.

1.3.1 Minimum Cost Flow Problem (Computer Science)


The min cost ow problem has lots of applications such as route planning, airline scheduling, image seg-
mentation, recommendation systems, etc. In this problem, there is a graph G = (V, E) with m = |E| and
n = |V |. Each edge e ∈ E has capacity ue > 0 and cost ce . The problem is to minimize the total cost of
sending d amount of ow from a source vertex s ∈ V to a sink vertex t ∈ V . To imagine this less abstractly,
imagine we want to match every person to the best ight for them. Then we can have a source node s
connected to a node for each person with ue = 1 for all such e. Further, we take t to be connected to a
node for each ight, with ue being the number of people that can t on that ight. Then, all the remaining
edges will be from people nodes to ight nodes. For any such e, ue = 1 and ce is proportional to how good
that ight is for that person (does it get them where they need to go at the time they need to go?) with
0 representing a perfect ight and ∞ representing a ight they would not take even if given the option for
free. Then we can calculate the min-cost ow to nd the best allocation of people Pto ights.
Formally, the problem can be written as an optimization problem minf ∈R|E| e∈E ce · fe subject to the
constraints:
ˆ Capacity constraints: P 0 ≤ fe ≤ ue for P all e ∈ E .
ˆ Flow conservation: e enters v f e = e leaves v fe for all v ∈ V \ {s, t}.
ˆ Demand: e leaves s fe = d.
P P
e enters t fe =
To check this is a convex problem, we note that the objective function is a linear function c⊤ f which is
convex. The domain is the intersection of three sets (the three set of equations above). The rst set is
a scaled hypercube, the second and last set is a linear subspace. All of them are convex and so is their
intersection. Therefore, this is a convex problem.

1.3.2 Linear Programs (Operation Research/Economics)


Consider the diet problem: nd a cheapest diet plan that satises the nutrients requirements. Formally,
suppose there are n dierent foods and m dierent nutrients. The food i has unit cost ci , a1i unit of nutrient
1, a2i unit of nutrient 2, · · · . Furthermore, every day we need bj unit of the nutrient j . Then, the problem
is simply nd an assignment x such that
min c⊤ x
x≥0,Ax≥b

where c ∈ Rn is the cost vector, b ∈ Rm is the intake requirement vector and A ∈ Rm×n is the matrix of
nutrients contents of each food2 .
2 Unfortunately, Nobel Laureate Stigler showed that [69] the optimal meal is evaporated milk, cabbage, dried navy beans,
and beef liver.
1.3. Convex problems are everywhere! 12

Both the diet problem and the minimum cost ow problem can be reformulated into the form

min c⊤ x (1.4)
Ax=b,x≥0

for some vectors c, b and some matrix A. These problems are called linear programs and have many applica-
tions in resource allocation. Special cases of linear programs are also of great interest; for example the diet
problem is a packing/covering LP.
Exercise 1.11. Show how the minimum cost ow problem and the diet problem can be written in the form
(1.4). Also, show that (1.4) is a convex problem.

1.3.3 Logistic regression (from Machine Learning)


Consider the problem of predicting the likelihood of getting diabetes in the future. Suppose we have collected
many examples {(xi , yi )}ni=1 where xi ∈ Rd represents the features of a person and yi ∈ {±1} represents
whether that person gets diabetes. For example, the feature vector can be (age, weight, height, BMI, fasting
glucose level, ...). The features in the vector may be redundant and the purpose of extra variables is to make
linear functions expressive enough to be able to classify. In particular, we assume that there is a vector θ
such that for most i, we have ⟨xi , θ⟩ < 0 if yi = 1 and ⟨xi , θ⟩ > 0 if yi = −1 . The error of the vector θ is
n
1X
1 i .
n i=1 yi ⟨x ,θ⟩>0

This function is not convex in θ. More generally, one considers the objective function (to be minimized over
θ)
n
1X
R(θ) = f (yi ⟨xi , θ⟩) + λ∥θ∥1 (1.5)
n i=1
where f is some function such that f (z) is large when z is positive and large and f (z) = 0 when z is
highly negative, and λ∥θ∥1 is a regularization term to make sure θ is bounded. One popular function is
f (z) = log(1 + ez ), and this problem is called logistic regression. In the section 1.5, we prove that the
function (1.5) is indeed convex when f (z) = log(1 + ez ).

1.3.4 Minimal Surface (Physics)


A surface M ⊂ R3 is a minimal surface if it has the minimum surface area among all surfaces with the same
boundary. These surfaces appear naturally and have been studied extensively not just for R3 but also for
dierent manifolds. For simplicity, we consider the case that the surface is parameterized by

M = {(x, y, f (x, y)) : 0 ≤ x ≤ 1, 0 ≤ y ≤ 1}.

In this case, f is a minimal surface if

f = argminfeasible g SurfaceArea(g)

where we say g is feasible if g(x, y) = f (x, y) for all x, y on the boundary of [0, 1]2 . One natural question
(called Plateau's problem) is to nd a minimal surface with a given boundary. For this particular case we
consider, we can simply use convex optimization. Note that the constraint (g is feasible) is exactly a linear
subspace on the space of functions on [0, 1]2 . Furthermore, the objective is convex by using the fact that
s 2  2
Z 1Z 1 
∂f (x, y) ∂f (x, y)
SurfaceArea(f ) = 1+ + dxdy.
0 0 ∂x ∂y

Exercise 1.12. Show that surface area is convex by using the denition.
Calculus of variations is an area of mathematics that studies optimization in function spaces and there
are many common theorems between this and convex optimization.
1.4. Examples of convex sets and functions 13

1.4 Examples of convex sets and functions


There are many important convex sets and here we only list some that appear in this course. One of the
most important classes of convex functions comes from convex sets.
Denition 1.13. For a convex set K , we dene the indicator function of K by
(
0 if x ∈ K
δK (x) = .
+∞ otherwise
def
We can also construct convex sets by convex functions. We let the domain domf = {x ∈ Rn : f (x) <
+∞}. The denition of a convex function shows that domf is a convex set if f is a convex function.
Alternatively, by looking at the set of points above the graph of the function, we obtain a convex set called
an epigraph.

Denition 1.14. The epigraph of f : Rn → R is epif def


= {(x, t) ∈ Rn × R : t ≥ f (x)}.
A function f is convex if and only if epif is a convex set.

Figure 1.6: Epigraph of f; quasiconvex function

This characterization shows that minx f (x) is the same as min(x,t)∈epif t. Therefore, convex optimization
is the same as optimizing a linear function over a convex set. Another important feature of a convex set is
the following.
Fact 1.15. Any level set {x ∈ Rn : f (x) ≤ t} of a convex function f is convex.

In particular, this shows that the set of minimizers is connected. Therefore, any local minimum is a
global minimum. We note that the converse of the fact above is not true. A function is quasiconvex if every
level set is convex. Equivalently, a function f : Rn → R is quasiconvex if for every x, y ∈ Rn and t ∈ [0, 1],
f ((1 − t)x + ty) ≤ max f (x), f (y).
For example, Fig. 1.6 shows a function that is quasiconvex but not convex.
Finally, we note that many operations preserve convexity. Here is an example.
Exercise 1.16. For a matrix A, vector b, positive scalars t1 , t2 ≥ 0, and convex functions f1 and f2 , the
function g(x) = t1 f1 (Ax + b) + t2 f2 (x) is convex.
Here are some convex sets and functions. In Section 1.5, we illustrate how to check convexity.
Example. Convex sets: polyhedron {x : Ax ≤ b}, polytope conv ({v1 , . . . , vm }) with v1 , . . . , vm ∈ Rn ,
ellipsoid {x : x⊤ Ax ≤ 1} with A ⪰ 0, positive semidenite cone {X ∈ Rn×n : X ⪰ 0}, norm ball {x :
∥x∥p ≤ 1} for all p ≥ 1.

Example. Convex functions: x, max(x, 0), ex , |x|a for a ≥ 1, − log(x), x log x, ∥x∥p for p ≥ 1, (x, y) → x2
y
Qn 1
(for y > 0), A → − log det A over PSD matrices A, (x, Y ) → x⊤ Y −1 x (for Y ≻ 0), log exi , ( i xi ) n .
P
i

Exercise 1.17. Show that the above sets and functions are all convex.
Exercise 1.18. Show that the intersection of convex sets is convex; show that for convex functions, f, g ,
the function h(x) = max f (x), g(x) is also convex.
1.5. Checking convexity 14

1.5 Checking convexity


It is often cumbersome to check if a function is convex via (1.3). The major benet of that denition is that
it works for non-dierentiable functions and innite dimensional spaces. However, when a function is twice
dierentiable, one can check the convexity by simply checking if the Hessian is positive semi-denite. (Recall
the denition of A ⪰ 0 in 0.6.)
To show this, we rst prove the following second-order Taylor theorem. A useful idea here is to note
that a function is convex if and only if its restriction to any one-dimensional line is convex, and to dene a
suitable one-dimensional function.
Theorem 1.19. For any f ∈ C 2 (Rn ), and any x, y ∈ Rn , there is a z ∈ [x, y] s.t.

1
f (y) = f (x) + ∇f (x)⊤ (y − x) + (y − x)⊤ ∇2 f (z)(y − x).
2
Proof. Let g(t) = f ((1 − t)x + ty). Taylor expansion (Theorem 0.4) shows that
1
g(1) = g(0) + g ′ (0) + g ′′ (ζ)
2
where ζ ∈ [0, 1]. To see the result, note that g(0) = f (x), g(1) = f (y), g ′ (0) = ∇f (x)⊤ (y − x) and
g ′′ (ζ) = (y − x)⊤ ∇2 f ((1 − ζ)x + ζy)(y − x).
Now, we show that f is convex if and only if ∇2 f (x) ⪰ 0 for all x.
Theorem 1.20. Let f ∈ C 2 (Rn ). Then, the following are equivalent:
1. f is convex.
2. f (y) ≥ f (x) + ∇f (x)⊤ (y − x) for all x, y ∈ Rn .
3. ∇2 f (x) ⪰ 0 for all x ∈ Rn .
Proof.We have proved (1) implies (2) in Theorem 1.9.
Suppose (2) holds. Then, for any x, h ∈ Rn
f (x + th) ≥ f (x) + t∇f (x)⊤ h.
By Taylor expansion (Lemma 1.19), we have that
t2 ⊤ 2
f (x + th) = f (x) + t∇f (x)⊤ h + h ∇ f (z)h
2
where z ∈ [x, x + th]. By comparing two equations, we have that h⊤ ∇2 f (z)h ≥ 0. Taking t → 0, we have
z → x and hence ∇2 f (z) → ∇2 f (x). Therefore, we have that
h⊤ ∇2 f (x)h ≥ 0
for all x and h. Hence, this gives (3).
Suppose (3) holds. Fix x, y ∈ Rn . Consider the function
g(λ) = f (λx + (1 − λ)y) − λf (x) − (1 − λ)f (y).
Consider λ∗ = argmaxλ∈[0,1] g(λ). If λ∗ is either 0 or 1, then we have g(λ∗ ) = 0. Otherwise, by Taylor's
theorem, there is a ζ ∈ [λ∗ , 1] such that
1
g(1) = g(λ∗ ) + g ′ (λ∗ )(1 − λ∗ ) + g ′′ (ζ)(1 − λ∗ )2
2
∗ 1 ′′ ∗ 2
= g(λ ) + g (ζ)(1 − λ )
2
where we used that g ′ (λ∗ ) = 0. Note that
g ′ (ζ) = ∇f (ζx + (1 − ζ)y)⊤ (x − y) − f (x) + f (y),
g ′′ (ζ) = (x − y)⊤ ∇2 f (ζx + (1 − ζ)y)(x − y).
By the assumption (3), we have that g ′′ (ζ) ≥ 0 and hence 0 = g(1) ≥ g(λ∗ ). Hence, in both cases,
maxλ∈[0,1] g(λ) = g(λ∗ ) ≤ 0. This gives (1).
1.6. Subgradients 15

Now, as an example, we prove that the function (1.5) is convex.

Example 1.21. The function (1.5) is convex for f (z) = log(1 + exp(z)).
Pn
We write R(θ) = R1 (θ) + R2 (θ) where R1 (θ) = n1 i=1 f (yi · xi , θ ) and R2 (θ) = λ∥θ∥1 . It is easy


Proof.
to check R2 is convex. So, it suces to prove R1 is convex. Now, we use Theorem 1.20 to prove that R1 is
convex. Note that
n
1 X ′ i
i i i
∇R1 (θ) = f (y x , θ )y x ,
n i=1
n
1 X ′′ i
i i i ⊤
∇2 R1 (θ) = f (y x , θ )x (x )
n i=1

where we used (y i )2 = 1. Since xi (xi )⊤ ⪰ 0, it suces to prove that f ′′ (y i xi , θ ) ≥ 0. This follows from


exp(z) exp(z)
the calculation: f ′ (z) = 1+exp(z) 1
= 1 − 1+exp(z) and f ′′ (z) = (1+exp(z))2 ≥ 0.

1.6 Subgradients
The standard denition of a convex function in terms of gradients requires dierentiability. However, a
more general denition allows us to avoid this requirement. For a convex function f : Rn → R, we say
that a function g : Rn → Rn is called a subgradient if it satises the following property: for any x, y ∈
Rn ,f (y) − f (x) ≥ ⟨g(x), y − x⟩. For the purpose of optimization algorithms, in almost all cases, a subgradient
will suce in place of a gradient.

Figure 1.7: Sub-gradient

1.7 Logconcave functions


Denition 1.22. A function f : Rn → R+ is logconcave if log f is concave, i.e., f is nonnegative and for
any t ∈ (0, 1), we have f (tx + (1 − t)y) ≥ f (x)t f (y)1−t .
Thus any logconcave function can be viewed as e−f (x) for some convex function f .

Exercise 1.23. Show that for any t ≥ 0, the level set L(t) = {x : f (x) ≥ t} of a logconcave function f is
convex.
1.7. Logconcave functions 16
(
1 if x ∈ K
Example 1.24. The indicator function of a convex set 1K (x) = is logconcave. The Gaussian
0 otherwise
density function is logconcave. The Gaussian density restricted to any convex set is logconcave.
To see that the indicator function of a convex set K is logconcave, simply consider two points x, y which
(1) both lie in K , (2) both lie outside and (3) one is in K , one is outside. Now check the value of the indicator
along any convex combination of x and y .
Lemma 1.25 (Dinghas; Prékopa; Leindler). The product, minimum and convolution of two logconcave
functions is also logconcave; in particular, any linear transformation or any marginal of a logconcave density
is logconcave; the distribution function of any logconcave density is logconcave.

We next describe the basic theorem underlying the above properties. We will see their proofs in a later
chapter.
Theorem 1.26 (Prékopa-Leindler). Fix λ ∈ [0, 1]. Let f, g, h : Rn → R+ be functions satisfying h(λx + (1 −
λ)y) ≥ f (x)λ g(x)1−λ for all x ∈ Rn . Then,

Z Z λ Z 1−λ
h≥ f g .
Rn Rn Rn

An equivalent version of the lemma for sets in Rn is often useful. By a measurable set below, we mean
Lebesgue measurable, which coincides with the denition of volume (for an axis aligned box, it is the product
of the axis lengths; for any other set, it is the limit over increasingly ner partitions into boxes, of the sum
of volumes of boxes that intersect the set).
Theorem 1.27 (Brunn-Minkowski). For any λ ∈ [0, 1] and measurable sets A, B ⊂ Rn , we have

vol(λA + (1 − λ)B)1/n ≥ λvol(A)1/n + (1 − λ)vol(B)1/n .


An immediate consequence of the rst theorem above is that any one-dimensional marginal distribution of
a convex body is logconcave; the second implies that it is in fact (1/(n − 1))-concave (a function is s-concave
if f s is concave) if the body is in Rn .
Exercise 1.28. Prove both corollaries just mentioned.
Example 1.29. We give an example problem related to Bayesian inference. Suppose we have a signal
θ = (θ1 , θ2 , · · · , θn ) and that we can take a measurement yi of θi . The measurement only incurs unbiased
Gaussian noise, i.e., yi = θi + ϵi where ϵi ∼ N (0, 1). The question is to recover the signal θ using y . Without
any prior on θ, the only sensible recovery is θ = y . With a prior, one can apply Bayes' theorem:
P(y|θ)P(θ)
P(θ|y) = .
P(y)
The Bayesian choice is to nd θ with maximum likelihood, namely θ = argmaxθ log P(θ|y) = argminθ −
log P(θ|y).
Using the noise assumption, we have that
1X
P(y|θ) ∝ exp(− (yi − θi )2 ).
2 i

Now, say we know the signal is smooth and we model the prior as P(θ) ∝ exp(−λ − θi+1 )2 ) where λ
P
i (θi
controls how smooth the signal is. Hence,
X X
− log P(θ|y) = c + (yi − θi )2 + λ (θi − θi+1 )2 .
i i

Since each term in the function above is convex, so is the whole formula. Hence, the recovery question
becomes a convex optimization problem
X X
min (yi − θi )2 + λ (θi − θi+1 )2 .
θ
i i
1.7. Logconcave functions 17

When we recover a signal, we want to know how condent we are because there are many choices of
θ that could explain the same measurement y . One way to do this is to sample multiple θ ∝ P(θ|y) and
compute the empirical variance or other statistics. Note that
P 2 P 2
P(θ|y) ∝ e− i (yi −θi ) −λ i (θi −θi+1 )

which is a logconcave distribution. Therefore, one can study the signal and quality of signal recovery via
logconcave sampling.
Part I

Optimization

18
Chapter 2

Gradient Descent

2.1 Philosophy
Optimization methods often follow the following framework:
Algorithm 1: OptimizationFramework
for k = 0, 1, · · · do
Approximate f by a simpler function fk according to the current point x(k)
Do something using fk (such as set x(k+1) = arg minx fk (x))
end
The runtime depends on the number of iterations and the cost per iteration. Philosophically, the dicul-
ties of a problem can never be created nor destroyed, only converted from one form of diculty to another.
When we decrease the number of iterations, the cost per iteration often increases. The gain of new methods
often come from avoiding some wasted computation, utilizing some forgotten information or giving a faster
but tailored algorithm for a sub-problem. This is of course just an empirical observation.
One key question to answer in designing an optimization algorithm is what the problem looks like (or how
can we approximate f by a simpler function). Here are some approximations we will use in this textbook:
ˆ First-order Approximation: f (y) ≈ f (x) + ⟨∇f (x), y − x⟩ (Section 2.2).
ˆ Second-order Approximation: f (y) ≈ f (x)P+ ⟨∇f (x), y − x⟩ + (y − x)⊤ ∇2 f (x)(y − x) (Section 5.4).
ˆ Stochastic/Monte-Carlo Approximation: i fi (x) ≈ fj (x) for a random j (Section 6.3).
ˆ Matrix Approximation: Approximate A by a simpler B with 21 A ⪯ B ⪯ 2A (Section 6.1 and Section
6.2).
ˆ Matrix Approximation: Approximate A by a low-rank matrix.
ˆ Set Approximation: Approximate a convex set by an ellipsoid or a polytope (Section 3.1).
ˆ Barrier Approximation: Approximate a convex set by a smooth function that blows up on the boundary
(Section 5.5).
ˆ Polynomial Approximation: Approximate a function by a polynomial (Section 7.1).
ˆ Partial Approximation: Split the
PK problem into two parts and approximate only one part.
ˆ Taylor Approximation: f (y) ≈ k=0 Dk f (x)[y − x]k . Pn
ˆ Mixed ℓ2 -ℓp Approximation: f (y) ≈ f (x) + ⟨∇f (x), y − x⟩ + i=1 αi (yi − xi )2 + βi (yi − xi )p
Here are other approximation not covered:
ˆ Stochastic Matrix Approximation: Approximate A by a simpler random B with B ⪯ 2A and EB ⪰ 21 A
ˆ Homotopy Method: Approximate a function by a family of functions
ˆ ...(Please give me more examples here)...
The second question to answer is how to maintain dierent approximations created in dierent steps. One
simple way would be forget the approximation we got in previous steps, but this is often not optimal. Another
way is to keep all previous approximations/information (such as Section 3.1). Often the best way will be
combining previous and current approximation carefully to a better approximation (such as Section 7.4).

2.2 Basic Algorithm


Perhaps the most natural algorithm for optimization is gradient descent. In fact it has many variants with
dierent guarantees. Assume that the function f to be optimized is continuously dierentiable. By basic

19
2.2. Basic Algorithm 20

calculus, either the minimum (or point achieving the minimum) is unbounded or the gradient is zero at a
minimum. So we try to nd a point with gradient close to zero (which, of course, does not guarantee global
optimality). The basic algorithm is the following.
Algorithm 2: GradientDescent (GD)
Input: Initial point x(0) ∈ Rn , step size h > 0.
for k = 0, 1, · · · do
if ∥∇f (x(k) )∥2 ≤ ϵ then return x(k) ;
// Alternatively, one can use x(k+1) ← argminx=x(k) +t∇f (x(k) ),t∈R f (x).
x(k+1) ← x(k) − h · ∇f (x(k) ).
end
One can view gradient descent as a greedy method for solving minx∈Rn f (x). At a point x, gradient
descent goes to the minimizer of
min f (x) + ∇f (x)⊤ δ.
∥δ∥2 ≤h/∥∇f (x)∥2

The term f (x) + ∇f (x)⊤ δ is simply the rst-order approximation of f (x + δ). Note that in this problem,
the current point x is xed and we are optimizing the step δ . Certainly, there is no inherent reason for using
rst-order approximation and the Euclidean norm ∥x∥2 . For example, if you use second-order approximation,
then you would get a method involving Hessian of f .
The step size of the algorithm usually either uses a xed constant, or follows a predetermined schedule,
or determined using a line search.
If the iteration stops, we get a point with ∥∇f (x)∥2 ≤ ϵ. Why is this good? The hope is that x is
a near-minimum in the neighborhood of x. However, this might not be true if the gradient can uctuate
wildly:

Denition 2.1. We say f has L-Lipschitz gradient if ∇f is L-Lipschitz, namely, ∥∇f (x) − ∇f (y)∥2 ≤
L∥x − y∥2 for all x, y ∈ Rn .
Similar to Theorem 1.20, we have the following equivalent:

Theorem 2.2. Let f ∈ C 2 (Rn ). For any L ≥ 0, the following are equivalent:
1. ∥∇f (x) − ∇f (y)∥2 ≤ L∥x − y∥2 for all x, y ∈ Rn .
2 n
2. −LI ⪯ ∇ f (x) ⪯ LI for all x ∈
R .
3. f (y) − f (x) − ∇f (x)⊤ (y − x) ≤ L ∥y − x∥22 for all x, y ∈ Rn .
2

Proof. Suppose (1) holds. By the denition of ∇2 f , we have

∇f (x + hv) − ∇f (x)
∇2 f (x)v = lim .
h→0 h

Since ∥ ∇f (x+hv)−∇f
h
(x)
h ∥hv∥2 = L∥v∥2 , we have ∥∇ f (x)v∥2 ≤ L∥v∥2 , which means all eigenvalues of
∥2 ≤ L 2

∇ f are atmost L in magnitude. This proves (2).


2

Suppose (2) holds. Since by the fundamental theorem of calculus,


Z 1
∇f (x) − ∇f (y) = ∇2 f (y + t(x − y))(x − y)dt,
0

we have that Z 1
∥∇f (x) − ∇f (y)∥2 ≤ ∥∇2 f (y + t(x − y))∥op ∥x − y∥2 dt ≤ L∥x − y∥2 .
0

This gives (1).


Suppose (2) holds. By integration along the direction y − x from x to y , we have
Z 1
f (y) = f (x) + ∇f (x)⊤ (y − x) + (1 − t)(y − x)⊤ ∇2 f (x + t(y − x))(y − x)dt.
0
2.2. Basic Algorithm 21

Since −LI ⪯ ∇2 f (x) ⪯ LI , we have


(y − x)⊤ ∇2 f (x + t(y − x))(y − x) ≤ L∥y − x∥2 .

This gives (3).


Suppose (3) holds, then g(x) = f (x) + L2 ∥x∥2 satises g(y) ≥ g(x) + ∇g(x)⊤ (y − x) for all x, y ∈ Rn .
So Theorem 1.20 shows that g is convex and ∇2 g(x) ⪰ 0. Hence, ∇2 f (x) ⪰ −LI . Similarly, by taking
g(x) = L2 ∥x∥2 − f (x), we have ∇2 f (x) ⪯ LI . This gives (2).
Exercise 2.3. Prove the implication above (2) ⇒ (3), by dening a one-dimensional function.
Exercise 2.4. Prove that if f ∈ C 2 (Rn ) has an L-Lipschitz gradient then the function g(x) = L
2
2
∥x∥2 − f (x)
satises ⟨∇g(x) − ∇g(y), x − y⟩ ≥ 0, and so g is convex.
With the equivalent denitions in Thm. 2.2, we can have an alternative view of gradient descent. Each
step, we perform
D E L
x(k+1) = arg min f (x(k) ) + ∇f (x(k) ), y − x(k) + ∥y − x(k) ∥22 .
y 2
To see this is the same step, we let
D E L
g(y) = f (x(k) ) + ∇f (x(k) ), y − x(k) + ∥y − x(k) ∥22 .
2
The optimality condition shows that 0 = ∇g(x(k+1) ) = ∇f (x(k) ) + L(x(k+1) − x(k) ). Hence, this gives the
step x(k+1) = x(k) − L1 ∇f (x(k) ).
By Theorem 2.2, we know that g is an upper bound of f , namely g(y) ≥ f (y) for all y . In general, many
optimization methods involves minimizing some upper bound function every step. Note that the progress
we made for f is at least the progress we made for g . If g is exactly f , we can get all the progress we can
make in one step. Hence, we should believe if g is a better approximation of f , then we are making more
progress. For gradient descent, it uses the simplest rst-order approximation. Although this is not the best
approximation one can come up with, it is robust enough to use in all sorts of applications.
Exercise 2.5. Show that any convex function is Lipschitz over any compact subset of its domain.

2.2.1 Analysis for general functions


Gradient descent works for both convex and non-convex functions. For non-convex function, we can only
nd a point with small gradient (called an approximate saddle point).
Theorem 2.6. 2 n ∗
Let f ∈ C (R ) with L-Lipschitz gradient and x be any minimizer of f . Then GradientDescent
1 2L ∗
(0)

with step size h= L outputs a point x such that ∥∇f (x)∥ 2 ≤ ϵ in ϵ2 f (x ) − f (x ) iterations.

One practical advantage of line search is that the algorithm does not need to know a bound on the
Lipschitz constant of the gradient. The next lemma shows that the function value must decrease along the
GD path for a suciently small step size, and the magnitude of the decrease depends on the norm of the
current gradient.
Lemma 2.7. For any f ∈ C 2 (Rn ) with L-Lipschitz gradient, we have

1 1
f (x − ∇f (x)) ≤ f (x) − ∥∇f (x)∥22 .
L 2L
Proof. Lemma 1.19 shows that
1 1 1
f (x − ∇f (x)) = f (x) − ∥∇f (x)∥22 + ∇f (x)⊤ ∇2 f (z)∇f (x)
L L 2L2
for some z ∈ [x, x − L ∇f (x)].
1
Since ∥∇2 f (x)∥op ≤ L, we have that

∇f (x)⊤ ∇2 f (z)∇f (x) ≤ L · ∥∇f (x)∥22 .


2.3. Analysis for convex functions 22

Putting this back above, we have


1 1 L 2 1
f (x − ∇f (x)) ≤ f (x) − ∥∇f (x)∥22 + ∥∇f (x)∥2 = f (x) − ∥∇f (x)∥22 .
L L 2L2 2L

We can now prove the theorem.


Proof of Theorem 2.6. We observe that either ∥∇f (x)∥2 ≤ ϵ, or ∥∇f (x)∥2 ≥ ϵ and so by Lemma 2.7, the
ϵ2
function value f (x) decreases by at least 2L . Since the function value can decrease by at most f (x(0) )−f (x∗ ),
ϵ2
this bounds the number of iterations  each step of gradient descent decreases f by at least 2L and since
we can decrease f by at most f (x(0) ) − f (x∗ ), we have the result.
Despite the simplicity of the algorithm and the proof, it is known that this is the best one can do via
any algorithm in this general setting [15].

2.3 Analysis for convex functions


Assuming the function is convex, we can prove that gradient descent in fact converges to the global minimum.
In particular, when ∥∇f (x)∥2 is small, convexity shows that f (x) − f ∗ is small (Theorem 1.9).
Lemma 2.8. For any convex f ∈ C 1 (Rn ), we have that f (x) − f (y) ≤ ∥∇f (x)∥2 · ∥x − y∥2 for all x, y .
Proof. Theorem 1.9 and Cauchy-Schwarz inequality shows that

f (x) − f (y) ≤ ⟨∇f (x), x − y⟩ ≤ ∥∇f (x)∥2 · ∥x − y∥2 .

This in turn give a better bound on the number of iterations because the bound in Theorem 2.6 is aected
by f (x) − f ∗ rather than ∥x − x∗ ∥2
Theorem 2.9. 2 n
Let f ∈ C (R ) be convex with L-Lipschitz gradient and x∗ be any minimizer of f. With
1 (k)
step size h= L , the sequence x in GradientDescent satises

2LR2
f (x(k) ) − f (x∗ ) ≤ where R= max ∥x − x∗ ∥2 .
k+4 f (x)≤f (x(0) )

Proof. Let ϵk = f (x(k) ) − f (x∗ ). Lemma 2.7 shows that


1 1
f (x(k+1) ) = f (x(k) − ∇f (x(k) )) ≤ f (x(k) ) − ∥∇f (x(k) )∥22 .
L 2L
Subtracting f (x∗ ) from both sides, we have ϵk+1 ≤ ϵk − 1
2L ∥∇f (x )∥2 .
(k) 2
Lemma 2.8 shows that

ϵk ≤ ∥∇f (x(k) )∥2 · ∥x(k) − x∗ ∥2 ≤ ∥∇f (x(k) )∥2 · R.

Therefore, we have that


1  ϵk  2
ϵk+1 ≤ ϵk − .
2L R
Now, we need to solve the recursion. We note that
1 1 ϵk − ϵk+1 ϵk − ϵk+1 1
− = ≥ ≥ .
ϵk+1 ϵk ϵk ϵk+1 ϵ2k 2LR2
Also, we have that
L (0) LR2
ϵ0 = f (x(0) ) − f ∗ ≤ ∇f (x∗ )⊤ (x(0) − x∗ ) + ∥x − x∗ ∥22 ≤ .
2 2
2.3. Analysis for convex functions 23

Therefore, after k iterations, we have


1 1 k 2 k k+4
≥ + ≥ + = .
ϵk ϵ0 2LR2 LR2 2LR2 2LR2

This style of proof is typical in optimization. It shows that when the gradient is large, then we make
large progress and when the gradient is small, we are close to optimal.
This proof above does not make essential use any property of ℓ2 or inner product space. It can be
extended to work for general norms if the gradient descent step is dened using that norm. For the case of
ℓ2 , one can prove that ∥x(k) − x∗ ∥2 is in fact decreasing.
Lemma 2.10. For
2
h ≤
L , we have that ∥x(k+1) − x∗ ∥2 ≤ ∥x(k) − x∗ ∥2 . Therefore, for an L-gradient
Lipschitz convex function f , for GD with h = L1 , we have

2L∥x(0) − x∗ ∥22
f (x(k) ) − f (x∗ ) ≤ .
k+4
Proof. We compute the distance to an optimal point as follows, noting that ∇f (x∗ ) = 0:

∥x(k+1) − x∗ ∥22 = ∥x(k) − x∗ − h∇f (x(k) )∥22


D E
= ∥x(k) − x∗ ∥22 − 2h ∇f (x(k) ), x(k) − x∗ + h2 ∥∇f (x(k) )∥2
D E
= ∥x(k) − x∗ ∥22 − 2h ∇f (x(k) ) − ∇f (x∗ ), x(k) − x∗ + h2 ∥∇f (x(k) ) − ∇f (x∗ )∥2 .

To handle the term ∇f (x) − ∇f (x∗ ), we note that

∇f (x(k) ) − ∇f (x∗ ) = H(x(k) − x∗ )


R1
with H = 0
∇2 f (x∗ + t(x(k) − x∗ ))dt. Since 0 ⪯ H ⪯ LI and that H ⪰ L1 H 2 , we have
D E
∇f (x(k) ) − ∇f (x∗ ), x(k) − x∗ = (x(k) − x∗ )⊤ H(x(k) − x∗ )
1 (k)
≥ (x − x∗ )⊤ H 2 (x(k) − x∗ )
L
1
= ∥∇f (x(k) ) − ∇f (x∗ )∥2 .
L
Hence, we have
2h
∥x(k+1) − x∗ ∥22 ≤ ∥x(k) − x∗ ∥22 − ( − h2 )∥∇f (x(k) ) − ∇f (x∗ )∥2
L
≤ ∥x(k) − x∗ ∥22 .

The error estimate follows from ∥x(k) − x∗ ∥22 ≤ ∥x(0) − x∗ ∥22 for all k and the proof in Theorem 2.9.
2L∥x(0) −x∗ ∥2
Rewriting the bound, Theorem 2.9 shows it takes ϵ
2
iterations. Compare to the bound
2L
) − f in Theorem 2.6, it seems the new result has a strictly better dependence on ϵ. How-

(0)

ϵ2 f (x
ever, this is not true because one measures the error in terms of ∥∇f (x)∥2 while the other is in terms of
f (x) − f ∗ . For f (x) = x2 /2, we have f (x) − f ∗ = ∥∇f (x)∥22 and hence both have the same dependence on ϵ
for this particular function. So, the real benet of Theorem 2.9 is its global convergence.
2.4. Strongly Convex Functions 24

2.4 Strongly Convex Functions


We note that the convergence rate ϵ−1 or ϵ−2 is not great if we need to solve the problem up to very high
accuracy (e.g., up to n bits would mean ϵ = 2−n ). Getting to very high accuracy is sometimes important if
the optimization problem is used as a subroutine. We note that for the case f (x) = 12 ∥x∥22 , gradient descent
with step size h = 1 takes exactly 1 step. Therefore, it is natural to ask if one can improve the bound for
functions close to quadratics. This motivates the following assumption.
Denition 2.11. We call a function f ∈ C 1 (Rn ) is µ-strongly convex if for any x, y ∈ Rn
µ
f (y) ≥ f (x) + ∇f (x)⊤ (y − x) + ∥y − x∥22 .
2
A growth of a strongly convex function around any point is lower bounded by a quadratic function.
Clearly a strongly convex function is also convex and the sum of a strongly convex and convex function
remains strongly convex. An example of a function that is convex but not strongly convex is f (x) = |x|.
Similar to the convex case (Theorem 1.20), we have the following.
Theorem 2.12. 2 n
Let f ∈ C (R ). For any µ ≥ 0, the following are equivalent:
ˆ f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y) − 21 µt(1 − t)∥x − y∥22 for all x, y ∈ Rn and t ∈ [0, 1].
ˆ f (y) ≥ f (x) + ∇f (x)⊤ (y − x) + µ2 ∥y − x∥22 for all x, y ∈ Rn .
ˆ ∇2 f (x) ⪰ µI for all x ∈ Rn
Proof. It follows from applying Theorem 1.20 on the function g(x) = f (x) − µ2 ∥x∥2 .
Next, we study gradient descent for µ-strongly convex functions.
Lemma 2.13. Let f ∈ C 2 (Rn ) be µ-strongly convex. Then, for any x, y ∈ Rn ,we have

2
∥∇f (x)∥ ≥ 2µ (f (x) − f (y)) .
Proof. By the denition of µ strong convexity, we have
µ 2
f (y) ≥ f (x) + ⟨∇f (x), y − x⟩ + ∥x − y∥ .
2
Rearranging, we have
µ 2 µ 2 1 2
f (x) − f (y) ≤ ⟨∇f (x), x − y⟩ − ∥x − y∥ ≤ max ∇f (x)⊤ ∆ − ∥∆∥ = ∥∇f (x)∥ .
2 ∆ 2 2µ

This will lead to the following guarantee. Note that the error now decreases geometrically rather than
additively.
Theorem 2.14. Letf ∈ C 2 (Rn ) be µ-strongly convex with L-Lipschitz gradient and x∗ be any minimizer of
f. With step size h = L1 , the sequence x(k) in GradientDescent satises
 µk  
f (x(k) ) − f (x∗ ) ≤ 1 − f (x(0) ) − f (x∗ ) .
L
In a later chapter, we will
p µ see that an accelerated variant of gradient descent improves this further by
replacing the L
µ
term with L .
Proof. Lemma 2.7 shows that
1
f (x(k+1) ) − f ∗ ≤ f (x(k) ) − f ∗ − ∥∇f (x(k) )∥22 (2.1)
2L
µ 
≤ f (x(k) ) − f ∗ − f (x(k) ) − f ∗ (2.2)
L
 µ 
= 1− f (x ) − f ∗
(k)
(2.3)
L
where we used Lemma 2.13. The conclusion follows.
2.5. Line Search 25

A similar bound also holds for convergence to the optimal point.


Exercise 2.15. Let f ∈ C 2 (Rn ) be µ-strongly convex with L-Lipschitz gradient and x∗ be the minimizer of
f . With step size h = 1/L, show that the sequence x(k) in GradientDescent satises
 µ k 2
∥x(k) − x∗ ∥22 ≤ 1 − x − x∗ .
(0)
L 2

It is natural to ask to what extent the assumption of convexity is essential for the bounds we obtained.
This is the motivation for the next exercises.
Exercise 2.16. Suppose f satises ⟨∇f (x), x − x∗ ⟩ ≥ α (f (x) − f (x∗ )). Derive a bound similar to Theorem
2.9 for gradient descent.
Exercise 2.17. Suppose f satises ∥∇f (x)∥22 ≥ µ (f (x) − f (x∗ )) . Derive a bound similar to Theorem 2.14
for gradient descent.
Exercise 2.18. Give examples of nonconvex functions satisfying the above conditions. (Note: convex
functions satisfy the rst with α = 1 and µ-strongly convex functions satisfy the second.)

2.5 Line Search


In practice, we often stop line search early because we can make larger progress in a new direction. A
standard stopping condition is called the Wolfe condition.
Denition 2.19. A step size h satises the Wolfe condition with respect to direction p if
1. f (x + hp) ≤ f (x) + c1 h · p⊤ ∇f (x),
2. −p⊤ ∇f (x + hp) ≤ −c2 p⊤ ∇f (x).
with 0 < c1 < c2 < 1.
Let the objective for progress be ϕ(h) = f (x) − f (x + hp). The rst condition requires the algorithm
makes sucient progress (ϕ(h) ≥ c1 h · ϕ′ (0)). The second condition requires the slope to reduce signicantly
(ϕ′ (h) ≤ c2 ϕ′ (0)). One can think the rst condition gives a upper bound on h while the second condition
gives a lower bound on h. In general, a step size satisfying the Wolfe conditions will be larger than the step
size h = L1 . In particular, one can show the following.
Exercise 2.20. Suppose f is µ-strongly convex and has L-Lipschitz gradient. If a step size h satises the
Wolfe condition with respect to the direction p = −∇f (x), then show that

2(1 − c1 ) 1 − c2
≥h≥ .
µ L

As a corollary, we have that the function value progress given by such step is Ω(∥∇f (x)∥2 /L). Therefore,
this gives the same guarantee Theorem 2.9 and Theorem 2.14. A common way to implement this is via a
backtracking line search. The algorithm starts with a large step size and decreases it by a constant factor if
the Wolfe conditions are violated. For gradient descent, the next step involves exactly computing ∇f (x + hp)
and hence if our line search accepts the step size immediately, the line search almost costs nothing extra.
Therefore, if we maintain a step size throughout the algorithm and decreases it only when it violates the
condition, the total cost of the line search will be only an additive logarithmic number of gradient calls
throughout the algorithm and is negligible.
Finally, we note that for problems of the form f (x) = i fi (a⊤ i x), the bottleneck is often in computing
P
Ax. In this case, exact line search is almost free because we can store the vectors Ax and Ah.

2.6 Generalizing Gradient Descent*


Now, we study properties of gradient descent that were actually used for the strongly convex case. There
are many ways to generalize it. One way is to view gradient descent is as approximating the function f by
2.6. Generalizing Gradient Descent* 26

splitting it into two terms, one term is the rst-order approximation and the second term is just an ℓ2 norm.
More generally, we can split a function into two terms, one that is easy to optimize and another that we
need to approximate with some error. More precisely, we consider the following
Denition 2.21. We say g + h is an α-approximation to f at the point x if
ˆ g is convex, with the same value at x, i.e., g(x) = f (x) ,
ˆ h(x) = 0 and h((1 − α)x + αb x) for all x
x) ≤ α2 h(b b,
ˆ g(y) + αh(y) ≤ f (y) ≤ g(y) + h(y) for all y.
To understand this assumption, we note that if f is µ-strongly convex with L-Lipschitz gradient, then
for any x, we can use α = L
µ
and

g(y) = f (x) + ⟨∇f (x), y − x⟩ ,


L
h(y) = ∥y − x∥2 .
2
The condition requires h converging to 0 quadratically when y → x.
Now, we consider the following algorithm:
Algorithm 3: GeneralizedGradientDescent
Input: Initial point x(0) ∈ Rn , approximation factor α > 0.
for k = 0, 1, · · · do
Find an α-approximation of f at x(k) given by g (k) (x) + h(k) (x).
x(k+1) ← arg miny g (k) (y) + h(k) (y).
end

Theorem 2.22. Suppose we are given a convex function f such that we can nd an α-approximation at any
x. Let x∗ f . Then the sequence x(k) in GeneralizedGradientDescent satises
be any minimizer of

f (x(k) ) − f (x∗ ) ≤ (1 − α)k (f (x(0) ) − f (x∗ )).

Proof. Using the fact that g (k) + h(k) is an upper bound on f , we have that our progress on f is larger than
the best possible progress on g (k) + h(k) :

f (x(k+1) ) ≤ min g (k) (y) + h(k) (y).


y

To bound the best possible progress, i.e., the RHS above, we consider x
b = arg miny g (k) (y) + αh(k) (y) and
(k)
z = (1 − α)x + αb x. We have that

min g (k) (y) + h(k) (y) ≤ g (k) (z) + h(k) (z)


y

≤ (1 − α)g (k) (x(k) ) + αg (k) (b


x) + α2 h(k) (b
x)
≤ (1 − α)g (k) (x(k) ) + α(g (k) (x∗ ) + αh(k) (x∗ ))

where we used g (k) is convex and the assumption on h in the second inequality, we used x b minimizes
g (k) + αh(k) .
Combining both and using the fact that g (k) + αh(k) is a lower bound on f , we have

f (x(k+1) ) ≤ (1 − α)f (x(k) ) + αf (x∗ ).

This gives the result.


Although Theorem 2.22 looks weird, it captures many well-known theorems. Here we list some of them.
Exercise 2.23. Show that the second condition in the denition of α-approximation can be replaced by
g(y) + αh(y/α) ≤ f (y) ≤ g(y) + h(y)

while maintaining the guarantee for the convergence of generalized GD.


2.6. Generalizing Gradient Descent* 27

Projected Gradient Descent / Proximal Gradient Descent


Given a convex set K , a µ-strongly convex function f with L-Lipschitz gradient, we consider the problem

min f (x).
x∈K

def
To apply Theorem 2.22 to F (x) = f (x) + δK (x), for any x, we consider the functions

g(y) = f (x) + ⟨∇f (x), y − x⟩ + δK (y),


L
h(y) = ∥y − x∥22 .
2
Note that g+h is a Lµ
-approximation to F . Theorem 2.22 shows the GeneralizedGradientDescent converges
in O( µ log(1/ϵ)) steps where each step involves solving the problem
L

L
min f (x) + ⟨∇f (x), y − x⟩ + ∥y − x∥22 .
y∈K 2
More generally, this works for problems of the form

min f (x) + ϕ(x)


x

for some convex function ϕ(x). Theorem 2.22 requires us to solve a sub-problem of the form

L
min f (x) + ⟨∇f (x), y − x⟩ + ∥y − x∥22 + ϕ(y).
y 2

ℓp Regression
One can apply this framework for optimizing ℓp regression minx ∥Ax−b∥pp . For example, one can approximate
the function f (x) = xp by g(y) = xp + pxp−1 (y − x) and h(y) = p2p−1 (xp−2 (y − x)2 + (y − x)p ). Using this,
one can show that one can solve the problem

min ∥Ax − b∥pp


x

using the problem


min v ⊤ y + ∥DA(y − x)∥22 + ∥A(y − x)∥pp (2.4)
y

for some vector v and some diagonal matrix D. One can show that Theorem 2.22 only need to solve the
sub-problem approximately. Therefore, this shows that one can solve ℓp regression with log(1/ϵ) convergence
by solving mixed ℓ2 + ℓp regression approximately.

Other assumptions
For some special cases, it is possible to analyze this algorithm without convexity. One prominent application
is compressive sensing:
min ∥Ax − b∥22 .
∥x∥0 ≤k

For matrices A satisfying restricted isometry property, one can apply GeneralizedGradientDescent to solve
the problem with the splitting g(x) = 2A⊤ (Ax − b) + δ∥x∥0 ≤k and h(x) = ∥x∥2 . In this case, the algorithm
is called iterative hard-thresholding [10] and the sub-problem has a closed form expression.

Exercise 2.24. Give the closed form solution for the sub-problem given by the splitting above.
2.7. Gradient Flow 28

2.7 Gradient Flow


In continuous time, gradient descent follows the Ordinary Dierential Equation (ODE)

dxt
= −∇f (xt ).
dt
This can be viewed as the canonical continuous algorithm. Finding the right discretization has lead to many
fruitful research directions. One benet of the continuous view is to simplify some calculations. For example,
for strongly convex f , Theorem 2.14 now becomes
d dxt
(f (xt ) − f (x∗ )) = ∇f (xt )⊤ = −∥∇f (xt )∥22 ≤ −2µ (f (xt ) − f (x∗ ))
dt dt
where we used Lemma 2.13 at the end. Solving this dierential inequality, we have

f (xt ) − f (x∗ ) ≤ e−2µt (f (x0 ) − f (x∗ )).

Without the strong convexity assumption, gradient ow can behave wildly. For example, the length of the
gradient ow can be exponential in d on an unit ball [56].
We emphasize that this continuous view is mainly useful for understanding, indicative of but not neces-
sarily implying an algorithmic result. In some cases, eective algorithmic results can be obtained simply by
discretizing time in the gradient ow. The study of such numerical methods and their convergence properties
is its own eld, and well-known basic methods include the forward Euler method (which results in the basic
version of GD), the backward Euler method, the Implicit Midpoint method and Runge-Kutta methods. We
will see that gradient ow and its discretization also play an important role in the development of sampling
algorithms.

2.8 Discussion
Convex optimization by variants of gradient descent is a very active eld with an increasing number of
applications. Often, methods that are provable for the convex setting are applied as heuristics to nonconvex
problems as well, most notably in deep learning. This is one of the principal features of GD, its wide
applicability as an algorithmic paradigm.
Researchers are also using GD to get provably faster algorithms for classical problems. For example, [2]
and [40] applied the decomposition of Eqn. (2.4) to obtain fast algorithms for ℓp regression and the ℓp ow
problem. [36] showed that the ℓp ow problem can be used as a subroutine to solve uncapacitied maximum
ow problem in m4/3+o(1) time. Instead of assuming h(x) converges to 0 quadratically, [55] proved Theorem
2.22 assuming h is given by some divergence function and showed its applications in D-optimal design.
Chapter 3

Elimination

3.1 Cutting Plane Methods


The goal of this chapter is to present some polynomial-time algorithms for convex optimization. These
algorithms use knowledge of the function in a minimal way, essentially by querying the function value and
weak bounds on its support/range.
Given a continuously dierentiable convex function f , Theorem 1.9 shows that

f (y) ≥ f (x) + ⟨∇f (x), y − x⟩ for all y. (3.1)

Let x∗ be any minimizer of f . Replacing y with x∗ , we have that

f (x) ≥ f (x∗ ) ≥ f (x) + ⟨∇f (x), x∗ − x⟩ .

Therefore, we know that ⟨∇f (x), x∗ − x⟩ ≤ 0. Namely, x∗ lies in a halfspace H with normal vector −∇f (x).
Roughly speaking, this shows that each gradient computation cuts the set of possible solutions in half. In
one dimension, this allows us to do a binary search to minimize convex functions.
It turns out that in Rn , binary search still works. In this chapter, we will cover several ways to do this
binary search. All of them follow the same framework, called the cutting plane method. In this method, the
convex set or function of interest is given by an oracle, typically a separation oracle: for any x ∈
/ K ⊆ Rn ,
the oracle nds a vector g(x) ∈ R such that
n

g(x)⊤ (y − x) ≤ 0 for all y ∈ K.

Cutting plane methods address the following class of problems.

Problem 3.1 (Finding a point in a convex set). Given ϵ > 0, R > 0, and a convex set K ⊆ R · B n specied
by a separation oracle, nd a point y ∈ K or conclude that volK ≤ ϵn . The complexity of an algorithm is
measured by the number of calls to the oracle and the number of arithmetic operations.

Remark. To minimize a convex function, we set g(x) = ∇f (x) and K to be the set of (approximate)
minimizers of f . In Section 3.3, we relate the problem of proving that vol(K) is small to the problem of
nding an approximate minimizer of f .
In this framework, we maintain a convex set E (k) that contains the set K . Each iteration, we choose
some x(k) based on E (k) and query the oracle for g(x(k) ) . The guarantee for g(.) implies that K lies in the
halfspace
H (k) = {y : g(x(k) )⊤ (y − x(k) ) ≤ 0}
and hence K ⊂ H (k) ∩ E (k) . The algorithm continues by choosing E (k+1) to be a convex set that contains
H (k) ∩ E (k) .

29
3.2. Ellipsoid Method 30

Algorithm 4: CuttingPlaneFramework
Input: Initial set E (0) ⊆ Rn containing K .
for k = 0, · · · do
Choose a point x(k) ∈ E (k) .
if E (k) is small enough then return x(k) ;
Find E (k+1) ⊃ E (k) ∩ H (k) where
def
H (k) = {x ∈ Rn such that g(x(k) )⊤ (x − x(k) ) ≤ 0}. (3.2)

end
To analyze the algorithm, the main questions we need to answer are:
1. How do we choose x(k) and E (k+1) ?
2. How do we measure progress?
3. How quickly does the method converge?
4. How expensive is each step?
Progress on the cutting plane method is shown in the next table.

Year E (k)and x(k) Iter Cost/Iter


1965 [45, 58] Center of gravity n nn
1979 [77, 65, 38] Center of ellipsoid n2 n2
1988 [37] Center of John ellipsoid n n2.878
1989 [71] Volumetric center n n2.378
1995 [7] Analytic center n n2.378
2004 [9] Center of gravity n n4
2015 [43] Hybrid center n n2 (amortized)
2020 [30] Volumetric center n n2 (amortized)

Table 3.1: Dierent Cutting Plane Methods. Omitted polylogarithmic terms. The number of iterations follows from the
rate.

3.2 Ellipsoid Method

Figure 3.1: Ellipsoid Method

We start by explaining the Ellipsoid method. The algorithm maintains an ellipsoid


 −1
def
E (k) = {y ∈ Rn : (y − x(k) )⊤ A(k) (y − x(k) ) ≤ 1}

that contains K and becomes smaller in volume in each step. Note that for this to be an ellipsoid the matrix
A(k) must be symmetric PSD. After we compute g(x(k) ) and H (k) via (3.2), we dene E (k+1) to be the
smallest volume ellipsoid containing E (k) ∩ H (k) . The key observation is that the volume of the ellipsoid E (k)
decreases by a factor of 1 − Θ( n1 ) every iteration. This volume property holds for any halfspace through the
3.2. Ellipsoid Method 31

center of the current ellipsoid (not only for the one whose normal is the gradient), a property we will exploit
in the next chapter.
Algorithm 5: Ellipsoid
−1
Input: Initial ellipsoid E (0) = {y ∈ Rn : (y − x(0) )⊤ A(0) (y − x(0) ) ≤ 1}.
for k = 0, · · · do
if E (k) is small enough or Oracle says YES then return x(k) ;
(k) (k)
A g(x )
1 √
x(k+1) = x(k) − n+1 .
g(x(k) )⊤ A(k) g(x(k) )
(k) ⊤ (k)
2
 (k) (k)

2 A g(x )g(x ) A
A(k+1) = n2n−1 A(k) − n+1 (k) ⊤ (k)
g(x ) A g(x ) (k) .
end
Remark 3.2. We usually obtain g(x) from a separation oracle.

Lemma 3.3. For the Ellipsoid method (Algorithm 5), we have


1
volE (k+1) < e− 2n+2 volE (k) and E (k) ∩ H (k) ⊂ E (k+1) .
1
Remark 3.4. Note that the proof below also shows that volE (k+1) = e−Θ( n ) volE (k) . Therefore, the ellipsoid
method does not run faster even for any function, e.g., a linear function, making it provably slow in practice.
Proof. Note that the ratio of volE (k+1) /volE (k) does not change under any ane transformation, and neither
does set containment. Therefore, we can do a transformation so that A(k) = I , x(k) = 0 and v(x(k) ) = e1 .
Therefore H (k) is simply the halfspace {x : x1 ≤ 0}. This simplies our calculation. We need to prove two
1
statements: volE (k+1) < e− 2n+21 volE (k) and that E (k) ∩ H (k) ⊂ E (k+1) .
Claim 1: volE (k+1) < e− 2(n+1) volE (k) .  
2
Note that since v(x(k) ) = e1 ,we have A(k+1) = n2n−1 I − n+1 2
1 . Therefore, we have that
e1 e⊤
2 n
volE (k+1) det A(k+1) n2
   
2 ⊤
=
= det I − e1 e1
volE (k) det A(k) n2 − 1 n+1
n−1
n2 n2
  
n−1
=
n2 − 1 n2 − 1 n + 1
 n−1  2
1 1
= 1+ 2 1−
n −1 n+1
n−1 2 1
< exp( 2 − ) = exp(− )
n −1 n+1 n+1
where we used 1 + x ≤ ex for all x.
Claim 2. E (k) ∩ H (k) ⊂ E (k+1) .
By the denition of E (k) and the assumption A(k) = I , we have
e1
x(k+1) = −
n+1
and, for any x ∈ E (k) ∩ H (k) , we have that ∥x∥2 ≤ 1 and x1 ≤ 0. By direct computation, using the fact that
 −1  
2 ⊤ 2 ⊤
I− e1 e1 = I+ e1 e
n+1 n−1 1
we have
  
e1 ⊤ 1 2 e1
(x + ) 1− 2 I+ e1 e⊤ (x + )
n+1 n n−1 1 n+1
  
1 2x1 (1 + x1 ) 1
= 1− 2 ∥x∥2 + + 2
n n−1 n −1
  
1 1
≤ 1− 2 1+0+ 2 =1
n n −1
3.3. From Volume to Function Value 32

where we used that ∥x∥2 ≤ 1 and x1 (1 + x1 ) ≤ 0 (since −1 ≤ x1 ≤ 0) at the end. This shows that
x ∈ E (k+1) .
Exercise 3.5. Show that the ellipsoid E (k+1) computed above is the minimum volume ellipsoid containing
E (k) ∩ H (k) .
Exercise 3.6. Suppose that we used a box instead of an ellipsoid. Could we ensure progress in each
iteration? What about a simplex?

3.3 From Volume to Function Value


Lemma 3.3 shows that the volume of a set containing all minimizers decreases by a constant factor every n
steps. In general, knowing that the optimal x∗ lies in a small volume set does not provide enough information
to nd a point with small function value. For example, if we only knew that x∗ lies in the plane {x : x1 = 0},
we would still need to search for x∗ over an n − 1 dimensional space. However, if the set is constructed by
the cutting plane framework (Algorithm 4), then we can guarantee that small volume implies that any point
in the set has close to optimal function value.
This implies that we can minimize any convex function with ε additive error in O(n2 log(1/ε)) iterations.
To make the statement more general, we note that the Ellipsoid method can be used for non-dierentiable
functions. In the next theorem, we allow for a general progress function and an arbitrary support Ω. The
convergence argument works for any monotonic, scale-and-shift-invariant progress measure.
Theorem 3.7. Let x(k)
be the sequence of points produced by the cutting plane framework (Algorithm 4) for
a convex function f. V be a mapping from subsets of Rn to non-negative real numbers satisfying
Let
n
1. (Linearity) For any set S ⊆ R , any vector y and any scalar α ≥ 0, we have V(αS + y) = αV(S) where
αS + y = {αx + y : x ∈ S}.
2. (Monotonicity) For any set T ⊂ S , we have that V(T ) ≤ V(S).
(0)
Then, for any set Ω ⊆ E , with V(Ω) > 0, we have

V(E (k) )
 
min f (x(i) ) − min f (y) ≤ · max f (z) − min f (x) .
i=1,2,···k y∈Ω V(Ω) z∈Ω x∈Ω

Remark 3.8. We can think V(E) as some way to measure the size of E . It can be radius, mean-width or any
1
other way to measure size. For the ellipsoid method, we use V(E) = vol(E) n for which we have proved
volume decrease in Lemma 3.3. We raise the volume to power 1/n to satisfy linearity. Also note that we
only guarantee that one of the previous query points has a small function value; of course we can simply use
the point with minimum function value.
(k)
Proof. Let x∗ be any minimizer of f over Ω. For any α > V(E )
V(Ω) and S = (1 − α)x + αΩ, by the linearity

of V , we have that
V(S) = αV(Ω) > V(E (k) )
Therefore, S is not a subset of E (k) and hence there is a point y ∈ S\E (k) . y is not in E (k) . This means it
is eliminated by the subgradient halfspace at some step i ≤ k , namely for some i ≤ k , we have (denoting the
subgradient by ∇f ),
∇f (x(i) )⊤ (y − x(i) ) > 0.
By the convexity of f , it follows that f (x(i) ) ≤ f (y). Since y ∈ S , we have y = (1 − α)x∗ + αz for some
z ∈ Ω. Thus, the convexity of f implies that

f (x(i) ) ≤ f (y) ≤ (1 − α)f (x∗ ) + αf (z).

Therefore, we have  
(i)
min f (x ) − min f (x) ≤ α max f (z) − min f (x) .
i=1,2,···k x∈Ω z∈Ω x∈Ω
(k)
V(E )
Since this holds for any α > V(Ω) , we have the result.
3.3. From Volume to Function Value 33

Combining Lemma 3.3 and Theorem 3.7, we have the following rate of convergence.

Theorem 3.9. f n
be a convex function on R , E
Let
(0)
be any initial ellipsoid and Ω ⊂ E
(0)
be any convex
(0)
set. Suppose that for any x ∈ E , we can nd, in time T , a nonzero vector g(x) such that

f (y) ≥ f (x) for any y such that g(x)⊤ (y − x) ≥ 0.

Then, we have

 n1
vol(E (0) )
   
(i) k
min f (x ) − min f (y) ≤ exp − max f (z) − min f (x) .
i=1,2,···k y∈Ω vol(Ω) 2n(n + 1) z∈Ω x∈Ω

Moreover, each iteration takes O(n2 + T ) time.

Remark 3.10. We note that for this rate of convergence (and hence the entire algorithm) to be polynomial,
we need some bound on the range of the function value. This often follows from a bound on the diameter of
the support.

Proof. Lemma 3.3 shows that the volume of the ellipsoid maintained decreases by a factor of exp(− 2n+2
1
)
1
in every iteration. Hence, vol n decreases by exp(− 2n(n+1)
1
) every iteration. The bound follows by applying
def 1
Theorem 3.7 with V(E) = vol(E) n . Using Theorem 3.7, we have
 n1 
vol(E (k) )
 
(i)
min f (x ) − min f (y) ≤ max f (z) − min f (x)
i=1,2,···k y∈Ω vol(Ω) z∈Ω x∈Ω
 n1
vol(E (0) )
   
k
≤ exp − max f (z) − min f (x) .
vol(Ω) 2n(n + 1) z∈Ω x∈Ω

Next, we note the proof of Theorem 3.7 only used the fact that one side of halfspace dened by the
gradient has higher value. Therefore, we can replace the gradient with the vector g(x).
To bound the time per iteration, note that we make one query to the separation oracle per iteration, and
then compute the next Ellipsoid using the formulas in Algorithm 5. The most time-consuming operation is
multiplying an n × n matrix by an n-vector, which has complexity O(n2 ).

This theorem can be used to solve many problems in polynomial time. As an illustration, we show how
to solve linear programs in polynomial time here.

Theorem 3.11. Given a linear program minx∈Ω c⊤ x where P = {x: Ax ≥ b}. Let the diameter of P be
def 1
R = maxx∈Ω ∥x∥2 and its volume radius be r = vol(Ω) n . Then, we can nd x∈P for which

c⊤ x − min c⊤ y ≤ ε · (max c⊤ y − min c⊤ y)


y∈P y∈P y∈P

R
in O(n2 (n2 + nnz(A)) log( rε )) time where nnz(A) is the number of non-zero elements in A.
Remark 3.12. If the dimension n is constant, this algorithm is nearly linear time (linear in the number of
constraints)!

Proof. For the linear program minAx≥b c⊤ x, the function we want to minimize is
(
0 if a⊤
i x ≥ bi for all i
L(x) = c x + δAx≥b (x) where δAx≥b (x) =

. (3.3)
+∞ otherwise.

For this function L, we can use the separation oracle v(x) = c if Ax ≥ b and v(x) = −ai if a⊤
i x < bi . If
there are many violated constraints, any one of them will do.
3.4. Center of Gravity Method 34

In this case, we can set Ω = P . We can simply pick E (0) be the unit ball centered at 0 with radius R.
We apply Theorem 3.9 to nd x such that

c⊤ x − min c⊤ y ≤ ε · (max c⊤ y − min c⊤ y)


y∈P y∈P y∈P

in time O(n2 (n2 + nnz(A)) log( rε


R
)). Note that the complexity of computer a matrix-vector product Ax is
O(nnz(A)).
Exercise 3.13. Give a bound on R as a polynomial in n, ⟨A⟩, ⟨b⟩, where ⟨A⟩, ⟨b⟩ are the numbers of bits
needed to represent A, b whose entries are rationals.

To get the solution exactly, i.e., ε = 0, we need to assume the linear program has integral (or rational)
coecients and then the running time will depend on the sizes of the numbers in the matrix A and in the
vectors b and c. It is still open how to solve linear programs in time bounded by a polynomial in only the
number of variables and constraints (and not the bit sizes of the coecients). Such a running time is called
strongly polynomial.

Open Problem. Can we solve linear programs in strongly polynomial time?

3.4 Center of Gravity Method


In the Ellipsoid method, we used an ellipsoid as the current set, and its center as the next query point.
In the center of gravity method, we start with any bounded convex set containing a minimizer, e.g., a
large enough cube, and the set maintained is simply the intersection of all halfspaces used so far with the
original set. The query point will be the center of gravity (or barycenter) of the current set. Recall that the
center of gravity of a bounded set is the average of points in the set, i.e.,
Z
1
zK = x dx.
vol(K) K

The measure of progress will once again be the volume (or more precisely, volume radius) of the current set.
It is clear that the volume can only decrease in each iteration. But at what rate? The following classical
theorem shows that the volume of the convex body decreases by a constant factor (no more than 1 − 1e )
when using the exact center of gravity.

Theorem 3.14. [29]Let K be a convex body in Rn with center of gravity z. Let H be any halfspace containing
z. Then,
 n
n
vol(K ∩ H) ≥ vol(K).
n+1
Note that the constant on the RHS is at least 1/e. We prove the theorem later in this chapter. Unfor-
tunately, computing the center of gravity even of a polytope is #P-hard [62]. For the purpose of ecient
approximations, it is important to establish a stable version of the theorem that does not require an exact
center of gravity.
Recall that a nonnegative function is logconcave if its logarithm is concave, i.e., for any x, y ∈ Rn and
any λ ∈ [0, 1],
f (λx + (1 − λ)y) ≥ f (x)λ f (y)1−λ .
We refer to Section 1.7 for some background on logconcave functions. A distribution p is isotropic if the
mean of a random variable drawn from the distribution is zero and the covariance is the identity matrix.
The randomized center of gravity method dened as follows:
3.4. Center of Gravity Method 35

Algorithm 6: RandomizedCenterOfGravity
Input: Initial convex set E (0) .
for k = 0, · · · do
if E (k) is small enough then return x(k) ;
Let y (1) , . . . , y (N ) be uniform random points from E (k) and set
N
1 X (i)
x(k) = y
N i=1
E (k+1) = E (k) ∩ H (k)

def
where H (k) = {x ∈ Rn : g(x(k) )⊤ (x − x(k) ) ≤ 0} obtained by querying g(x(k) ).
end
Remark 3.15. The question of how to samply (nearly) uniform random points is an important one that we
will address in detail in this book. Using sampling to approximate the center of gravity leads to the reduction
in the time per iteration from nn to n4 (Table 3.1).
To prove the convergence of the method, we will use a robust version of Theorem 3.14, which will give a
similar result despite using only an approximate center of gravity.
Theorem 3.16 (Robust Grünbaum). Let p be an isotropic logconcave distribution, namely Ex∼p x = 0 and
Ex∼p x2 = 1. For any unit vector θ ∈ Rn , t ∈ R we have
1
Px∼p (x⊤ θ ≥ t) ≥ − |t| .
e
Proof. By taking the marginal with respect to the direction θ, we can assume the distribution is one-
dimensional. Let P (t) = Px∼p (x⊤ θ ≤ t). Note that P (t) is the convolution of p and 1(−∞,0] . Hence, it is
logconcave (Lemma 1.25). By some limit arguments, we can assume P (−M ) = 0 and P (M ) = 1 for some
large enough M (to be rigorous, we do the proof below for nite M and a RHS ϵ(M ) instead of zero, then
take the limit M → ∞). Since Ex∼p x = 0, we have that
Z M
dP (t)
t dt = 0
−M dt
RM
Integration by parts gives that −M P (t)dt = M . Note that P (t) is increasing logconcave, if P (0) is too
RM
small, it would make −M P (t)dt too small. To be precise, since P is logconcave, i.e., − log P (t) is convex,
and so we have,
d
P (0)
− log P (t) ≥ − log P (0) − dt t.
P (0)
Or we simply write P (t) ≤ P (0)eαt for some α. Hence,
1
Z M Z Z M
α eP (0) 1
M= P (t)dt ≤ P (0)eαt dt + 1dt = +M − .
−M −∞ 1/α α α

This shows that P (0) ≥ 1e .


Next, Lemma 3.17 shows that maxx p(x) ≤ 1. Therefore, the cumulative distribution P is 1-Lipschitz
and we have
1
Px∼p (x⊤ θ ≥ t) ≥ Px∼p (x⊤ θ ≥ 0) − t ≥ − t.
e

Lemma 3.17. Let p be a one-dimensional isotropic logconcave density. Then max p(x) ≤ 1.
For a proof of this (and for other properties of logconcave functions), we refer the reader to [54].
3.4. Center of Gravity Method 36

Exercise 3.18. Give a short proof that maxx p(x) = O(1) for any one-dimensional isotropic logconcave
density.

Using the robust Grünbaum theorem 3.16, we get the following algorithm, which uses uniform random
points from the current set. Obtaining such a random sample algorithmically is an interesting problem that
we will study in the second part of this book.

Lemma 3.19. Suppose y (1) , . . . , y (N ) are i.i.d. uniform random points from a convex body K and y =
1
P N (i)
N i=1 y . Then for any halfspace H not containing y ,

 r 
1 n
E(vol(K ∩ H)) ≤ 1− + vol(K).
e N

Proof. Without loss of generality, we assume that K is in isotropic position, i.e., EK (y (i) ) = 0 and EK (y (i) (y (i) )⊤ ) =
I . Then we have E(y) = 0 and
N
(i) 2
1 X 1 2 n
2
E ∥y∥ = 2 E y = E y (1) = .

N i=1
N N

Therefore, r   rn
2
E ∥y∥ ≤ E ∥y∥ = .
N
Thus, we can apply Theorem 3.16 with t = N to bound E(vol(K ∩ H))/vol(K).
pn

Theorem 3.7 readily gives the following guarantee for convex optimization, again using volume radius as
the measure of progress.

Theorem 3.20. Let f be a convex function on Rn , E (0) be any initial set and Ω ⊂ E (0) be any convex set.
Suppose that for any x ∈ E (0) , we can nd a nonzero vector g(x) such that

f (y) ≥ f (x) for any y such that g(x)⊤ (y − x) ≥ 0.

Then, for the center of gravity method with N = 10n, we have

1
vol(E (0) ) n
    
k
(i)
E min f (x ) − min f (y) ≤ (0.95) n
max f (z) − min f (x) .
i=1,2,···k y∈Ω vol(Ω) z∈Ω x∈Ω

Now we give a geometric proof of Theorem 3.14. Note that one can modify the proof of Theorem 3.16 to
get another proof.

Proof. Since ane transformations do not aect ratios of volumes, without loss of generality, assume that
the center of gravity of K is the origin and H is the halfspace {x : x1 ≤ 0}. For each point t ∈ R, let
A(t) = K ∩ {x : x1 = t} be the (n − 1)-dimensional slice of K with x1 = t. Dene r(t) as the radius of the
(n − 1)-dimensional ball with the same (n − 1)-dimensional volume as A(t).
The goal of the proof is to show that the smallest possible halfspace volume is achieved for a cone by a cut
perpendicular to its axis. In the rst step, we will symmetrize K as follows: replace each cross-section A(t)
by a ball of the same volume, centered at (t, 0, . . . , 0)T . We claim that the resulting rotationally symmetric
body is convex. To see this, note that all we have to show is that the radius function r(t) is concave. For
any s, t ∈ R, and any λ ∈ [0, 1], we have by convexity of K that

λA(s) + (1 − λ)A(t) ⊆ A(λs + (1 − λ)t)

and by the Brunn-Minkowski theorem (Theorem 1.27) applied to A(s), A(t), the function voln−1 (A(s))1/(n−1)
is a concave function and so we have
1 1 1
voln−1 (A(λs + (1 − λ)t)) n−1 ≥ λvoln−1 (A(s)) n−1 + (1 − λ)voln−1 (A(t)) n−1 .
3.4. Center of Gravity Method 37

Figure 3.2: Ane transformations for Centre of Gravity method

From this and the denition of r(t), it follows that

r(λs + (1 − λ)t) ≥ λr(s) + (1 − λ)r(t)

as desired.
Next consider the subset K1 = K ∩ {x : x1 ≤ 0}. We replace this subset with a cone C having the same
base A(0) and apex at some point along the e1 axis so that the volume of the cone is the same as vol(K1 ).
Using the concavity of the radial function, this transformation can only decrease the center of gravity along
e1 . Therefore, proving a lower bound on the transformed body K1 will give a lower bound for K . So assume
we do this and the center of gravity is the origin. Next, extend the cone to the right, so that it remains a
rotational cone, and the volume in the positive halfspace along e1 is the same as vol(K \K1 ). Once again, the
center of gravity can only move to the left, and so the volume of K1 can only decrease by this transformation.
At the end we have shown that the lower bound for any convex body follows by proving for a rotational cone
with axis along the normal to the halfspace. The intersection of this cone with the halfspace is the original
cone scaled down by the ratio of the distance from the apex to the center of gravity, to the height of the
cone. So all that remains to be done is to compute the relative distance of the center of gravity from the
apex, which is exactly n/(n + 1). Now we compute the volume ratio:
 n
vol(K1 ) n
= .
vol(K) n+1

Exercise 3.21. Show that for a cone of height h in Rn , i.e., the convex hull of a convex body K ⊂ Rn−1
with a single point a ∈ Rn at distance h from the hyperplane H containing K , the distance of the center of
gravity z of the cone is at distance h/(n + 1) from H .

To conclude this section, we note that the number of separation oracle queries made by the center-of-
gravity cutting plane method is asymptotically the best possible.
3.4. Center of Gravity Method 38

Theorem 3.22. Any algorithm that solves Problem 3.1 using a separation oracle needs to make Ω(n log(R/ϵ))
queries to the oracle.

Proof. Suppose K is a cube of side length ϵ contained in the cube [0, R]n . Imagine a tiling of the big cube
by cubes of side length ϵ. Consider the oracle that always returns an axis parallel halfspace that does not
cut any little cube and contains at least half of the volume of the remaining region, i.e., the set given by
the original cube intersected with all halfspaces given by the oracle so far. This is always possible since for
any halfspace either the halfspace or its complement will contain at least half the volume of any set. Thus
each query at best halves the remaining volume. To solve the problem, the algorithm needs to cut down to
a set of volume ϵn starting from a set of volume Rn . Thus it needs at least n log2 (R/ϵ) queries.

Figure 3.3: Complexity using separation oracle

Discussion
In later chapters we will see how to implement each iteration of the center of gravity method in polynomial
time. Computing the exact center of gravity is #P-hard even for a polytope [62], but we can nd an
arbitrarily close approximation in randomized polynomial time via sampling. The method generalizes to
certain noisy computation models, e.g., when the oracle reports a function value that is within a bounded
additive error, i.e., a noisy function oracle.
3.5. Sphere and Parabola Methods 39

3.5 Sphere and Parabola Methods


In Section 2.2, we proved that gradient descent converges at the rate (1 − L ) in function value assuming
µ k

the function satises µ · I ⪯ ∇ f (x) ⪯ L · I for all x. Hence, if µ is not too large, this rate of decrease of the
2 L
k k
function value can be much better than the 1 − n12 rate of the ellipsoid method or the 1 − n1 rate of
the center of gravity method (recall that the convergence rate in function value is n times slower than the
convergence rate in volume). In this section, we show how to modify the ellipsoid method to get a faster
convergence rate when L µ is small. One can view this whole section as just an interpretation of accelerated
gradient descent (which we haven't seen yet) in the cutting plane framework. In a later section, we will give
another interpretation.

3.5.1 Sphere Method


We have been using convexity to nd a halfspace containing the optimum with the current point on its
boundary. For a strongly convex function, any optimal x∗ is contained in a strictly smaller region than a
halfspace. In particular, we have
µ ∗ 2
f (x∗ ) ≥ f (x) + ∇f (x)⊤ (x∗ − x) + ∥x − x∥2 .
2
Completing the square, we have that
2 2
µ (x∗ − x) + 1 ∇f (x) − ∥∇f (x)∥2 ≤ − (f (x) − f (x∗ ))

2 µ
2 2µ

def def
Using x+ = x − 1
L ∇f (x) and x++ = x − µ1 ∇f (x), we can write

2
x − x++ 2 ≤ ∥∇f (x)∥2 − 2 (f (x) − f (x∗ )). (3.4)

2 µ2 µ

To use this formula in the cutting plane framework, we need a crude upper bound on f (x∗ ). One can simply
use f (x∗ ) ≤ f (x). Or, we can use Lemma 2.7 and get
1 1
f (x∗ ) ≤ f (x+ ) ≤ f (x − ∇f (x)) ≤ f (x) − ∥∇f (x)∥2 .
L 2L
Putting it into (3.4), we see that

x − x++ 2 ≤ 1
 
∗ 2
∥∇f (x)∥2 − 2µ · (f (x) − f (x∗ ))

2 µ2
1 
2  2
∥∇f (x)∥2 − 2µ · f (x) − f (x+ ) − f (x+ ) − f (x∗ )

≤ 2
µ µ
 
1 2 1 2
∥∇f (x)∥22 − f (x+ ) − f (x∗ )

≤ 2 ∥∇f (x)∥2 − 2µ ·
µ 2L µ
µ

1− L 2 2
f (x+ ) − f (x∗ ) . (3.5)

≤ ∥∇f (x)∥2 −
µ2 µ

Therefore, using the trivial bound of zero for the second term on the RHS, x∗ lies in a ball centered at x++
with radius at most r
µ ∥∇f (x)∥2
1− · . (3.6)
L µ
This suggests using balls instead of ellipsoids in a cutting plane algorithm; it would certainly be more
ecient to maintain!
We arrive at the following algorithm.
3.5. Sphere and Parabola Methods 40

Algorithm 7: SphereMethod
Input: Initial point x(0) ∈ Rn , strong convexity parameter µ, Lipschitz gradient parameter L.
Q(0) ← Rn .
for k = 0, · · 
· do 
2 µ 2
1− L
Set Q = · ∇f (x(k) ) 2 .

x ∈ Rn : x − (x(k) − µ1 ∇f (x(k) )) ≤

µ2
2
Q(k+1) ← minSphere(Q ∩ Q(k) ) where minSphere(K) is the smallest sphere covering K .
x(k+1) ← center of Q(k+1) .
end
To analyze SphereMethod, we need the following lemma, which is illustrated in Figure 3.4.
√ √
Lemma 3.23. For any g ∈ Rn and ϵ ∈ (0, 1), we have B(0, 1) ∩ B(g, ∥g∥2 1 − ϵ) ⊂ B(x, 1 − ϵ) for some
x.
Proof. By symmetry, it suces to consider the two-dimensional case, and to assume that g = ae1 . If a ≤ 1,
we can simply pick x = g . Otherwise, let (x, 0) be the center of the smallest ball containing the required
intersection, and y be its radius. (See Figure 3.4). We have x2 + y 2 = 1 and (x − a)2 + y 2 = (1 − ϵ)a2 . This
implies that
1 + ϵa2
x=
2a
and so
ϵ 1 ϵ2 a2
y2 = 1 − − 2 − ≤1−ϵ
2 4a 4
as claimed.

Lemma 3.24. Let the measure of progress V(Q) = radius(Q). Then, we have that x∗ ∈ Q(k) and V(Q(k+1) ) ≤
µ
· V(Q(k) ) for all k .
p
1− L
Remark 3.25. The function value decrease follows from Theorem 3.7.

Proof Sketch. The fact x∗ ∈ Q(k) follows directly from the denition of Q. For the decrease of radius,
suppose that n o
Q(k) = x ∈ Rn : ∥x − x(k) ∥ ≤ R(k) .

Then, the new ball is given by


( 2 ) !
µ
(k+1)
(k) 1 (k)
1− L 2
(k)
n
(k) (k)
o
Q = minSphere x − (x − µ ∇f (x )) ≤ µ2 · ∇f (x ) 2 ∩ ∥x − x ∥ ≤ R .

2

radius(Q(k+1) )2 ∇f (0)
To compute radius(Q(k) )2
, we can assume x(k) = 0 and R(k) = 1 and let g = µ . Hence, we have
n
2 µ 2
o 
Q(k+1) = minSphere ∥x − g∥2 ≤ (1 − ) · ∥g∥2 ∩ {∥x∥ ≤ 1} .
L
Now Lemma 3.23 (with ϵ = L)
µ
shows that the radius(Q(k+1) )2 ≤ 1 − L.
µ

Note that this gives the same convergence rate as gradient descent for strongly convex functions. Each
iteration is much faster than the O(n2 ) time of the Ellipsoid method.
3.5. Sphere and Parabola Methods 41

radius =
p
1 − ϵ|g|2

|g| |g|

√ √ √

1−ϵ 1 − ϵ |g| 1 − ϵ |g|
p
1 1− ϵ

√ √ p √ p √
B(0, 1) ∩ B(g, |g| 1 − ϵ) ⊂ B(x, 1 − ϵ) B(0, 1 − ϵ|g|2 ) ∩ B(g, |g| 1 − ϵ) ⊂ B(x, 1 − ϵ)

Figure 3.4: The left diagram shows the intersection shrinks at the same rate if only one of the ball shrinks; the
right diagram shows the intersection shrinks much faster if two balls shrink at the same absolute amount.

3.5.2 Parabola Method Via Epigraph


In previous sections, we have discussed cutting plane methods that maintain a region that contains the
minimizer x∗ of f . The proof of such a cutting plane method relies on the following inequality

f (x) > f (x∗ ) ≥ f (x) + ⟨∇f (x), x∗ − x⟩ (3.7)

Notice that the left-hand side is a strict inequality (unless we already solved the problem). We infer a
halfspace containing the optimum using the subgradient at x, namely

⟨∇f (x), x∗ − x⟩ ≤ 0.

As the algorithm proceeds, we nd a new point x(new) such that f (x(new) ) < f (x). Therefore, the original
inequality can be strengthened to

f (x(new) ) > f (x) + ⟨∇f (x), x∗ − x⟩

or
⟨∇f (x), x∗ − x⟩ < −(f (x) − f (x(new) )).
This suggests we should move earlier halfspaces and thereby reduce the measure of p the next set. We expect
merely updating the halfspaces can improve the convergence rate from 1 − L µ
to 1 − L µ
because of the right
diagram in Figure 3.4. An ecient way to manage all this information is to directly maintain a region that
contains (x∗ , f (x∗ )). Now, we can view the inequality f (y) ≥ f (x) + ⟨∇f (x), y − x⟩ as a cutting plane of the
epigraph of f and we do not need to update previous cutting planes anymore.
The next algorithm is an epigraph cutting plane method.
3.5. Sphere and Parabola Methods 42

Algorithm 8: ParabolaMethod
Input: Initial point x(0) ∈ Rn and the strong convexity parameter µ.
q (0) (y) ← −∞.
for k = 0, · · · do
1 2
Set q (k+ 2 ) (y) = f (x(k) ) + ∇f (x(k) )⊤ (y − x(k) ) + µ2 y − x(k) .

1
Let q (k+1) = maxParabola(max(q (k+ 2 ) , q (k) )) where maxParabola(q) outputs the parabolic
function p such that p(x) ≤ q(x) for all x, and p maximizes minx p(x).
// Alternatively, one can use x(k+ 2 ) ← x(k) − L1 ∇f (x(k) ) below.
1

1
Let x(k+ 2 ) = lineSearch(x(k) , −∇f (x(k) )) where

lineSearch(x, y) = argminz=x+ty with t≥0 f (z).


1 1
Let x(k+1) = lineSearch(x(k+ 2 ) , c(k+1) − x(k+ 2 ) ) where c(k+1) be the minimizer of q (k+1) (y).
end
As in gradient descent, to avoid using the parameter L, we do a line search from x(k) along the direction
∇f (x(k) ). The subroutine maxParabola has an explicit formula [21].
Algorithm 9: q = maxParabola(max(qa , qb ))
Input: qA (x) = vA + µ2 ∥x − cA ∥22 andqB (x) = vB + µ2 ∥x − cB ∥22 .
Compute λ = proj[0,1] 1
2 + vA −vB
µ∥cA −cB ∥2 .
cλ = λcA + (1 − λ)cB .
vλ = λvA + (1 − λ)vB + µ2 λ(1 − λ)∥cA − cB ∥2 .
Output: q(x) = vλ + µ2 ∥x − cλ ∥22 .

Exercise 3.26. Show that the formula for maxParabola computes the optimal parabola.
The key fact we will be using from the formula is that when 0 < λ < 1, we have
2 2 2
µ (∥cA − cB ∥ + µ (vB − vA ))
vλ = vA + .
8 ∥cA − cB ∥2

This says that the quadratic lower bound improves a lot whenever µ2 ∥cA −cB ∥2 ≫ vB −vA or µ2 ∥cA −cB ∥2 ≪
vB − vA . Using this, we can analyze the ParabolaMethod.
Theorem 3.27. Assume that f is µ-strongly convex with L-Lipschitz gradient. Let rk = f (x(k) )−miny q (k) (y).
Then, we have that
r
2 µ
rk+1 ≤ (1 − ) · rk2 .
L
In particular, we have that
r
(k+1) 1
∗ µ k
f (x )−f ≤ (1 − ) ∥∇f (x(0) )∥2 .
2µ L
Remark 3.28. Note that the squared radius of {y : q (k) (y) = f (x(k) )} is µ2 (f (x(k) ) − miny q (k) (y)) because
q (k) (y) = miny q (k) (y) + µ2 ∥y − arg minx q (k) (x)∥2 . Hence, rk is measuring the squared radius of our quadratic
lower bound. To relate to the cutting plane framework, we can view the set as {y : q (k) (y) ≤ f (x(k) )} and
the measure V = f (x(k) ) − miny q (k) (y).
2 1 2
Proof. Fix some k . We write q (k) (y) = vA + µ
2 ∥y − cA ∥ and q (k+ 2 ) (y) = vB + µ
2 ∥y − cB ∥ with

∇f (x(k) ) ∥∇f (x(k) )∥2


cA = c(k) , cB = x(k) − and vB = f (x(k) ) − .
µ 2µ
3.5. Sphere and Parabola Methods 43

2
Using the notation in maxParabola, we write q (k+1) (y) = vλ + µ2 ∥y − cλ ∥ . Note that rk+1
2
= f (x(k+1) ) − vλ
and rk = f (x ) − vA . Therefore, we have
2 (k)

rk2 − rk+1
2
f (x(k) ) − f (x(k+1) ) + vλ − vA
= . (3.8)
rk2 rk2
To bound the right hand side, it suces to bound vA and vλ . From the description of the algorithm
maxParabola, we see that there are three cases λ = 0, λ = 1 and 0 < λ < 1. We only focus on proving the
nontrivial case λ ∈ (0, 1). In this case, we have that
2 2 2
µ (∥cA − cB ∥ + µ (vB − vA ))
vλ = vA +
8 ∥cA − cB ∥2
2 2 (k) ∥∇f (x(k) )∥2 2
µ (∥cA − cB ∥ + µ (f (x ) − vA ) − µ2 )
= vA + .
8 ∥cA − cB ∥2
Since f ≥ q (k) , we have f (x(k) ) ≥ q (k) (x(k) ) ≥ minx q (k) (x) = vA . Next, we claim that ∥cA − cB ∥2 ≥
∥∇f (x(k) )∥2
µ2 . Using these two facts, we can prove that
2 (k) 2
µ ( µ (f (x ) − vA )) µ · rk4
vλ ≥ vA + (k) 2
= v A + .
8 ∥∇f (x )∥
2
2∥∇f (x(k) )∥2
µ

Putting this into (3.8), we have


rk2 − rk+1
2
f (x(k) ) − f (x(k+1) ) µ · rk2
≥ +
rk2 rk2 2∥∇f (x(k) )∥2
1
f (x(k+ 2 ) ) − f (x(k+1) ) µ · rk2
≥ 2 +
rk 2∥∇f (x(k) )∥2
∥∇f (x(k) )∥2 µ · rk2
≥ + (3.9)
2Lrk2 2∥∇f (x(k) )∥2
r
µ

L
1
where the second inequality is due to the fact x(k+ 2 ) is a line search from x(k) , the third inequality is due
to the assumption on L (Lemma 2.7), the last inequality follows from the Cauchy-Schwarz inequality.
For the nal conclusion, we note that
µ 2
q (1) (y) = f (x(0) ) + ∇f (x(0) )⊤ (y − x(0) ) + y − x(0)

2
2
(0) 1 (0) 2 µ (0) 1 (0)

= f (x ) − ∥∇f (x )∥ + y − (x − ∇f (x )) .

2µ 2 µ
Hence, we have v (1) = f (x(0) ) − 2µ1
∥∇f (x(0) )∥2 and hence r1 ≤ f (x(1) ) − v (1) ≤ 2µ
1
∥∇f (x(0) )∥2 .
To prove the claim, we note that x (k)
is the result of a line search of f between c(k) and some point.
Therefore, we have that ∇f (x ) ⊥ (x − c(k) ) and hence
(k) (k)

∇f (x(k) ) 2 ∥∇f (x(k) )∥2


∥cA − cB ∥2 = ∥c(k) − x(k) + ∥ ≥ .
µ µ2

Remark 3.29. We can view (3.9) as the key equation of the proof above. It shows that the progress is roughly
∥∇f ∥2
L
µ
+ ∥∇f ∥2where the rst term comes from the progress on the function value and the second term comes
from the curvature of the cutting sphere.
Exercise 3.30. Provepthe following extension of Lemma 3.23: There exists x s.t. B(0, 1 − ϵ|g|2 ) ∩
p
√ √
B(g, |g| 1 − ϵ) ⊂ B(x, 1 − ϵ).
3.6. Lower Bounds 44

3.5.3 Discussion
This section was about the idea of managing cutting planes, and as a byproduct we get an accelerated rate
of convergence. As wep will see later, standard accelerated gradient descent does not use line search and
achieves the rate 1 − L µ
. However, it seems that the use of line search helps in practice and that with a
careful implementation, line search can be as cheap as gradient computation. For more dicult problems,
one may want to store multiple quadratic lower bounds (see [21]).

3.6 Lower Bounds


q
In this chapter, we have discussed a few cutting plane methods. In particular, we showed that O(min(n, L 1
µ ) log( ϵ ))
q
gradient computations suce. We conclude this section by showing that min(n, L µ ) is in fact optimal among
gradient-based methods. For convenience the reader may consider µ = 1 and L = κ.

Theorem 3.31. Fix any L ≥ µ > 0. Consider the function

n−1 n √
L−µ L−µ X L−µ 2 µX 2 Lµ − µ 2
f (x) = − x1 + 2
(xi − xi+1 ) + x1 + xi + xn (3.10)
4 8 i=1 8 2 i=1 4

which satises µ·I ⪯ ∇2 f (x) ⪯ L·I . Assume that our algorithm satises x(k) ∈ span(x(0) , ∇f (x(0) ), · · · , ∇f (x(k−1) ))
(0)
with the initial point x = 0. Then, for k < n,
p
(k) µ 3/2 L/µ − 1 2k
f (x ) − min f (x) ≥ ( ) ( p ) (f (x(0) ) − min f (x)).
x L L/µ + 1 x

Proof. First, we check the strong convexity and smoothness. Note that f (x) = −x1 + 21 x⊤ Ax for some
matrix A. Hence, ∇2 f (x) = A. Hence, we have
n−1 n √
⊤ 2 L−µ X 2 L−µ 2 X
2 Lµ − µ 2
θ ∇ f (x)θ = (θi − θi+1 ) + θ1 + µ θi + θn
4 i=1 4 i=1
2

For the upper bound ∇2 f (x) ⪯ L · I , we note that


n−1 n √
⊤ 2 L−µ X 2 2 L−µ 2 X
2 Lµ − µ 2
θ ∇ f (x)θ ≤ (2θ + 2θi+1 ) + θ1 + µ θi + ( )θn
4 i=1 i 4 i=1
2
n
X n
X
≤ (L − µ) θi2 + µ θi2
i=1 i=1
n
X
=L θi2 .
i=1

For the lower bound ∇2 f (x) ⪰ µ · I , we note that


n
X
θ⊤ ∇2 f (x)θ ≥ µ θi2 .
i=1

To lower bound the error, we note that the gradient at x(0) is of the form (?, 0, 0, 0, · · · ) and hence by the
assumption x(1) = (?, 0, 0, 0, · · · ) and ∇f (x(1) ) = (?, ?, 0, 0, · · · ). By induction, only the rst k coordinates
of x(k) are non-zero.
3.6. Lower Bounds 45

Now, we compute the minimizer of f (x). Let x∗ be the minimizer of f (x). By the optimality condition,
we have that
L−µ L−µ ∗ L−µ ∗
− + (x1 − x∗2 ) + x1 + µx∗1 = 0,
4 4 4
L−µ ∗ L−µ ∗
(xi − x∗i−1 ) + (xi − x∗i+1 ) + µx∗i = 0, for i ∈ {2, 3, · · · , n − 1}
4 4 √
L−µ ∗ ∗ Lµ + µ ∗
(xn − xn−1 ) + xn = 0.
4 2

L/µ−1 i
By a direct substitution, we have that x∗i = ( √ ) is a solution of the above equation. Now, we note
L/µ+1
that
n p p
X L/µ − 1 L/µ − 1
∥x(k) − x∗ ∥22 ≥ (p )2i ≥ ( p )2(k+1)
i=k+1
L/µ + 1 L/µ + 1

and that
∞ p p p
X L/µ − 1 2i L/µ − 1 2 L/µ + 1
∥x (0)
− x∗ ∥22 ≤ (p ) = (p ) .
i=1
L/µ + 1 L/µ + 1 2
Now, by smoothness and by the strong convexity of f , we have
p
f (x(k) ) − f (x∗ ) µ ∥x(k) − x∗ ∥22 µ 3/2 L/µ − 1 2k
≥ · (0) ≥ ( ) (p ) .
f (x(0) ) − f (x∗ ) L ∥x − x∗ ∥22 L L/µ + 1

Note that this worst function naturally appears in many problems. So, it is a problem we need to
address. In some sense, the proof points out a common issue of any algorithm which only uses gradient
information. Given any convex function, we construct the dependence graph G on the set of variables xi by
connecting xi to xj if ∇f (x)i depends on xj or ∇f (x)j depends on xi (given all other variables). Note that
the dependence graph G of the worst function is simply a n vertex path, whose diameter is n − 1. Also, note
that gradient descent can only transmit information from one vertex to another in each iteration. Therefore,
it takes at least Ω(diameter) time to solve the problem unless we know the solution is sparse (when L/µ is
small). However, we note that this is not a lower bound for all algorithms.
The problem (3.10) belongs to a general class of functions called Laplacian systems and it can be solved
in nearly linear time using spectral graph theory.
Chapter 4

Reduction

4.1 Equivalences between Oracles


4.1.1 Oracles for Convex Sets
Grötschel, Lovász and Schrijver [28] dened ve dierent oracles to access convex sets, showed they are equiv-
alent and used them to get polynomial-time algorithms for a variety of combinatorial problems (including
the rst ones in many cases).
Here are four basic oracles1 for a convex set K ⊆ Rn . Later we will allow for error parameters in each
oracle, i.e., approximate versions of all the oracles below. We begin with exact oracles for simplicity.
Denition 4.1 (Membership Oracle (MEM)). Queried with a vector y ∈ Rn , the oracle either
ˆ asserts that y ∈ K , or
ˆ asserts that y ∈
/ K.
Denition 4.2 (Separation Oracle (SEP)). Queried with a vector y ∈ Rn , the oracle either
ˆ asserts that y ∈ K , or
ˆ nds a unit vector c ∈ Rn such that cT x ≤ cT y for all x ∈ K .
Denition 4.3 (Validity Oracle (VAL)). Queried with a unit vector c ∈ Rn , the oracle either2
ˆ outputs maxx∈K c⊤ x, or
ˆ asserts that K is empty.
Denition 4.4 (Optimization Oracle (OPT)). Queried with a unit vector c ∈ Rn , the oracle either
ˆ nds a vector y ∈ K and cT x ≤ cT y for all x ∈ K , or
ˆ asserts that K is empty.
According to the denitions, the separation oracle gives more information than the membership oracle
and the optimization oracle gives more information than the validity oracle. Depending on the problem,
usually one of the oracles will be the preferred or more natural way to access the convex set. For example,
for the polytope given by {Ax ≥ b}, the separation oracle is the preferred way because the membership
oracle takes as much time as the separation oracle, and both validity and optimization involve solving a
linear program. On the contrary, the preferred oracle for the convex set conv({ai }) is the optimization oracle
because maxx∈conv({ai }) θ⊤ x can be solved by only checking x = ai for each i. In combinatorial optimization,
many polytopes have exponentially many vertices and constraints but one can use combinatorial structure
to solve the optimization problem eciently.
Example 4.5. The spanning tree polytope of an undirected graph G = (V, E) is given by
|E|
X X
P = {x ∈ R+ | xe = |V | − 1, x(u,v) ≤ |S| − 1 ∀S ⊆ V, |S| ≥ 1}.
e∈E (u,v)∈E∩(S×S)

The extreme points of this polytope can be shown to exactly correspond to the indicator vectors of spanning
trees of G. Thus, the optimization oracle in this case is to simply nd a maximum cost spanning tree.
Exercise 4.6. Design a membership oracle for the spanning tree polytope.
1 We also omit the fth oracle VIOL dened by [28], which checks whether the convex set satises a given inequality or gives
a violating point in the convex set, since this is equivalent to the OPT oracle below up to a logarithmic factor.
2 We use a slightly dierent denition than [28] for clarity.

46
4.1. Equivalences between Oracles 47

OP T (K) = ∂δK ∗ SEP (K) = ∂δK

−−−→ is Õ(1)
V AL(K) = δK ∗ M EM (K) = δK
−−→ is Õ(n)

Figure 4.1: The relationships among the four oracles for convex sets. The arrows are implications.

4.1.2 Oracles for Convex Functions


Now, we generalize the oracles for convex sets to convex functions. The membership oracle and separation
oracle can be generalized as follows
Denition 4.7 (Evaluation Oracle (EVAL)). Queried with a vector y , the oracle outputs f (y).
The next oracle generalizes gradients to subgradients so that they can be computed for general convex
functions. Any vector output by the oracle is a subgradient of the function at the query point.
Denition 4.8 (Subgradient Oracle (GRAD)). Queried with a vector y , the oracle outputs f (y) and a
vector g ∈ Rn such that
f (x) ≥ f (y) + g ⊤ (x − y) for all x ∈ Rn (4.1)
or outputs  f (y) undened.
To generalize the validity oracle, we dene the convex (Fenchel) conjugate of f :
Denition 4.9 (Convex Conjugate). For any function f , we dene the convex conjugate
def
f ∗ (θ) = sup θ⊤ x − f (x).
x∈Rn

Note that f ∗ is convex because it is the supremum of linear functions. Also, we have f ∗ (0) = − inf x∈Rn f (x).
Note that3 δK

(c) = supx∈K c⊤ x. Therefore, the validity oracle for δK is simply the evaluation oracle for δK ∗
.
The following lemma shows that the optimization oracle is simply the (sub)gradient oracle for δK . We use

∇f to represent subgradient.
Lemma 4.10. For any continuous function f with dierentiable f ∗, we have that ∇f ∗ (θ) = arg maxx θ⊤ x −
f (x).
Proof. First we observe that the supremum is achieved. Fix θ. Let gθ (x) = θ⊤ x − f (x). We assume
supx gθ (x) = f ∗ (θ) is nite. Let ϵ > 0. Then, gθ (x) is a continuous function, so S = gθ−1 ([f ∗ (θ) − ϵ, ∞)) is a
closed set. The set S is not empty because there is some x for which gθ (x) ≥ supz gθ (z) − ϵ. Now suppose
for a contradiction that S is not bounded. Then, there exists a sequence xi ∈ Rn such that ∥xi ∥ ≥ i and
gθ (xi ) ≥ f ∗ (θ) − ϵ for all i. By the compactness of the unit sphere, we may assume by taking a subsequence
that xi /∥xi ∥ → u for some unit vector u ∈ Rn . Then, u⊤ xi /∥xi ∥ → u⊤ u = 1, and since ∥xi ∥ → ∞, we have
u⊤ xi → ∞. Since gθ+u (xi ) = gθ (xi )+u⊤ xi ≥ f ∗ (θ)−ϵ+u⊤ xi , this means that gθ+u (xi ) → ∞, contradicting
f ∗ (θ + u) being nite. Hence, S is bounded and thus compact, so gθ (x) attains its maximum in S , and thus
in Rn .
Let xθ ∈ arg supx θ⊤ x − f (x). By denition, we have that f ∗ (θ) = θ⊤ xθ − f (xθ ) and that f ∗ (η) ≥
η xθ − f (xθ ) for all η . Therefore,

θ (η − θ) for all η.
f ∗ (η) ≥ f ∗ (θ) + x⊤

Therefore, xθ ∈ ∇f ∗ (θ).
3 Recall that δC (x) = 0 if x ∈ C and +∞ otherwise.
4.1. Equivalences between Oracles 48

Note that this lemma shows that ∇δK ∗


(θ) = arg maxx∈K θ⊤ x. So, the gradient oracle for δK

is exactly
the optimization oracle for K .
Figure 4.2 shows the current best reduction between these four function oracles. Note that according to
our denition of the GRAD oracle, it also outputs the function value (EVAL). It is not hard to show that
you can use n + 1 calls to evaluation oracle to compute one gradient approximately (using nite dierence)
and that you can use Õ(n) calls to gradient oracle of f to compute one gradient of f ∗ (using the cutting
plane method). In the next section (Sec. 4.2), we will see the remaining reductions. However, it is still an
open question if all of these reductions are the best possible.
Open Problem. Prove that it takes Ω(n2 ) calls to the evaluation oracle of f to compute the gradient of
f ∗.

GRAD(f ∗ ) GRAD(f )

−−−→ is Õ(1)
EV AL(f ∗ ) EV AL(f )
−−→ is Õ(n)

Figure 4.2: This illustrates the relationships of oracles for a convex function f and its convex conjugate f ∗.

4.1.3 Convex Conjugate and Equivalences


Here are some examples of conjugates.
Exercise 4.11. Show that
ˆ The conjugate of f (x) = p1 ∥x∥pp is f ∗ (θ) = 1q ∥θ∥qq where p1 + 1q = 1 (with p, q ≥ 1).
(
b, θ=a
ˆ The conjugate of f (x) = ⟨a, x⟩ − b is f ∗ (θ) = .
+∞ otherwise
(P
i θi log θi − θi if θi ≥ 0 for all i
ˆ The conjugate of f (x) = i exi is f ∗ (θ) =
P
.
+∞ otherwise

Exercise 4.12. For any x, θ, we have θ⊤ x ≤ f (x) + f ∗ (θ).


Exercise 4.13. Prove that the gradient of f is L-Lipschitz if and only if f ∗ is 1
L strongly convex.
The following lemma shows that one can recover f from f ∗ .
Lemma 4.14 (Involution property). For any convex function f with a closed epigraph, we have that f ∗∗ = f .
Proof. Since epif is a closed, convex set, Corollary 1.7 shows that it must be an intersection of halfspaces,
i.e., a set H s.t.
f (x) = sup θ⊤ x − b
(θ,b)∈H

where H is the set of supporting planes of epif and contains all ane lower bounds on f , namely, f (x) ≥
θ⊤ x − b for all x. Alternatively, we can write
\
(x, t) : t ≥ θ⊤ x − b .

epi(f ) =
H={(θ,b): ∀x, f (x)≥θ ⊤ x−b}

For a xed θ, any feasible b satises b ≥ θ⊤ x − f (x) for all x. So, the smallest feasible value satises

b∗ = sup θ⊤ x − f (x) = f ∗ (θ).


x
4.2. Gradient from Evaluation via Finite Dierence 49

Hence,
f (x) = sup θ⊤ x − b∗ = sup θ⊤ x − f ∗ (θ) = f ∗∗ (x).
(θ,b)∈H θ

∥x∥2
Exercise 4.15. Let f be a convex function with closed epigraph. Show that f = f ∗ i f (x) = 2 .

Since we can use the gradient, ∇f , to compute the gradient of the dual, ∇f ∗ (via cutting plane method),
the involution property shows that we can do the reverse  use ∇f ∗ to compute ∇f . Going back to the
example about conv({ai }), since we know how to compute maxx∈conv({ai }) θ⊤ x = δconv({a

i })
(θ), this reduction
gives us a way to separate conv({ai }), or equivalently, to compute the (sub)gradient of δconv({a

i })
. This is
formalized in the next exercise.

Exercise 4.16. Show how to implement the separation oracle SEP for a convex set K given access to an
optimization oracle OPT for K .

Recall that for any linear space X , X ∗ denotes the dual space, i.e., the set of all linear functions on X
and that under mild assumptions4 , we have X ∗∗ = X . Therefore, there are two natural coordinate systems
to record a convex function, the primal space X and the dual space X ∗ . Under these coordinate systems,
we have the dual functions f and f ∗ .

Exercise 4.17 (Order reversal). Show that f ≥ g ⇐⇒ f ∗ ≤ g ∗ (both for all x ∈ Rn ).


Interestingly, the convex conjugate is the unique transformation on convex functions that satises both
involution and order reversal.

Theorem 4.18 ([6]). Given a transformation T that maps the set of lower semi-continuous5 convex functions
onto itself such that TTϕ = ϕ and ϕ ≤ ψ =⇒ T ϕ ≥ T ψ for all lower-semi-continuous convex functions
ϕ and ψ. Then, T is essentially the convex conjugate, namely, there is an invertible symmetric linear
transformation B, a vector v0 and a constant C0 such that

(T ϕ)(x) = ϕ∗ (Bx + v0 ) + v0⊤ x + C0 .

In combinatorial optimization, many convex sets are given by the convex hull for some discrete objects.
In many cases, the only known way to do the separation is via such reductions. In this chapter, we will study
the following general theorem showing that optimization can be reduced to membership/evaluation with a
quadratic overhead in dimension for the number of oracle queries.

Theorem 4.19. Let K be a convex set specied by a membership oracle, a point x 0 ∈ Rn , and numbers
0 < r < R such that B(x0 , r) ⊆ K ⊆ B(x0 , R). For any f given by an evaluation
convex function oracle and
any ϵ > 0, there is a randomized algorithm that computes a point z ∈ B(K, ϵ) such that.
 
f (z) ≤ min f (x) + ϵ max f (x) − min f (x)
x∈K x∈K x∈K

2 nR
2

with constant probability using O n log ϵr calls to the membership oracle and evaluation oracle and
O(n3 logO(1) nR

ϵr ) total arithmetic operations.

4.2 Gradient from Evaluation via Finite Dierence


When the function f is in C 1 , we can approximate the gradient of a function by calculating a nite dierence
for suciently small h:
∂f f (x + hei ) − f (x)
= + O(h)
∂xi h
4 The dual of the dual of a vector space X is isomorphic to X if and only if X is a nite-dimensional vector space.
5 See Def.0.3.
4.2. Gradient from Evaluation via Finite Dierence 50

which only takes n + 1 calls to the evaluation oracle (for computing f (x), f (x + he1 ), · · · , f (x + hen )). The
only issue is that the convex function may not be dierentiable. However, any convex Lipschitz function
is twice dierentiable almost everywhere (see the proof below). Therefore, we can simply perturb x with
random noise, then apply a nite dierence. To see the idea more precisely, we rst observe that the norm
of the Hessian can be bounded in expectation for a Lipschitz function. Note that this is Lipschitzness of the
function, not its gradient. The proof below uses the basic fact that the gradient is dened almost everywhere
for Lipschitz functions.

Lemma 4.20. For any L-Lipschitz convex function f dened in a unit ball, we have ∇2 f (x) exists almost
everywhere and that Ex∈B(0,1) ∥∇2 f (x)∥F ≤ nL.
Proof. The existence almost everywhere is classical (Alexandrov Theorem, see e.g., [59]; see also Rademacher's
theorem about the existence of the derivative almost everywhere for Lipschitz functions). We will only prove
the part Ex∈B(0,1) ∥∇2 f (x)∥F ≤ nL. Since ∇2 f ⪰ 0 (where dened), we have ∥∇2 f (x)∥F ≤ tr∇2 f (x).
Therefore, Z Z Z
∥∇2 f (x)∥F dx ≤ tr∇2 f (x)dx = ∆f (x)dx.
B(0,1) B(0,1) B(0,1)

Using Stokes' Theorem, and letting ν(x) be the normal vector at x, we have
Z Z
∆f (x)dx = ⟨∇f (x), ν(x)⟩ dx ≤ |∂B(0, 1)| · L.
B(0,1) ∂B(0,1)

Hence, we have
|∂B(0, 1)|
Ex∈B(0,1) ∥∇2 f (x)∥F ≤ L = nL.
|B(0, 1)|
[59]

To turn this into an algorithm, we need to develop it a bit further.

Lemma 4.21 ([42]). Let B∞ (x, r) = {y : ∥x − y∥∞ ≤ r}. For any 0 < r2 ≤ r1 and any convex function f
dened on B∞ (x, r1 + r2 ) with ∥∇f (z)∥∞ ≤ L for any z ∈ B∞ (x, r1 + r2 ) we have

r2
Ey∈B∞ (x,r1 ) Ez∈B∞ (y,r2 ) ∥∇f (z) − g(y)∥1 ≤ n3/2 L
r1

where g(y) is the average of ∇f over B∞ (y, r2 ).


Proof. Let ωi (z) = ⟨∇f (z) − g(y), ei ⟩ for all i ∈ [n]. Then, we have that
Z XZ
∥∇f (z) − g(y)∥1 dz = |ωi (z)| dz.
B∞ (y,r2 ) i B∞ (y,r2 )

Since ωi (z)dz = 0, the Poincaré inequality for a box (Theorem 4.25 below) shows that
R
B∞ (y,r2 )
Z Z
|ωi (z)| dz ≤ r2 ∥∇ωi (z)∥2 dz
B∞ (y,r2 ) B∞ (y,r2 )
Z
2
= r2 ∇ f (z)ei dz
2
B∞ (y,r2 )
XZ √
Z
2
|ωi (z)| dz ≤ nr2 ∇ f (z) dz
F
i B∞ (y,r2 ) B∞ (y,r2 )

Since f is convex, eigenvalues of the Hessian are nonnegative and so we have


sX X
2
λi = tr∇2 f (z) = ∆f (z).

∇ f (z) =
F
λ2i ≤
i i
4.2. Gradient from Evaluation via Finite Dierence 51

Therefore, we have

Ez∈B∞ (y,r2 ) ∥∇f (z) − g(y)∥1 ≤ nr2 Ez∈B∞ (y,r2 ) ∆f (z)

= nr2 ∆h(y)

where h = (2r12 )n f ∗ χB∞ (0,r2 ) where χB∞ (0,r2 ) is 1 on the set B∞ (0, r2 ) and 0 on outside.
Integrating by parts, we have that
Z Z
∆h(y)dy = ⟨∇h(y), n(y)⟩ dy
B∞ (x,r1 ) ∂B∞ (x,r1 )

2
where ∆h(y) = i ddxh2 (y) and n(y) is the normal vector on ∂B∞ (x, r1 ) the boundary of the box B∞ (x, r1 ),
P
i
i.e. standard basis vectors. Since f is L-Lipschitz with respect to ∥·∥∞ so is h, i.e. ∥∇h(z)∥∞ ≤ L. Hence,
we have that
Z
1 1 nL
Ey∈B∞ (x,r1 ) ∆h(y) ≤ n
∥∇h(y)∥∞ ∥n(y)∥1 dy ≤ n
· 2n(2r1 )n−1 · L = .
(2r1 ) ∂B∞ (x,r1 ) (2r1 ) r1

Therefore, we have that


r2
Ey∈B∞ (x,r1 ) Ez∈B∞ (y,r2 ) ∥∇f (z) − g(y)∥1 ≤ n3/2 L.
r1

Exercise 4.22. For a Lipschitz function f : Rn → R,and h = f ∗ χB∞ (0,1/2) , prove that h is Lipschitz and
∆h(y) = Ez∼B∞(y,1/2) (∆f (z)).
4.2. Gradient from Evaluation via Finite Dierence 52

Algorithm 10: SubgradConvexFunc(f, x, r1 , ε)


Require:qr1 > 0, ∥∂f (z)∥∞ ≤ L for any z ∈ B∞ (x, 2r1 ).
Set r2 = εr1

nL
.
Sample y ∈ B∞ (x, r1 ) and z ∈ B∞ (y, r2 ) independently and uniformly at random.
for i = 1, 2, · · · , n do
Let αi and βi denote the end points of the interval B∞ (y, r2 ) ∩ {z + sei : s ∈ R}.
Set g̃i = f (βi )−f
2r2
(αi )
where we compute f to within ε additive error.
end
Output g̃ as the approximate subgradient of f at x.

This lemma shows that we can implement an approximate gradient oracle (GRAD) using an evaluation
oracle (EVAL) even for non-dierentiable functions. By the involution property again, this completes all the
reductions in Figure 4.2. With the above fact asserting that, on average, the gradient is approximated by its
average in a small ball, we now proceed to construct an approximate subgradient, using only an approximate
evaluation oracle. The parameter r2 in the algorithm is chosen to optimize the nal error of the output. By
making the ratio r2 /r1 suciently small, we can get a desired error for the subgradient.

Lemma 4.23. Let r1 > 0 and f be a convex function. Suppose that ∥∇f (z)∥∞ ≤ √L for any z ∈
B∞ (x, 2r1 ) and suppose that we can evaluate f to within ε additive error forε ≤qr1 nL. Let g̃ =
Lε 5/4
SubgradConvexFunc(f, x, r1 , ε). Then, there is random variable ζ ≥ 0 with Eζ ≤ 2 r1 n such that
for any y
f (y) ≥ f (x) + ⟨g̃, y − x⟩ − ζ ∥y − x∥∞ − 4nr1 L.

Proof. We assume that f is twice dierentiable. For general f , we can reduce to this case by viewing it as
a limit of twice-dierentiable functions.

First, we assume that we can compute f exactly, namely ε = 0. Fix i ∈ [n]. Let g(y) be the average of
4.2. Gradient from Evaluation via Finite Dierence 53

∇f over B∞ (y, r2 ). Then, for the function g̃ computed by the algorithm, we have that

f (βi ) − f (αi )
Ez |g̃i − g(y)i | = Ez
− g(y)i
2r2
Z
1 df
≤ Ez (z + sei ) − g(y)i ds
2r2 dxi

df
= Ez
(z) − g(y)i
dxi

where we used that both z + sei and z are uniform distribution on B∞ (y, r2 ) in the last line. Hence, we have

Ez ∥g̃ − ∇f (z)∥1 ≤ Ez ∥∇f (z) − g(y)∥1 + Ez ∥g̃ − g(y)∥1 ≤ 2Ez ∥∇f (z) − g(y)∥1 .

Now, applying the convexity of f yields that

f (q) ≥ f (z) + ⟨∇f (z), q − z⟩


= f (z) + ⟨g̃, q − x⟩ + ⟨∇f (z) − g̃, q − x⟩ + ⟨∇f (z), x − z⟩
≥ f (z) + ⟨g̃, q − x⟩ − ∥∇f (z) − g̃∥1 ∥q − x∥∞ − ∥∇f (z)∥∞ ∥x − z∥1 .

Now, using f is L-Lipschitz between x and z , we have that f (z) ≥ f (x) − L · ∥x − z∥1 . Hence, we have

f (q) ≥ f (x) + ⟨g̃, q − x⟩ − ∥∇f (z) − g̃∥1 ∥q − x∥∞ − 2L ∥x − z∥1 .

Note that ∥x − z∥1 ≤ n · ∥x − z∥∞q≤ n(r1 + r2 ) by assumption. Moreover, we can apply Lemma 4.21 to
bound ∥∇f (z) − g̃∥1 and use r2 = √εrnL
1
≤ r1 to get

f (q) ≥ f (x) + ⟨g̃, q − x⟩ − ζ ∥q − x∥∞ − 4nr1 L

with Eζ ≤ n3/2 rr12 L.


Since we only compute f up to ε additive error, it introduces ε
r2 additive error in g̃i . Hence, we instead
have that
r2 εn
Eζ ≤ n3/2 L + .
r1 r2
q
Setting r2 = √εrnL 1
completes the proof.

Note that if we happened to have an exact oracle for f , then we can make r2 arbitrarily small.

Exercise 4.24. What is the best possible bound in Lemma 4.21?


In the proof of Lemma 4.21, we used the following fact applied to the case when Ω is a cube. In this
case, the coecient on the RHS is given by the Cheeger or KLS constant of the cube and is 1. This is an
example of an isoperimetric inequality. Such inequalities will play an important role later in this book.

Theorem 4.25 (L1 -Poincaré inequality). Let Ω be connected, bounded and open. Then the following (best-
possible) inequality holds for any smooth function f : Ω → R:
 
2|S||Ω \ S|
Z
f − 1

f (x) dx ≤ sup ∥∇f ∥L1 (Ω)
|Ω| Ω 1
L (Ω) S⊂Ω |∂S||Ω|

where the supremum is over all subsets S s.t. S and Ω\S are both connected.

Exercise 4.26. Prove the inequality in Theorem 4.25 using the classical coarea formula.
4.3. Separation via Membership 54

4.3 Separation via Membership


In this section, we show that how to implement a separation oracle for a convex set using a nearly linear
number of queries to a membership oracle. We divide this into two steps. In the rst step (Algorithm ??),
we compute an approximate subgradient of a given Lipschitz convex function. Using this, in the second step
(Algorithm ??), we compute an approximate separating hyperplane.
Theorem 4.27. Let K be a convex body s.t. B(0, 1) ⊆ K ⊆ B(0, R). Then, for any 0 < η ≤ 1/2, we can
compute an η -approximate separation oracle for K using O(n log(nR/η)) queries to a membership oracle for
K.
Throughout this section, let K ⊆ Rn be a convex set that contains B2 (0, r) and is contained in B2 (0, R).
Recall that Bp (x, r) is the p-norm ball of radius r centered at x. By an approximate separation oracle
we mean that either the queried point x lies within distance η of K , or the oracle provides a halfspace
cT y ≤ cT x + η for all points y ∈ K at distance at least η from the boundary of K . We also dene
def
Bp (K, −δ) = {x ∈ Rn : Bp (x, δ) ⊆ K}.
Given some point x ∈/ K , we wish to separate x from K using a halfspace. To do this, we reduce this
problem to computing an approximate subgradient of a Lipschitz convex function hx (d) dened for points
in K. Roughly speaking, it is the height (or distance from the boundary) of a point d in the direction of x.
Let
αx (d) = max α and hx (d) = −αx (d) ∥x∥2 .
d+αx∈K

Note that d + αx (d)x is the last point in K on the line through d ∈ K in the direction of x, and −hx (d) is
the ℓ2 distance from this boundary point to d (see Fig.4.3).

Figure 4.3: The convex height function hx

The output of the algorithm for separation is a halfspace that approximately contains K , and the input
point x is close to its bounding hyperplane. It uses a call to the subgradient function above.
We now proceed to analyze the height function.
Lemma 4.28. hx (d) is convex on K.
Proof. Let d1 , d2 ∈ K and λ ∈ [0, 1]. Now d1 + αx (d1 )x ∈ K and d2 + αx (d2 )x ∈ K . Consequently,
[λd1 + (1 − λ)d2 ] + [λ · αx (d1 ) + (1 − λ) · αx (d2 )] x ∈ K .
def
Therefore, if we let d = λd1 +(1−λ)d2 we see that αx (d) ≥ λ·αx (d1 )+(1−λ)·αx (d2 ) and hx (λd1 +(1−λd2 ) ≤
λhx (d1 ) + λhx (d2 ) as claimed.
4.3. Separation via Membership 55

Algorithm 11: Separateε,ρ (K, x)


Require: B2 (0, r) ⊂ K ⊂ B2 (0, R).
if MEMε (K) asserts that x ∈ B(K, ϵ) then
Output:  x ∈ B(K, ε).
else if x ∈/ B2 (0, R) then
Output: the halfspace {y : 0 ≥ ⟨y − x, x⟩}.
end
Let κ = R/r, αx (d) = maxd+αx∈K α and hx (d) = −αx (d) ∥x∥2 .
The evaluation oracle of αx (d) can be implemented via binary search and MEMε (K).
Compute g̃ = SubgradConvexFunc(hx , 0, r1 , 4ε) with r1 = n1/6 ε1/3 R2/3 κ−1 and the evaluation
oracle of αx (d).
Output: the halfspace  
31 7/6 2/3 1/3
y: n R κε ≥ ⟨g̃, y − x⟩
ρ

 
Lemma 4.29. hx is
R+δ
r−δ -Lipschitz over points in B2 (0, δ) for any δ < r.

Proof. Let d1 , d2 be arbitrary points in B(0, δ). We wish to upper bound |hx (d1 ) − hx (d2 )| in terms of
∥d1 − d2 ∥2 . We assume without loss of generality that αx (d1 ) ≥ αx (d2 ) and therefore
|hx (d1 ) − hx (d2 )| = |αx (d1 ) ∥x∥2 − αx (d2 ) ∥x∥2 | = (αx (d1 ) − αx (d2 )) ∥x∥2 .
Consequently, it suces to lower bound αx (d2 ). We split the analysis into two cases.
Case 1: ∥d2 − d1 ∥2 ≥ r − δ . Since 0 ≥ hx (d1 ), hx (d2 ) ≥ −R − δ , we have that
R+δ
|hx (d1 ) − hx (d2 )| ≤ R + δ ≤ ∥d2 − d1 ∥2 .
r−δ
Case 2: ∥d2 − d1 ∥2 ≤ r − δ . We consider the point d3 = d1 + d2 −d λ
1
with λ = ∥d2 − d1 ∥2 /(r − δ). Note
that
1 1
∥d3 ∥2 ≤ ∥d1 ∥2 + ∥d2 − d1 ∥2 ≤ δ + ∥d2 − d1 ∥2 ≤ r.
λ λ
Hence, d3 ∈ K . Since λ ∈ [0, 1] and K is convex, we have that λ · d3 + (1 − λ) · [d1 + αx (d1 )x] ∈ K . Now,
we note that
λ · d3 + (1 − λ) · [d1 + αx (d1 )x] = d2 + (1 − λ) · αx (d1 )x
4.3. Separation via Membership 56

and this shows that  


∥d2 − d1 ∥2
αx (d2 ) ≥ (1 − λ) · αx (d1 ) = 1− · αx (d1 ).
r−δ
Since d1 + αx (d1 )x ∈ K ⊂ B2 (0, R), we have that αx (d1 ) · ∥x∥2 ≤ R + δ and hence

∥d2 − d1 ∥2 R+δ
|hx (d1 ) − hx (d2 )| = (αx (d1 ) − αx (d2 )) · ∥x∥2 ≤ αx (d1 ) · ∥x∥2 ≤ ∥d2 − d1 ∥2 .
r−δ r−δ
In either case, as claimed we have
R+δ
|hx (d1 ) − hx (d2 )| ≤ ∥d2 − d1 ∥2 .
r−δ

The next lemma shows that hx gives us a way to implement an approximation separation oracle and
only needs access to an approximation evaluation oracle for hx which in turn only needs an approximate
membership oracle for K . To be precise, we dene approximate oracles.

Denition 4.30 (Separation Oracle (SEP)). Queried with a vector y ∈ Rn and real numbers δ, δ ′ > 0, with
probability at least 1 − δ ′ , the oracle either
ˆ assert that y ∈ B(K, δ), or
ˆ nd a unit vector c ∈ Rn such that cT x ≤ cT y + δ for all x ∈ B(K, −δ).
We let SEPδ,δ′ (K) be the time complexity of this oracle.

Denition 4.31 (Membership Oracle (MEM)). Queried with a vector y ∈ Rn and real numbers δ, δ ′ > 0,
with probability at least 1 − δ ′ , either
ˆ assert that y ∈ B(K, δ), or
ˆ assert that y ∈
/ B(K, −δ).
We let MEMδ,δ′ (K) be the time complexity of this oracle.

We can now state the main lemma for the separation oracle. Since the algorithm is randomized, we have
a parameter ρ ∈ (0, 1) to denote the probability of failure.

Lemma 4.32. Let K be a convex set satisfying B2 (0, r) ⊂ K ⊂ B2 (0, R). Given any 0 < ρ < 1 and
0 ≤ ε ≤ r. With probability 1 − ρ, Separateε,ρ (K, x) outputs a halfspace that contains K .
Proof. When x ∈/ B2 (0, R), the algorithm outputs a valid separation for B2 (0, R). For the rest of the proof,
we assume x ∈/ B(K, −ε) (due to the membership oracle) and x ∈ B2 (0, R).
By Lemma 4.28 and Lemma 4.29, hx is convex with Lipschitz constant 3κ on B2 (0, 2r ). By our assumption
on ε and our choice of r1 , we have that B∞ (0, 2r1 ) ⊂ B2 (0, 2r ). Hence, we can apply Lemma 4.23 to get that

hx (y) ≥ hx (0) + ⟨g̃, y⟩ − ζ ∥y∥∞ − 12nr1 κ (4.2)

for any y ∈ K . Note that − κx ∈ K and hx (− κx ) = hx (0) − κ1 ∥x∥2 . Hence, we have


 
1 1 1 1
hx (0) − ∥x∥2 = hx (− x) ≥ hx (0) + g̃, − x − ζ ∥x∥∞ − 12nr1 κ.
κ κ κ κ

Therefore, we have
⟨g̃, x⟩ ≥ ∥x∥2 − ζ ∥x∥∞ − 12nr1 κ2 . (4.3)
Now, we note that x ∈
/ B(K, −ε). Using that B(0, r) ⊂ K , we have (1 − rε )K ⊂ B(K, −ε). Hence,
 ε
hx (0) ≥ − 1 − ∥x∥2 ≥ − ∥x∥2 .
r
Therefore, we have
hx (0) + ⟨g̃, x⟩ ≥ −ζ ∥x∥∞ − 12nr1 κ2
4.3. Separation via Membership 57

Combining this with (4.2), we have that

hx (y) ≥ ⟨g̃, y − x⟩ − ζ ∥y∥∞ − ζ ∥x∥∞ − 12nr1 κ − 12nr1 κ2


≥ ⟨g̃, y − x⟩ − 2ζR − 24nr1 κ2

for any qy ∈ K . Recall from Lemma 4.23 that ζ is a positive random scalar independent of y satisfying
Eζ ≤ 2 3κε r1 n
5/4
. For any y ∈ K , we have that hx (y) ≤ 0 and hence ζ̃ ≥ ⟨g̃, y − x⟩ where ζ̃ is a random
scalar independent of y satisfying
r
3κε 5/4
Eζ̃ ≤ 4 n R + 24nr1 κ2
r1
≤ 31n7/6 R2/3 ε1/3 κ.

where we used r1 = n1/6 ε1/3 R2/3 /κ and 0 ≤ ε ≤ r. The result then follows using Markov's inequality.
Exercise 4.33. Suppose we can evaluate the subgradient of hx exactly for a convex set K containing the
origin. Give a short proof that for any x ̸∈ K , we have ⟨∇hx (0), y − x⟩ ≤ 0 for all y ∈ K .
Theorem 4.34. Let K B2 (0, 1/κ) ⊂ K ⊂ B2 (0, 1).
be a convex set satisfying For any 0≤η< 1
2 , we have
that   

SEPη (K) ≤ O n log MEM(η/nκ)O(1) (K).
η
Proof. First, we bound the running time. Note that the bottleneck is to compute hx with δ additive error.
Since −O(1) ≤ hx (y) ≤ 0 for all y ∈ B2 (0, O(1)), one can compute hx (y) by binary search with O(log(1/δ))
calls to the membership oracle.
Next, we check that Separateδ,ρ (K, x) is indeed a separation oracle. Note that g̃ may not be an unit
vector and we need to re-normalize the g̃ by 1/ ∥g̃∥2 . So, we need to a lower bound ∥g̃∥2 .
3
From (4.3) and our choice of r1 , if δ ≤ 106ρn6 κ6 , then we have that
r
⟨g̃, x⟩ ≥ ∥x∥2 − ζ ∥x∥∞ − 12nr1 κ2 ≥ .
4
Hence, we have that ∥g̃∥2 ≥ 4κ 1
. Therefore, this algorithm is a separation oracle with error 400 7/6 2 1/3
ρ n κ δ
and failure probability O(ρ + log(1/δ)δ).

SEPΩ(max(n7/6 κ2 δ1/3 /ρ+ρ+log(1/δ)δ) (K) ≤ O(log(1/δ))MEMδ (K).


√  6 
Setting ρ = n7/6 κ2 δ 1/3 and δ = Θ n7/2 η
κ6
, we have that


SEPη (K) ≤ O(log( ))MEMη6 /(n7/2 κ6 ) (K).
η

4.3.1 Gradient from Evaluation via Auto Dierentiation


In this section, we give yet another way to compute gradients using evaluations.
Theorem 4.35. Given a function f : Rn → R represented by an circuit whose gates compute dierentiable
functions of a nite number of variables. Suppose that f (x) can be computed in time T (namely, the circuit
has T edges). Then we can compute ∇f (x) in O(T ) time.

Remark. In practice, the runtime is roughly 2T assuming we have enough memory. Check out google/jax
for a modern implementation.
Before proving it formally, we rst go through an example. Consider the function f (x1 , x2 ) = sin(x1 /x2 )+
x1 x2 . We use xi to denote both the input and all intermediate variables. Then, we can write the program
in T = 6 steps:
4.3. Separation via Membership 58

ˆ x1 = x1 , x2 = x2 , x3 = x1 /x2 , x4 = sin(x3 ), x5 = x1 x2 , Output x6 = x4 + x5 .


Note that each step involves computing
only few xj
z }| {
xi = fi (x1 , · · · , xi−1 )

with simple functions fi whose derivatives we know how to compute. The key idea is compute ∂x ∂f
1
not just
for the inputs x1 and x2 , but also for all intermediate variables. Here, we use ∂xi to denote the derivative of
∂f

f with respect to xi while xing x1 , x2 , · · · , xi−1 (and other inputs if xi is an input). For the example above,
suppose we want to compute ∇f (π, 2), we can simply compute rst compute all xi from i = 1, 2, · · · , 6, then
∂xi in the reverse order from i = 6, 5, · · · , 1:
∂f

ˆ x1 = π , x2 = 2, x3 = π/2, x4 = sin(x3 ) = 1, x5 = x1 x2 = 2π , x6 = x4 + x5 = 2π + 1.
ˆ ∂x
∂f
6
= 1, ∂x
∂f
5
= ∂(x∂x
4 +x5 )
5
= 1, ∂x∂f
4
= ∂(x∂x
4 +x5 )
4
= 1,
ˆ ∂x3 = ∂x4 ∂x3 = 1 · cos(x3 ) = 0,
∂f ∂f ∂x4

ˆ ∂x
∂f
2
∂f ∂x3
= ∂x 3 ∂x2
∂f ∂x5
+ ∂x 5 ∂x2
= 0 · (− xx21 ) + 1 · x1 = π ,
2

ˆ ∂x
∂f
1
∂f ∂x3
= ∂x 3 ∂x1
∂f ∂x5
+ ∂x 5 ∂x1
= 0 · ( x12 ) + 1 · x2 = 2.
The general case is similar. See AutoDifferentiation for the algorithm.
Algorithm 12: AutoDifferentiation
Input: a function f (x1 , x2 , · · · , xn ) given by f (x1 , x2 , · · · , xn ) = xm and
xi = fi (x1 , · · · , xi−1 ) for i = n + 1, n + 2, · · · , m

for i = n + 1, n + 2, · · · , m do
Compute xi = fi (x1 , · · · , xi−1 ).
end
Let ∂f
∂xm= 1.
for i = m − 1, · · · , 1 do
Let Li be the set of j such that fj depends on xi (i.e. xj directly depends on xi ).
∂f ∂xj
Compute ∂x ∂f
.
P
i
= j∈Li ∂x j ∂xi

end

∂f ∂xj
We prove by induction that the formula ∂f
is correct. For the
P
Proof of Theorem 4.35. ∂xi = j∈Li ∂xj ∂xi
base case i = m, we have f = xm and hence ∂f
∂xm= 1. For the induction, we let Li = {xj1 , xj2 , · · · , xjk }. If
we x variables x1 , x2 , · · · , xi−1 , then f is a function of xi (and of other inputs if xi is an input). Since only
xj1 , xj2 , · · · , xjk depend on xi , we can also view f as a function of xj1 , xj2 , · · · , xjk . More precisely, we have

f (xi ) = f (xj1 (xi , xj−1 ), xj2 (xi , xj−2 ), · · · , xjk (xi , xj−k ))

where we use xj−1 to denote the variables xj2 , xj3 , · · · , xjk . By chain rule, we have
∂f X ∂f ∂xj
= .
∂xi ∂xj ∂xi
j∈Li

To bound the runtime, we dene the computation graph G be a graph on x1 , x2 , · · · , xm such that i → j
if fj depends on xi . Note that each edge is examined O(1) times whether evaluating f or its gradient. Hence,
the cost of computing f and the cost of our algorithm are both Θ(m) where m is the number of edges in G.
This completes the proof.
We note that an ecient implementation of the chain rule is the heart of the backpropagation algorithm
fo neural networks. To conclude this section, we see that Theorem 4.35 can be surprisingly useful even for
some simple explicit functions.
Corollary 4.36. If we can compute f (A) = det A exactly in time T, then we can compute A−1 exactly in
O(T ).
4.4. Composite Problem via Duality 59

Proof. Note that ∂A∂ij det A = adj(A)ji = det A · (A−1 )ji . Hence, ∇ log det A = A−⊤ . Theorem 4.35 shows
that computing A−⊤ can be done as fast as det A.

4.3.2 Gradient from Evaluation via Complex Step Dierentiation


Auto dierentiation is great if the function can be computed exactly. However, it does not work well if
f can be only approximated or the computation of f involves too many variables. For example, if f is
the energy usage of some aircraft design, then f can only be computed approximately via simulation and
such computation is memory expensive (if we need to store all intermediate variables) and not exact, so the
method in the previous subsection is not practical.
To introduce complex step dierentiation, we recall the formula of nite dierence:
ˆ Forward/Backward dierence: f (x±h)−f
±h
(x)
= f ′ (x) + hf ′′ (ζ).
ˆ Central dierence: f (x+h)−f
2h
(x−h)
= f ′ (x) + O(h2 )f ′′′ (ζ).
Note that the error analysis above assumes no numerical error involved. The formula involves subtracting
two close numbers and dividing by a small number and this step creates lots of error. If we are computing
f as oating point, we have
(1 ± ϵ)f (x ± h) − (1 ± ϵ)f (x) ϵ
= f ′ (x) + O( )f (ζ1 ) + O(h)f ′′ (ζ2 )
±h h
(1 ± ϵ)f (x + h) − (1 ± ϵ)f (x − h) ϵ
= f ′ (x) + O( )f (ζ1 ) + O(h2 )f ′′′ (ζ2 )
2h h
where ϵ is the oating point precision. Suppose that f, f ′ , f ′′ , f ′′′ are all √
bounded by constants and suppose
ϵ = (10)−8 (single precision oating point), then we should pick h = ϵ for the rst case and h = ϵ1/3
in the second case. Hence, forward/backward dierence gives only accuracy ϵ1/2 ≈ (10)−4 and the central
dierence gives only accuracy ϵ2/3 ≈ (10)−5 . In reality, the error will be larger because f ′′ and f ′′′ usually
are large.
The complex step dierentiation uses the step
Imf (x + ih)
≈ f ′ (x) + O(h2 )f ′′ (ζ).
h
This formula works only for complex analytic functions, but it avoids subtracting numbers. In practice, this
really allows us to compute f ′ (x) close to machine accuracy. In general, ensuring algorithms give close to
machine accuracy is important because algorithms often stack on top of each other.

4.4 Composite Problem via Duality


4.4.1 Motivation
In this section, we will discuss a few algorithmic applications of duality. In general, two reasons to consider a
dual problem are that (1) the dual problem might have smaller dimension and (2) (sub)gradients are easier
to compute. We will see such examples.
Suppose we have a dicult convex problem minx f (x), i.e., one that does not satisfy some smoothness
condition which direct suggests a faster than worst-case optimization method. Often, we can decompose
the dicult problem as f (x) = g(x) + h(Ax) such that the (sub)gradients ∇g ∗ (x) and ∇h∗ (x) are easy to
compute. Then, it is useful to compute its dual as follows:

min g(x) + h(Ax) = min max g(x) + θ⊤ Ax − h∗ (θ)


x x θ
= max min g(x) + (A⊤ θ)⊤ x − h∗ (θ)
θ x

= max −g ∗ (−A⊤ θ) − h∗ (θ)


θ
= − min g ∗ (−A⊤ θ) + h∗ (θ)
θ
4.4. Composite Problem via Duality 60

where we used h = h∗∗ in the rst line, the following minimax theorem on the second line, and the denition
of g ∗ on the third line.
Theorem 4.37 (Sion's minimax theorem). X ⊂ Rn be a compact convex set and Y ⊂ Rm be a
Let convex
set. If f : X × Y → R ∪ {+∞} such that f (x, ·) is upper semi-continuous and quasi-concave on Y for all
x ∈ X and f (·, y) is lower semi-continuous and quasi-convex on X for all y ∈ Y . Then, we have

min sup f (x, y) = sup min f (x, y).


x∈X y∈Y y∈Y x∈X

Remark. Compactness is necessary. Consider f (x, y) = x + y . This theorem generalizes Von Neumann's
minimax theorem.
We call  g(x) + h(Ax) the primal problem and  g ∗ (−A⊤ θ) + h∗ (θ) the dual problem. Often, the dual
problem gives us some insight on the primal problem. However, we note that there are many ways to split
a problem into two and hence many candidate dual problems.
Example 4.38. Consider the unit capacity ow problem on a graph G = (V, E):
max c⊤ f
Af =d,−1≤f ≤1

where f ∈ RE is the ow vector, A ∈ RV ×E encodes the vertex-edge adjacency matrix with two nonzeros
per column, d is the demand vector so that Af = d is ow conservation, and c is the cost vector. We can
write the dual as follows:

max c⊤ f = max min c⊤ f − ϕ⊤ (Af − d)


Af =d,−1≤f ≤1 −1≤f ≤1 ϕ

= min max ϕ⊤ d + (c − A⊤ ϕ)⊤ f


ϕ −1≤f ≤1
X
= min ϕ⊤ d + |c − A⊤ ϕ|e .
ϕ
e∈E

When c = 0 and d = F · 1st this is the maximum ow problem with ow value F , and the dual problem
is the minimum s − t cut problem with the cut given by {v ∈ V such that ϕ(v) ≥ t}. We can view ϕ as
assigning a potential to every vertex of the graph. Note that there are |E| variables in primal and |V |
variables in dual. So, in this sense, the dual problem is easier for dense graphs. Although we do not have a
way to turn a minimum s − t cut to a maximum s − t ow in general, we will see various tricks to reconstruct
the primal solution from the dual solution by modifying the problem.

4.4.2 Example: Semidenite Programming


When you can solve the dual problem, the proof of the optimality of the dual problem often directly gives
you the solution of the primal problem, or gives you some way to solve the primal problem eciently. Here,
we use the semidenite programming as an concrete example. Dene • as the inner product of matrices, i.e.,
X
A•B = Aij Bij .
i,j

Consider the semidenite programming (SDP) problem:

max C • X s.t. Ai • X = bi for i = 1, 2, · · · , m (4.4)


X⪰0

and its dual


m
X
min b⊤ y s.t. y i Ai ⪰ C (4.5)
y
i=1

where X , C , Ai are n × n symmetric matrices and b, y ∈ Rm . SDP is a generalization of linear programming


and is useful for various problems involving matrices. If we apply the current-best cutting plane method
4.4. Composite Problem via Duality 61

naively on the primal problem, we would get Õ(n2 (Z + n4 )) time algorithm for the primal (because there
are n2 variables) and Õ(m(Z + nω + mP 2
)) for the dual where Z is the total number of non-zeros in Ai . (The
term Z + n is the cost of computing
ω
yi Ai and nding its minimum eigenvalue.) Generally, n2 ≫ m and
hence it takes much less time to solve the dual.
We note that
Pm min b⊤ y = Pm min b⊤ y.
i=1 yi Ai ⪰C v⊤ ( i=1 yi Ai −C)v≥0 ∀∥v∥2 =1

In each step of the cutting plane method, the (sub)gradient oracle either outputs b or outputs one of the
cutting planes
Xm
v⊤ ( yi Ai − C)v ≥ 0.
i=1

Let S be the set of all cutting planes used in the algorithm. Then, the proof of the cutting plane method
shows that
Pm min b⊤ y = Pm min b⊤ y ± ε. (4.6)
i=1 yi Ai ⪰C v⊤ ( i=1 yi Ai −C)v≥0 ∀v∈S

The key idea to obtaining the primal solution is to take the dual of the right hand side (which is an
approximate dual of the original problem). Now, we have

X Xm
min b⊤ y = min max b⊤ y − λv v ⊤ ( yi Ai − C)v
v⊤ ( m y λv ≥0
P
i=1 yi Ai −C)v≥0 ∀v∈S i=1
v∈S
X m
X X
= max min C • λv vv ⊤ + b⊤ y − yi λv vv ⊤ • Ai
λv ≥0 y
v∈S i=1 v∈S
m
X
= P max min C •X + yi (bi − X • Ai )
X= v∈S λv vv ⊤ ,λv ≥0 y
i=1
= max C • X.
X= v∈S λv vv ⊤ ,λv ≥0,X•Ai =bi
P

Note
P that this is exactly the primal SDP problem, except that we restrict the set of solutions to the form
v∈S λv vv with λv ≥ 0. Also, we can write this problem as a linear program:

X
max λv v ⊤ Cv. (4.7)
λv v ⊤ Ai v=bi for all i,λv ≥0
P
v v

Therefore, we can simply solve this linear program and recover an approximate solution for the SDP. By
(4.6), we know that this is an approximate solution with the same guarantee as the dual SDP.
Now, we analyze the runtime of this algorithm. This algorithm contains two phases: solve the dual SDP
via cutting plane method, and solve the primal linearPmprogram. Note that each step of the cutting plane
method involves nding a separating hyperplane of i=1 yi Ai ⪰ C .

Exercise 4.39. Let Ω = {y ∈ Rm : m i=1 yi Ai ⪰ C}. Show that one can implement the separation oracle
P
in time O∗ (Z + nω ) via eigenvalue computation.

Therefore, the rst phase takes O∗ (m(Z + nω + m2 )) time in total. Since the cutting plane method
takes O∗ (m) steps, we have |S| = O∗ (m). In the second phase, we need to solve a linear program (4.7) with
O∗ (m) variables with O(m) constraints. It is known how to solve such linear programs in time O∗ (m2.38 )
[18]. Hence, the total cost is dominated by the rst phase

O∗ mZ + mnω + m3 .


Problem 4.40. In the rst phase, each step involves computing an eigenvector of similar matrices. So, can
we use matrix update formulas to decrease
 the cost per step in the cutting plane to O (Z + n )? Namely,
∗ 2

can we solve SDP in time O mZ + m ?


∗ 3
4.4. Composite Problem via Duality 62

4.4.3 Duality and Convex Hull


The problem minx g(x) + h(Ax) in general can be solved by a similar trick. To make this geometric picture
clearer, we consider its linear optimization version: min(x,t1 )∈epi g,(Ax,t2 )∈epi h t1 +t2 . To simplify the notation,
we consider the problem
min c⊤ x
x∈K1 ,M x∈K2

where M ∈ Rm×n , and we have convex sets K1 ⊂ Rn and K2 ⊂ Rm .


To be concrete, let us consider the following example. Let V1 be a set of students, V2 be a set of schools.
Each edge e ∈ E represents a possible choice of a student. Let we be the happiness of a school/student if
the student is assigned to that school. Suppose that every student can only be assigned to one school and
school b can accept cb students. Then, the problem can be formulated as
X X X
max we xe subject to x(a,b) ≤ 1 ∀a ∈ V1 , x(a,b) ≤ cb ∀b ∈ V2 .
xe ≥0
e∈E (a,b)∈E (a,b)∈E

This is the weighted b-matching problem. Typically, the number of students is much more than the number
of schools. Therefore, an algorithm with running time linear in the number of students is preferable. To
apply our framework, we let
X
K1 = {x ∈ RE , xe ≥ 0, x(a,b) ≤ 1 ∀a ∈ V1 },
(a,b)∈E
V2
K2 = {y ∈ R , yb ≤ cb },

and M ∈ RE → RV2 is the map (T x)b = a:(a,b)∈E x(a,b) .


P
To further emphasize its importance, consider some general examples here:
ˆ Linear programming: minAx=b,x≥0 c⊤ x: K1 = {x ≥ 0}, K2 = {b} and M = A.
ˆ Semidenite programming: minAi •X=bi ,X⪰0 C • X : K1 = {X ⪰ 0}, K2 = {b} and M : Rn×n → Rm
dened by (M X)i = Ai • X .
ˆ Matroid intersection: minx∈M1 ∩M2 1⊤ x: K1P = M1 and K2 = M2 , M = I .
ˆ Submodular minimization: K1 = {y ∈ Rn : i∈S yi ≤ f (S) for all S ⊂ [n]}, K2 = {y ≤ 0},P M = I.
ˆ Submodular ow: K1 = {φ ∈ RE , ℓe ≤ φe ≤ ue for all e ∈ E}, K2 = {y ∈ RV : i∈S yi ≤
f (S) for all S ⊂ [n]}, M is the incidence matrix.
In all of these examples, it is easy to compute the gradient of δK

1
and δK∗
2
. For the last three examples, it is
not clear how to compute gradient of δK1 and/or δK2 directly. Furthermore, in all examples, M maps from
a larger space to the same or smaller space. Therefore, it is good to take advantage of the smaller space.
Before [43], the standard way was to use the equivalence of ∇δK ∗
1
and ∇δK1 , and apply cutting plane
methods. With the running time of cutting plane methods, such an algorithm usually had theoretical running
time at least n5 and was of little practical value.
Now we rewrite the problem as we did in the beginning:

min c⊤ x = minn c⊤ x + δK1 (x) + δK2 (M x)


x∈K1 ,M x∈K2 x∈R

= minn max
m
c⊤ x + δK1 (x) + θ⊤ M x − δK

2
(θ)
x∈R θ∈R

= max
m
minn c⊤ x + δK1 (x) + θ⊤ M x − δK

2
(θ)
θ∈R x∈R

= max
m
−δK 1
(−c − M ⊤ θ) − δK

2
(θ).
θ∈R

Taking the dual has two benets. First, the number of variables is smaller. Second, the gradient oracle is
something we can compute eciently. Hence, cutting plane methods can be used to solve it in O∗ (mT + m3 )
where T is the time to evaluate ∇δK∗
1
and ∇δK

2
. The only problem left is to recover the primal x.
The key observation is the following lemma:
Lemma 4.41. Let xi ∈ K1 be the set of points output by the oracle

∇δK 1
during the cutting plane method.
Dene yi ∈ K2 similarly. Suppose that the cutting plane method ends with the guarantee that the additive
4.4. Composite Problem via Duality 63

error is less than ε. Then, we have that

min c⊤ x ≤ min c⊤ x ≤ min c⊤ x + ε


x∈K1 ,T x∈K2 x∈K
f1 ,T x∈K
f2 x∈K1 ,T x∈K2

where K
f1 = conv(xi ) and K
f2 = conv(yi ).

Proof. Let θi be the set of directions queried by the oracle for ∇δK ∗
1
and φi be the directions queried by
the oracle for ∇δK2 . We claim that xi ∈ ∇δK
∗ ∗
f1 (θ i ) and y i ∈ ∇δ ∗
K
f2 (φ i ) . Having this, the algorithm cannot
distinguish between K1 and K1 , and between K2 and K2 . Hence, the algorithm runs exactly the same, i.e.,
f f
uses the same sequence of points. Therefore, we get the same value c⊤ x. However, by the guarantee of
cutting plane method, we have that
min c⊤ x ≤ c⊤ x ≤ min c⊤ x + ε.
x∈K
f1 ,T x∈K
f2 x∈K1 ,T x∈K2

To prove the claim, we note that xi ∈ ∇δK


f (θi ). Note that K1 ⊂ K1 and hence minx∈K
∗ f ⊤
f1 θi x ≥
1
minx∈K1 θi⊤ x. Also, note that
xi = arg min θi⊤ x ∈ K
f1
x∈K1

and hence minx∈K f1 θi x ≤ minx∈K1 θi x. Therefore, we have that minx∈K


⊤ ⊤ ⊤
f1 θi x = minx∈K1 θi⊤ x. Therefore,
xi ∈ arg minx∈K1 θi⊤ x. This proves the claim for K
f1 . The proof for K
f2 is the same.

This reduces the problem into the form minx∈K f2 c x. For the second phase, we let zi = M xi ∈ R .
f1 ,T x∈K
⊤ m

Then, we have
X
min c⊤ x = min
P P c⊤ ( t i xi )
x∈K
f1 ,M x∈K
f2 ti ≥0,si ≥0,M i ti x i = i si yi
i
X
= min P
P ti · c⊤ xi .
ti ≥0,si ≥0, i ti zi = i si yi

Note that it takes O∗ (mZ) time to write down this linear program where Z is the number of non-zeros in
M . Next, we note that this linear program has O∗ (m) variables and m constraints. Therefore, we can solve
it in O∗ (m2.38 ) time.
Therefore, the total running time is
O∗ (m(T + m2 ) + (mZ + m2.38 )).
To conclude, we have the following theorem.
Theorem 4.42. Given convex sets K1 ⊂ Rn and K2 ⊂ Rm with m ≤ n and a matrix M : Rn → Rm with
∗ ∗
Z non-zeros, let T be the cost to compute ∇δK and ∇δK . Then, we can solve the problem
1 2

min c⊤ x
x∈K1 ,M x∈K2

in time O∗ (mT + mZ + m3 ).
Remark. We hid all sorts of terms in the log term hidden in O∗ such as the diameter of the set. Also this is
the number of arithmetic operations, not the bit complexity.
Going back to the school/student problem, this algorithm gives a running time of
O∗ (|V2 ||E| + |V2 |3 )
which is linear in the number of students!
In general, this statement says that if we can split a convex problem into two parts, with both being easy
to solve and one part having fewer variables, then we can solve the entire problem in time depending on the
smaller dimension.
Exercise 4.43. How fast can we solve minx∈∩ki=1 Ki c⊤ x given the oracles ∇δK

i
?
Chapter 5

Geometrization

In this chapter, we study techniques that further exploit the geometry of convex functions and associated
norms. Many of these techniques are eective in practice for large scale problems. We recall the following
relevant denitions.

5.1 Norms and Local Metrics


Any 0-symmetric convex body K induces the following norm:

∥x∥K = inf {t : x ∈ tK} .

For the usual Euclidean norm, the convex body is the Euclidean ball, B2n . Similarly, for an ℓp norm, the
convex body is the unit ℓp ball.
It is often useful to consider ane transformations, and the norms they induce, e.g., for a PSD matrix
A, we can dene the associated norm as follows:

∥x∥A = x⊤ Ax.

The convex body (unit ball) of this norm is an ellipsoid centered at zero and dened by the matrix A.
As we have encountered previously in this book, convex sets and functions have duals. For a convex body
K , the dual (or polar) is dened as follows:

K ∗ = {y : ∀x ∈ K, ⟨x, y⟩ ≤ 1} .
The dual norm for ∥.∥K is ∥.∥K ∗ . We can state a generalized Cauchy-Schwarz inequality.

Fact 5.1. For x, y ∈ Rn and any centrally symmetric convex body K, we have ⟨x, y⟩ ≤ ∥x∥K ∥y∥K ∗ .

In this chapter, an important idea will be local norms, i.e., at each point x in the domain, there could
be a dierent norm. Indeed, in p a Riemannian metric M, for every x ∈ M,there is a matrix A(x) s.t. the
norm at x is dened as ∥v∥x = v ⊤ A(x)v. A special class of Riemannian metrics of particular interest for
us will be Hessian metrics (corresponding to Hessian manifolds). Here the matrix dening the local norm is
the Hessian of a convex, twice-dierentiable function, i.e.,

A(x) = ∇2 ϕ(x)

for some convex function ϕ.

5.2 Mirror Descent


The cutting plane method is well-suited for minimizing non-smooth convex functions with high accuracy.
However, its relatively large polynomial time complexity and quadratic space requirement are not favorable
for large scale problems. Here we discuss a dierent approach to minimize a non-smooth function with low
accuracy.

64
5.2. Mirror Descent 65

5.2.1 Subgradient method


In this section, we consider the constrained non-smooth minimization problem minx∈D f (x). Recall how to
dene gradient for non-smooth functions:
Denition 5.2. For any convex function f , we dene ∂f (x) be the set of vectors g such that
f (y) ≥ f (x) + g ⊤ (y − x) for all y ∈ Rn .

Such a vector g is called a subgradient of f at x.

Algorithm 13: SubgradientMethod


Input: Initial point x(0) ∈ Rn , step size h > 0.
for k = 0, 1, · · · , T − 2 do
Pick any g (k) ∈ ∂f (x(k) ).
y (k+1) ← x(k) − h · g (k) .
x(k+1) ← πD (y (k+1) ) where πD (y) = arg minx∈D ∥x − y∥2
end
return
PT −1
1
T k=0 x(k) .
Note that we return the average of all iterates, rather than the last iterate, as the average is better
behaved in the worst case. To analyze the algorithm, we rst need the following Pythagorean theorem. This
shows that when we project a point to a convex set, we get closer to each point in the convex set, and how
much closer depends on how much we move.
Lemma 5.3 (Pythagorean Theorem). Given a convex set D and a point y, let π(y) = arg minx∈D ∥x − y∥2 .
For any z ∈ D, we have that
∥z − π(y)∥22 + ∥π(y) − y∥22 ≤ ∥z − y∥22 .
Proof. Let h(t) = ∥(π(y) + t(z − π(y))) − y∥2 . Since π(y) is the closest point to y , h(t) must be minimized
at t = 0. So, we have that h′ (0) ≥ 0. i.e.,

h′ (0) = (π(y) − y)⊤ (z − π(y)) ≥ 0.


(If y is in the set, then the statement is trivial since π(y) = y . Otherwise, since π(y) is the closest point
in D to y , the hyperplane normal to π(y) − y through π(y) separates y from z .). Hence,

∥z − y∥22 = ∥z − π(y) + π(y) − y∥22


= ∥z − π(y)∥22 + 2(z − π(y))⊤ (π(y) − y) + ∥π(y) − y∥22
≥ ∥z − π(y)∥22 + ∥π(y) − y∥22 .

Now, we are ready to analyze the subgradient method. It basically involves tracking the squared distance
to the optimum, ∥x(k+1) − x∗ ∥22 .
Theorem 5.4. Let f be a convex function that is G-Lipschitz in ℓ2 norm. After T steps, the subgradient
method outputs a point x such that

∥x(0) − x∗ ∥22 h
f (x) ≤ f (x∗ ) + + G2
2hT 2
where x∗ is any point that minimizes f over D.
∥x(0) −x∗ ∥2
Remark. If the distance ∥x(0) − x∗ ∥ and the Lipschitz constant G are known, we can pick h = √
G T
and get
G · ∥x(0) − x∗ ∥2
f (x) ≤ f (x∗ ) + √ .
T
5.2. Mirror Descent 66

Proof. Let x∗ be any point that minimizes f . Then, we have

∥x(k+1) − x∗ ∥22 = ∥πD (y (k+1) ) − x∗ ∥22


≤ ∥y (k+1) − x∗ ∥22
= ∥x(k) − hg (k) − x∗ ∥22
D E
= ∥x(k) − x∗ ∥22 − 2h g (k) , x(k) − x∗ + h2 ∥g (k) ∥22 .

where we used Lemma 5.3 in the inequality. Since x∗ lies on the −g (k) direction, we expect x(k+1) is closer
to x∗ than x(k) if the step size is small enough (or if we ignore the second order term ∥g (k) ∥22 ). To bound
the distance improvement, we apply the denition of subgradient and get
D E
f (x∗ ) ≥ f (x(k) ) + g (k) , x∗ − x(k) .

Therefore, we have that

∥x(k+1) − x∗ ∥22 ≤ ∥x(k) − x∗ ∥22 − 2h · (f (x(k) ) − f (x∗ )) + h2 ∥g (k) ∥22


≤ ∥x(k) − x∗ ∥22 − 2h · (f (x(k) ) − f (x∗ )) + h2 G2 .

Note that this equation shows that if the error f (x(k) ) − f (x∗ ) is larger, then we move faster towards the
optimum. Rearranging the terms, we have
1  (k)  h
f (x(k) ) − f (x∗ ) ≤ ∥x − x∗ ∥22 − ∥x(k+1) − x∗ ∥22 + G2 .
2h 2
We sum over all iterations, to get
T −1
1 X 1 1 h
(f (x(k) ) − f (x∗ )) ≤ · (∥x(0) − x∗ ∥22 − ∥x(T ) − x∗ ∥22 ) + G2
T T 2h 2
k=0
∥x(0) − x∗ ∥22 h
≤ + G2 .
2hT 2
The result follows from observing that for a convex function,
T −1 T −1
!
1 X (k) 1 X
f x − f (x∗ ) ≤ (f (x(k) ) − f (x∗ )).
T T
k=0 k=0

5.2.2 Intuition of Mirror Descent


Consider using the subgradient method above to minimize f (x) = i |xi | over the unit ball B(0, 1). The
P
subgradient of f is given by sign(x) (assuming xi ̸= 0 for all i). Therefore, √ we have that the Lipshitz constant
is bounded by the Euclidean norm of a vector with ±1 entries, i.e., G = n and the set is contained p in a ball
of radius R = 1. Hence, by the main theorem of the previous section, the error is bounded by Tn after T
steps, which grows with the dimension. Intuitively, the dimension dependence comes from the fact that we
must take a tiny step size to avoid changing the variables too much. For example, if we take a constant step
size h and start at the point x = (1, n1 , n1 , · · · , n1 ), then we will get a point with xi constant in all directions
i ̸= 1, and this increases the function f dramatically from Θ(1) to Θ(nh). Therefore, to get constant error,
we need to take step size h ≤ n1 and hence we need Ω(n) iterations.
Conceptually, the step x = x − ηg does not make sense either. Imagine the problem is innitely dimen-
sional. Note that f (x) < +∞ for any
X
x ∈ ℓ1 = {x ∈ RN : |xi | < ∞}.
i
5.2. Mirror Descent 67

On the other hand, its gradient g lives in ℓ∞ space; the dual space of ℓ1 is ℓ∞ (in general, ℓp is dual to ℓq
where (1/p) + (1/q) = 1). Since x and g are not in the same space, the term x − ηg does not make sense.
More precisely, we have the following tautology (directly follows from the denition of dual space, namely
the set of all linear maps in the original space).

Denition 5.5. A Banach space over the reals is a vector space over the reals together with a norm
that denes a complete metric space, i.e., for any Cauchy sequence, X = (xi )∞
i=1 , there is a vector x s.t.
limn→∞ ∥xn − x∥ = 0.
Fact 5.6. Given any Banach space D over the reals and a continuously dierentiable function f from D to
R, its gradient ∇f (x) ∈ D∗ for any x.

In general, if the function f is on the primal space D, then the gradient g lives in the dual space D∗ .
Therefore, we need to map x from the primal space D to the dual space D∗ , update its position, then map
the point back to the original space D.
In fact, Lemma 5.6 gives us one such map, ∇f . Consider the following algorithm: Starting at x. We use
∇f (x) to map x to the dual space y = ∇f (x). Then, we apply the gradient step on the dual y (new) = y−∇f (x)
and map it back to the primal space, namely nding x(new) such that ∇f (x(new) ) = y (new) . Note that
y (new) = 0 and hence x(new) is exactly a minimizer of f . So, if we can map it back, this algorithm solves the
problem in one step. Unfortunately, the task of mapping it back is exactly our original problem.
Instead of using the same f , mirror descent uses some other convex function Φ, called the mirror map.
For constrained problems, the mirror map may not bring the point back to a point in D. Naively, one may
consider the algorithm

∇Φ(y (t+1) ) = ∇Φ(x(t) ) − h · ∇f (x(t) ),


x(t+1) = arg min ∥x − y (t+1) ∥2 .
x∈D

Note that the rst step of nding y (t+1) involves solving an optimization problem. We will show how to do
this optimization later (see 5.1) but with a proper formulation of the algorithm which takes into account the
distance as measured by the mirror map Φ.

Denition 5.7. For any strictly convex function Φ, we dene the Bregman divergence as
DΦ (y, x) = Φ(y) − Φ(x) − ⟨∇Φ(x), y − x⟩ .

Note that DΦ (y, x) is the error of the rst order Taylor expansion of Φ at x. Due to the convexity of Φ,
we have that DΦ (y, x) ≥ 0. Also, we note that DΦ (y, x) is convex in y , but not necessarily in x.

Example 5.8. DΦ (y, x) = ∥y − x∥2 for Φ(x) = ∥x∥2 . DΦ (y, x) = i yi log xyii − yi + xi for Φ(x) =
P P P
xi log xi .
P

Algorithm 14: MirrorDescent


Input: Initial point x(0) = arg minx∈D Φ(x), step size h > 0.
for k = 0, 1, · · · , T − 2 do
Pick any g (k) ∈ ∂f (x(k) ).
// As shown in (5.1), the next 2 steps can be implemented by an optimization
over DΦ .
Find y (k+1) such that ∇Φ(y (k+1) ) = ∇Φ(x(k) ) − h · g (k) .
x(k+1) ∈ πDΦ (k+1)
(y ) where πD
Φ
(y) = arg minx∈D DΦ (x, y).
end
return
PT −1
1
T k=0 x(k) .
The Mirror Descent step can be written as follows. In the second step below, we use the fact that the
5.2. Mirror Descent 68

optimization is only over x (and hence all terms in y can be ignored):

x(k+1) = arg min DΦ (x, y (k+1) )


x∈D

= arg min Φ(x) − Φ(y (k+1) ) − ∇Φ(y (k+1) )⊤ (x − y (k+1) )


x∈D

= arg min Φ(x) − ∇Φ(y (k+1) )⊤ x


x∈D

= arg min Φ(x) − ∇Φ(x(k) )⊤ x + hg (k)⊤ x


x∈D

= arg min hg (k)⊤ x + DΦ (x, x(k) ). (5.1)


x∈D

Note that this is a natural generalization of x(k+1) = arg minx∈D hg (k)⊤ x + ∥x − x(k) ∥2 .

5.2.3 Analysis of Mirror Descent


Lemma 5.9 (Pythagorean Theorem). Given a convex set D and a point y, let π(y) = arg min DΦ (x, y). For
any z ∈ D, we have that
DΦ (z, π(y)) + DΦ (π(y), y) ≤ DΦ (z, y).
Proof.Let h(t) = DΦ (π(y) + t(z − π(y)), y). Since h(t) is minimized at t = 0, we have that h′ (0) ≥ 0. Hence,
we have
h′ (0) = (∇Φ(π(y)) − ∇Φ(y))⊤ (z − π(y)) ≥ 0.
Hence,

DΦ (z, y) = DΦ (z, π(y)) + 2(z − π(y))⊤ (∇Φ(π(y)) − ∇Φ(y)) + DΦ (π(y), y)


≥ DΦ (z, π(y)) + DΦ (π(y), y).

Theorem 5.10. Let f G-Lipschitz convex function on D with respect to some norm ∥ · ∥. Let Φ be
be a
a ρ-strongly convex function on D with respect to ∥ · ∥ with squared diameter R2 = supx∈D Φ(x) − Φ(x(0) ).
Then, mirror descent outputs x such that

R2 h
f (x) − min f (x) ≤ + G2 .
x hT 2ρ

Remark 5.11. We say a function f is ρ strongly convex with respect to the norm ∥ · ∥ if for any x, y , we have
ρ
f (y) ≥ f (x) + ⟨∇f (x), y − x⟩ + ∥y − x∥2 .
2
The usual strong convexity is with respect to the Euclidean norm.
q q
Remark 5.12. Picking h = G T , we get the rate f (x) ≤ minx f (x) + GR
R 2ρ 2
ρT .

Proof. This proof is a complete mirror of the proof in Theorem 5.4.

DΦ (x∗ , x(k+1) ) ≤ DΦ (x∗ , y (k+1) ) − DΦ (x(k+1) , y (k+1) )


= DΦ (x∗ , x(k) ) + (x∗ − x(k) )⊤ (∇Φ(x(k) ) − ∇Φ(y (k+1) )) + DΦ (x(k) , y (k+1) ) − DΦ (x(k+1) , y (k+1) )
D E
= DΦ (x∗ , x(k) ) − h · g (k) , x(k) − x∗ + DΦ (x(k) , y (k+1) ) − DΦ (x(k+1) , y (k+1) )

where we used Lemma 5.9 in the inequality. Using the denition of subgradient, we have that
D E
f (x∗ ) ≥ f (x(k) ) + g (k) , x∗ − x(k) .
5.2. Mirror Descent 69

Therefore, we have that

DΦ (x∗ , x(k+1) ) ≤ DΦ (x∗ , x(k) ) − h · (f (x(k) ) − f (x∗ )) + DΦ (x(k) , y (k+1) ) − DΦ (x(k+1) , y (k+1) ).

Next we note that

DΦ (x(k) , y (k+1) ) − DΦ (x(k+1) , y (k+1) )


=Φ(x(k) ) − Φ(x(k+1) ) − ∇Φ(y (k+1) )⊤ (x(k) − x(k+1) )
D E ρ
≤ ∇Φ(x(k) ) − ∇Φ(y (k+1) ), x(k) − x(k+1) − ∥x(k) − x(k+1) ∥2
D E ρ 2
(k) (k) (k+1) (k) (k+1) 2
=h · g , x − x − ∥x − x ∥
2
ρ
≤hG∥x(k) − x(k+1) ∥ − ∥x(k) − x(k+1) ∥2
2
(hG)2
≤ .

Hence, we have

(hG)2
DΦ (x∗ , x(k+1) ) ≤ DΦ (x∗ , x(k) ) − h · (f (x(k) ) − f (x∗ )) + .

Rearranging the terms, we have

1  hG2
f (x(k) ) − f (x∗ ) ≤ DΦ (x∗ , x(k) ) − DΦ (x∗ , x(k+1) ) + .
h 2ρ
Summing over all iterations, we have
T −1
1 X 1 1 h
(f (x(k) ) − f (x∗ )) ≤ · (DΦ (x∗ , x(0) ) − DΦ (x∗ , x(T ) )) + G2
T T h 2ρ
k=0
1 h
≤ DΦ (x∗ , x(0) ) + G2
hT 2ρ
R2 h 2
≤ + G .
hT 2ρ

Using the fact that x(0) was chosen as a minimum of Φ(x),


D E
DΦ (x∗ , x(0) ) =(Φ(x∗ ) − Φ(x(0) ) − ∇Φ(x(0) ), x∗ − x(0) )
=Φ(x∗ ) − Φ(x(0) )
≤R2

The result follows from the convexity of f .

5.2.4 Multiplicative Weight Update


In this section, we discuss
P mirror descent under the map Φ(x) = xi log xi with the convex set being the
P
simplex D = {xi ≥ 0, xi = 1}.

Step Formula As we showed in (5.1), we have that


x(k+1) = arg min hg (k)⊤ x + DΦ (x, x(k) ).
x∈D
5.2. Mirror Descent 70

Note that
(k) (k) (k) (k)
X X X
DΦ (x, x(k) ) = xi log xi − xi log xi − (1 + log xi )(xi − xi )
i
X xi
= xi log (k)
xi
(k)
where we used that xi . Hence, the step is simply
P P
i xi = i
X xi
x(k+1) = arg P min hg (k)⊤ x + xi log (k)
.
xi =1,xi ≥0 xi

Note that the optimality condition is given by


(k+1)
(k) xi
hgi + log (k)
+1−λ=0
xi

for some Lagrangian multiplier λ. Rewriting, we have


(k)
(k+1) (k)
xi = e−hgi xi /Z

for some normalization constant Z . Note that this algorithm multiplies the current x with a multiplicative
factor and hence it is also called multiplicative weight update.

Strong Convexity To bound the strong convexity parameter, we note that


1
Φ(y) − Φ(x) − ⟨∇Φ(x), y − x⟩ = (y − x)⊤ ∇2 Φ(ζ)(y − x)
2
for some ζ between x and y . Since ∇2 Φ(ζ) = ζ1 , we have

X (yi − xi )2 ( i |yi − xi |)2


P X
(y − x)⊤ ∇2 Φ(ζ)(y − x) = ≥ P =( |yi − xi |)2
ζi i ζ i i

where we used that yi = 1 and so = 1. Hence,


P P P
i xi = i i ζi

1
Φ(y) − Φ(x) − ⟨∇Φ(x), y − x⟩ ≥ ∥y − x∥21 .
2
Therefore, Φ is 1-strongly convex in ∥ · ∥1 . Hence, ρ = 1.

Diameter Direct calculation shows that − log n ≤ Φ(x) ≤ 0. We start at x(0) = n1 (1, . . . , 1)T . Hence,
R2 = log n.

Result
Theorem
P 5.13. Let f be a 1-Lipschitz function on ∥ · ∥1 . Then, mirror descent with the mirror map
Φ(x) = i xi log xi . r
2 log n
T
f (x ) − min f (x) ≤
.
T x

In comparison, projected gradient descent had the bound of Tn .


p
5.3. FrankWolfe 71

5.3 FrankWolfe
Mirror descent is not suitable for all spaces. The guarantee of mirror descent crucially depends on the fact
that there is a 1-strongly convex mirror map Φ such that maxx Φ(x) − minx Φ(x) is small on the domain.
For some domains such as {x : ∥x∥∞ ≤ 1}, the range maxx Φ(x) − minx Φ(x) can be large.
Lemma 5.14. Let Φ be a 1-strongly convex function on Rn over ∥ · ∥ ∞ ≤ 1. Then,

n
max Φ(x) ≥ min Φ(x) + .
∥x∥∞ ≤1 ∥x∥∞ ≤1 2

Proof. By the strong convexity of Φ, we have that


1
Esk ∈{±1} ∀k∈[n] Φ(s1 , · · · , sn ) ≥ Esk ∈{±1} ∀k∈[n−1] Φ(s1 , · · · , sn−1 , 0) + Esn ⟨∇Φ(s1 , · · · , sn−1 , 0), sn ⟩ +
2
1
= Esk ∈{±1} ∀k∈[n−1] Φ(s1 , · · · , sn−1 , 0) +
2
..
.
n
≥ Φ(0) + .
2

Remark. This inequality is tight because 12 ∥x∥2 is 1-strongly convex in ∥ · ∥∞ and its value is between 0 and
2.
n

5.3.1 Algorithm
Now we give another geometry dependent algorithm that relies on a dierent set of assumptions. The
problem we study in this section is of the form

min f (x)
x∈D

for f such that ∇f is Lipschitz in a certain sense. The algorithm reduces the problem to a sequence of
linear optimization problems.
Algorithm 15: FrankWolfe
Input: Initial point x(0) ∈ Rn , step size h > 0.
for k = 0, 1, · · · , T − 1 do
Compute y (k) = arg miny∈D y, ∇f (x(k) ) .

x(k+1) ← (1 − hk )x(k) + hk y (k) with hk = k+2


2
.
end
return x(T ) .

Analysis
Theorem 5.15. Let f be a convex function on a convex set D with a constant Cf such that

1
f ((1 − h)x + hy)) ≤ f (x) + h ⟨∇f (x), y − x⟩ + Cf h2 .
2
Then, for any x, y ∈ D and h ∈ [0, 1], we have

2Cf
f (x(k) ) − f (x∗ ) ≤ .
k+2
Remark. If ∇f is L-Lipschitz with respect to the norm ∥ · ∥ over the domain D, then Cf ≤ L · diam∥·∥ (D)2 .
5.4. The Newton Method 72

Proof. By the denition of Cf , we have that


D E 1
f (x(k+1) ) ≤ f (x(k) ) + hk ∇f (x(k) ), y (k) − x(k) + Cf h2k .
2
Note that D E D E
f (x∗ ) ≥ f (x(k) ) + ∇f (x(k) ), x∗ − x(k) ≥ f (x(k) ) + ∇f (x(k) ), y (k) − x(k)

where we used the fact that y (k) = arg miny∈D y, ∇f (x(k) ) and that x∗ ∈ D. Hence, we have that

1
f (x(k+1) ) ≤ f (x(k) ) − hk (f (x(k) ) − f (x∗ )) + Cf h2k .
2
Let ek = f (x(k) ) − f (x∗ ). Then,
1
ek+1 ≤ (1 − hk )ek + Cf h2k .
2
2Cf
Note that e0 = f (x(0) ) − f ∗ ≤ 12 Cf . By induction, we have that ek ≤ k+2 .

Remark. Note that this proof is in fact the same as Theorem 2.9.

5.4 The Newton Method


We begin with the Newton-Raphson method for nding the zeros of a function g , i.e. nding x such that
g(x) = 0. In optimization, the goal is to nd a root of the gradient, ∇f (x) = 0. The Newton-Raphson
iteration approximates a function by its gradient/Jacobian, then guesses where the gradient intersects the
zero line as a likely zero, repeating this process until a sucient approximation has been found. In one
dimension, we approximate the function g by its gradient

g(x) ∼ g(x(k) ) + g ′ (x(k) )(x − x(k) ).

Finding the zeros of the right hand side, we have the Newton step

g(x(k) )
x(k+1) = x(k) − .
g ′ (x(k) )

In high dimension, we can approximate the function by its Jacobian g(x) ∼ g(x(k) ) + Dg(x(k) )(x − x(k) ) and
this gives the step
 −1
x(k+1) = x(k) − Dg(x(k) ) g(x(k) ).

When the function g(x) = ∇f (x), then the Newton step becomes
 −1
x(k+1) = x(k) − ∇2 f (x(k) ) ∇f (x(k) ).

Alternatively, we can derive the Newton step as


D E 1
x(k+1) ← min f (x(k) ) + ∇f (x(k) ), x − x(k) + (y − x(k) )T (∇2 f (x(k) ))(y − x(k) ).
x 2
In words, the Newton step nds the minimum of the second order approximation of the function. We note
that the Newton iteration is a natural idea, since
1. It is ane invariant (changing coordinate systems does not aect convergence).
2. It is the fastest descent direction taking the Hessian into account.
5.4. The Newton Method 73

To see why ane-invariance is important for optimization, we consider the following function
100 2 1 2
f (x1 , x2 ) = x + x .
2 1 2 2
The gradient descent for this function is
(x1 , x2 ) ← (x1 , x2 ) − h∇f (x1 , x2 ) = ((1 − 100h)x1 , (1 − h)x2 ).
We need h < 100 1
in order the rst coordinate to converge, but this will make the second coordinate converges
too slowly. In general, we may want to take dierent step sizes for dierent directions and Newton method
gives the best step if the function is quadratic.
For many classes of functions, gradient methods converge to the solution linearly (namely, it takes
c · log 1ϵ iterations for some c depending on the problem) while the Newton method converges to the solution
quadratically (namely, it takes c′ · log log 1ϵ for some c′ ) if the starting point is suciently close to a root.
However, each step of Newton method involves solving a linear system, which can be much more expensive.
Furthermore, Newton method may not converges if the starting point is far away.
Algorithm 16: NewtonMethod
Input: Initial point x(0) ∈ Rn
for k = 0, 1, · · · , T − 1 do
−1
x(k+1) = x(k) − Dg(x(k) ) g(x(k) ).
end
return x(T ) .
Theorem 5.16 (Quadratic convergence). Assume that g : Rn → Rn is twice continuously dierentiable.
(k) (k) ∗
Let x be the sequence given by the Newton method. Suppose that x converges to some x such that
g(x∗ ) = 0 and Dg(x∗ ) is invertible. Then, for k large enough, we have

∥x(k+1) − x∗ ∥ ≤ ∥(Dg(x∗ ))−1 D2 g(x∗ )∥op · ∥x(k) − x∗ ∥2 .


Proof. Let e(k) = x∗ − x(k) . Then, we have that
Z 1
g(x∗ ) = g(x(k) ) + Dg(x(k) )[e(k) ] + (1 − s)D2 g(x(k) + se(k) )[e(k) , e(k) ]ds.
0

Hence, we have
Z 1
(k) −1 ∗
0 = Dg(x ) g(x (k)
)+x −x (k)
+ (1 − s)Dg(x(k) )−1 D2 g(x(k) + se(k) )[e(k) , e(k) ]ds.
0

Since x (k+1)
=x (k)
− Dg(x g(x ), we have
(k) −1
) (k)

Z 1

x (k+1)
−x = (1 − s)Dg(x(k) )−1 D2 g(x(k) + se(k) )[e(k) , e(k) ]ds.
0

So, we have
Z 1

∥x (k+1)
−x ∥≤ (1 − s)∥Dg(x(k) )−1 D2 g(x(k) + se(k) )[e(k) , e(k) ]∥ds
0
Z 1
≤ (1 − s)∥Dg(x(k) )−1 D2 g(x(k) + se(k) )∥op ds · ∥x(k) − x∗ ∥2 .
0

Since x(k) → x∗ , we have that


Dg(x(k) )−1 D2 g(x(k) + se(k) ) → Dg(x∗ )−1 D2 g(x∗ )
and for large enough k , we can assume
∥Dg(x(k) )−1 D2 g(x(k) + se(k) )∥op ≤ 2∥Dg(x∗ )−1 D2 g(x∗ )∥op .
The result follows.
5.4. The Newton Method 74

This argument above uses only that g ∈ C 2 and does not require convexity. Without further global
assumptions on g , the Newton method does not always converge to a root. The argument above only shows
that if the algorithm converges, then it converges quadratically eventually. We call this local convergence
since it only gives a bound when x(k) is close enough to x∗ . In comparison, all earlier analyses in this
book are about global convergence, bounding the total number of iterations. In practice, both analyses are
important; global convergence makes sure the algorithm is robust and local quadratic convergence makes sure
the algorithm converges to machine accuracy quickly√ enough. The local quadratic convergence is particularly
important for simple problems such as computing x.

Finding the largest root of a real-rooted polynomial


In general, the Newton method does not converge globally. Here, we give an application that does converge
globally.
Theorem 5.17. Given a polynomial g(x) = ni=1 ai xi with only real roots. Let λ1 be its largest root.
P
(0) 1 (k)
Suppose that λ1 ≤ x . Then, after O(n log( ϵ )) iterations, the Newton method nds x such that

λ1 ≤ x(k) ≤ λ1 + ε · (x(0) − λ1 ).
Proof. Since the roots of g are real, we can write
n
Y
g(x) = an · (x − λi )
i=1

with λn ≤ λn−1 ≤ · · · ≤ λ1 . Then, we have that g ′ (x) = an · − λj ) and hence


P Q
i j̸=i (x

g(x) 1

=P 1 .
g (x) i x−λi

For any x ≥ λ1 , we have


g(x) 1
≤ 1 = x − λ1 ,
g ′ (x) x−λ1
g(x) 1 x − λ1
≥P 1 = .
g ′ (x) i x−λ1 n

Hence, by induction, we have that x(k) ≥ λ1 every step and that


1
x(k+1) − λ1 ≤ (1 − )(x(k) − λ1 ).
n

Exercise 5.18. Consider the following iteration:


1 g (k−1) (x)
xt+1 = xt −
n1/k g (k) (x)

where g (k) (x) = i (x−λi )k .


1
For k = 1, this is exactly the Newton iteration as g (0) (x) = n. Show that when
P

applied to a degree n real-rooted polynomial, starting with x(0) > λ1 , in each iteration the distance to the
largest root decreases by a factor of 1 − n1/k
1
.
Such a dependence of log 1ϵ is called linear convergence. The convergence of the Newton method can be
quadratic, when close enough to a root.
Theorem 5.19. Assume that |f ′ (x∗ )| ≥ α at f (x∗ ) = 0 and f ' is L-Lipschitz. Then, if |x0 − x∗ | ≤ α
2L ,
 2(k)
α L (k)
|x(k) − x∗ | ≤ |x − x∗ | .
L α
5.4. The Newton Method 75

Proof. The iteration is


f (x(k) )
x(k+1) − x∗ = x(k) − x∗ − .
f ′ (x(k) )

Z x∗ Z x∗
f (x∗ ) = f (x(k) ) + f ′ (z)dz = f (x(k) ) + f ′ (x(k) )(x∗ − x(k) ) + (f ′ (z) − f ′ (x(k) ))dz.
xt xt

Therefore,
Z x∗
∗ 1
x (k+1)
− x = ′ (k) (f ′ (z) − f ′ (x(k) ))dz.
f (x ) xt
Z x∗
L
≤ |z − x(k) |dz
|f ′ (x(k) )| xt
L|x∗ − x(k) |2
= .
2|f ′ (x(k) )|

Since f ′ (x(k) ) ≥ f ′ (x∗ ) − L|x(k) − x∗ |


L
− x∗ ≤ |x(k) − x∗ |2 .
(k+1)
2(α − L|x(k) − x∗ |)
x

So, if |x(k) − x∗ | ≤ 2L ,
α

L (k) 1
|x(k+1) − x∗ | ≤ |x − x∗ |2 ≤ |x(k) − x∗ |
α 2
and  2
L (k+1) L (k)
|x − x∗ | ≤ |x − x∗ | .
α α
After t steps,
 2(k)
L (k) L
|x − x∗ | ≤ ϵ0
α α

αϵ ) steps. Note that f has not been assumed to be convex or


implying |x(k) − x∗ | < ϵ after log log( Lϵ 0

polynomial.
Moving back to optimization, we can view the goal as nding a root of ∇f (x) = 0. Newton's iteration is
the update
x(k+1) = x(k) − (∇2 f (x(k) ))−1 ∇f (x(k) ).
By the above proof, Newton's iteration has quadratic convergence from points close enough to the optimal.

Quasi-Newton Method
When the Jacobian of g is not available, we can approximate it using the function itself. For one dimension,
we can approximate the Newton method and get the following secant method

x(k) − x(k−1)
x(k+1) = x(k) − g(x(k) )
g(x(k) ) − g(x(k−1) )
g(x(k) )−g(x(k−1) )
where we approximated the g ′ (x(k) ) by x(k) −x(k−1)
. For nice enough function, the convergence rate

1+ 5
satises εk+1 ≤ C · εk , which is super linearly but not quadratic.
2

For higher dimension, we need to approximate the Jacobian of g . Let J (k) be the approximate Jacobian
we maintained in the k th iteration. Similar to the secant method, we want to enforce

J (k+1) · (x(k) − x(k−1) ) = g(x(k) ) − g(x(k−1) ).


5.5. Interior Point Method for Linear Programs 76

In dimension higher than one, this does not uniquely dene the J (k+1) . One natural choice is to nd J (k+1)
that is closest to J (k) while satisfying the equation above. Solving the problem

min ∥J (k+1) − J (k) ∥F , (5.2)


J (k+1) ·(x(k) −x(k−1) )=g(x(k) )−g(x(k−1) )

we have the update rule


y (k) − J (k) s(k) (k)⊤
J (k+1) = J (k) + s (5.3)
∥s(k) ∥2
where s(k) = x(k) − x(k−1) and y (k) = g(x(k) ) − g(x(k−1) ).

Exercise 5.20. Prove that the equation (5.3) is indeed the minimizer of (5.2).
When g is given by the gradient of a convex function f , we know that the Jacobian of g satises
Dg = ∇2 f (x) ⪰ 0. Therefore, we should impose some conditions such that J (k) ⪰ 0 for all k .

BFGS algorithm
BroydenFletcherGoldfarbShanno (BFGS) algorithm is one of the most popular quasi-Newton methods.
The algorithm maintains an approximate Hessian J (k) such that
ˆ J (k+1) (x(k) − x(k−1) ) = ∇f (x(k) ) − ∇f (x(k−1) )
ˆ J (k+1) is close to J (k) .
ˆ J (k) ≻ 0.
To achieve all of these conditions, the natural optimization is

2
  −1  1
def 1
W 2 J −1 − J (k)
J (k+1) = arg min W 2
Js=y,J=J ⊤

F
R1
where s(k) = x(k) − x(k−1) , y (k) = ∇f (x(k) ) − ∇f (x(k−1) ) and W = 0 ∇2 f (x(k−1) + s(x(k) − x(k−1) ))ds
1
(or any W such that W y = s). In some sense, the W − 2 is just a correct change of variables so that the
algorithm is ane invariant. Solving the equation above [25], one obtain the update
 −1  −1
J (k+1) = (I − ρk · s(k) y (k)⊤ ) J (k) (I − ρk · y (k) s(k)⊤ ) + ρk · s(k) s(k)⊤ (5.4)

where ρk = 1
y (k)⊤ s(k)
. Alternatively, one can also show that [26]

J (k+1) = arg min DKL (N (0, J)∥N (0, J (k) )).


Js=y,J=J ⊤

−1
To implement the BFGS algorithm, it suces to compute J (k) ∇f (x(k) ). Therefore, we can di-
−1 −1
rectly use the recursive formula (5.4) to compute J (k) ∇f (x(k) ) instead of maintaining J (k) or J (k)


explicitly.
In practice, the recursive formula J (k) becomes too expensive and hence one can stop the recursive
formula after constant steps, which gives the limited-memory BFGS algorithm.

5.5 Interior Point Method for Linear Programs


In this section, we study interior point methods. In practice, this method reduces the problem of optimizing
a convex function
√ to solving a small number (typically less than 30) linear systems. In theory, it reduces the
problem to Õ( n) linear systems.
We start by describing interior point method for linear programs. We establish a polynomial bound and
then later discuss implementation details.
5.5. Interior Point Method for Linear Programs 77

5.5.1 Basic Properties


We rst consider the primal problem

(P) : min c⊤ x subject to Ax = b, x ≥ 0


x

where A ∈ Rm×n . The diculty of linear programs is the constraint x ≥ 0. Without this constraint, we can
simply solve it as a linear system. One natural idea to solve linear programs is to replace the hard constraint
x ≥ 0 by some smooth function. So, let us consider the following regularized version of the linear program
for some t ≥ 0:
Xn
(Pt ) : ⊤
min c x − t ln xi subject to Ax = b.
x
i=1

We will explain the reason of choosing ln x in more detail later. For now, we can think it as a nice function
that blows up to ∞ as x approaches zero.
One can think that − ln x gives a force from every constraint x ≥ 0 to make sure x ≥ 0 is true. Since
the gradient of − ln x blows up when x = 0, when x is close enough, the force is large enough to counter the
cost c. When t → 0, then the problem (Pt ) is closer to the original problem (P) and hence the minimizer of
(Pt ) is closer to a minimizer of (P).
First, we give a formula for the minimizer of (Pt ).
Lemma 5.21 (Existence and Uniqueness of central path). If the polytope {Ax = b, x ≥ 0} has an interior,
then the optimum of (Pt ) is unique and is given by the solution of the following system:

xs = t,
Ax = b,
A⊤ y + s = c,
(x, s) ≥ 0

where the variables si are additional slack variables, and xs = t is shorthand of xi si = t for all i.

Proof. The optimality condition, using dual variables y for the Lagrangian of Ax = b is given by
t
c− = A⊤ y.
x
Write si = xi ,
t
to get the formula. The solution is unique because the function − ln x is strictly convex.

Denition 5.22. We dene the central path Ct = (x(t) , y (t) , s(t) ) as the sequence of points satisfying
x(t) s(t) = t,
Ax(t) = b,
A⊤ y (t) + s(t) = c,
(x(t) , s(t) ) ≥ 0.

To give another interpretation of the central path, note that the dual problem is

(D) : max b⊤ y subject to A⊤ y + s = c, s ≥ 0.


y,s

Note that for any feasible x and y , we have that

0 ≤ c⊤ x − b⊤ y = c⊤ x − x⊤ A⊤ y = x⊤ s.

Hence, (x, y, s) solves the linear program if it satises the central path equation with t = 0. Therefore,
following the central path is a balanced way to decrease xi si uniformly to 0. We can formalize the intuition
that for small t, x(t) is a good approximation of the primal solution. In fact t itself is a bound on the error
of the current solution.
5.5. Interior Point Method for Linear Programs 78

Lemma 5.23 (Duality Gap). We have that


 ⊤  ⊤
Duality Gap = c⊤ x(t) − b⊤ y (t) = c⊤ x(t) − x(t) A⊤ y (t) = x(t) s(t) = tn.

The interior point method follows the following framework:


1. Find C1
2. Until t < nε ,
(a) Use Ct to nd C(1−h)t for h = 101√n .
Note that this algorithm only nds a solution with ε error. If the linear program is integral, we can simply
stop at small enough ε and round it o to the closest integral point.

5.5.2 Following the central path


During the algorithm, we maintain a point (x, y, s) such that Ax = b, A⊤ y + s = c and xi si is close to t for
all i. We show how to nd a feasible (x + δx , y + δy , s + δs ) such that it is even closer to t. We can write the
equation as follows:

(x + δx )(s + δs ) ≈ t,
A(x + δx ) = b,

A (y + δy ) + (s + δs ) = c.

(Omitted the non-negative conditions.) Using our assumption on (x, y, s) and noting that δx · δs is small,
the equation can simplied as follows. We use the notation X = Diag(x), S = Diag(s).

0 A⊤ I
    
δx 0
 A 0 0   δy  =  0 .
S 0 X δs t − xs
This is a linear system and hence we can solve it exactly.
Exercise 5.24. Let r = t−xs. Prove that Sδx = (I−P )r and Xδs = P r where P = XA⊤ (AS −1 XA⊤ )−1 AS −1 .
First, we show that x(new) = x + δx and s(new) = s + δs are feasible.
Lemma 5.25. Suppose i (xi si − t)2 ≤ ε2 t2 with ε < 21 . Then,x(new) (new)
P
i >0 and si >0 for all i.

Proof. Note that P 2 = P . However, in general P ̸= P ⊤ ,i.e. P might not be an orthogonal projection matrix.
1 1 1 1
It will be convienient to consider the orthogonal projection matrix P = S − 2 X 2 A⊤ (AS −1 XA⊤ )−1 AS − 2 X 2 .
Note that
1 1 1 1
X −1 δx = S − 2 X − 2 (I − P )S − 2 X − 2 r.
By the assumption for each i, xi si ≥ (1 − ε)t. Therefore, we have
1 1 1
∥X −1 δx ∥2 ≤ p ∥(I − P )S − 2 X − 2 r∥2
(1 − ϵ)t
1 1 1
≤p ∥S − 2 X − 2 r∥2
(1 − ϵ)t
1 ϵ
≤ ∥r∥2 ≤ .
(1 − ϵ)t 1−ϵ
1 1 1 1
Similarly, we have S −1 δs = S − 2 X − 2 P S − 2 X − 2 r. Hence, we have ∥S −1 δs ∥2 ≤ 1−ϵ
ϵ
. Therefore, when ϵ < 21 ,
we have both ∥X δx ∥∞ and ∥S δs ∥∞ less than 1, which shows that both x
−1 −1 (new)
and s(new) are positive.
Next, we show that xs is closer to t after one Newton step.
Lemma 5.26. If i (xi si − t)2 ≤ ε2 t2 with ϵ < 41 , we have that
P

X 2
(new) (new)
− t ≤ ϵ4 + 16ϵ5 t2 .

xi si
i
5.5. Interior Point Method for Linear Programs 79

Proof. We have that xi δs,i + si δx,i = t − xi si . Using this,

X (new) (new)
2 X X 2
X  δx,i 2  δs,i 2
LHS = xi si −t = 2
(xi si +xi δs,i +si δx,i +δx,i δs,i −t) = 2 2
δx,i δs,i ≤ ((1 + ϵ)t) ·
i i i i
xi si

where in the last step we used x2i s2i ≤ (1 + ε)2 t2 . Using the previous lemma, we have that
2
LHS ≤ ((1 + ϵ)t) · ∥X −1 δx ∥24 ∥S −1 δs ∥24
2
≤ ((1 + ϵ)t) · ∥X −1 δx ∥22 ∥S −1 δs ∥22
 4
2 ϵ
≤ ((1 + ϵ)t)
1−ϵ
4 5 2

≤ ϵ + 16ϵ t

Using this, we have the main theorem.



Theorem 5.27. We can solve a linear program to within δ error (see Lemma 5.28) in O( n log( 1δ ))
iterations and each iteration only needs to solve a linear system.

2 t2
Let Φ = i (xi si − t) be the error of the current iteration. We always maintain Φ ≤ 16 for the
P
Proof.
t2
current (x, y, s) and t. At each step, we use Lemma 5.26 which makes Φ ≤ 50 . Then, we decrease t to
t(1 − 101√n ). Note that

X 2 2t2 2t2 (t(1 − h))2


(xi si − t(1 − h)) ≤ 2Φ + 2t2 h2 n ≤ + ≤ .
i
50 100 16

Therefore, the invariant is preserved after each step. Since t is decreased by a (1 − 1


√ )
10 n
factor each step,

it takes O( n log( 1δ )) to decrease t from 1 to δ 2 .

5.5.3 Finding the initial point


The rst question is to nd C1 . This can be handled by extending the problem to slightly higher dimension.
To the reader familiar with the Simplex method, this might be reminiscent of the two phases of the simplex
method, where the purpose of the rst phase is to nd a feasible initial solution.

Lemma 5.28. Consider a linear programminAx=b,x≥0 c⊤ x with n variables and d constraints. Assume that
1. Diameter: For any x≥0 Ax = b, we have that ∥x∥∞ ≤ R.
with
2. Lipschitz constant of the objective: ∥c∥∞ ≤ L.

For any 0 < δ ≤ 1, the modied linear program minAx=b,x≥0 c x with

 
 1
  1
 δ/L · c
A 0 Rb − A1n Rb
A= ,b = , and c= 0
1⊤

n 1 0 n+1
1

satises the
 following:
1n + Lδ · c
  
1n  
0d
1. x=  1 ,y=
 and s =  1  are feasible primal dual vectors.
−1
1 1
2
2. For any feasible primal dual vectors (x, y, s) with duality gap at most δ , the vector x̂ = R·x1:n (x1:n are
the rst n coordinates of x) is an approximate solution to the original linear program in the following
5.5. Interior Point Method for Linear Programs 80

sense

c⊤ x̂ ≤ min c⊤ x + LR · δ,
Ax=b,x≥0
 
X
∥Ax̂ − b∥1 ≤ 4nδ · R |Ai,j | + ∥b∥1  ,
i,j

x̂ ≥ 0.

Part 1. For the rst result, straightforward calculations show that (x, y, s) ∈ R(n+2)×(d+1)×(n+2) are
feasible, i.e.,  
 1
 1n  1 
A 0 Rb − A1n b
Ax = ·1= R =b
1⊤
n 1 0 n+1
1
and
A⊤ 1n + Lδ · c
   
1n  
⊤ 0
A y+s=  0 1· d + 1 
1 ⊤ ⊤ ⊤ −1
b − 1n A 0 1
R  
1n + Lδ · c

−1n
=  −1  +  1 
0 1
δ 
L ·c
=  0 =c
1

Part 2. For the second result, we let


OPT = min c⊤ x, and, OPT = min c⊤ x
Ax=b,x≥0 Ax=b,x≥0

For any optimal x ∈ Rn in the original LP, we consider the following x ∈ Rn+2
1
 
R xP
n
x = n + 1 − R1 i=1 xi  (5.5)
0

and c ∈ Rn+2
· c⊤
δ 
L
c= 0  (5.6)
1

We want to argue that x ∈ Rn+2 is feasible in the modied LP. It is obvious that x ≥ 0, it remains to show
Ax = b ∈ Rd+1 . We have
1
 
A 0 R1 b − A1n  R xP
  1   1 
1 n Ax Rb
Ax = ⊤ · n + 1 − R x R
i=1 i = n + 1 = n + 1 = b,

1n 1 0
0

where the third step follows from Ax = b, and the last step follows from denition of b.
Therefore, using the denition of x in (5.5) we have that
1
 
R xP δ δ
n
OPT ≤ c⊤ x = Lδ · c⊤ 0 1 · n + 1 − R1 i=1 xi  = · c⊤ x = · OPT. (5.7)
 
LR LR
0
5.5. Interior Point Method for Linear Programs 81

where the rst step follows from modied program is solving a minimization problem, the second step follows
from denition of x ∈ Rn+2 (5.5) and c ∈ Rn+2 (5.6), the last step follows from x ∈ Rn is an optimal solution
in the original linear program.  
x1:n
Given a feasible (x, y, s) ∈ R(n+2)×(d+1)×(n+2) with duality gap δ 2 , we can write x =  τ  ∈ Rn+2 for
θ
some τ ≥ 0, θ ≥ 0. We can compute c⊤ x which is Lδ · c⊤ x1:n + θ. Then, we have

δ ⊤ δ
· c x1:n + θ ≤ OPT + δ 2 ≤ · OPT + δ 2 , (5.8)
L LR
where the rst step follows from denition of duality gap, the last step follows from (5.7).
Hence, we can upper bound the OPT of the transformed program as follows:
LR δ ⊤ RL δ
c⊤ x̂ = R · c⊤ x1:n = · c x1:n ≤ ( · OPT + δ 2 ) = OPT + LR · δ,
δ L δ LR
where the rst step follows by x̂ = R · x1:n , the third step follows by (5.8).
Note that
δ ⊤ δ δ 1 δ n
c x1:n ≥ − ∥c∥∞ ∥x1:n ∥1 = − ∥c∥∞ ∥ x∥1 ≥ − ∥c∥∞ ∥x∥∞ ≥ −δn, (5.9)
L L L R L R
where the second step follows from denition x ∈ Rn+2 , and the last step follows from ∥c∥∞ ≤ L and
∥x∥∞ ≤ R.
We can upper bound the θ in the following sense,
δ
θ≤ · OPT + δ 2 + δn ≤ 2nδ + δ 2 ≤ 4nδ (5.10)
LR
where the rst step follows from (5.8) and (5.9), the second step follows by OPT = minAx=b,x≥0 c⊤ x ≤ nLR
(because ∥c∥∞ ≤ L and ∥x∥∞ ≤ R), and the last step follows from δ ≤ 1 ≤ n.
The constraint in the new polytope shows that
1 1
Ax1:n + ( b − A1n )θ = b.
R R
Using x̂ = Rx1:n ∈ Rn , we have
1 1 1
A x̂ + ( b − A1n )θ = b.
R R R
Rewriting it, we have Ax̂ − b = (RA1n − b)θ ∈ Rd and hence

∥Ax̂ − b∥1 = ∥(RA1n − b)θ∥1 ≤ θ(∥RA1n ∥1 + ∥b∥1 ) ≤ θ · (R∥A∥1 + ∥b∥1 ) ≤ 4nδ · (R∥A∥1 + ∥b∥1 ),

where the second step follows from triangle inequality, the third step follows from ∥A1n ∥1 ≤ ∥A∥1 (because
the denition of entry-wise ℓ1 norm), and the last step follows from (5.10).

5.5.4 Why n?
The central path is the solution to the following ODE
d d
St xt + Xt st = 1,
dt dt
d
A xt = 0,
dt
d d
A⊤ yt + st = 0.
dt dt
5.6. Interior Point Method for Convex Programs 82

−1
Solving this linear system, we have that St dx
dt = (I−Pt )1 and Xt dt = Pt 1 where Pt = Xt A (ASt Xt A )
t dst ⊤ ⊤ −1
ASt−1 .
Using that xt st = t, we have that

Pt = Xt A⊤ (AXt2 A⊤ )−1 AXt = St−1 A⊤ (ASt−2 A⊤ )−1 ASt−1

and that Xt−1 dx t 1 −1 dst


dt = t (I − Pt )1 and St dt = t Pt 1. Equivalently, we have
1

d ln xt d ln st
= (I − Pt )1 and = Pt 1.
d ln t d ln t
Note that √
∥Pt 1∥∞ ≤ ∥Pt 1∥2 = n.
Hence, xt and st can change by at most a constant factor when we change t by a 1 ± √1
n
factor.

Exercise 5.29. If we are given x such that ∥ln x − ln xt ∥∞ = O(1), then we can nd xt by solving Õ(1)
linear systems.

5.6 Interior Point Method for Convex Programs


The interior point method can be used to optimize any convex function. For more in-depth treatment, please
see the structural programming section in [57].
Recall that any convex optimization problem

min f (x)
x

can be rewritten in the epigraph form as


min t.
{(x,t): f (x)≤t}

Hence, it suces to study the problem minx∈K c⊤ x. Similar to the case of linear programs, we replace the
hard constraint x ∈ K by a soft constraint as follows:

min ϕt (x) where ϕt (x) = tc⊤ x + ϕ(x).


x

where ϕ(x) is a convex function such that ϕ(x) → +∞ as x → ∂K . Note that we put the parameter t in
front of the cost c⊤ x instead of ϕ as in the last lecture, it is slightly more
Pn convenient here. We say ϕ is a
barrier for K . To be concrete, we can always keep in mind ϕ(x) = − i=1 ln xi . As before, we dene the
central path.
Denition 5.30. The central path xt = arg minx ϕt (x).
The interior point method follows the following framework:
1. Find x close to x1 .
2. While t is not tiny,
(a) Move x closer to xt
(b) t → (1 + h) · t.

5.6.1 Self-concordance
In this section, we give a general analysis for the Newton method. In the next section, we will use this to
show that interior point method can be generalized to convex optimization. A key property of the Newton
method is that it is invariant under linear transformation. In general, whenever a method uses k th order
information, we need to assume the k th derivative is continuous. Otherwise, the k th derivative is not useful
for algorithmic purposes. For the Newton method, it is convenient to assume that the Hessian is Lipschitz.
Since the method is invariant under linear transformation, it only makes sense to impose an assumption that
is invariant under linear transformation.
5.6. Interior Point Method for Convex Programs 83

Denition 5.31. Given a convex function f : Rn → R, and any point x ∈ Rn , dene the norm ∥.∥x as
2
∥v∥x = v ⊤ ∇2 f (x)v.

We call a function f self-concordant if for any h ∈ Rn and any x in domf , we have


3
D3 f (x)[h, h, h] ≤ 2 ∥h∥x

where Dk f (x)[h1 , h2 , · · · , hk ] is the directional k th derivative of f along the directions h1 , h2 , · · · , hk .

Remark. The constant 2 is chosen so that − ln(x) exactly satises the assumption and it is not very important,
in that by scaling f , we can change any constant to any other constant.

Exercise 5.32. Show that the following property is equivalent fo self-concordance as dened above: re-
stricted on any straight line g(t) = f (x + th), we have g ′′′ (t) ≤ 2g ′′ (t)3/2 .

Exercise 5.33. Show that the functions x⊤ Ax, − ln x, − ln(1 − x2i ), − ln det X are self-concordant under
P
suitable nonnegativity conditions.

The self-concordance condition says that locally, the Hessian does not change too fast, i.e., the change in
the Hessian is bounded by its magnitude (to the power 1.5). We will skip the proof of the lemma below.

Lemma 5.34. Given a self-concordant function f, for any h1 , h2 , h3 ∈ Rn , we have

D3 f (x)[h1 , h2 , h3 ] ≤ 2 ∥h1 ∥x ∥h2 ∥x ∥h3 ∥x .

From the self-concordance condition, we have the following more directly usable property.

Lemma 5.35. For a self-concordant function f and any x ∈ domf and any ∥y − x∥x < 1, we have that

1
(1 − ∥y − x∥x )2 ∇2 f (x) ⪯ ∇2 f (y) ⪯ ∇2 f (x).
(1 − ∥y − x∥x )2

Let α(t) = ∇2 f (x + t(y − x))u, u . Then, we have that




Proof.

α′ (t) = D3 f (x + t(y − x))[y − x, u, u].

By self-concordance, we have
2
|α′ (t)| ≤ 2 ∥y − x∥x+t(y−x) ∥u∥x+t(y−x) . (5.11)
3
For u = y − x, we have |α′ (t)| ≤ 2α(t) 2 . Hence, we have d
dt
√1 ≥ −1. Integrating both sides wrt t, we
α(t)
have
1 1 1
p ≥p −t= − t.
α(t) α(0) ∥x − y∥x
Rearranging it gives
2
2 1 ∥x − y∥x
∥y − x∥x+t(y−x) = α(t) ≤ 1 = .
( ∥x−y∥ − t)2 (1 − t ∥x − y∥x )2
x

For general u, (5.11) gives


∥x − y∥x
|α′ (t)| ≤ 2 α(t).
1 − t ∥x − y∥x
Rearranging,
ln α(t) ≤ 2 ∥x − y∥x
d d
= −2 ln(1 − t ∥x − y∥x )
dt 1 − t ∥x − y∥ dt x
Integrating both from t = 0 to 1 gives the result.
5.6. Interior Point Method for Convex Programs 84

5.6.2 Self-concordant barrier functions


To analyze the algorithm above, we need to assume that ϕ is well-behaved. We measure the quality of ϕ by
∥∇ϕ∥∇2 ϕ(x)−1 . One can think this as the Lipschitz constant of ϕ but measured in the local norm.
Denition 5.36. We call ϕ a ν -self-concordant barrier for K if ϕ is self-concordant, ϕ(x) → +∞ as x → ∂K
and that ∥∇ϕ(x)∥2∇2 ϕ(x)−1 ≤ ν for all x.

Not all convex functions are self-concordant. However, for our purpose, it suces to show that we can
construct a self-concordant barrier for any convex set.

Theorem 5.37. Any convex set has an n-self concordant barrier.

Unfortunately, this is an existence result and the barrier function is expensive to compute.In practice,
we construct self-concordant barriers out of simpler ones:

Lemma 5.38. The following functions are self-concordant barriers. We use ν -sc as a short form for ν -self-
concordant barrier.
ˆ − ln x is 1-sc for {x ≥ 0}.
ˆ − ln cos(x) is 1-sc for {|x| ≤ π2 }.
2
ˆ − ln(t2 − ∥x∥ ) is 2-sc for {t ≥ ∥x∥2 }.
ˆ − ln det X is n-sc for {X ∈ Rn×n , X ⪰ 0}.
ˆ − ln x − ln(ln x + t) is 2-sc for {x ≥ 0, t ≥ − ln x}.
ˆ − ln t − ln(ln t − x) is 2-sc for {t ≥ ex }.
ˆ − ln x − ln(t − x ln x) is 2-sc for {x ≥ 0, t ≥ x ln x}.
ˆ −2 ln t − ln(t2/p − x2 ) is 4-sc for {t ≥ |x|p } for p ≥ 1.
ˆ − ln x − ln(tp − x) is 2-sc for {tp ≥ x ≥ 0} for 0 < p ≤ 1.
ˆ − ln t − ln(x − t−1/p ) is 2-sc for {x > 0, t ≥ x−p } for p ≥ 1.
ˆ − ln x − ln(t − x−p ) is 2-sc for {x > 0, t ≥ x−p } for p ≥ 1.
The following lemma shows how we can combine barriers.

Lemma 5.39. If ϕ1 and ϕ2 are ν1 and ν2 -self concordant barriers for K1 and K2 respectively, then ϕ1 + ϕ2
is a ν1 + ν2 self concordant barrier for K1 ∩ K2 .

Lemma 5.40. If ϕ is a ν -self concordant barrier for K, then ϕ(Ax + b) is ν -self concordant for {y : Ay + b ∈
K}.
Exercise 5.41. Using the lemmas above, prove that −
Pm
i=1 i x − bi ) is an m-self concordant barrier
ln(a⊤
for the convex set {Ax ≥ b}.

5.6.3 Convergence of Newton method for self-concordant functions


Now we are ready to study the convergence of Newton method.

Lemma 5.42. Given a self-concordant convex function f, consider the iteration

x′ = x − (∇2 f (x))−1 ∇f (x).

Suppose that r = ∥∇f (x)∥∇2 f (x)−1 < 1, then we have

r2
∥∇f (x′ )∥∇2 f (x′ )−1 ≤
.
(1 − r)2

Remark 5.43. Note that ∥∇f (x)∥∇2 f (x)−1 = ∇ f (x) ∇f (x) is the step size of the Newton method.
2 −1

x
This is a measurement of the error, since the goal is to nd x with ∇f (x) = 0.
5.6. Interior Point Method for Convex Programs 85

Proof. Lemma 5.35 shows that


∇2 f (x′ ) ⪰ (1 − r)2 ∇2 f (x).
and hence
∥∇f (x′ )∥∇2 f (x)−1
∥∇f (x′ )∥∇2 f (x′ )−1 ≤ .
(1 − r)
To bound ∇f (x′ ), we calculate that
Z 1
∇f (x′ ) = ∇f (x) + ∇2 f (x + t(x′ − x))(x′ − x)dt
0
Z 1
= ∇f (x) − ∇2 f (x + t(x′ − x))(∇2 f (x))−1 ∇f (x)dt
0
 Z 1 
= ∇2 f (x) − ∇2 f (x + t(x′ − x))dt (∇2 f (x))−1 ∇f (x). (5.12)
0
R1
For the term in the bracket, we use Lemma 5.35 to get (note that 0 (1 − tr)2 dt = 1 − r + 13 r2 ):
Z 1
1 2 2 1
(1 − r + r )∇ f (x) ⪯ ∇2 f (x + t(x′ − x))dt ⪯ ∇2 f (x).
3 0 1 − r
Therefore, we have
 Z 1 
(∇ f (x))− 21 ∇2 f (x) −
2 2 ′ 2 − 12 r 1 r
, r − r2 ) =

∇ f (x + t(x − x))dt (∇ f (x)) ≤ max( .

0 op 1−r 3 1−r
Putting it into (5.12) gives
1

∥∇f (x′ )∥∇2 f (x)−1 = ∇2 f (x)− 2 ∇f (x′ )

2
 Z 1 
1 1 1
−2
∇2 f (x + t(x′ − x))dt (∇2 f (x))− 2 (∇ f (x))− 2 ∇f (x)
2 2 2
≤ (∇ f (x)) ∇ f (x) −

op 2

0
r
≤ ·r
1−r
r2
= .
1−r

Finally, we bound the error of the current iterate in terms of ∥∇f (x)∥∇2 f (x)−1 .
Lemma 5.44. x such that ∥∇f (x)∥∇2 f (x)−1 ≤
Given
1
6 , we have that
ˆ ∥x − x ∥x∗ ≤ 2∥∇f (x)∥∇2 f (x)−1 ,

ˆ ∥x − x∗ ∥x ≤ 34 ∥∇f (x)∥∇2 f (x)−1 ,


ˆ f (x) ≤ f (x∗ ) + ∥∇f (x)∥2∇2 f (x)−1 .
Proof. Let r = ∥x − x∗ ∥x . Suppose that r ≤ 14 . Note that
Z 1

∇f (x) = ∇f (x) − ∇f (x ) = ∇2 f (x∗ + t(x − x∗ ))(x − x∗ )dt.
0

Using that ∇ f (x + t(x − x )) ⪰ (1 − (1 − t)r) ∇ f (x) (Lemma 5.35), we have


2 ∗ ∗ 2 2

Z 1
2 ∗ ∗ ∗

∥∇f (x)∥∇2 f (x)−1 =
∇ f (x + t(x − x ))(x − x ) dt

0 ∇2 f (x)−1
Z 1
≥ (1 − (1 − t)r)2 ∥x − x∗ ∥x dt
0
r2
 
3r
= 1−r+ r≥ .
3 4
5.6. Interior Point Method for Convex Programs 86

Using ∥∇f (x)∥∇2 f (x)−1 ≤ 16 , we have indeed r ≤ 1


4 (our lower bound above is a non-decreasing function)
Using Lemma 5.35 again, we have
∥x − x∗ ∥x 4 1
∥x − x∗ ∥x∗ ≤ ≤ ∥∇f (x)∥∇2 f (x)−1 ≤ 2∥∇f (x)∥∇2 f (x)−1 .
1−r 3 1 − 14
For the bound for f (x), we have that
Z 1
f (x) = f (x∗ ) + ⟨∇f (x∗ ), x − x∗ ⟩ + (1 − t)(x − x∗ )⊤ ∇2 f (x∗ + t(x − x∗ ))(x − x∗ )dt.
0

Using that ∇2 f (x∗ + t(x − x∗ )) ⪯ (1−(1−t)r)2 ∇ f (x),


1 2
we have
1
1−t
Z
f (x) ≤ f (x∗ ) + 2
dt · ∥x − x∗ ∥2x
0 (1 − (1 − t)r)
1 r
= f (x∗ ) + 2 ( + log(1 − r))∥x − x∗ ∥2x
r 1−r
1
≤ f (x∗ ) + ( + r)∥x − x∗ ∥2x
2
2
≤ f (x∗ ) + ∥∇f (x)∥∇2 f (x)−1

where we used r ≤ 1
4 at the end.

5.6.4 Main Algorithm and Analysis


Algorithm 17: InteriorPointMethod
Input: A ν -self-concordant barrier ϕ for K , the minimizer x of ϕ.
Dene ft (x) = tc⊤ x + ϕ(x). t = 16 ∥c∥−1
∇2 ϕ(x)−1 .

while t ≤ ν+ ν
ϵ do
x ← x − ∇ ft (x)−1 ∇ft (x).
2

t ← (1 + h)t with h = 9√1


ν
.
end
return x.
We rst explain the termination condition. Intuitively, we can see that min ft (xt ) tends to optimality as
t → ∞. We rst need a lemma showing that the gradient of ϕ is small.
Lemma 5.45 (Duality Gap). Suppose that ϕ is a ν -self concordant barrier. For any x, y ∈ K , we have that

⟨∇ϕ(x), y − x⟩ ≤ ν.
Proof. Let α(t) = ⟨∇ϕ(zt ), y − x⟩ where zt = x + t(y − x). Then, we have
α′ (t) = ∇2 ϕ(zt )(y − x), y − x .

Note that √
α(t) ≤ ∥∇ϕ(zt )∥∇2 ϕ(zt )−1 ∥y − x∥∇2 ϕ(zt ) ≤ v ∥y − x∥∇2 ϕ(zt ) .
Hence, we have α′ (t) ≥ v1 α(t)2 . If α(0) ≤ 0, then we are done. Otherwise, α is increasing and hence α(1) > 0.
Since α(1)
1 1
≤ α(0) − v1 . So, α(0) ≤ v .

Lemma 5.46 (Duality Gap). Suppose that ϕ is a ν -self concordant barrier, we have that

ν
⟨c, xt ⟩ ≤ ⟨c, x∗ ⟩ + .
t
1
More generally, for any x such that ∥tc + ∇ϕ(x)∥(∇2 ϕ(x))−1 ≤ 6 , we have that

ν+ ν
⟨c, x⟩ ≤ ⟨c, x∗ ⟩ + .
t
5.6. Interior Point Method for Convex Programs 87

Proof. Let x∗ be a minimizer of c⊤ x on K . By optimality, we have tc + ∇ϕ(x) = 0. Therefore, we have


1 ν
⟨c, xt ⟩ − ⟨c, x∗ ⟩ = ⟨∇ϕ(xt ), x∗ − xt ⟩ ≤ .
t t
For the second result, let f (x) = tcT x + ϕ(x). Lemma 5.44 shows that
1
∥x − xt ∥x ≤ 2∥∇f (x)∥∇2 f (x)−1 ≤ .
3
Hence,
1
⟨c, x − xt ⟩ ≤ ∥c∥∇2 ϕ(x)−1 ∥x − xt ∥x ≤ ∥c∥∇2 ϕ(x)−1
3
tc+∇ϕ(x) ∇ϕ(x)
Using c = t − t , we have

1 
⟨c, x − xt ⟩ ≤ ∥tc + ∇ϕ(x)∥∇2 ϕ(x)−1 + ∥∇ϕ(x)∥∇2 ϕ(x)−1
3t   √
1 1 √ ν
≤ + ν ≤ .
3t 6 t

This gives the result.



Hence, it suces to end with t = (ν + ν)/ε, which is exactly same as the previous lecture.

Theorem 5.47. Given a ν -self concordant barrier ϕ and its minimizer. We can nd x ∈ K such that
c⊤ x ≤ c⊤ x∗ + ϵ in
√ ν
O( ν log( ∥c∥∇2 ϕ(x)−1 ))
ϵ
iterations.

Proof. We prove by induction that ∥∇ft (x)∥∇2 ft (x))−1 ≤ 16 at the beginning of each iteration. This is true
at the beginning by the denition of initial t. By Lemma 5.42, after the Newton step, we have

1/6 2 1
∥∇ft (x)∥∇2 ft (x)−1 ≤ ( ) = .
1 − 1/6 25

Let t′ = (1 + h)t with h = 9√1


ν
. Note that ∇ft′ (x) = (1 + h)tc + ∇ϕ(x) and hence ∇ft′ (x) = (1 + h)∇ft (x) −
h∇ϕ(x) and ∇ ft (x) = ∇ ft (x) = ∇2 ϕ(x). Therefore, we have that
2 ′ 2

∥∇ft′ (x)∥∇2 ft′ (x)−1 = ∥(1 + h)∇ft (x) − h∇ϕ(x)∥∇2 ft (x)−1


≤ (1 + h)∥∇ft (x)∥∇2 ft (x)−1 + h∥∇ϕ(x)∥∇2 ϕ(x)−1
1+h √ 1
≤ +h ν ≤ .
25 6
This completes the induction.
Chapter 6

Sparsication

In this chapter, we study some randomization techniques for faster convex optimization.

6.1 Subspace embedding


In this section and the next section, we consider the least squares regression problem

min ∥Ax − b∥22


x

where A ∈ Rn×d with n ≥ d (we assume this throughout the chapter). The gradient of the function is
2A⊤ Ax − 2A⊤ b. Setting it to zero, and assuming A⊤ A is invertible, the solution is given by

x = (A⊤ A)−1 A⊤ b.

If AT A is not invertible, we use its pseudo-inverse. If the matrix A⊤ A ∈ Rd×d is given, then we can solve the
equation above in time dω , the current complexity of matrix multiplication. If n > dω , then the bottleneck
is simply to compute A⊤ A. The following lemma shows that it suces to approximate A⊤ A.
The simplest iteration is the Richardson iteration:

x(k+1) = AT b + (I − A⊤ A)x(k) = x(k) + AT b − A⊤ Ax(k) .

To ensure this converges, we scale down AT A by its largest eigenvalue so that AT A ≺ I . This gives a bound
of O(κ(A⊤ A) log(1/ϵ)) on the number of iterations to get ϵ error where κ(A⊤ A) = λmax (A⊤ A)/λmin (A⊤ A).
More generally, one can use pre-conditioning. Recall that for a vector v , the norm ∥v∥M is dened as v T M v .
Lemma 6.1. ⊤
Given a matrix M such that A A ⪯ M ⪯ κ · A⊤ A for some κ ≥ 1. Consider the algorithm
(k+1) (k) −1 ⊤ (k) ⊤
x =x − M (A Ax − A b) . Then, we have that
 k
1
∥x(k) − x∗ ∥M ≤ 1− ∥x(0) − x∗ ∥M .
κ
Remark 6.2. The proof also shows why the choice of norm above is the natural one. In this norm, the
residual drops geometrically.
Proof. Using x∗ = (A⊤ A)−1 A⊤ b, i.e., A⊤ b = (A⊤ A)x∗ , and the formula of x(k+1) , we have

x(k+1) − x∗ = x(k) − M −1 (A⊤ Ax(k) − A⊤ b) − x∗


= x(k) − M −1 (A⊤ Ax(k) − A⊤ Ax∗ ) − x∗
 
= I − M −1 A⊤ A x(k) − x∗ .

Therefore, the norm of error is


 ⊤  
∥x(k+1) − x∗ ∥2M = x(k) − x∗ (I − A⊤ AM −1 )M (I − M −1 A⊤ A) x(k) − x∗
 ⊤ 1 1
 
= x(k) − x∗ M 2 (I − H)2 M 2 x(k) − x∗

88
6.1. Subspace embedding 89

1 1
where H = M − 2 A⊤ AM − 2 . Note that the eigenvalues of H lie between 1/κ and 1 and hence
 
2 1 2
λmax (I − H) ≤ 1 − .
κ

This gives the conclusion.

Note that x⊤ A⊤ Ax = ∥Ax∥2 . Alternatively, we can think that our goal is to approximate A by a smaller
matrix B s.t. ∥Ax∥2 is close to ∥Bx∥2 for all x. In this section, we show that we can simply take B = ΠA
for a random matrix Π with relatively few rows. With this choice of M , we can run the Richardson iteration.
We need to see if this will make the entire procedure more ecient.

Denition 6.3. A matrix Π ∈ Rm×n is an embedding for a set S ⊂ Rn with distortion ϵ if


(1 − ϵ)∥y∥2 ≤ ∥Πy∥2 ≤ (1 + ϵ)∥y∥2

for all y ∈ S .

In this section, we focus on the case that S is a d-dimensional subspace in Rn , namely S = {Ax : x ∈ Rd }.
Consider the SVD A = U ΣV ⊤ . For any y ∈ S , we have that

∥U ⊤ y∥22 = ∥U ⊤ U ΣV ⊤ x∥22 = x⊤ V Σ2 V ⊤ x = ∥Ax∥2 .

Therefore, any d-dimensional subspace has a zero distortion embedding of size d × n.

Exercise 6.4. For any d-dimensional subspace S , any embedding with distortion ϵ < 1 must have at least
d rows.

This embedding is not useful for solving the least squares problem because the solution of the least square
problem is simply a closed form of the SVD decomposition x = V Σ−1 U ⊤ b and nding the SVD is usually
more expensive.

6.1.1 Oblivious Subspace Embedding via Johnson-Lindenstrauss


Surprisingly, there are random embeddings that have small distortion without knowledge of the subspace.

Denition 6.5. A random matrix Π ∈ Rm×n is a (d, ϵ, δ)-oblivious subspace embedding (OSE) if for any
xed d-dimensional subspace S ⊂ Rn , Π is an embedding for S with distortion ϵ with probability at least
1 − δ.

The next lemma provides an equivalent denition.

Lemma 6.6. We call Π a (d, ϵ, δ) OSE if for any matrix U ∈ Rn×d with orthonormal columns, we have that

P(∥U ⊤ Π⊤ ΠU − Id ∥op ≤ ϵ) ≥ 1 − δ.

Proof. Let S be the subspace with an orthonormal basis U ∈ Rn×d , namely S = {y : y = U z} Then, the
condition
(1 − ϵ)∥y∥2 ≤ ∥Πy∥2 ≤ (1 + ϵ)∥y∥2
can be rewritten as
(1 − ϵ)U ⊤ U ⪯ U ⊤ Π⊤ ΠU ⪯ (1 + ϵ)U ⊤ U.
Using U ⊤ U = Id , we have
(1 − ϵ)Id ⪯ U ⊤ Π⊤ ΠU ⪯ (1 + ϵ)Id .
6.1. Subspace embedding 90

For the special case d = 1, the denition becomes


P |∥Πa∥22 − 1| ≤ ϵ ≥ 1 − δ
 

for all unit vectors a. An OSE for d = 1 is given by the Johnson-Lindenstrauss Lemma. The original version
was for a uniform random subspace of dimension d, but later versions extended this to Gaussian, Bernoulli
and more general random matrices [73].
Lemma 6.7 (Johnson-Lindenstrauss Lemma). Π ∈ Rm×n be a random matrix with i.i.d entries
Let from
N (0, √1m ) or uniformly sampled from ± √1m with m = Θ( ϵ12 log( 1δ )). Then, Π is a (1, ϵ, δ) OSE.
We will skip the proof for this as we will prove a more general result later. Next, we show that any OSE
for d = 1 is a OSE for general d. Therefore, it suces to focus on the case d = 1. First, we need a lemma
about ϵ-net on S n−1 .
Lemma 6.8. For any ϵ > 0 and any n ∈ N, there are at most (1 + 2ϵ )n unit vectors x i ∈ Rn such that for
n
any unit vector x ∈ R , there is an i such that ∥x − xi ∥2 ≤ ϵ.

Remark. We call the points {xi }i the ϵ-net on S n−1 .


Proof. We consider the following algorithm:
1. V = S n−1 . i ← 1.
2. While there is xi ∈ V
(a) V ← V \B(xi , ϵ).
(b) i ← i + 1.
Let {x1 , x2 , · · · , xN } be the points it found. By the construction, it is clear that ∥x − xi ∥ ≤ ϵ for some i (else
the procedure would continue). Now consider balls of radius ϵ/2 centered at each xi . To bound the number
of points, we note that all B(xi , 2ϵ ) are disjoint and that all balls lie in B(0, 1 + 2ϵ ). Therefore,
ϵ ϵ
N · vol(B(0, )) ≤ vol(B(0, 1 + )).
2 2
and hence N ≤ (1 + 2ϵ )n .
Lemma 6.9. Suppose that Π is a (1, ϵ, δ) OSE. Then, Π is a (d, 4ϵ, 5d δ) OSE.

Proof. Let N = {xi }i=1 be a 2 -net for S . Then, for any x, we have x1 ∈ N such that ∥x − x1 ∥2 ≤ 21 .
1 d−1

Using the 12 -net guarantee on the vector x − x1 , we can nd x2 ∈ N and 0 ≤ t2 ≤ 21 such that
1
∥x − x1 − t2 x2 ∥ ≤ .
4
P∞
Continuing similarly, we have x = with 0 ≤ ti ≤ 2i−1
i=1 ti xi
1
. Hence, we have that
X
x⊤ (U ⊤ Π⊤ ΠU − Id )x = ti tj x⊤ ⊤ ⊤
i (U Π ΠU − Id )xj
i,j
X
ti tj · max x⊤ U ⊤ Π⊤ ΠU − Id x


x∈N
i,j

≤ 4 max x⊤ U ⊤ Π⊤ ΠU − Id x

x∈N

= 4 max ∥Πx∥2 − 1

x∈U N

Now, using that Π is (1, ϵ, δ) OSE, we have that


P( ∥Πx∥2 − 1 ≤ ϵ) ≥ 1 − δ.

Taking a union bound over all x ∈ U N , we have


P(x⊤ (U ⊤ Π⊤ ΠU − Id )x ≤ 4ϵ) ≥ 1 − 5d δ.
6.1. Subspace embedding 91

This reduction and the Johnson-Lindenstrauss Lemma shows that a random ± √1m is a (d, ϵ, δ) OSE with

1 1
m = Θ( 2
(d + log( ))).
ϵ δ
As we discussed before any (d, ϵ, δ) OSE should have at least d rows. Therefore, the number of rows of this
OSE is tight for the regime ϵ = Θ(1). We only need an OSE for ϵ = Θ(1) because of Lemma 6.1; by iterating
we can get any ϵ with an overhead of log(1/ϵ). Unfortunately, computing ΠA is in fact more expensive than
A⊤ A. The rst involves multiplying Θ(d) × n and n × d matrix, the second one involves multiplying d × n
and n × d matrix.

6.1.2 Sparse Johnson-Lindenstrauss


We consider the sparse matrix Π ∈ Rm×n where each entry is ± √1s with probability s
m and 0 otherwise. We
will show that this random matrix is a (d, ϵ, δ)-OSE for s = Θ(log (d/δ)/ϵ2 ) and m = Θ(d log( dδ )/ϵ2 ). When
2

n = d = 1, ∥Πx∥2 is the number of non-zeros in the only column. Therefore, we indeed need that s scales
like 1/ϵ2 . The advantage of a sparse embedding is that applying it can be much more ecient.
Remark 6.10. It turns out that one can select exactly s non-zeros for each column. This allows us to use
s = Θ( log(d/δ)
ϵ ). The proof of this is slightly more complicated due to the lack of independence [16].
To analyze U Π ΠU , we note that
⊤ ⊤

m
X
U ⊤ Π⊤ ΠU = (ΠU )⊤
r (ΠU )r .
r=1

Since each row of Π is independent, we can use matrix concentration bounds to analyze the sum above. See
[70] for a survey on matrix concentration bounds.
Theorem 6.11 (Matrix Cherno). Suppose we have a sequence of independent, random, self-adjoint ma-
R n
trices Mj ∈ Rn×n such that EMj = I and 0 ⪯ Mj ⪯ R · I . Then, for T = ε2 log δ ,

T
1X
(1 − O(ε))I ⪯ Mj ⪯ (1 + O(ε))I
T j=1

with probability at least 1 − δ.


Lemma 6.12 (Hanson-Wright Inequality). For independent random variables σ1 , · · · , σn with Eσi = 0,
|σi | ≤ 1 and A ∈ Rn×n , we have
r

σ Aσ − Eσ ⊤ Aσ ≲ ∥A∥F
1 1
log( ) + ∥A∥op log( )
δ δ
with probability 1 − δ.
For our problem, we set Mr = m · U ⊤ πr πr⊤ U where πr is the r-th row of Π (as a column vector). One
can check that Mr ⪰ 0 and that EMr = I . Next, we note that

Mr ⪯ m · πr⊤ U U ⊤ πr · I. (6.1)

With small probability, πr⊤ U U ⊤ πr can be huge. However, as long as πr⊤ U U ⊤ πr is bounded by R with
probability 1 − δ , then we can still use the matrix Cherno bound above. To bound πr⊤ U U ⊤ πr , we will use
the following large deviation inequality.
Lemma 6.13. Assume that s≫ m
n log( 1δ ),log2 ( 1δ )/ϵ2 and m ≫ d log( 1δ )/ϵ2 . Then,

ϵ2
πr⊤ U U ⊤ πr ≤
log(d/δ)
with probability 1 − δ. (Here ≫ means greater by a suciently large constant factor).
6.1. Subspace embedding 92

Proof.Note that πr ∈ Rn with each entry non-zero with probability m s


. By the Cherno bound, one can
show that πr has at most 2sn/m non-zeros with probability 1 − δ using that sn/m ≫ log( 1δ ). Let I be the
def
set of indices for which πr is non-zero. Conditional on the set I and that |I| ≤ 2sn/m, note that σ = πr |I
is a random ± √1s vector. Let P = (U U ⊤ )|I×I . Then, we have that

πr⊤ U U ⊤ πr = σ ⊤ P σ.

The Hanson-Wright inequality shows that


r
σ P σ − Eσ ⊤ P σ ≲ 1 ∥P ∥F 1 1 1

log( ) + ∥P ∥op log( ).
s δ s δ

Note that U U ⊤ is a projection matrix and hence ∥P ∥op ≤ 1. Since P ⪰ 0, we have that
√ q √
∥P ∥F = trP 2 ≤ ∥P ∥op · trP ≤ trP .

Also, we have that Eσ ⊤ P σ = 1s trP . Hence, we have


r !!
⊤ 1 1 1
σ Pσ ≤ trP + O trP · log + log
s δ δ

with probability 1 − δ . Note that trU U ⊤ = trU ⊤ U = d and that P is a random diagonal block of U U ⊤ of
size at most 2sn/m. By the Cherno bound, one can show that

4sd
trP ≤ .
m
Hence, we have r !
4d 1 sd 1 1
σ⊤ P σ ≤ + O · log + log
m s m δ δ
ϵ2
with probability 1−δ . Using s ≫ log2 ( dδ )/ϵ2 and m ≫ d log( dδ )/ϵ2 , we have σ ⊤ P σ ≤ log(d/δ) with probability
1 − δ.

Now we can prove the main theorem.

Theorem 6.14. Consider a random sparse matrix Π ∈ Rm×n where each entry is ± √1s with probability
s d log( d
δ) log2 ( d
δ)
m and 0 otherwise. There exist constants c1 , c2 such that for m = c1 ϵ2 and s = c2 ϵ2 Π is an
O(d, ϵ, δ)-OSE.

Proof. By the previous lemma, we have that

ϵ2
πr⊤ U U ⊤ πr ≤
log(d/δ)

with high probability. Under this event, using Theorem 6.11 and (6.1), we have that
m
X
(1 − O(ε))I ⪯ πr⊤ U U ⊤ πr ⪯ (1 + O(ε))I.
r=1
6.2. Leverage Score Sampling 93

6.1.3 Back to Regression


In the last subsection, we presented a proof sketch of a suboptimal OSE. For completeness, we state a tighter
version here:
Algorithm 18: RegressionUsingOSE
Input: a matrix A ∈ Rn×d , a vector b ∈ Rd , success probability δ and target accuracy ϵ.
Let Π be a O(d, 14 , δ) OSE constructed by Theorem 6.14.
Compute A′ = 2ΠA, M = A′⊤ A′ and M −1 .
x(1) = 0.
for k = 1, 2, · · · , 4 log( 1ϵ ) do
x(k+1) = x(k) − M −1 (A⊤ Ax(k) − A⊤ b)
end
return x(last) .
Theorem 6.15. Given any matrix A ∈ Rn×d , any vector b ∈ Rd , the algorithm RegressionUsingOSE
returns a vector x such that
∥A⊤ Ax − A⊤ b∥(A⊤ A)−1 ≤ ϵ∥A⊤ b∥(A⊤ A)−1 .
with probability 1−δ in O(nnz(A)
e + dω ) time.

Proof. The guarantee of x follows from the denition of OSE and Lemma 6.1. We note that ΠA simply
involves duplicating each row of A into roughly s many rows in ΠA. Hence, the cost of computing ΠA is
O(s · nnz(A)) = O(nnz(A))
e . The cost of computing M takes O(de ω ) time. Computing M −1 takes O(d e ω)
time. The loop takes O(nnz(A)) time. This explain the total time.
e

Linear regression can also be solved in time O nnz(A) + dO(1) via a very sparse embedding: each column


of Π picks exactly one nonzero entry in a random location. This was analyzed by Clarkson and Woodru
[76]. See also [34] for a survey.

6.2 Leverage Score Sampling


Here we give a way to reduce solving a tall linear regression system into a sequence of submatrices:
Theorem 6.16 ([17]). Given a matrix A ∈ Rn×d . Let T (m) be the cost of solving B ⊤ Bx = b among all m
distinct subrows B of A. Then, we have that

T (n) = O∗ (nnz(A) + T (O(d log d))).

Remark. Here we use O∗ to emphasize there is some dependence on log( 1ϵ ) suppressed for notational sim-
plicity. Lemma 6.1 shows that once we can solve the system with constant approximation, we can repeat
log( 1ϵ ) times to get an ϵ-accurate solution.

6.2.1 Leverage scores


The key concept in this reduction is leverage score.
Denition 6.17. Given a matrix A ∈ Rn×d , let a⊤
i be the i
th
row of A. The leverage score of the ith row
of A is
def
σi (A) = a⊤ ⊤ +
i (A A) ai .

Note that σ(A) is the diagonal of the projection matrix A(A⊤ A)+ A⊤ . Since 0 ⪯ A(A⊤ A)+ A⊤ ⪯ I , we
have that 0 ≤ σi (A) ≤ 1. Moreover, since A(A⊤ A)+ A⊤ is a projection matrix, the sum of A's leverage scores
(its trace) is equal to the rank of A:
n
X
σi (A) = tr(A(A⊤ A)+ A⊤ ) = rank(A(A⊤ A)+ A⊤ ) = rank(A) ≤ d. (6.2)
i=1
6.2. Leverage Score Sampling 94

The leverage score measures the importance of a row in forming the row space of A. If a row has a
component orthogonal to all other rows, its leverage score is 1. Removing it would decrease the rank of A,
completely changing its row space. The coherence of A is ∥σ(A)∥∞ . If A has low coherence, no particular
row is especially important. If A has high coherence, it contains at least one row whose removal would
signicantly aect the composition of A's row space. The following two characterizations help with this
intuition.
Lemma 6.18. For all A ∈ Rn×d and i ∈ [n] we have that
2
σi (A) = min ∥x∥2 .
A⊤ x=ai

where ai is the ith row of A.


Lemma 6.19. For all A ∈ Rn×d and i ∈ [n], we have that σi (A) is the smallest t such that

ai a⊤ ⊤
i ⪯ t · A A. (6.3)
Sampling rows from A according to their exact leverage scores gives a spectral approximation for A with
high probability. Sampling by leverage score overestimates also suces.
Lemma 6.20. Given a vector u of leverage score overestimates, i.e., σi (A) ≤ ui for all i, dene

1 ui
X= ai a⊤
i with probability pi = .
pi ∥u∥1
∥u∥1 log n 1
For T = Ω( ε2 ), with probability 1− nO(1)
, we have that

T
1X
(1 − ε)A⊤ A ⪯ Xi ⪯ (1 + ε)A⊤ A
T i=1
where Xi are independent copies of X.
Proof. Note that EX = A A and that ⊤

1 ∥u∥1
0⪯X= ai a⊤
i ⪯ ai a⊤ ⊤
i ⪯ ∥u∥1 · A A.
pi σi
1 1
Now, the statement simply follows from the matrix Cherno bound with Mk = (A⊤ A)− 2 Xk (A⊤ A)− 2 and
R = ∥u∥1 .
Combining Lemma 6.20 and Lemma 6.1, we have that
T (n) = cost of computing σi + O∗ (nnz(A) + T (d log d)) (6.4)
where we used that ∥σ∥1 = O(d). However, computing σ exactly is too expensive for many purposes. In
[67], they showed that we can compute leverage scores approximately by solving only polylogarithmically
many regression problems. This result uses the fact that
2
σi (A) = A(A⊤ A)+ A⊤ ei 2

and that by the Johnson-Lindenstrauss Lemma these lengths are preserved up to a multiplicative error if we
project these vectors to a random low-dimensional subspace. 2
In particular, this lemma shows that we can approximate σi (A) via ΠA(A⊤ A)+ A⊤ ei 2 . The benet of

this is that we can compute ΠA(A⊤ A)+ by solving logε2 many linear systems. In other words, we have that
n

cost of approximating σi = O∗ (nnz(A)) + O∗ (T (n)).


Putting it into (6.4), we have this useless formula
T (n) = O∗ (nnz(A) + T (n) + T (d log d)).
This is a chicken and egg problem. To solve an overdetermined system faster, we want to use leverage score
to sample the rows. And to approximate the leverage score, we need to solve the original overdetermined
system. (Even worse, we need to solve it a few times.)
6.3. Stochastic Gradient Descent 95

6.2.2 Uniform Sampling


The key idea to break this chicken and egg problem is to use uniform sampling of the rows of A. We dene

σi,S = a⊤ ⊤
i (AS AS )
−1
ai

where AS is A restricted to rows in S . The set S will be a random sample of k rows of A. Note that A⊤
S AS
is an overestimate of σi . Hence, it suces to bound ∥σi,S ∥1 . The key lemma is the following:
Lemma 6.21. We have that
n
X nd
E|S|=k σi,S∪{i} ≤ .
i=1
k

Proof. Note that


n
X X X
E|S|=k σi,S∪{i} = E|S|=k σi,S∪{i} + E|S|=k σi,S∪{i} .
i=1 i∈S
/ i∈S

Note that i∈S σi,S∪{i} = i∈S σi,S ≤ d. Hence, the second term is bounded by n.
P P
For the rst term, we note that sample a set S of size k , then sample i ∈
/ S  is same as sample a set T
of size k + 1, then sample i ∈ T . Hence, we have

E|S|=k Ei∈S
/ σi,S∪{i} = E|T |=k+1 Ei∈T σi,T
d d
≤ E|T |=k+1 =
k+1 k+1
Hence, we have that
n
X d n+1
E|S|=k σi,S∪{i} ≤ (n − k) + d = d · .
i=1
k+1 k+1

Next, using Sherman-Morrison formula, we have that



σi,S if i ∈ S
σi,S∪{i} = .
1
 1+ 1 otherwise
σi,S

Namely, we can compute σi,S∪{i} using σi,S . Therefore, we have that

cost of approximating σi,S∪{i} = O∗ (nnz(A)) + O∗ (T (k)).

Using Lemma 6.21, we have now


nd
T (n) = O∗ (nnz(A) + T (k) + T ( log d)).
k

Picking k = nd log d, we have that
p
T (n) = O∗ (nnz(A) + T ( nd log d)).

Repeating this process Õ(1) times, we have

T (n) = O∗ (nnz(A) + T (O(d log d))).

6.3 Stochastic Gradient Descent


In this section, we study problems of the form fi where each fi is convex and the sum can be either
P
innite or nite.
6.3. Stochastic Gradient Descent 96

6.3.1 Stochastic Problem


In this part, we discuss the problem
def
min F (x), where F (x) = Ef ∼D f (x)
x

where D is a distribution of convex functions in Rd . The goal is to nd the minimizer x∗ = argminx F (x).
Suppose we observed samples f1 , f2 , · · · , fT from D. Ideally, we wish to approximate x∗ by the empirical
risk minimizer
T
(T ) def 1
X
xERM = arg min FT (x) = fi (x).
x T i=1
(T )
It is known that xERM is optimal in a certain sense in spite of its computational cost. Therefore, to discuss
the eciency of an optimization algorithm for F (x), it is helpful to consider the ratio
EF (x(T ) ) − F (x∗ )
(T )
.
EF (xERM ) − F (x∗ )
(T )
We will rst discuss the term EF (xERM ) − F (x∗ ). As an example, consider the simplest one-dimensional
problem F (x) = Eb∼N (0,I) (x − b)2 . Note that x(T ) is simply the average of T standard normal variables and
hence
(T ) (T )
EF (xERM ) − F (x∗ ) = Eb1 ,b2 ,··· ,bT Eb (xERM − b)2 − b2
(T )
= Eb1 ,b2 ,··· ,bT (xERM )2
T
1X 2 1
= Eb1 ,b2 ,··· ,bT ( bi ) = .
T i=1 T

In general, the following lemma shows that

(T ) σ2 1
EF (xERM ) − F (x∗ ) → where σ 2 = E∥∇f (x∗ )∥2(∇2 F (x∗ ))−1 .
T 2
σ2
where T is called the Cramer-Rao bound.
Lemma 6.22. Suppose that f is µ-strongly convex with Lipschitz Hessian for all f ∼D for some µ > 0.
Suppose that Ef ∼D ∥∇f (x∗ )∥2 < +∞. Then, we have that

(N )
EF (xERM ) − F (x∗ )
lim = 1.
N →+∞ σ 2 /N
Remark. The statement holds with weaker assumptions and the rate of convergence can be made quantitative.
(N ) (N )
Proof.We rst prove xERM → x∗ as N → +∞. By the optimality condition of xERM , and using Taylor's
theorem, for some x
e, we have that
(N ) (N )
0 = ∇FN (xERM ) = ∇FN (x∗ ) + ∇2 FN (e
x)(xERM − x∗ ). (6.5)
By the µ-strongly convexity, we have
(N ) 1
∥xERM − x∗ ∥22 ≤ ∥∇FN (x∗ )∥2 .
µ
Since Ef ∼D ∇f (x∗ ) = ∇F (x∗ ) = 0, we have
(N ) 1 1 X
E∥xERM − x∗ ∥22 ≤ E ∥∇fi (x∗ )∥2
µ N2 i
1
= Ef ∼D ∥∇f (x∗ )∥2
µN
6.3. Stochastic Gradient Descent 97

(N )
Therefore, xERM → x∗ as N → +∞.
Now, to compute the error, Taylor expansion of F at x∗ shows that

(N ) 1 (N ) (N )
F (xERM ) − F (x∗ ) = (x − x∗ )⊤ ∇2 F (x)(xERM − x∗ )
2 ERM
(N )
for some x between x∗ and xERM . Using this and (6.5) gives

(N ) 1
F (xERM ) − F (x∗ ) = ∇FN (x∗ )⊤ (∇2 FN (e x))−1 ∇FN (x∗ ).
x))−1 ∇2 F (x)(∇2 FN (e
2
(N )
Since xERM → x∗ and ∇2 FN ⪰ µI , we have (∇2 FN (e x))−1 → (∇2 F (x∗ ))−1 . Hence,
x))−1 ∇2 F (x)(∇2 FN (e
we have
(N ) N
lim EN · (F (xERM ) − F (x∗ )) = lim E∇FN (x∗ )⊤ (∇2 F (x∗ ))−1 ∇FN (x∗ )
N →∞ N →∞ 2
1
= E∥∇f (x∗ )∥2(∇2 F (x∗ ))−1 .
2

Now, we discuss how to achieve a bound similar to σ 2 /T using stochastic gradient descent. Since gradient
descent is a rst order method, we can only achieve a bound related to the ℓ2 norm, E∥∇f (x∗ )∥22 , instead
of the inverse Hessian norm.
Lemma 6.23. Suppose f has L-Lipschitz gradient for all f ∈ D. Let x∗ be the minimizer of F. Then, we
have
Ef ∼D ∥∇f (x) − ∇f (x∗ )∥22 ≤ 2L · (F (x) − F (x∗ )).
Proof. Let g(x) = f (x) − ∇f (x∗ )⊤ (x − x∗ ). By construction, x∗ is the minimizer of g and f (x∗ ) = g(x∗ ).
Hence, by the progress of gradient descent, we know
1
g(x∗ ) ≤ g(x) − ∥∇g(x)∥2 .
2L
Rearranging the term, we have
∥∇g(x)∥2 ≤ 2L(g(x) − g(x∗ ))
and hence
∥∇f (x) − ∇f (x∗ )∥2 ≤ 2L · (f (x) − f (x∗ ) − ∇f (x∗ )⊤ (x − x∗ )).
Taking expectation, we have the result.

Algorithm 19: StochasticGradientDescent (SGD)


Input: Initial point x(0) ∈ Rd , step size h > 0. Access to D, a distribution on convex functions.
for k = 0, 1, · · · , T do
Sample f ∼ D.
x(k+1) ← x(k) − h · ∇f (x(k) ).
end

Theorem 6.24. ∗
Suppose f has L-Lipschitz gradient for all f ∈ D and F is µ strongly convex. Let x be the
2 1 ∗ 2 1 (k)
minimizer of F and σ = 2 E∥∇f (x )∥ . For step size h ≤ 4L , the sequence x in StochasticGradientDescent
satises
8hσ 2 hµ k
E∥x(k) − x∗ ∥2 ≤ + (1 − ) · ∥x(0) − x∗ ∥2
µ 2
and
T −1
1 X ∥x(0) − x∗ ∥2
EF (x(k) ) − F (x∗ ) ≤ 4hσ 2 + .
T hT
k=0
6.3. Stochastic Gradient Descent 98

Proof. Note that


D E
∥x(k+1) − x∗ ∥2 = ∥x(k) − x∗ ∥2 − 2h x(k) − x∗ , ∇f (x(k) ) + h2 ∥∇f (x(k) )∥2 .

Taking expectation on f and using that


E∥∇f (x(k) )∥2 ≤ 2E∥∇f (x(k) ) − ∇f (x∗ )∥2 + 2E∥∇f (x∗ )∥2
≤ 4L · (F (x(k) ) − F (x∗ )) + 4σ 2 ,
we have
D E
E∥x(k+1) − x∗ ∥2 = ∥x(k) − x∗ ∥2 − 2h x(k) − x∗ , ∇F (x(k) ) + 4h2 σ 2 + 4Lh2 · (F (x(k) ) − F (x∗ ))
≤ ∥x(k) − x∗ ∥2 − (2h − 4Lh2 )(F (x(k) ) − F (x∗ )) + 4h2 σ 2 . (6.6)

Using h ≤ 1
4L and F (x(k) ) − F (x∗ ) ≤ 1
2µ ∥∇F (x )∥2 ,
(k) 2
we have

E∥x(k+1) − x∗ ∥2 ≤ 4h2 σ 2 + (1 − ) · ∥x(k) − x∗ ∥2 .
2
The rst conclusion follows.
For the second conclusion, (6.6) shows that
E∥x(k) − x∗ ∥2 − E∥x(k+1) − x∗ ∥2
EF (x(k) ) − F (x∗ ) ≤ + 4hσ 2 .
h
Hence, we have
T −1
1 X ∥x(0) − x∗ ∥2
EF (x(k) ) − F (x∗ ) ≤ 4hσ 2 +
T hT
k=0

Exercise 6.25. Applying the theorem above twice, show that one can achieve error Õ( µT
2
σ
), which is roughly
same as the Cramer-Rao bound.
We note that for many algorithms in this book, such as mirror descent, the stochastic version is obtained
by replacing the gradient ∇F by the gradient of a sample, ∇f .

6.3.2 Finite Sum Problem


When ∇f (x∗ ) = 0 for all f ∼PD, stochastic gradient descent converges exponentially. When the function is
given by a nite sum F = n1 fi , we can reduce the variance at x∗ by replacing

∇fei (x) = ∇fi (x) − ∇fi (x(0) ) + ∇F (x(0) ).

Note that if x(0) is close to x∗ , then ∇fei (x∗ ) is small because both ∇fi (x∗ ) − ∇fi (x(0) ) and ∇F (x(0) ) are
small. Formally, the variance for fe is bounded as follows:
Lemma 6.26. Suppose f has L-Lipschitz gradient for all f ∈ D. For any f ∈ D, let ∇fe(x) = ∇f (x) −
(0)
∇f (x )+ ∇F (x(0) ) for some xed x(0) . Then, we have

E∥∇fe(x∗ )∥2 ≤ 8L · (F (x(0) ) − F (x∗ )).


Proof. Note that
E∥∇fe(x∗ )∥2 ≤ 2E∥∇f (x∗ ) − ∇f (x(0) )∥2 + 2∥∇F (x(0) ) − ∇F (x∗ )∥2
= 2E∥∇f (x∗ ) − ∇f (x(0) )∥2 + 2E∥∇f (x(0) ) − ∇f (x∗ )∥2
≤ 4L(F (x(0) ) − F (x∗ )) + 4L(F (x(0) ) − F (x∗ ))
where we used Lemma 6.23 at the end.
6.4. Coordinate Descent 99

Algorithm 20: StochasticVarianceReducedGradient (SVRG)


Input: Initial point x(0) ∈ Rd , step size h > 0, restart time m.
e = x(0) .
x
for k = 0, 1, · · · , T do
if k is a multiple m and k ̸= 0 then
Pm of (k−j)
x(k) = m 1
j=1 x .
e = x(k) .
x
end
Sample fi for i ∈ [n]
x(k+1) ← x(k) − h · (∇f (x(k) ) − ∇f (e x)).
x) + ∇F (e
end

Theorem 6.27. f has L-Lipschitz gradient for all fi and F is µ strongly convex. Let x∗ be the mini-
Suppose
1 256L (km)
mizer of F . For step size h = 64L and m = µ , the sequence x in StochasticVarianceReducedGradient
satises
1
EF (x(km) ) − F (x∗ ) ≤ · (F (x(0) ) − F (x∗ )).
2k
In particular, it takes O((n + L 1
µ ) log( ϵ )) gradient computations to nd x such that EF (x) − F (x∗ ) ≤ ϵ ·
(0) ∗
(F (x ) − F (x )).

Remark. Gradient descent gives O(n L


µ log( ϵ )) instead because computing ∇F involves computing n gradients
1

∇f .
Proof. The algorithm consists of T
m phases. For the rst phase, we compute ∇F (x(0) ). Lemma 6.26 shows
that
E∥∇f (x(k) ) − ∇f (x(0) ) + ∇F (x(0) )∥2 ≤ 8L · (F (x(0) ) − F (x∗ )).
Hence, Theorem 6.24 shows that
m−1
1 X F (x(0) ) − F (x∗ )
EF (x(k) ) − F (x∗ ) ≤ 16hL · (F (x(0) ) − F (x∗ )) +
m µhm
k=0
1
≤ (F (x(0) ) − F (x∗ )).
2
Hence, the error decreases by half each phase. This shows the result.

6.4 Coordinate Descent


If some of the coordinates are more important than others for optimization, it makes sense to update the
important coordinates more often.

Lemma 6.28. Given a convex function f , suppose that ∂x ∂2


P
2 f (x) ≤ Li for all x and let L = Li . If we
i
Li
sample coordinate i with probability pi = L , then,

1 ∂ 1 2
Ei f (x − f (x)ei ) ≤ f (x) − ∥∇f (x)∥2 .
Li ∂xi 2L
Proof. Note that the function ζ(t) = f (x + tei ) is Li smooth. Hence, we have that

1 ∂ 1 ∂
f (x − f (x)ei ) ≤ f (x) − f (x)2 .
Li ∂xi 2Li ∂xi
6.4. Coordinate Descent 100

Since we sample coordinate i with probability pi = L,


Li
we have that

1 ∂ X Li 1 ∂
Ef (x − f (x)ei ) ≤ f (x) − f (x)2
Li ∂xi i
L 2Li ∂x i

1 X ∂
= f (x) − f (x)2
2L i ∂xi
1 2
= f (x) − ∥∇f (x)∥2 .
2L

By the same proof as Theorem 2.9, we have the following:


Algorithm 21: CoordinateDescent (CD)
Input: Initial point x(0) ∈ Rd , step size h > 0.
for k = 0, 1, · · · , T do
Sample i with probability proportional to Li .

x(k+1) ← x(k) − L1i ∂x i
f (x(k) )ei .
end

Theorem 6.29 (Coordinate Descent Convergence). Given a convex function f, suppose that
∂2
∂x2i
f (x) ≤ Li
(k+1) (k) 1 ∂ (k)
P
for all x and let L= Li . Consider the algorithm x ←x − Li ∂xi f (x )ei . Then, we have that

2LR2
Ef (x(k) ) − f (x∗ ) ≤ with R= max ∥x − x∗ ∥2
k+4 f (x)≤f (x(0) )

for any minimizer x∗ of f. If f is µ strongly convex, then we also have

µ Ω(k)
Ef (x(k) ) − f (x∗ ) ≤ (1 − ) (f (x(0) ) − f (x∗ )).
L
Remark. Note that Ld ≤ Lip(∇f ) ≤ L. Therefore gradient descent takes at least 1
d times as many steps as
coordinate descent while each step takes d times longer (usually).
Chapter 7

Acceleration

7.1 Chebyshev Polynomials


The goal of this section is to introduce some basic results in approximation theory. Suppose magically there
was a polynomial q(x) such that q(0) = 1 and q(x) = 0 for all x > 0. By assumption, we know that the
constant term in q is 1 and hence we can write
q(x) = 1 − xp(x)
for some p(x). Therefore, for any positive denite matrix A, we have that
I − Ap(A) = 0
and hence A−1 = p(A). Therefore, we could compute A−1 b by calculating p(A)b which takes deg(p) · nnz(A)
time. Unfortunately, there is no such polynomial q . In this section, we will discuss if there is such polynomial
that satises the condition above approximately.
We will bound inf q(0)=1 (maxi |q(λi )|) for matrices A satisfying µ · I ⪯ A ⪯ L · I . First, we note that we
x k
can choose q(x) = 1 − L and get


 µ k
inf max |q(x)| ≤ 1 − .
q(0)=1 µ≤x≤L L

This corresponds to the Richardson iteration, and the above bound shows that it takes O( Lµ log( ϵ )) degree
1

to make the error bounded by ϵ. The key fact


√ we will use is that we can in general approximate a polynomial
of degree k q
by a polynomial of degree Õ( k). The construction uses Chebyshev polynomials. This will
imply an Õ( L
µ) iteration bound for the conjugate gradient method of the next section.

Denition 7.1. For any integer d, the d'th Chebyshev polynomial is dened as the unique polynomial that
satises
Td (cos θ) = cos(dθ).
Exercise 7.2. Show that Td (x) is a degree |d| polynomial in x and that
Td+1 (x) + Td−1 (x)
xTd (x) = . (7.1)
2
Theorem 7.3 (Cherno Bound). For independent random variables Y1 , · · · , Ys such that P(Yi = 1) =
1
P(Yi = −1) = 2 , for any a ≥ 0, we have
s
X a2
P( Yi ≥ a) ≤ 2 exp(− ).


i=1
2s

Now we are ready to show that there is a degree Õ( s) polynomial that estimates xs .
Theorem 7.4. For any positive integers s and d, there is a polynomial p of degree d such that

d2
max |p(x) − xs | ≤ 2 exp(− ).
x∈[−1,1] 2s

101
7.2. Conjugate Gradient 102

Ps
Proof. Let Yi are i.i.d. random variable uniform on {−1, 1}. Let Zs = i=1 Yi . Note that
1
Ez∼Zs Tz = Ez∼Zs−1 (Tz+1 + Tz−1 )
2
Now, (7.1) shows that Ez∼Zs Tz (x) = Ez∼Zs−1 xTz (x). By induction, we have
Ez∼Zs Tz (x) = xs T0 (x) = xs . (7.2)
Now, we dene the polynomial p(x) = Ez∼Zs Tz (x)1|z|≤d . The error of the polynomial can be bounded as
follows
max |p(x) − xs | = max |Ez∼Zs Tz (x)1|z|>d |
x∈[−1,1] x∈[−1,1]

≤ max Ez∼Zs |Tz (x)|1|z|>d


x∈[−1,1]

≤ Pz∼Zs (|z| > d)


d2
≤ 2 exp(−
)
2s
where we used (7.2) on the rst line, |Tz (x)| ≤ 1 for all x ∈ [−1, 1] on the third line and Cherno bound
(Theorem 7.3) on the last line.
Remark.This proof comes from [64]. Please see [64] for the approximation of other functions.
By scaling and translating Theorem 7.4, we have the following guarantee.
q
Theorem 7.5. For any 0 < µ ≤ L and any 0 ≤ ϵ ≤ 1, there is a polynomial q of degree O( Lµ log( 1ϵ )) such

that
inf max |q(x)| ≤ ϵ.
q(0)=1 µ≤x≤L

Proof. Theorem 7.4 shows that there is q such that


 x s d2
max q(x) − 1 − ≤ 2 exp(− ).

x∈[0,L] L 2s
2
Hence, we have that |q(0) − 1| ≤ 2 exp(− d2s ) and that
 µ s d2
max |q(x)| ≤ 1 − + 2 exp(− ).
x∈[µ,L] L 2s

To make both |q(0) − 1| ≤ 3ϵ and maxx∈[µ,L] |q(x)| ≤ 3ϵ , we set d = O( s log 1ϵ ) and s = O( L
µ log( ϵ )). By
1

rescaling q , we have q(0) = 1.

7.2 Conjugate Gradient


Consider any algorithm that starts with x(0) = 0 and satises the invariant
x(k+1) ∈ x(0) + span{∇f (x(0) ), ∇f (x(1) ), · · · , ∇f (x(k) )}.
For f (x) = 21 x⊤ Ax − b⊤ x, it is easy to see that x(k) ∈ Kk dened as follows:
Denition 7.6. Dene the Krylov subspaces K0 = {0}, Kk = span{b, Ab, · · · , Ak−1 b}.
Therefore, the best such an algorithm can do is to solve minx∈Kk f (x). For the quadratic function
def 1 ⊤
2 x Ax − b x, one can compute minx∈Kk f (x) eciently using the conjugate gradient method. Note

f (x) =
that
2
arg min f (x) = arg min ∥x − x∗ ∥A
x∈Kk x∈Kk
where Ax∗ = b.
Throughout this section, we assume A is positive denite. For a given linear system Ax = b, we can
always go to AT Ax = AT b to satisfy the assumption.
7.2. Conjugate Gradient 103

Lemma 7.7. Let x(k) = argminx∈Kk f (x) be the Krylov sequence. . Then, the steps v (k) = x(k) − x(k−1) are
conjugate, namely,
v (i)⊤ Av (j) = 0 for all i ̸= j.
Proof. Assume that i < j . The optimality of x(j) shows that ∇f (x(j) ) ∈ Kj⊥ ⊂ Kj−1

and that ∇f (x(j−1) ) ∈

Kj−1 . Hence, we have

Av (j) = ∇f (x(j) ) − ∇f (x(j−1) ) ∈ Kj−1 .
Next, we note that v (i) = x(i) − x(i−1) ∈ Ki ⊂ Kj−1 . Hence, we have v (i)⊤ Av (j) = 0.
Since the steps are conjugate, v (i) forms a conjugate basis for the Krylov subspaces:

Kk = span{v (1) , v (2) , · · · , v (k) }.

Note that x(k) = x(k−1) + v (k) . Hence, it suces to nd a formula for v (k) .
Lemma 7.8. We have

v (k)⊤ r(k−1) r(k−1)⊤ Av (k−1) (k−1)


 
v (k) = r(k−1) − v .
∥r(k−1) ∥2 v (k−1) Av (k−1)

Proof. Again, the optimality condition shows that ∇f (x(k−1) ) ∈ Kk−1 ⊥


and that ∇f (x(k−1) ) ∈ Kk by the
denition of K. Hence, we have
Kk = span{v (1) , · · · , v (k−1) , r(k−1) }
where r(k−1) = b − Ax(k−1) . Therefore, we have
k−1
X
v (k) = c0 r(k−1) + ci v (i)
i=1

for some c0 .
For c0 , we use that r(k−1) ∈ Kk−1⊥
. This gives v (k)⊤ r(k−1) = c0 ∥r(k−1) ∥2 .
For ck−1 , since v (i)⊤
Av (k−1)
= 0 for any i ̸= k − 1, we have

0 = v (k)⊤ Av (k−1) = c0 r(k−1)⊤ Av (k−1) + ck−1 v (k−1) Av (k−1)


(k−1)⊤ (k−1)
and ck−1 = −c0 · rv(k−1) Av
Av
(k−1) .

For other ci , we note that Av (j) ∈ Kj+1 and r(k−1) ∈ Kk−1 ⊥


. Hence, ci = 0 for all i ∈
/ {0, k − 1}.
Substitute the ci , we get

v (k)⊤ r(k−1) r(k−1)⊤ Av (k−1) (k−1)


 
v (k) = r (k−1)
− v .
∥r(k−1) ∥2 v (k−1) Av (k−1)

∥r (k−1) ∥22
To make the formula simpler, we dene p(k) = v (k)⊤ r (k−1)
v (k) .
∥r (k−1) ∥22 (k) ∥r (k−1) ∥2 (k−1)
Lemma 7.9. We have that x(k) = x(k−1) + p(k)⊤ Ap(k)
p and p(k) = r(k−1) − ∥r (k−2) ∥22
p .

Proof. For the rst formula, note that

v (k)⊤ r(k−1) (k)


x(k) = x(k−1) + v (k) = x(k−1) + p .
∥r(k−1) ∥22

For the quantity v (k)⊤ r(k−1) , we note that f (x(k−1) + tv (k) ) is minimized at t = 1 and hence

v (k)⊤ r(k−1) = v (k)⊤ Av (k)


v (k)⊤ r(k−1) v (k)⊤ r(k−1) (k)⊤ (k)
= p Ap
∥r(k−1) ∥22 ∥r(k−1) ∥22
7.2. Conjugate Gradient 104

where we used the denition of p(k) . Simplifying gives

v (k)⊤ r(k−1) ∥r(k−1) ∥22


(k−1) 2
= (k)⊤ (k)
∥r ∥2 p Ap

and hence the rst formula.


∥r (k−1) ∥22
For the second formula, we use Lemma 7.8 and substitute p(k) = v (k)⊤ r (k−1)
v (k) . This gives

r(k−1)⊤ Av (k−1) (k−1)


p(k) = r(k−1) − p . (7.3)
∥r(k−2) ∥22

Note that
r(k−1) = b − Ax(k−1) = r(k−2) − Av (k−1) .
Taking inner product with r(k−1) and using r(k−1) ⊥ r(k−2) gives ∥r(k−1) ∥2 = r(k−1)⊤ Av (k−1) . Put this into
(7.3) gives the result.

Algorithm 22: ConjugateGradient (CG)


r(0) = b − Ax(0) . p(0) = r(0) .
for k = 1, 2, · · · do
∥r (k−1) ∥22
α= p(k)⊤ Ap(k)
.
x ←x(k)
+ αp(k) .
(k−1)
(k)
r ←r (k−1)
+ αAp(k) .
if ∥r ∥2 ≤ ϵ then return x(k) ;
(k)

∥r (k) ∥2
p(k+1) = r(k) − ∥r (k−1) ∥2
p(k) .
end

Theorem 7.10. f (x) = 12 x⊤ Ax − b⊤ x for some positive denite


Let matrix A and some vector b. Let x(k)
be the sequence produced by ConjugateGradient. Then, we have

1
f (x(k) ) − f (x∗ ) ≤ inf max q(λi )2 · ∥b∥2A−1
2 q(0)=1 i

where q searches over polynomials of degree at most k and λi are eigenvalues of A.


Proof. Note that Kk = {p(A) : deg p < k}. Hence, we have

2(f (x(k) ) − f (x∗ )) = min ∥x − x∗ ∥2A


x∈Kk

= inf ∥(p(A) − A−1 )b∥2A


deg p<k
1 1
= inf ∥A− 2 (Ap(A) − I)A− 2 b∥2A
deg p<k
1
= inf ∥(Ap(A) − I)A− 2 b∥22
deg p<k

Let q(x) = 1 − xp(x). Note that q(0) = 1 and deg q ≤ k and any such q is of the form 1 − xp. Hence, we
have
1
2(f (x(k) ) − f (x∗ )) = inf ∥q(A)A− 2 b∥22
deg q≤k,q(0)=1

≤ inf ∥q(A)∥2op · ∥b∥2A−1 .


q(0)=1

The result follows from the fact that ∥q(A)∥2op = maxi q(λi )2 .
7.3. Accelerated Gradient Descent via Plane Search 105

It is known that for any 0 < µ ≤ L, there is a degree k polynomial q with q(0) = 1 such that
 k
2
max q(λi )2 ≤ 2 1 − q 
i L
µ +1
q
Therefore, it takes O( L µ log( ϵ )) iterations to nd x such that f (x) − f (x ) ≤ ϵ · ∥b∥A−1 .
1 ∗ 2

Also, we note that if there are only s distinct eigenvalues in A, then conjugate gradient nds the exact
solution in s iterations.

7.3 Accelerated Gradient Descent via Plane Search


According to legend, the rst proof for accelerated gradient descent is due to Nemirovski in the 70s. The
rst proof involves a 2-dimensional plane search subroutine. Later on this was improved to line search and
nally Nesterov showed how to get rid of line search in 1982. We change the proof slightly to get rid of all
uses of parameters.
Algorithm 23: NemAGD
Input: Initial point x(1) ∈ Rn .
for k = 1, 2, · · · do
x(k)+ ← argminx=x(k) +t∇f (x(k) ) f (x).
Pk ∇f (x(s) )
P (k) ← x(1) + span(x(k)+ − x(1) , s=1 ∥∇f (x(s) )∥
).
∇f (x(s) )
// Alternatively, one can use P (k) ← x(k) + span(x(k) − x(1) , ∇f (x(k) ),
Pk
s=1 ∥∇f (x(s) )∥ )
without defining x(k)+ .
x (k+1)
= arg minx∈P (k) f (x).
end

Theorem 7.11. Assume that f is convex with ∇2 f (x) ⪯ L · I for all x. Then, we have that

L∥x∗ − x(1) ∥2
f (x(k) ) − f (x∗ ) ≲ .
k2
Proof. Let δk = f (x(k) ) − f (x∗ ). By the convexity of f , we have

δk ≤ ∇f (x(k) )⊤ (x(k) − x∗ )
= ∇f (x(k) )⊤ (x(k) − x(1) ) + ∇f (x(k) )⊤ (x(1) − x∗ ).

Since x(k) is the minimizer on P (k−1) which contains x(1) , we have ∇f (x(k) )⊤ (x(k) − x(1) ) = 0 and hence

δk ≤ ∇f (x(k) )⊤ (x(1) − x∗ ).

Let λt = 1
∥∇f (x(t) )∥
. Note that

T
* T
+ T
X X ∇f (x(k) ) X ∇f (x(k) )
λk δk ≤ , x(1) − x∗ ∗
≤ ∥x − x (1)
∥2 · .

∥∇f (x (k) )∥ ∥∇f (x(k) )∥
k=1 k=1 k=1 2

(s)
Pk ∇f (x )
Finally, we note that ∇f (x(k+1) ) ⊥ P (k) and hence ∇f (x(k+1) ) ⊥ s=1 ∥∇f (x(s) )∥ . Therefore, we have
2
∇f (x(k) ) 2

T T
∇f (x(k) )
X
X
= ∥∇f (x(k) )∥ = T.

∥∇f (x(k) )∥

2

k=1 2 k=1
7.4. Accelerated Gradient Descent 106

Hence, we have
T
X √
λ k δk ≤ T · ∥x∗ − x(1) ∥2 .
k=1

Since x(k)+ ∈ P , we have that


1
δk+1 = f (x(k+1) ) − f (x∗ ) ≤ f (x(k)+ ) − f (x∗ ) ≤ δk − ∥∇f (x(k) )∥22 .
2L
Hence, we have ∥∇f (x(k) )∥22 ≤ 2L(δk − δk+1 ). This gives
T
X δk √
p ≤ 2LT · ∥x∗ − x(1) ∥2
k=1
δk − δk+1

for all T . Solving this recursion gives the result.

Exercise 7.12. Solve the recursion omitted at the end of the proof.
Pk ∇f (x(s) )
Note that the proof above used only the fact that {x(1) , x(k)+ , s=1 ∥∇f (x(s) )∥
} ⊂ P . Therefore, one can
put extra vectors in P to obtain extra features. For example, if we use the subspace
k
X ∇f (x(s) )
P = x(k) + span(x(k) − x(1) , ∇f (x(k) ), (s) )∥
, x(k) − x(k−1) ),
s=1
∥∇f (x

then one can prove that this algorithm is equivalently to conjugate gradient when f is a quadratic function.

7.4 Accelerated Gradient Descent


7.4.1 Gradient Mapping
To give a general version of acceleration, we consider problem of the form
def
min ϕ(x) = f (x) + h(x)
x

where f (x) is strongly convex and smooth and h(x) is convex. We assume that we access the function f and
h dierently via:
1. Let Tf be the cost of computing ∇f (x).
2
2. Let Th,λ be the cost of minimizing h(x) + λ2 ∥x − c∥2 exactly.
The idea is to move whatever we can optimize in ϕ to h and hopefully this makes the remaining part of ϕ,
f , as smooth and strongly convex as possible. To make the statement general, we only assume h is convex
and hence h may not be dierentiable. To handle this issue, we need to dene an approximate derivative of
h that we can compute.
Denition 7.13. We dene the gradient step
L 2
px = argminy f (x) + ∇f (x)⊤ (y − x) + ∥y − x∥ + h(y)
2
and the gradient mapping
gx = L(x − px ).
Note that if h = 0, then px = x − 1
L ∇f (x) and gx = ∇f (x). In general, if ϕ ∈ C 2 , then we have that

1 1
px = x − ∇ϕ(x) + O( 2 ).
L L
7.4. Accelerated Gradient Descent 107

Therefore, we have that gx = ∇ϕ(x) + O( L1 ). Hence, the gradient mapping is an approximation of the
gradient of ϕ that is computable in time Tg + Th,L .
The key lemma we use here is that ϕ satises a lower bound dening using gx . Ideally, we would love to
get a lower bound as follows:
µ 2
ϕ(z) ≥ ϕ(x) + gx⊤ (z − x) + ∥z − x∥2 .
2
But it is WRONG. If that was true for all z , then we would have gx = ∇ϕ(x). However, if ϕ ∈ C 2 is µ
strongly convex, then we have
µ 2
ϕ(z) ≥ ϕ(x) + ∇ϕ(x)⊤ (z − x) + ∥z − x∥2
2
1 1 2 µ 2
≥ ϕ(x − ∇ϕ(x)) + ∥∇ϕ(x)∥ + ∇ϕ(x)⊤ (z − x) + ∥z − x∥2 . (7.4)
L 2L 2
It turns out that this is true and is exactly what we need for proving gradient descent, mirror descent and
accelerated gradient descent.
Theorem 7.14. Given ϕ = f + h. Suppose that f is µ strongly convex with L-Lipschitz gradient. Then, for
any z, we have that
1 2 µ 2
ϕ(z) ≥ ϕ(px ) + gx⊤ (z − x) + ∥gx ∥2 + ∥z − x∥2 .
2L 2
2
Proof. Let f (y) = f (x) + ∇f (x)⊤ (y − x) + L2 ∥y − x∥2 and pt = px + t(z − px ). Using that p0 is the minimizer
of f + h, we have that
L 2
f (p0 ) + h(p0 ) ≤ f (pt ) + h(pt ) ≤ f (p0 ) + ∇f (p0 )⊤ (pt − p0 ) + ∥pt − p0 ∥ + h(pt ).
2
Hence, we have that
L 2
0 ≤ ∇f (p0 )⊤ (pt − p0 ) + ∥pt − p0 ∥ + h(pt ) − h(p0 )
2
 Lt2 2
≤ t · ∇f (p0 )⊤ (z − p0 ) + h(z) − h(p0 ) + ∥z − p0 ∥ .
2
Taking t → 0+ , we have
∇f (px )⊤ (z − px ) + h(z) − h(px ) ≥ 0
Expanding the term ∇f (px ), we have
∇f (x)⊤ (z − px ) + L(px − x)⊤ (z − px ) + h(z) − h(px ) ≥ 0.
Equivalently,
h(z) ≥ h(px ) + ∇f (x)⊤ (px − z) + L(px − x)⊤ (px − z)
1
= h(px ) + ∇f (x)⊤ (px − z) + ∥gx ∥2 + L(px − x)⊤ (x − z)
L
1
= h(px ) + ∇f (x)⊤ (px − z) + ∥gx ∥2 + gx⊤ (z − x).
L
Using that
µ
f (z) ≥ f (x) + ∇f (x)⊤ (z − x) + ∥z − x∥2
2
and that
1
f (px ) ≤ f (x) + ∇f (x)⊤ (px − x) + ∥gx ∥2 ,
2L
we have the result:
1 µ
ϕ(z) ≥ h(px ) + f (x) + ∇f (x)⊤ (px − x) + ∥gx ∥2 + gx⊤ (z − x) + ∥z − x∥2
L 2
1 µ
≥ ϕ(px ) + ∥gx ∥2 + gx⊤ (z − x) + ∥z − x∥2 .
2L 2
7.4. Accelerated Gradient Descent 108

The next lemma shows that ∥gx ∥2 ≤ 2G if ϕ is G-Lipschitz.

Lemma 7.15. If ϕ is G-Lipschitz, then ∥gx ∥2 ≤ 2G for all x.


Proof. By the denition of gradient mapping (namely, px is the minimizer of a function), we have that

L 2
f (x) + ∇f (x)⊤ (px − x) + ∥px − x∥ + h(px ) ≤ f (x) + h(x).
2
Using h(px ) ≥ h(x) + ∇h(x)⊤ (px − x), we have that

L 2 L 2
0 ≥ ∇ϕ(x)⊤ (px − x) + ∥px − x∥ ≥ −G ∥px − x∥2 + ∥px − x∥ .
2 2
Hence, we have that ∥px − x∥2 ≤ 2
LG and hence ∥gx ∥2 ≤ 2G.

7.4.2 Gradient Descent Using Gradient Mapping


Putting z = x in Theorem 7.14, we have the following:

Lemma 7.16 (Gradient Descent Lemma). We have that

1 2
ϕ(px ) ≤ ϕ(x) − ∥gx ∥2 .
2L
2
This shows that each step of the gradient step decreases the function value by 2L1
∥gx ∥2 . Therefore, if
the gradient is large, then we decrease the function value by a lot. On the other hand, Putting z = x∗ for
Theorem 7.14 shows that
ϕ(x∗ ) ≥ ϕ(px ) + gx⊤ (x∗ − x).
If the gradient is small and domain is bounded, this shows that we are close to the optimal. Combining
these two facts, we can get the gradient descent.

7.4.3 Mirror Descent Using Gradient Mapping


Consider the mirror descent as
x(k+1) = x(k) − ηgx(k) . (7.5)
The main dierence between this and gradient descent is that we will take a larger step size η . To analyze
the mirror descent, we use Theorem 7.14 and the convexity of ϕ to get that
k k k
1X 1X 1X ⊤
ϕ( px(i) ) − ϕ(x∗ ) ≤ (ϕ(px(i) ) − ϕ(x∗ )) ≤ g (i) (x(i) − x∗ ). (7.6)
k i=1 k i=1 k i=1 x

Therefore, it suces to upper bound gx⊤(i) (x(i) − x∗ ). The following lemma shows that if gx⊤(i) (x(i) − x∗ ) is
large, then either the gradient is large or the distance to optimum moves a lot. It turns out this holds for
any vector g , not necessarily an approximate gradient.

Lemma 7.17 (Mirror Descent Lemma). Let p = x − ηg . Then, we have that

η 2 1  2 2

g ⊤ (x − u) = ∥g∥2 + ∥x − u∥2 − ∥p − u∥2
2 2η
for any u.
7.4. Accelerated Gradient Descent 109

7.4.4 Algorithm and Analysis


Recall Lemma 7.16 shows that if the gradient is large, gradient descent makes a large progress. On the other
hand, if the gradient is small, (7.6) shows that mirror descent makes a large progress. Therefore, it is natural
to combine two approaches.
Algorithm 24: AGD
Input: Initial point x(1) ∈ Rn .
y (1) ← x(1) , z (1) ← x(1) .
for k = 1, 2, · · · , T do
x(k+1) ← τ z (k) + (1 − τ )y (k) .
Perform a gradient step: y (k+1) = x(k+1) − L1 gx(k+1) .
Perform a mirror step: z (k+1) = z (k) − ηgx(k+1) .
end
Return px(k+1) .
1
PT
T k=1

Note that if τ = 1, the algorithm is simply mirror descent and if τ = 0, the algorithm is gradient descent.

Theorem 7.18. Setting


1−τ
τ = ηL and η= √1 , we have that
µL
s
T
1 X 2 L 
ϕ( px(k+1) ) − ϕ(x∗ ) ≤ ϕ(x(1) ) − ϕ(x∗ ) .
T T µ
k=1
q
In particular, if we restart the algorithm every 4 Lµ iterations, we can nd x such that

r
µ Ω(T )  
ϕ(x) − ϕ(x∗ ) ≤ 2(1 − ) ϕ(x(1) ) − ϕ(x∗ )
L
in T steps. Furthermore, each step takes Tf + Th,L
Proof. Lemma 7.17 showed that
 2 2 
η 2 1
gx⊤(k+1) (z (k) ∗ ∗ ∗
(7.7)
(k) (k+1)
− x ) ≤ ∥gx(k+1) ∥2 + z − x − z −x
2 2η 2 2

This shows that if the mirror descent has large error gx⊤(k+1) (z (k) − x∗ ), then the gradient descent makes
2
a large progress ( η2 ∥gx(k+1) ∥2 ).
To make the left-hand side usable, note that x(k+1) = z (k) + 1−τ τ · (y
(k)
− x(k+1) ) and hence

1−τ ⊤
gx⊤(k+1) (x(k+1) − x∗ ) = gx⊤(k+1) (z (k) − x∗ ) + · gx(k+1) (y (k) − x(k+1) )
τ
1−τ 1 2
≤ gx⊤(k+1) (z (k) − x∗ ) + (ϕ(y (k) ) − ϕ(y (k+1) ) − ∥gx(k+1) ∥2 )
 τ 2L
η 1 2 2  1 − τ 1
2 2
z − x∗ − z (k+1) − x∗ +
(k)
≤ ∥gx(k+1) ∥2 + (ϕ(y (k) ) − ϕ(y (k+1) ) − ∥g (k+1) ∥2 ).

2 2η 2 2 τ 2L x

where we used Theorem 7.14 in the middle and (7.7) at the end.
Now, we set 1−τ τ = ηL and get
 2 2 
1
gx⊤(k+1) (x(k+1) − x∗ ) ≤ ηL(ϕ(y (k) ) − ϕ(y (k+1) )) + ∗ ∗
(k) (k+1)
− x − − x .

z z
2 2
7.5. Accelerated Coordinate Descent 110

PT
Taking a sum on both side and let x = 1
T k=1 px(k+1) , we have that

T
1 X
ϕ(x) − ϕ(x∗ ) ≤ (ϕ(px(k+1) ) − ϕ(x∗ ))
T
k=1
T
1 X ⊤
≤ gx(k+1) (x(k+1) − x∗ )
T
k=1
ηL   1 2
ϕ(y (1) ) − ϕ(y (T +1) ) + z − x∗
(1)


T 2ηT 2
 
ηL 1 
≤ + ϕ(x(1) ) − ϕ(x∗ )
T 2ηT µ

The conclusion follows from our setting of η .

7.5 Accelerated Coordinate Descent


∂2
Recall from Theorem 6.29 that if ∂x2i
ℓ(x) ≤ Li for all x and that ℓ is µ strongly convex, then we can nd
x (k)
in k coordinate steps such that
µ Ω(k)
Eℓ(x(k) ) − ℓ(x∗ ) ≤ (1 − ) (ℓ(x(0) ) − f (x∗ ))
L
where L = Li . Now, the really fun part is here. Consider the function
P
i

µ 2 µ 2
ϕ(x) = f (x) + h(x) with f (x) = ∥x∥2 and h(x) = ℓ(x) − ∥x∥2 .
2 2
Since f is µ + smooth (YES!, I know this is also µ smooth) and µ strongly convex and since h is convex,
L
n q
we apply Theorem 7.18 and get an algorithm that takes O∗ ( nµ L
) steps. Note that each step involves
Tf + Th,µ+ L . Obviously, Tf = 0. Next, note that Th,µ+ L involves solving a problem of the form
n n

µ L 2 µ 2
yx = argminy ( + ) ∥y − x∥ + (ℓ(y) − ∥x∥ )
2 2n 2
L 2
= argminy ℓ(y) − µy ⊤ x + ∥y − x∥ .
2n
Now, we can apply Theorem 6.29 to solve this problem. It takes

L + (L/n) · n
O∗ ( ) = O∗ (n) coordinate steps.
L/n

Therefore, in total it takes


s s
L Ln
O∗ ( ) · O∗ (n) = O∗ ( ) coordinate steps.
nµ µ

Hence, we have the following theorem

Theorem 7.19 (Accelerated Coordinate Descent Convergence). Given an µ strongly-convex function


q ℓ.
∂2 ∗ Ln
coordinate steps.
P
Suppose that
∂x2i
ℓ(x) ≤ Li for all x and let L= Li . We can minimize ℓ in O ( µ )

P q
Remark 7.20. It is known how to do it in O∗ ( i Lµi ) steps [4].
7.6. Accelerated Stochastic Descent 111

7.6 Accelerated Stochastic Descent


Pn
Recall from Theorem 6.27 that if ∇2 ℓi (x) ≤ L for all x and i and that ℓ(x) = 1
n i=1 ℓi (x) is µ strongly
convex, then we can nd x in O((n + Lµ ) log( ϵ )) stochastic steps such that
1

Eℓ(x) − ℓ(x∗ ) ≤ ϵ(ℓ(x(0) ) − ℓ(x∗ )).

Similar to the coordinate descent, we can accelerate it using the accelerated gradient descent (Theorem
7.18). To apply Theorem 7.18, we consider the function
µ 2 µ 2
ϕ(x) = f (x) + h(x) with f (x) = ∥x∥2 and h(x) = ℓ(x) − ∥x∥2 .
2 2
Since f is µ + smooth (YES!, I know this is also µ smooth) and µ strongly convex and since h is convex,
L
n q
we apply Theorem 7.18 and get an algorithm that takes O∗ (1 + nµ L
) steps. Note that each step involves
Tf + Th,µ+ L . Obviously, Tf = 0. Next, note that Th,µ+ L involves solving a problem of the form
n n

µ L 2 µ 2
yx = argminy ( + ) ∥y − x∥ + (ℓ(y) − ∥x∥ )
2 2n 2
L 2
= argminy ℓ(y) − µy ⊤ x + ∥y − x∥
2n
1X L 2
= argminy (ℓi (y) + ∥y − x∥ − µy ⊤ x)
n i 2n

Now, we can apply Theorem 6.27 to solve this problem. It takes


L
L+
O∗ (n + n
L
) = O∗ (n)
µ+ n

Therefore, in total it takes s s


L∗ ∗ ∗ L
O (1 + ) · O (n) = O (n + n ).
nµ µ

Theorem 7.21. Given a convex function ℓ = n1 ℓi . Suppose that ∇2 ℓi (x) ≤ L for all i and x and that ℓ
P
is µ strongly convex. Suppose we can compute ∇ℓi in O(1) time. We have an algorithm that outputs an x
such that
Eℓ(x) − ℓ(x∗ ) ≤ ε(ℓ(x(0) ) − ℓ(x∗ ))
q
in O∗ (n + nL
µ) stochastic steps.
Part II

Sampling

112
Chapter 8

Gradient-based Sampling

Sampling in high dimension is a fundamental problem. Informally, given access to a function f : Rn →


R ∪ {∞}, the sampling problem is to generate a point x ∈ Rn from the distribution with density proportional
to e−f (x) . Note that any density can be written in this form, so this is completely general. To make the
problem precise, we also have to specify a starting point with positive density, and an error parameter ϵ that
measures the distance of the distribution of the output from the desired target.
Unfortunately, in this generality, just like optimization, sampling is also intractable. To see this, consider
the following function: (
0 x∈S
f (x) =
M x ̸∈ S

for some closed set S . Then sampling according to e−f for a suciently large M would allow us to nd an
element of S , which could, e.g., be the minimizer of a hard-to-optimize function.
Consider a second example, which might appear more tractable:
1 ⊤
g(x) = e− 2 x Ax
1x≥0 .

Without the restriction to the nonnegative orthant, the target density is the Gaussian N (0, A−1 ), and can
be sampled by rst sampling the standard Gaussian N (0, I) and applying the linear transformation A−1/2 .
To sample from the standard Gaussian in Rn , we can sample each coordinate independently from N (0, 1),
a problem which has many (ecient) numerical recipes. But how can we handle the restriction? In the
course of forthcoming chapters, we will see that this problem and its generalization to sampling logconcave
densities, i.e., when f is convex, can be solved in polynomial time. It is remarkable that the polynomial-time
frontier for both optimization and sampling is essentially determined by convexity.
We begin with gradient-based sampling methods. These rely on access to ∇f . These methods will in
fact be natural algorithmic versions of continuous processes on random variables, a particularly pleasing
connection. Later we will see methods that only use access to f , and others that utilize higher derivatives,
notably the Hessian. The parallels to optimization will be pervasive and striking.

8.1 Gradient-based methods: Langevin Dynamics


Here we study a simple stochastic process for generating samples from a desired distribution e−f (x) . As we
will see later in this chapter, it can also be viewed as a stochastic version of gradient descent in the space
of measures. While gradient descent corresponds to an ordinary dierential equation (ODE), stochastic
gradient descent corresponds to a stochastic dierential equation (SDE).
Algorithm 25: LangevinDiffusion (LD)
Input: Initial point x0 ∈ Rn .
Solve the stochastic dierential equation

dxt = −∇f (xt )dt + 2dWt

Output: xt .

113
8.1. Gradient-based methods: Langevin Dynamics 114

Here f : Rn → R is a function, xt is the random variable at time t and dWt is innitesimal Brownian
motion also known as a Wiener process. We can view it as the continuous version of the following discrete
process √
xt+1 = xt − h∇f (xt ) + 2hζt
with ζt sampled independently from N (0, I). When we take the step size h → 0, this discrete process
converges to the continuous one. We discuss the continuous version rst.
A more general form of an SDE is

dxt = µ(xt , t)dt + σ(xt , t)dWt

where xt ∈ Rn , µ(xt , t) ∈ Rn is a time-varying vector eld and σ(xt , t) ∈ Rn×m is a time-varying linear
transformation. The simplest such process is the Wiener process: dxt = dWt which nds many applications
in applied mathematics, nance, biology and physics. Another useful process is the Ornstein-Ulhenbeck
Process:

dxt = −axt dt + σdWt


This represents a particle moving in a uid, the rst term being the force of friction exerted on the particle
and the second term being the movement caused by other particles colliding in our particle, which is modeled
using Brownian motion.
A crucial dierence between ordinary dierentials and stochastic dierentials is in the chain rule. Unlike
classical dierentials where we have df (x) = ∇f (x)dx, we have the following chain rule for stochastic calculus.

Lemma 8.1 (Itô's lemma). For any process x t ∈ Rn satisfying dxt = µ(xt )dt + σ(xt )dWt where µ(xt ) ∈ Rn
n×m
and σ(xt ) ∈ R , we have that

1
df (xt ) = ∇f (xt )⊤ dxt + (dxt )⊤ ∇2 f (xt )(dxt )
2
1
= ∇f (xt )⊤ µ(xt )dt + ∇f (xt )⊤ σ(xt )dWt + tr(σ(xt )⊤ ∇2 f (xt )σ(xt ))dt.
2
The usual chain rule comes from using Taylor expansion and taking a limit, i.e.,

f (x) + h∇f (x)⊤ y + 21 h2 ...


∇f (x + hy) = lim
h→0 h
and only the rst term survives, as the second and later terms goes to zero. But with the stochastic
component, roughly speaking, (dWt )2 = dt, and we need to keep track of the second term in the Taylor
expansion as well. For a detailed treatment of stochastic calculus, we refer the reader to a standard textbook
such as [60].
First, we will see that e−f is a stationary density for the Langevin SDE in continuous time. The proof
relies on the following general theorem about the distribution induced by an SDE.

Theorem 8.2 (FokkerPlanck equation). For any process xt ∈ Rn satisfying dxt = µ(xt )dt + σ(xt )dWt
n n×m
where µ(xt ) ∈ R and σ(xt ) ∈ R with the initial point x0 drawn from p0 . Then the density pt of xt
satises the equation

dpt X ∂ 1 X ∂2
=− (µ(x)i pt (x)) + [(D(x))ij pt (x)]
dt i
∂xi 2 i,j ∂xi ∂xj

where D(x) = σ(x)σ(x)⊤ .


Proof. For any smooth function ϕ, we have that

Ex∼pt ϕ(x) = Eϕ(xt ).


8.1. Gradient-based methods: Langevin Dynamics 115

Taking derivatives on the both sides with respect to t, using Itô's lemma (Lemma 8.1), and noting that
EdWt = 0, we have that
Z  
⊤ ⊤ 1 ⊤ 2
ϕ(x)dpt (x)dx = E ∇ϕ(xt ) µ(xt )dt + ∇ϕ(xt ) σ(xt )dWt + tr(σ(xt ) ∇ ϕ(xt )σ(xt ))dt
2
 
⊤ 1 2
= E ∇ϕ(xt ) µ(xt )dt + tr(∇ ϕ(xt )D(xt ))dt .
2

Using xt ∼ pt , we have that


Z Z
dpt 1
ϕ(x) dx = ∇ϕ(x)⊤ µ(x)pt (x) + tr(∇2 ϕ(x)D(x))pt (x)dx.
dt 2
Integrating by parts,
Z Z X ∂
∇ϕ(x)⊤ µ(x)pt (x)dx = − ϕ(x) (µi (x)pt (x))dx.
i
∂xi

Similarly, integrating by parts twice gives


Z Z
tr(σ(x) ∇ ϕ(x)σ(x))pt (x)dx = tr(∇2 ϕ(x)σ(x)σ(x)⊤ )pt (x)dx
⊤ 2

Z X ∂
= − ⟨∇ϕ(x), (pt (x)D(x)i )⟩dx
i
∂xi
XZ ∂2
= ϕ(x) [(D(x))ij pt (x)] dx.
i,j
∂xi ∂xj

Hence,  
Z 2
dp t
X ∂ 1 X ∂
ϕ(x)  + (µ(x)i pt (x)) − [(D(x))ij pt (x)] dx = 0
dt i
∂x i 2 i,j
∂xi ∂xj

for any smooth ϕ. Therefore, we have the conclusion of the theorem.


We apply the FokkerPlanck equation to the Langevin dynamics.

Theorem 8.3. For any smooth function f , the density proportional to F = e−f is stationary for the Langevin
dynamics.

Proof. The FokkerPlanck equation (Theorem 8.2) shows that the distribution pt of xt satises

dpt X ∂ ∂f (x) X ∂2
= ( pt (x)) + [pt (x)] . (8.1)
dt i
∂xi ∂xi i
∂x2i

Now since pt is stationary the LHS is zero and we can rewrite the above as

dpt X ∂  ∂f (x) ∂

=0= pt (x) + pt (x)
dt i
∂xi ∂xi ∂xi
X ∂  
∂f (x) ∂

= pt (x) + log pt (x)
i
∂xi ∂xi ∂xi
X ∂  


pt (x)

= pt (x) log −f (x) .
i
∂xi ∂xi e

We can verify that pt (x) ∝ e−f (x) is a solution.


8.1. Gradient-based methods: Langevin Dynamics 116

Exercise 8.4. Consider the equation


d 1 d2
pt (x) = pt (x)
dt 2 dx2
for x ∈ R, t > 0. Show that the density of N (0, t), i.e.
 2
1 x
ϕt (x) = √ exp −
2πt 2t

satises this equation. Then generalize the process to Rn .

Exercise 8.5. Consider the SDE with Xt ∈ Rn :



dXt = −Xt dt + 2dWt .

Use the Fokker-Planck equation to derive a corresponding stationary density. Use Itô's lemma to derive
4
E(∥Xt ∥4 ). (Hint: Take expectation on both sides of Itô's lemma for appropriate f (Xt ), and use Edf (Xt ) =
dEf (Xt ) for continuous f and df ).

Convergence via Coupling. Next we turn to the rate of convergence, which will also prove uniqueness
of the stationary distribution for the stochastic process. For this, we assume that f is strongly convex. The
proof is via the classical coupling technique [3].
Our goal is to bound the rate at which the distribution of the current point approaches the stationary
distribution, in some chosen measure of distance between distributions (for example, the TV distance). To
do this, in the coupling technique, we consider two points which are both following the random process. One
of them is already in the stationary distribution, and therefore will stay there. The other is our point. We
will show that there is a coupling of the two distributions, i.e., a joint distribution over the two points, whose
marginals are identical to the single point processes, such that the expected distance between the two points
decreases at a certain rate. More formally, we couple two copies xt , yt of the random process with dierent
starting points (the coupling is a joint distribution D(xt , yt ) with the property that its marginal for each of
xt , yt is exactly the process) and show that their distributions get closer over time.
While the challenge usually is to nd a good coupling, in the present case, the simple identity coupling
(i.e., the same Wiener process is used for both xt and yt ) works well. The distance measure we will use here
is the Wasserstein distance (in Euclidean norm, see Denition 0.12).

Exercise 8.6. Show that for two distributions with the same nite support, computing their Wasserstein
distance reduces to a bipartite matching problem.

Lemma 8.7. Let xt , yt evolve according to the Langevin diusion for a µ-strongly convex function f : Rn →
R. Then, there is a coupling γ between xt and yt s.t.

2 2
Ext ,yt ∼γ ∥xt − yt ∥ ≤ e−2µt ∥x0 − y0 ∥ .

Proof. From the denition of LD, and by using the identity coupling, i.e., the same Gaussian dWt for both
processes xt and yt , we have that
d
(xt − yt ) = ∇f (yt ) − ∇f (xt ).
dt
Hence,
1 d
∥xt − yt ∥2 = ⟨∇f (yt ) − ∇f (xt ), xt − yt ⟩ .
2 dt
Next, from the strong convexity of f , we have
µ 2
f (yt ) − f (xt ) ≥ ∇f (xt )⊤ (yt − xt ) + ∥xt − yt ∥ ,
2
µ 2
f (xt ) − f (yt ) ≥ ∇f (yt )⊤ (xt − yt ) + ∥xt − yt ∥ .
2
8.2. Langevin Dynamics is Gradient Descent in Density Space*2 117

Adding two equations together, we have


2
(∇f (xt ) − ∇f (yt ))⊤ (xt − yt ) ≥ µ ∥xt − yt ∥ .

Therefore,
1 d
∥xt − yt ∥2 ≤ −µ∥xt − yt ∥2 .
2 dt
Hence,
d
d ∥xt − yt ∥2
log ∥xt − yt ∥2 = dt
dt ∥xt − yt ∥2
≤ −2µ.

Integrating both sides from 0 to t, we get

log ∥xt − yt ∥2 − log ∥x0 − y0 ∥2 ≤ −2µt

which proves the result.

Exercise 8.8. Give an example of a function f for which the density proportional to e−f is not stationary
for the following discretized Langevin algorithm

x(k+1) = x(k) − ϵ∇f (x(k) ) + 2ϵZ (k)

where Z (k) ∼ N (0, 1) are independent and the distribution of x(0) is Gaussian.

8.2 Langevin Dynamics is Gradient Descent in Density Space*1


Here we show that Langevin dynamics is simply gradient descent for the function F (ρ) = DKL (ρ∥ν) for
two densities ρ, ν on the Wasserstein space where ν = e−f (x) / e−f (y) dy is the target and ρ is the current
R

density. For this, we rst dene the Wasserstein space.

Denition 8.9. The Wasserstein space P2 (Rn ) on Rn is the manifold on the set of probability measures
on Rn such that the shortest path distance of two measures x, y in this manifold is exactly equal to the
Wasserstein distance between x and y .
We let Tp (M) refer to the tangent space at a point p in a manifold M.

Lemma 8.10. n n
For any p ∈ P2 (R ) and v ∈ Tp P2 (R ), we can write v(x) = ∇ · (p(x)∇λ(x)) for some
n
function λ on R . Furthermore, the local norm of v in this metric is given by

∥v∥2p = Ex∼p ∥∇λ(x)∥2 .

Proof. Let p ∈ P2 (Rn ) and v ∈ Tp P2 (Rn ). We will show that any change of density v can be represented by
a vector eld c on Rn as follows: Consider the process x0 ∼ p and dt d
xt = c(xt ). Let pt be the density of
the distribution of xt . To compute dt pt , we follow the same idea as in the proof as Theorem 8.2. For any
d

smooth function ϕ, we have that Ex∼pt ϕ(x) = Eϕ(xt ). Taking derivatives on the both sides with respect to
t, we have that
Z Z Z
d
ϕ(x) pt (x)dx = ∇ϕ(x)⊤ c(x)pt (x)dx = − ∇ · (c(x)pt (x))ϕ(x)dx
dt
where we used integration by parts at the end. Since this holds for all ϕ, we have that

dpt (x)
= −∇ · (pt (x)c(x)).
dt
1 Sections marked with * are more mathematical and can be skipped.
8.2. Langevin Dynamics is Gradient Descent in Density Space*3 118

Since we are interested only in vector elds that generate the minimum movement in Wasserstein distance,
we consider the optimization problem
Z
1
min p(x)∥c(x)∥2 dx
−∇·(pc)=v 2

where we can think v is the change of pt . Let λ(x) be the Lagrangian multiplier of the constraint −∇·(pc) = v .
Then, the problem becomes
Z Z
1
min p(x)∥c(x)∥2 dx − λ(x)∇ · (p(x)c(x))dx.
c 2
Z Z
1
= min p(x)∥c(x)∥ dx + ∇λ(x)⊤ c(x) · p(x)dx.
2
c 2

Now, we note that the problem is a pointwise optimization problem whose minimizer is given by

c(x) = −∇λ(x).

This proves that any vector eld that generates minimum movement in Wasserstein distance is a gradient
eld. Also, we have that v(x) = ∇R· (p(x)∇λ(x)). Note that the right hand side is an elliptical dierential
equation and hence for any v with v(x)dx = 0, there is an unique solution λ(x). Therefore, we can write
v(x) = ∇ · (p(x)∇λ(x)) for some λ(x).
Next, we note that the movement is given by
Z
∥v∥p = p(x)∥c(x)∥2 dx = Ex∼p ∥∇λ(x)∥2 .
2

As we discussed in the gradient descent section, one can use norms other than ℓ2 norm. For the Wasser-
stein space, we should use the local norm as given in Lemma 8.10.
Theorem 8.11. Let ρt be the density of the distribution produced by Langevin Dynamics for the target
ν = e−f (x) / e−f (y) dy . Then, we have that
R
distribution

dρ 1
= argminv∈Tp P2 (Rn ) ⟨∇F (ρ), v⟩p + ∥v∥2p .
dt 2
Namely, ρt follows continuous gradient descent in the density space for the function F (ρ) = DKL (ρ∥ν) under
the Wasserstein metric.

Proof. For any function c, the optimization problem of interest satises


Z Z Z
1 1
min ⟨c, δ⟩ + ρ(x)∥∇λ(x)∥2 dx = min − ρ(x) · ∇c(x)⊤ ∇λ(x)dx + ρ(x)∥∇λ(x)∥2 dx.
δ=∇·(ρ∇λ) 2 ∇λ 2

Solving the right hand side, we have ∇c = ∇λ and hence δ = ∇·(ρ∇c). Now, we note that ∇F (ρ) = log νρ −1.
Therefore,
dρ ρ
= ∇ · (ρ∇(log − 1))
dt ν
ρ
= ∇ · (ρ∇ log )
ν
= ∇ · (ρ∇f ) + ∆ρ

which is exactly equal to (8.1).

To analyze this continuous descent in Wasserstein space, we rst prove that continuous gradient descent
converges exponentially whenever F is strongly convex.
8.2. Langevin Dynamics is Gradient Descent in Density Space*4 119

Lemma 8.12. Let F be a function satisfying Gradient Dominance:

2
∥∇F (x)∥x ≥ α · (F (x) − min F (y)) for all x (8.2)
y

on the manifold with the metric ∥ · ∥x where ∇ is the gradient on the manifold. Then, the process dxt =
−αt
−∇F (xt )dt converges exponentially, i.e., F (xt ) − miny F (y) ≤ e (F (x0 ) − miny F (y)).

Proof. We write
d dxt
(F (x) − min F (y)) = ⟨∇F (xt ), ⟩x = −∥∇F (xt )∥2xt ≤ −α(F (x) − min F (y)).
dt y dt t y

The conclusion follows.

Finally, we note that the log-Sobolev inequality for the density ν can be re-stated as the condition (8.2).

Lemma 8.13. Fix a density ν. Then the log-Sobolev inequality, namely, for every smooth function g,
Z Z
1 2
∥∇g∥ dν ≥ α g(x)2 log g(x)2 dν
2

implies the condition (8.2).


q
ρ(x)
Proof. Take g(x) = ν(x) , the log-Sobolev inequality shows that

Z 2 Z
1 ρ(x)
dx ≥ α · ρ(x) log ρ(x) dx for all ρ.
ρ(x) ∇ log

2 ν(x) ν(x)

As we calculate in Theorem 8.11, we have that


Z 2
2 ρ(x)
∥∇F (ρ)∥ρ = ρ(x) ∇ log
dx.
ν(x)

Therefore, this is exactly the condition (8.2) with coecient 2α.

Combining Lemma 8.13 and Lemma 8.12, we have the following result:

Theorem 8.14. Let f be a smooth function with log-Sobolev constant α. Then the Langevin dynamics

dxt = −∇f (x)dt + 2dWt

converges exponentially in KL-divergence to the density ν(x) ∝ e−f (x) with mixing rate O( α1 ), i.e., KL(xt , ν) ≤
e−2αt KL(x0 , ν).
See [44] for a tight estimate of log-Sobolev constant for logconcave measures. In particular for a logconcave
measure with support of diameter D, the log-Sobolev constant is Ω(1/D).

8.2.1 Discussion
Langevin dynamics converges quickly in continuous time for isoperimetric distributions. Turning this into
an ecient algorithm typically needs more assumptions and there is much room for choosing discretizations.
This is similar to the situation with gradient descent for optimization. As we saw in Section 8.2, it turns
out that Langevin dynamics is in fact gradient descent in the space of probability measures under the
Wasserstein metric, where the function being minimized is the KL-divergence of the current density from
the target stationary density. For more on this view of sampling as optimization over measures, see [75].
Chapter 9

Elimination and Reduction

9.1 Cutting Plane method for Volume Computation


The cutting plane method gives a simple algorithm for computing the volume of a bounded set, or the
integral of a function. The algorithm relies crucially on the ability to compute the center of gravity of the
original set/function restricted by halfspaces. For simplicity we describe it here for the case when the input
object is a convex body.
Algorithm 26: CuttingPlaneVolume
Input: A body K ⊂ Rn , r ∈ R, and an oracle for computing centroid.
Fix an ordering on the axes: e1 , e2 , . . . en
K (0) = K, z (0) = centroid(K (0) ), V = 1, k = 0.
for k = 0, · · · do
Let ei be an axis vector so that the width of K (k) along ei is greater than r/2.
if There is no such ei then Break.;
def 
Let a be ei or −ei with sign chosen so that H (k) = x : a⊤ x ≤ a⊤ z (k) contains z (0) .

K (k+1) ← K (k) ∩ H (k) .


z (k+1) ← centroid(K (k+1) ).
zb ← centroid(K (k) \ K (k+1) ).
∥zb−z(k+1) ∥
V ← V · zb−z(k) .
∥ ∥
end
Qn
Return V · i=1 wi (K (k+1) ) where wi (·) is the width along ei .
The idea behind the algorithm is simple: when we cut a set with a hyperplane through its centroid, the
line joining the centroids of the two sides passes through the original centroid; moreover, the ratio of the two
segments is exactly the ratio of the volumes of the two halfspaces.
Lemma 9.1. For any measurable bounded set S in Rn and any halfspace H with bounding hyperplane
containing the centroid of S, we have

vol(S ∩ H) vol(S ∩ H)
centroid(S) = centroid(S ∩ H) + centroid(S ∩ H).
vol(S) vol(S)
The following lemma has a proof similar to that of the Grunbaum theorem.
Lemma 9.2. Let K ⊆ Rn be a convex body with centroid at the origin. Suppose that for some unit vector
θ, the support of K along θ is [a, b]. Then,
b
. |a| ≥
n
Exercise 9.3. Prove Lemma 9.2. [Hint: Use Theorem 1.27.]
Using the above property, we can show that the algorithm reaches a cuboid in a small number of iterations.
Theorem 9.4. Let K be a convex body in Rn containing a cube of side length r around its centroid and
contained in a cube of side length R. Algorithm CuttingPlaneVolume correctly computes the volume of K
nR
using O(n log r ) centroid computations.

120
9.2. Optimization from Membership via Sampling 121

Proof. By Lemma 3.14, at each iteration, the volume of the remaining set K (k) decreases by a factor of
at most (1 − 1e ). When the directional width along any axis is less than r/2 (namely maxx∈K e⊤ ≤ r ),

i x 2
the algorithm stops cutting along that axis. So, just before the last cut along any axis, the width in that
direction is at least r/2. Then, we use the center of gravity to cut. By Lemma 9.2, the directional width
along every axis of the surviving set is at least r/2(n + 1). Since the set always contains the origin, and the
original set contains a cube of side length r, when the algorithm stops, it must be an axis-parallel cuboid
with each side of length in the range [r/2(n + 1), r/2]. So the nal volume is at least (r/2(n + 1))n . The
initial volume is at most Rn . Therefore the number of iterations is at most
Rn
 
log(1− 1e ) = O (n log(nR/r)) .
(r/2(n + 1))n
In each iteration, by Lemma 9.1, the algorithm maintains the ratio of the volume of the original K to the
current K (k) .
The above algorithm shows that computing the volume is polytime reducible to computing the centroid.
Since volume is known to be #P-hard for explicit polytopes, this means that centroid computation is also
#P-hard for polytopes [62]. In later chapters we will see randomized polytime algorithms for sampling and
hence for approximating centroid and volume.
Exercise 9.5. [12] Given a partial order P on an n-element set, it is of interest to count the number of linear
extensions, i.e., total orders on the set that are consistent with P . We can dene an associated polyhedron,
Q = {x ∈ [0, 1]n : xi < xj if (i, j) ∈ P } .
Show that the number of linear extensions of P is exactly vol(Q)n!.

9.2 Optimization from Membership via Sampling


In this chapter, we assumed the oracle can be computed exactly. However, it is still open to determine the
best runtime for reductions between noisy oracles, where the oracle provides answers to within some bounded
error. In particular, the question about minimizing (approximately) convex functions under noisy oracles is
an active research area, and the gaps between the lower bound and upper bound for many problems are still
quite large ([8, 24, 13] based on [31]). We will see these methods in detail later. For now, to put them in the
context of oracles, we just discuss the following theorem.
Theorem 9.6 (OPT via sampling).
T
Let F (x) = e−αc x
1K (x) for some convex set K in Rn and vector
c ∈ Rn . Suppose minx∈K cT x is bounded. Let x be a random sample from the distribution with density
proportional to F (x). Then,
n
E cT x ≤ min cT x + .

K α
Proof. We will show that the worst case is when the convex set K is an innite cone. WLOG assume that the
minimum is at x = 0. Replace every cross-section of K along c with an (n − 1)-dimensional ball of the same
volume as the cross-section. This does not aect the expectation. Suppose the expectation is E(cT x) = a.
Next, replace the subset of K with cT x ≤ a with a cone whose base is the cross-section at a and apex is
at zero. This only makes the expectation larger. Next, replace the subset of K on the right of a with the
innite conical extension of the cone to the left of a. Again, this expectation can only be higher.
Now for this innite cone, we compute, using y = cT x,
R ∞ −αy n−1
T ye y dy
E c x = R0 ∞ −αy n−1

0
e y dy
R ∞ −y n
1 e y dy
= R ∞0 −y n−1
α 0 e y dy
1 n! n
= = .
α (n − 1)! α
9.2. Optimization from Membership via Sampling 122

T
The theorem says that if we sample according to e−αc x for α = n/ε, we will get an ε-approximation to
the optimum. However, sampling from such a density is not trivial. Instead, we will have to go through a
sequence of overlapping distributions, starting with one that is easy to sample and ending with a distribution
that is focused close to the minimum. This method is known as simulated annearling and is the subject
of Chapter 11. The complexity of sampling is polynomial in the dimension and logarithmic in a suitable
notion of probabilistic distance between the starting distribution and the target distribution. The sampling
algorithm only uses a membership (EVAL) oracle.

Exercise 9.7. Extend Theorem 9.6 by replacing cT x with any convex function f (x).
Open Problem. Given an approximately convex function F on unit ball such that max∥x∥2 ≤1 |f (x) −
F (x)| ≤ ε/n for some convex function f , how eciently can we nd x in the unit ball such that F (x) ≤
min∥x∥2 ≤1 F (x) + O(ε)? The current fastest algorithm takes O(n4 logO(1) (n/ε)) calls to the noisy EVAL
oracle for F .
Chapter 10

Geometrization

In this chapter, we study polynomial-time sampling algorithms.

The Ball Walk


The simplest continuous algorithm for exploring space is Brownian motion with the ODE dxt = dWt . To
turn this into a sampling algorithm, for a convex function f , we saw an extension using the gradient of f ,
namely √
dxt = −∇f (xt ) + 2dWt
which can be used to sample according to the density proportional to e−f . In this chapter we will begin
with an even simpler method, which does not need access to the gradient or assume dierentiability, only an
oracle that can evaluate F .
Algorithm 27: BallWalk
Input: Step-size δ , number of steps T , starting point x0 in the support of target density Q.
Repeat T times: at a point x,
1. Pick a random point y in the nδ -ball centered
o at x.
Q(y)
2. Go to y with probability min 1, Q(x) .
return x.
Exercise 10.1. Show that the distribution with density Q is stationary for the ball walk in a connected,
compact full-dimensional set, i.e., if the distribution of the current point x has density Q, it remains Q.
Under mild conditions, the distribution of the current point approaches the target density Q. The main
question is the rate of convergence, which would allow us to bound the number of steps. Note that each
step involves only a function evaluation, to an oracle that outputs the value of a function proportional to
the desired density. To bound the rate of convergence (and as a result the uniqueness of the stationary
distribution), we rst develop some general tools.

10.1 Basics of Markov chains


For more detailed reading, including additional properties, see Section 1 of [52].
A Markov chain is dened using a σ -algebra (K, A), where K is the state space and A is a set of subsets of
K that is closed under complements and countable unions. For each element u of K , we have a probability
measure Pu on (K, A), i.e., each set A ∈ A has a probability Pu (A). Informally, Pu is the distribution
obtained upon taking one step from u. The triple (K, A, {Pu : u ∈ K}) along with a starting distribution
Q0 denes a Markov chain, i.e., a sequence of elements of K , w0 , w1 , . . ., where w0 is chosen from Q0 and
each subsequent wi is chosen from Pwi−1 . The choice of wi+1 depends only on wi and is independent of
w0 , . . . , wi−1 .
A distribution Q on (K, A) is called stationary if taking one step from it maintains the distribution, i.e.,
for any A ∈ A, Z
Pu (A) dQ(u) = Q(A).
K

A distribution Q is atom-free if there is no x ∈ K with Q(x) > 0.

123
10.1. Basics of Markov chains 124

Example. For the ball walk in a convex body, the state space K is the convex body, and A is the set of all
measurable subsets of K . The next step distribution is

vol (K ∩ (u + δBn ))
Pu ({u}) = 1 −
vol(δBn )

and for any measurable subset A,

vol (A ∩ (u + δBn ))
Pu (A) = + 1u∈A (u)Pu ({u})
vol(δBn )
vol(A)
The uniform distribution is stationary, i.e., Q(A) = vol(K) .

The ergodic ow of a subset A w.r.t. the distribution Q is dened as


Z
Φ(A) = Pu (K \ A) dQ(u).
A

A distribution Q is stationary if and only if Φ(A) = Φ(K \ A). The existence and uniqueness of the
stationary distribution Q for general Markov chains is a subject on its own. One way to ensure uniqueness
of a stationary distribution is to use lazy Markov chains. In a lazy version of a given Markov chain, at each
step, with probability 1/2, we do nothing; with the rest we take a step according to the Markov chain. The
next theorem is folklore.

Exercise 10.2. If Q is stationary w.r.t. a lazy ergodic Markov chain, then it is the unique stationary
distribution for that Markov chain.

Informally, the mixing rate of a random walk is the number of steps required to reduce some measure of
the distance of the current distribution to the stationary distribution by a constant factor. The following
notions will be useful for comparing two distributions P, Q.
1. Total variation distance is dtv (P, Q) = supA∈A |P (A) − Q(A)|.
2. L2 or χ2 -distance of P with respect to Q is
Z  2 Z  2 Z
2 dP (u) dP (u) dP (u)
χ (P, Q) = − 1 dQ(u) = dQ(u) − 1 = dP (u) − 1.
K dQ(u) K dQ(u) K dQ(u)

P (A)
3. Warmth: P is said to be M -warm w.r.t. Q if M = supA∈A Q(A) .

Convergence via Conductance


Now we introduce an important tool to bound the rate of convergence of Qt ,the distribution after t steps to
Q. Assume that Q is the unique stationary distribution. The conductance of a subset A is dened as

Φ(A)
ϕ(A) =
min{Q(A), Q(K \ A)}

and the conductance of the Markov chain is


R
A
Pu (K \ A) dQ(u)
ϕ = min ϕ(A) = min .
A 0<Q(A)≤ 12 Q(A)

The local conductance of an element u is ℓ(u) = 1 − Pu ({u}).


For any 0 ≤ s < 21 , the s-conductance of a Markov chain is dened as

ϕ(A)
ϕs = min .
A:s<Q(A)≤ 12 Q(A) − s
10.1. Basics of Markov chains 125

Ideally we would like to show that d(Qt , Q), the distance between the distribution after t steps and the
target Q is monotonically (and rapidly) decreasing. We consider

sup Qt (A) − Q(A)


A:Q(A)=x

for each x ∈ [0, 1]. To prove inductively that this quantity decreases, Let Gx be the set of functions dened
as  Z 
Gx = g : K → [0, 1] : g(u) dQ(u) = x .
u∈K

Using this, dene


Z Z
ht (x) = sup g(u) (dQt (u) − dQ(u)) = sup g(u) dQt (u) − x.
g∈Gx u∈K g∈Gx u∈K

This function is strongly concave.


Exercise 10.3. Show that the function ht is concave, and if Q is atom-free , then ht (x) = supA:Q(A)=x Qt (A)−
Q(A) and the supremum is achieved by some subset.
If the target density Q has atoms, i.e., points of positive probability, then the function g achieving h
might have a fractional value at one atom.
Lemma 10.4. Let Q be atom-free and t ≥ 1. For any 0 ≤ x ≤ 1, let y = min{x, 1 − x}. Then,

1 1
ht (x) ≤ ht−1 (x − 2ϕy) + ht−1 (x + 2ϕy).
2 2
Proof. Assume that 0 ≤ x ≤ 21 . We construct two functions, g1 and g2 , and use these to bound ht (x). Let
A be a subset that achieves ht (x). Dene
( (
2Pu (A) − 1 if u ∈ A, 1 if u ∈ A,
g1 (u) = and g2 (u) =
0 if u ∈
/ A, 2Pu (A) if u ∈ / A.

Note that 21 (g1 + g2 )(u) = Pu (A) for all u ∈ K , which means that
Z Z Z
1 1
g1 (u) dQt−1 (u) + g2 (u) dQt−1 (u) = Pu (A) dQt−1 (u) = Qt (A).
2 u∈K 2 u∈K u∈K

Since the walk is lazy, Pu (A) ≥ 21 i u ∈ A, the range of the functions g1 , g2 is [0, 1]. We let
Z Z
x1 = g1 (u) dQ(u) and x2 = g2 (u) dQ(u),
u∈K u∈K

then g1 ∈ Gx1 and g2 ∈ Gx2 . Moreover,


Z Z Z
1 1 1
(x1 + x2 ) = g1 (u) dQ(u) + g2 (u) dQ(u) = Pu (A) dQ(u) = Q(A) = x.
2 2 u∈K 2 u∈K u∈K

since Q is stationary.

ht (x) = Qt (A) − Q(A)


Z Z
1 1
= g1 (u) dQt−1 (u) + g2 (u) dQt−1 (u) − Q(A)
2 u∈K 2 u∈K
Z Z
1 1
= g1 (u) (dQt−1 (u) − dQ(u)) + g2 (u) (dQt−1 (u) − dQ(u))
2 u∈K 2 u∈K
1 1
≤ ht−1 (x1 ) + ht−1 (x2 ).
2 2
10.1. Basics of Markov chains 126

Next,
Z
x1 = g1 (u) dQ(u)
u∈K
Z Z
= 2 Pu (A) dQ(u) − dQ(u)
u∈A u∈A
Z
= (1 − Pu (K \ A)) dQ(u) − x
2
u∈A
Z
= x−2 Pu (K \ A) dQ(u)
u∈A
= x − 2Φ(A)
≤ x − 2ϕx
= x(1 − 2ϕ).

Thus we have, x1 ≤ x(1 − 2ϕ) ≤ x ≤ x(1 + 2ϕ) ≤ x2 . Since ht−1 is concave, the chord from x1 to x2 on ht−1
lies below the chord from [x(1 − 2ϕ), x(1 + 2ϕ)]. Therefore,
1 1
ht (x) ≤ ht−1 (x(1 − 2ϕ)) + ht−1 (x(1 + 2ϕ)).
2 2

A proof along the same lines implies the following generalization.


Lemma 10.5. Let Q be atom-free and 0 ≤ s ≤ 1. For any s ≤ x ≤ 1 − s, let y = min{x − s, 1 − x − s}.
Then for any integer t > 0,
1 1
ht (x) ≤ ht−1 (x − 2ϕs y) + ht−1 (x + 2ϕs y).
2 2
These results can be extended to the case when Q has atoms with slightly weaker bounds [52].
Theorem 10.6. Let 0≤s≤1 and C0 and C1 be such that
√ √
h0 (x) ≤ C0 + C1 min{ x − s, 1 − x − s}.

Then
t
√ √ ϕ2s

ht (x) ≤ C0 + C1 min{ x − s, 1 − x − s} 1 − .
2
The proof is by induction on t.
Corollary 10.7. We have
1. Let M = supA Q0 (A)/Q(A). Then,

t
√ ϕ2

dT V (Qt , Q) ≤ M 1− .
2
1
2. Let 0<s≤ 2 and Hs = sup{|Q0 (A) − Q(A)| : Q(A) ≤ s}. Then,

t
ϕ2

Hs
dT V (Qt , Q) ≤ Hs + 1− s .
s 2

3. Let M = χ2 (Q0 , Q). Then for any ε > 0,


r t
ϕ2

M
dT V (Qt , Q) ≤ ε + 1− .
ε 2
10.2. Conductance of the Ball Walk 127

Convergence via Log-Sobolev


For a warm start, the convergence rate established by conductance is asymptotically optimal in many cases
of interest, including the ball walk for convex body. However, when the starting distribution is more focused,
e.g., a single point, then there is a signicant starting penalty usually a factor of the dimension or larger.
One way to avoid this is to observe that the conductance of smaller subsets is in fact even higher and that
one does not need to pay this large starting penalty. A classical technique in this regard is the log-Sobolev
constant. For a Markov chain with stationary density Q and transition operator P , we can dene it as
follows. R 2
x,y∈K
(f (x) − f (y)) P (x, y)dQ(x)
ρ= inf R .
g:smooth, g(x)2 dQ(x) = 1 f (x)2 log f (x)2 dQ(x)
R

This parameter allows us to show convergence of the current distribution to the target in relative entropy.
Recall that the relative entropy of a distribution P with respect to a distribution Q is
Z
P (x)
HQ (P ) = P (x) log dQ(x).
K Q(x)

Theorem 10.8. For a Markov chain with distribution Qt at time t, and log-Sobolev parameter ρ, we have

HQ (Qt ) ≤ e−2ρt HQ (Q0 ).

10.2 Conductance of the Ball Walk


In the section we bound the conductance of the ball walk when applied to the indicator function of a convex
body. At rst glance, the ball walk is not an ecient algorithm, even for uniformly sampling a convex body.
The reason is simply that the local conductance could be exponentially small (consider a point close to the
vertex of a polyhedron). We can get around this in two ways. The rst, which is simpler, but less ecient
is to smoothen the convex body by taking the Minkowski sum with a small Euclidean ball, i.e., replace K
with K + αB n .

Exercise 10.9. Let K be a convex


√ body in Rn containing the unit ball. Show that (a) vol(K + αB n ) ≤
(1 + α)n K and (b) with δ = α/ n the local conductance of every point in K + αB n is at least an absolute
constant.
Using the exercise, it suces to set α = 1/n, so that a sample from K + αB n has a large probability of
being in K and then δ = 1/n3/2 .
The second approach is to show that the local conductance is in fact large almost everywhere, and if the
starting distribution is warm
√ then these points can eectively be ignored. This will allow us to make δ
much larger, namely δ = 1/ n. Larger step sizes should allows us to prove faster mixing.

To convey the main ideas of the analysis, we focus on the rst approach here. The goal is to show that the
conductance of any subset is large, i.e., the probability of crossing over in one step is at least proportional to
the measure of the set or its complement, whichever is smaller. First, we argue that the one-step distributions
of two points will have a signicant overlap if the points are sucient close.

Lemma 10.10 (One-step overlap). Let u, v ∈ K s.t. ℓ(u), ℓ(v) ≥ ℓ and ∥u − v∥ ≤ tδ



n
. Then the one-step
distributions from them satisfy dT V (Pu , Pv ) ≤ 1 + t − ℓ.
Proof. First assume that ℓ = 1. This means that the balls of radius δ centered at u and v are fully contained in
K . The TV distance between these distributions is bounded by vol(u+δB n \v+δB n ) = vol(v+δB n \u+δB n )
divided by vol(δB n ). This is exactly the volume of a band of thickness ∥u − v∥ centered at the center of
δB n (see Fig. ). This has relative volume at most t, proving the lemma under the assumption that ℓ = 1.
For the general case, we note that when the balls are intersected with a convex body, the increase in the TV
distance is at most the probability that there is no proper move at u (or v ), i.e., 1 − ℓ. This completes the
proof.
10.2. Conductance of the Ball Walk 128
10.2. Conductance of the Ball Walk 129

Figure 10.1: One-step overlap fully contained in K

Figure 10.2: One-step overlap intersected by K


10.2. Conductance of the Ball Walk 130

Setting t = ℓ/2, this says that if the total variation distance between the one-step distributions from u, v
is greater than 1 − ℓ/2, then the distance between them is at least 2√ ℓδ
n
. What this eectively says is that
points close to the internal boundary of a subset are likely to cross over to the other side. To complete a proof
we would need to show that the internal boundary of any subset is large if the subset (or its complement) is
large, a purely geometric property.
Theorem 10.11 (Isoperimetry). Let S1 , S2 , S3 be a partition of a convex body K of diameter D. Then,

2
vol(S3 ) ≥ d(S1 , S2 ) min {vol(S1 ), vol(S2 )} .
D
This can be generalized to any logconcave measure. We will discuss this and other extensions in detail
later. But rst we bound the conductance.
Theorem 10.12. Let K be a convex body in Rn D containing the unit ball and with every u ∈ K
of diameter
having ℓ(u) ≥ ℓ. Then the conductance of the ball walk on K with step size δ is
 2 
ℓ δ
Ω √ .
nD

Proof. Let K = S1 ∪ S2 be a partition into measurable sets. We will prove that

ℓ2 δ
Z
Px (S2 ) dx ≥ √ min{vol(S1 ), vol(S2 )} (10.1)
S1 16 nD

Note that since the uniform distribution is stationary,


Z Z
Px (S2 ) dx = Px (S1 ) dx.
S1 S2

Consider the points that are deep inside these sets, i.e., unlikely to jump out of the set:
   
ℓ ℓ

S1 = x ∈ S1 : Px (S2 ) < and S2 = x ∈ S2 : Px (S1 ) <

.
4 4

Let S3′ be the rest i.e., S3′ = K \ S1′ \ S2′ .


Suppose vol(S1′ ) < vol(S1 )/2. Then
Z
ℓ ℓ
Px (S2 ) dx ≥ vol(S1 \ S1′ ) ≥ vol(S1 )
S1 4 8

which proves (10.1).


So we can assume that vol(S1′ ) ≥ vol(S1 )/2 and similarly vol(S2′ ) ≥ vol(S2 )/2. Now, for any u ∈ S1′ and
v ∈ S2′ ,

||Pu − Pv ||tv ≥ 1 − Pu (S2 ) − Pv (S1 ) > 1 − .
2
Applying Lemma 10.10 with t = ℓ/2, we get that

ℓδ
|u − v| ≥ √ .
2 n

Thus d(S1 , S2 ) ≥ ℓδ/2 n. Applying Theorem 10.11 to the partition S1′ , S2′ , S3′ , we have

ℓδ
vol(S3′ ) ≥ √ min{vol(S1′ ), vol(S2′ )}
nD
ℓδ
≥ √ min{vol(S1 ), vol(S2 )}.
2 nD
10.2. Conductance of the Ball Walk 131

We can now prove (10.1) as follows:


Z Z Z
1 1
Px (S2 ) dx = Px (S2 ) dx + Px (S1 ) dx
S1 2 S1 2 S2
1 ℓ
≥ vol(S3′ )
2 4
ℓ2 δ
≥ √ min{vol(S1 ), vol(S2 )}.
16 nD

Corollary 10.13. The ball walk in a convex body with local conductance at least ℓ everywhere has mixing
rate O(nD2 δ 2 /ℓ4 ).
Using the construction above of adding a small ball to every point of K gives a lower bound of δ = 1/n3/2
and ℓ = Ω(1) and thus a polynomial bound of O(n4 D2 ) on the mixing time. As we will see presently, this can
be improved to n2 D2 by avoiding the blow-up, and analyzing the average local conductance. The example
of starting near a corner (say of a hypercube) shows that this cannot work in general; however, from a warm
start, it will suce to bound the average local conductance rather than the minimum.

Warm Start and s-Conductance


In this section we give a better bound for the conductance of suciently large subsets resulting in the
following bound on the mixing rate of the ball walk.

Theorem 10.14. From a warm start, the ball walk in a convex body of diameter D containing a unit ball
has a mixing rate of O(n2 D2 ) steps.

This is based on two ideas: (1)√most points of a convex body containing a unit ball have large local
conductance and we can use δ = 1/ n instead of 1/n3/2 , (2) the s-conductance is large and hence the walk
mixes from a suitably warm start.

Lemma
 10.15. Let K be a convex body containing a unit ball. For the√ ball walk with δ step size, let
Kδ = u ∈ K : ℓ(u) ≥ 34 . Then Kδ is a convex set and vol(Kδ ) ≥ (1 − 2δ n)vol(K).

Exercise 10.16. Prove the rst part of the previous lemma.


The proof of the above lemma will use the following bound on crossing the boundary of K .

Lemma 10.17. Let L be any measurable subset of the boundary of a convex body K and SL = {(x, y) : x ∈ K, y ̸∈ K, ∥x − y∥ ≤
Then we have
δ vol(B n−1 )
vol2n (SL ) ≤ voln−1 (L)vol(δB n ).
(n + 1) vol(B n )
Proof. It suces to consider the case when L is innitesimally small; and then we can assume that the
surface of K is locally a hyperplane and compute the measure of SL explicitly.

Theorem 10.18. The s-conductance of the ball walk with δ= s



4 n
step size in a convex body K of diameter
D and containing the unit ball satises
s
ϕs ≳ .
nD
This proof is similar to that of Theorem 10.12, with one important extension. Rather than applying the
argument to a partition of the original convex body K , we restrict the argument to Kδ ,the points in K that
have high local conductance. Since this subset takes up most of K , and is convex, we will able to use its
isoperimetry to lower bound the conductance.
10.2. Conductance of the Ball Walk 132

Figure 10.3: Partitions of convex body


10.2. Conductance of the Ball Walk 133

Proof. As before, we consider the following partition of K . Let K = S1 ∪ S2 be a partition into measurable
sets. We will prove that

Z
δ s s
Px (S2 ) dx ≥ √ min{vol(S1 ) − , vol(S2 ) − } (10.2)
S1 C nD 2 2
Since the uniform distribution is stationary,
Z Z
Px (S2 ) dx = Px (S1 ) dx.
S1 S2

Consider the points that are deep inside these sets:


   
1 1

S1 = x ∈ S1 : Px (S2 ) < and S2 = x ∈ S2 : Px (S1 ) <

.
8 8
Let S3′ be the rest i.e., S3′ = K \ S1′ \ S2′ . Recall that
 
3
Kδ = x ∈ K : ℓ(x) ≥
4

and dene Si′′ = Si′ ∩ Kδ . Note that by Lemma 10.15, for δ ≤ s/(4 n),

vol(Si′′ ) ≥ vol(Si′ ) − s.

Suppose vol(S1′′ ) < vol(S1 ∩ Kδ )/2. Then


Z
1 1 1  s
Px (S2 ) dx ≥ vol(S1 ∩ Kδ \ S1′′ ) ≥ vol(S1 ∩ Kδ ) ≥ vol(S1 ) −
S1 8 16 16 2

which proves (10.2).


So we can assume that vol(S1′′ ) ≥ vol(S1 ∩ Kδ )/2 and similarly vol(S2′′ ) ≥ vol(S2 ∩ Kδ )/2. Now, for any
u ∈ S1′′ and v ∈ S2′′ ,
1
dT V (Pu , Pv ) ≥ 1 − Pu (S2 ) − Pv (S1 ) > 1 − .
4
Applying Lemma 10.10 with t = 3/8, we get that

|u − v| ≥ √ .
8 n

Thus d(S1′′ , S2′′ ) ≥ 3δ/(8 n). Applying Theorem 10.11 to the partition S1′′ , S2′′ , S3′′ of Kδ , we have

vol(S3′′ ) ≥ √ min{vol(S1′′ ), vol(S2′′ )}
4 nD
3δ s s
≥ √ min{vol(S1 ) − , vol(S2 ) − }.
8 nD 2 2
We can now prove (10.2) as follows:
Z Z Z
1 1
Px (S2 ) dx = Px (S2 ) dx + Px (S1 ) dx
S1 2 S1 2 S2
1 1
≥ vol(S3′′ ) ·
2 8

≥ √ min{vol(S1 ), vol(S2 )}
128 nD
which implies that the conductance is at least 3s/(512nD).
The mixing rate then follows by applying Theorem 10.6 and Corollary 10.7.
10.3. Generating a warm start 134

Tightness of the bound


The mixing rate of O(n2 D2 ) for the ball walk is in fact the best possible even from a warm start (with the
assumption of a unit ball inside the convex body and diameter D). To see this, consider a cylinder whose
cross-section is a unit ball and axis is [0, D] along e1 . Suppose the starting distribution is uniform in the part
of the cylinder in [0, D/3]. Then we claim that to reach the opposite third of the cylinder needs Ω(n2 D 2
√)
steps with high probability. Each step has length at most δ in a random direction, and this √ is about δ/ n
along e1 . Viewing this as an unbiased random walk along e1 , the eective diameter is D/(δ/ n) and hence
the number of steps to cross an interval of length D/3 is Ω(nD2 /δ 2 ) = Ω(n2 D2 ).

Exercise 10.19. Prove the above claim rigorously.

Speedy walk
In the above analysis of the ball walk, the dependence on the error parameter ε, the distance to the target
distribution, is polynomial in 1/ε rather than its logarithm. The speedy walk is a way to improve the analysis.
In the speedy walk, at a point x, we sample the next step uniformly from the intersection of (x + δB n ) ∩ K .
The resulting Markov chain is the subsequence of proper steps of the ball walk.

Exercise 10.20. Show that the stationary density of the speedy walk in a convex body is proportional to
the local conductance.

Theorem 10.21. The conductance of the speedy walk is Ω(1/nD).


To analyze the ball walk, we then need to show that the number of wasted steps is not too many. This
follows from the assumption of a warm start and Lemma 10.15.

10.3 Generating a warm start


To get the mixing rate of O(n2 D2 ), we need a warm start, i.e., a distribution whose density at any point is
within O(1) of the target density.
Algorithm 28: Warm Start
Input: membership oracle for K s.t. B n ⊆ K ⊆ DB n .
Let x be a random point in B n . Dene Ki = 2i/n B n ∩ K .
for i = 1, · · · , n log D do
1. Use ball walk from x to generate random point y in Ki .
2. Set x = y .
end
return x.
Since Ki+1 ⊆ 21/n Ki , we have vol(Ki+1 ) ≤ 2vol(Ki ) and hence a 2-warm start is maintained. Once we
have a random point from K , subsequent random points can be generated by simply continuing the ball
walk; thus the cost of the warm start is only for the rst sample.

10.4 Isotropic Transformation


The complexity of sampling with the ball walk is polynomial in n, D and log(1/ε) to get within ε of the
target density. This is not a polynomial algorithm since the dependence is on D and not log D. To get a
polynomial algorithm, we need one more ingredient.
We say that a distribution Q is isotropic if EQ (x) = 0 and EQ (xxT ) = I , i.e., the mean is zero and the
covariance matrix (exists and) is the identity. We say that the distribution is C -isotropic if the eigenvalues
of its covariance matrix are in [ C1 , C]. An ane transformation is said to be an isotropic tranformation if
the resulting distribution is isotropic.
10.5. Isoperimetry via localization 135

Any distribution with bounded second moments has an isotropic transformation. It is clear that satisfying
the rst condition is merely a translation, so assume the mean is zero. For the second, suppose the covariance
matrix is EQ (xxT ) = A. Then consider y = A−1/2 x. It is easy to see that

E(yy T ) = A−1/2 E xxT A−1/2 = I.




For convex bodies, isotropic position comes with a strong guarantee.

Theorem 10.22. For a convex body in isotropic position (i.e., the uniform distribution over the body is
isotropic), we have
r
n+2 n p
B ⊆ K ⊆ n(n + 2)B n .
n
Thus the eective diameter is O(n). If we could place a convex body in isotropic position before sampling,
we √
would have a poly(n) algorithm. In fact, it is even better than this as most points are within distance
O( n) of the center of gravity. We quote a theorem due to Paouris.

Theorem 10.23. For an isotropic logconcave density p in Rn and any t ≥ 1,


√  √
Pr ∥x∥ ≥ ct n ≤ e−t n .
p

How to compute an isotropic transformation? This is easy, from the denition, all we need is to estimate
its covariance matrix, which can be done from random samples. Thus, if we could sample K , we can compute
an isotropic transformation for it. This appears cyclic  we need isotropy for ecient sampling and ecient
sampling for isotropy. The solution is simply to bootstrap them.
Algorithm 29: IsotropicTransform
Input: membership oracle for K s.t. B n ⊆ K ⊆ DB n .
Let x be a random point in B n , A = I and Ki = 2i/n B n ∩ K .
for i = 1, · · · , n log D do
1. Use the ball walkP from x to generate N random points x1 . . . xN in AKi .
N
2. Compute C = N1 i=1 xi xTi and set A = C −1/2 A.
3. Set x = xN .
end
return x.
We will choose N large enough so that after the transformation Ki is 2-isotropic and therefore Ki+1 is
6-isotropic. We can bound N as follows.
Exercise 10.24. Show that if K is isotropic, then with N = O(n2 ), the matrix A = N1 N i=1 xi xi for N
T
P
random samples from K satises ∥A − I∥op ≤ 0.5.
A tight bound on the sample complexity was established by [1] (see also [11, 63, 68]).

Theorem 10.25. For an isotropic logconcave distribution Q in Rn , the covariance N = O(n) random
samples satises ∥A − I∥op ≤ 0.5.
Thus the overall algorithm needs O(n log D) phases, with O(n) samples in each phase from a near-isotropic
distribution, and thus poly(n) steps per sample.

10.5 Isoperimetry via localization


Theorem 10.11 was rened by KLS [32] as follows (we state it here for logconcave densities).

Theorem 10.26. For any partition S1 , S2 , S3 of Rn , and any logconcave measure µ in Rn ,


ln 2
µ(S3 ) ≥ min {µ(S1 ), µ(S2 )} .
Eµ (∥x − x̄∥)
10.5. Isoperimetry via localization 136


Thus, for a (near-)isotropic distribution, the diameter can be replaced by O( n) and this gives a bound
of O(n3 ) from a warm start. One way to summarize the analysis so far is that the complexity of sampling a
convex body (and in fact a logconcave density) from a warm start is O∗ (n2 /ψ 2 ) where ψ is the isoperimetric
ratio of the convex body. In other words, the expansion of the Markov chain reduces to the expansion of
the target logconcave density. It then becomes a natural question to nd the best possible estimate for the
isoperimetric ratio. KLS also provided a conjecture for this.
Conjecture 10.27. The isoperimetric ratio of any isotropic logconcave density in Rn is Ω(1).
The bound of the conjecture holds for all halfspace induced subsets. So the conjecture says that the
worst isoperimetry is achieved up to a constant factor by a halfspace (this version does not need isotropic
position). Here we discuss a powerful technique for proving such inequalities.
Classical proofs of isoperimetry for special distributions are based on dierent types of symmetrization
that eectively identify the extremal subsets. Bounding the Cheeger constant for general convex bodies
and logconcave densities is more complicated since the extremal sets can be nonlinear and hard to describe
precisely, due to the trade-o between minimizing the boundary measure of a subset and utilizing as much
of the external boundary as possible. The main technique to prove bounds in the general setting has been
localization, a method to reduce inequalities in high dimension to inequalities in one dimension. We now
describe this technique with a few applications.

10.5.1 Localization
We will sketch a proof of the following theorem to illustrate the use of localization. This theorem was also
proved by Karzanov and Khachiyan [35] using a dierent, more direct approach.
Theorem 10.28 ([22, 49, 35]). Let f be a logconcave function whose support has diameter D and let πf be
the induced measure. Then for any partition of Rn into measurable sets S1 , S2 , S3 ,

2d(S1 , S2 )
πf (S3 ) ≥ min{πf (S1 ), πf (S2 )}.
D
Before discussing the proof, we note that there is a variant of this result in the Riemannian setting.
Theorem 10.29 ([46]). If K ⊂ (M, g) is a locally convex bounded domain with smooth boundary, diameter
π2
R
D and Ricg ≥ 0, then the Poincaré constant is at least 4D2 , i.e., for any g with g = 0, we have that

π2
Z Z
2
|∇g(x)| dx ≥ g(x)2 dx.
4D2
For the case of convex bodies in Rn , this result is equivalent to Theorem 10.28 up to a constant. One
benet of localization is that it does not require a carefully crafted potential. Localization has recently been
generalized to Riemannian setting [39]. The origins of this method were in a paper by Payne and Weinberger
[61].
We begin the proof of Theorem 10.28. For a proof by contradiction, let us assume the converse of its
conclusion, i.e., for some partition S1 , S2 , S3 of Rn and logconcave density f , assume that
Z Z Z Z
f (x) dx < C f (x) dx and f (x) dx < C f (x) dx
S3 S1 S3 S2

where C = 2d(S1 , S2 )/D. This can be reformulated as


Z Z
g(x) dx > 0 and h(x) dx > 0 (10.3)
Rn Rn
where  
Cf (x)
 if x ∈ S1 , 0
 if x ∈ S1 ,
g(x) = 0 if x ∈ S2 , and h(x) = Cf (x) if x ∈ S2 ,
if x ∈ S3 . if x ∈ S3 .
 
−f (x) −f (x)
 

These inequalities are for functions in Rn . The next lemma will help us analyze them.
10.5. Isoperimetry via localization 137

Lemma 10.30 (Localization Lemma [32]). Letg, h : Rn → R be lower semi-continuous integrable functions
such that Z Z
g(x) dx > 0 and h(x) dx > 0.
Rn Rn

Then there exist two points a, b ∈ Rn and an ane function ℓ : [0, 1] → R+ such that

Z 1 Z 1
n−1
ℓ(t) g((1 − t)a + tb) dt > 0 and ℓ(t)n−1 h((1 − t)a + tb) dt > 0.
0 0

The points a, b represent an interval and one may think of ℓ(t)n−1 as proportional to the cross-sectional
area of an innitesimal cone. The lemma says that over this cone truncated at a and b, the integrals of g
and h are positive. Also, without loss of generality, we can assume that a, b are in the union of the supports
of g and h.

Proof outline. The main idea is the following. Let H be any halfspace such that
Z Z
1
g(x) dx = g(x) dx.
H 2 Rn

Let us call this a bisecting halfspace. Now either


Z Z
h(x) dx > 0 or h(x) dx > 0.
H Rn \H

Thus, either H or its complementary halfspace will have positive integrals for both g and h, reducing the
domain of the integrals from Rn to a halfspace. If we could repeat this, we might hope to reduce the
dimensionality of the domain. For any (n − 2)-dimensional ane subspace L, there is a bisecting halfspace
containing L in its bounding hyperplane. To see this, let H be a halfspace containing L in its boundary.
Rotating H about L we get a family of halfspaces with
R the same property. This family ′includes H , the

complementary halfspace of H . The function H g − Rn \H g switches sign from H to H . Since this is a


R

continuous family, there must be a halfspace for which the function is zero.
If we take all (n − 2)-dimensional ane subspaces dened by {x ∈ Rn : xi = r1 , xj = r2 } where r1 , r2 are
rational, then the intersection of all the corresponding bisecting halfspaces is a line or a point (by choosing
only rational values for xi , we are considering a countable intersection). To see why it is a line or a point,
assume we are left with a two or higher dimensional set. Since the intersection is convex, there is a point
in its interior with at least two coordinates that are rational, say x1 = r1 and x2 = r2 . But then there is a
bisecting halfspace H that contains the ane subspace given by x1 = r1 , x2 = r2 in its boundary, and so it
properly partitions the current set.
Thus the limit of this bisection process is a function supported on an interval (which could be a single
point), and since the function itself is a limit of convex sets (intersections of halfspaces) containing this inter-
val, it is a limit of a sequence of concave functions and is itself concave, with positive integrals. Simplifying
further from concave to linear takes quite a bit of work. For the full proof, we refer the reader to [50].

Going back to the proof sketch of Theorem 10.28, we can apply the localization lemma to get an interval
[a, b] and an ane function ℓ such that
Z 1 Z 1
ℓ(t)n−1 g((1 − t)a + tb) dt > 0 and ℓ(t)n−1 h((1 − t)a + tb) dt > 0. (10.4)
0 0

The functions g, h as we have dened them are not lower semi-continuous. However, this can be addressed
by expanding S1 and S2 slightly so as to make them open sets, and making the support of f an open set.
Since we are proving strict inequalities, these modications do not aect the conclusion.
Let us partition [0, 1] into Z1 , Z2 , Z3 as follows:

Zi = {t ∈ [0, 1] : (1 − t)a + tb ∈ Si }.
10.5. Isoperimetry via localization 138

Note that for any pair of points u ∈ Z1 , v ∈ Z2 , |u − v| ≥ d(S1 , S2 )/D. We can rewrite (10.4) as
Z Z
n−1
ℓ(t) f ((1 − t)a + tb) dt < C ℓ(t)n−1 f ((1 − t)a + tb) dt
Z3 Z1

and Z Z
ℓ(t)n−1 f ((1 − t)a + tb) dt < C ℓ(t)n−1 f ((1 − t)a + tb) dt.
Z3 Z2

The functions f and ℓ(·) n−1


are both logconcave, so F (t) = ℓ(t)n−1 f ((1 − t)a + tb) is also logconcave. We
get, Z Z Z 
F (t) dt < C min F (t) dt, F (t) dt . (10.5)
Z3 Z1 Z2

Now consider what Theorem 10.28 asserts for the function F (t) over the interval [0, 1] and the partition
Z1 , Z2 , Z3 : Z Z Z 
F (t) dt ≥ 2d(Z1 , Z2 ) min F (t) dt, F (t) dt . (10.6)
Z3 Z1 Z2

We have substituted 1 for the diameter of the interval [0, 1]. Also, 2d(Z1 , Z2 ) ≥ 2d(S1 , S2 )/D = C . Thus,
Theorem 10.28 applied to the function F (t) contradicts (10.5) and to prove the theorem in general, and it
suces to prove it in the one-dimensional case. A combinatorial argument reduces this to the case when
each Zi is a single interval. Proving the resulting inequality up to a factor of 2 is a simple exercise and uses
only the unimodality of F . The improvement to the tight bound requires one-dimensional logconcavity. This
completes the proof of Theorem 10.28.
The localization lemma has been used to prove a variety of isoperimetric inequalities. The next theorem is
a renement of Theorem 10.28, replacing the diameter by the square-root of the expected squared distance √
of a random point from the mean. For an isotropic distribution this is an improvement from n to n.
This theorem was proved by Kannan-Lovász-Simonovits in the same paper in which they proposed the KLS
conjecture.

Theorem 10.31 ([32]). For any logconcave density p in Rn with covariance matrix A, the KLS constant
satises
1
ψp ≳ p .
tr(A)
The next theorem shows that the KLS conjecture is true for an important family of distributions. The
proof is again by localization [19], and the one-dimensional inequality obtained is a Brascamp-Lieb Theorem.
We note that the same theorem can be obtained by other means [41, ?].

Theorem 10.32. Let h(x) = f (x)e− 2 x Bx / f (y)e− 2 y By dy where f : Rn → R+ is an integrable logcon-


1 ⊤ R 1 ⊤

n
cave function and B is positive denite. Then h is logconcave and for any measurable subset S of R ,

h(∂S) 1
≳ 1 .
min {h(S), h (Rn \ S)} −1
∥B ∥op
2

− 1
In other words, the expansion of h is Ω( B −1 op2 ).

The analysis of the Gaussian Cooling algorithm for volume computation [20] uses localization.
Next we mention an application to the anti-concentration of polynomials. This is a corollary of a more
general result by Carbery and Wright.

Theorem 10.33 ([14]). Let q be a degree d polynomial in Rn . Then for a convex body K ⊂ Rn of volume
1, any ϵ > 0, and x drawn uniform from K ,
  1
Pr |q(x)| ≤ ϵ max |q(x)| ≲ ϵ d d
x∼K K
10.6. Hit-and-Run 139

We conclude this section with a nice interpretation of the localization lemma by Fradelizi and Guedon.
They also give a version that extends localization to multiple inequalities.
Theorem 10.34 (Reformulated Localization Lemma [27]). Let K be a compact convex set in Rn and f be
an upper semi-continuous function. Let
R Pf be the set of logconcave distributions µ supported by K satisfying
f dµ ≥ 0. The set of extreme points of convPf is exactly:
1. the Dirac measure at points x such that f (x) ≥ 0, or
2. the distributions v satises
(a) density function is of the form eℓ with linear ℓ,
(b) support equals to a segment
R [a, b] ⊂ K ,
(c) f dv = 0,
Rx Rb
(d) a
f dv > 0 for x ∈ (a, b) or x
f dv > 0 for x ∈ (a, b).
Since we know the maximizer of any convex function is at extreme points, this shows that one can
optimize maxµ∈Pf Φ(µ) for any convex Φ by checking Dirac measures and log-ane functions.

10.6 Hit-and-Run
The ball walk does not mix rapidly from all starting points. While this hurdle can be overcome by starting
with a deep point and carefully maintaining a warm start, it is natural to ask if there is a simple process
that does truly mix rapidly from any starting point. Hit-and-Run satises this requirement.
Algorithm 30: Hit − and − Run
Input: starting point x0 in a convex body K .
Repeat T times: at current point x,
1. Pick a uniform random direction ℓ through x.
2. Go to uniform random point y on the chord of K induced by ℓ.
return x.
Since hit-and-run is a symmetric Markov chain, the uniform distribution on K is stationary for it.
To sample from a general density proportional to f (x), in Step 2, we sample y according to the density
proportional to f restricted to the random line ℓ.
Next we give a formula for the next step distribution from a point u.
Lemma 10.35. The next step distribution of Hit-and-Run from a point u is given by
Z
2 dx
Pu (A) = n−1
vol(S n−1 ) A ∥x − u∥ ℓ(u, x)

where A is any measurable subset of K and ℓ(u, x) is the length of the chord in K through u and x.
Exercise 10.36. Prove Lemma 10.35.
The main theorem of this section is the following [51].
Theorem 10.37. [51]The conductance of Hit-and-Run in a convex body K containing the unit ball and of
diameter D is Ω(1/nD).
This implies a mixing time of O n2 D2 log(M/ε) to get to within distance ε of the target density starting


from an M -warm initial density. By taking one step from the initial point, we can bound M by (D/d)n
where d is the minimum distance of the starting point from the boundary. Hence this gives a bound of
Õ(n3 D2 ) from any interior starting point.
The proof of the theorem follows the same high-level outline as that of the ball walk, needing two major
ingredients, namely, one-step coupling and isoperimetry. Notably, the isoperimetry is for a non-Euclidean
notion of distance. We begin with some suitable denitions.
Dene the median step-size function F as the F (x) such that
1
Pr(∥x − y∥ ≤ F (x)) =
8
10.6. Hit-and-Run 140

where y is a random step from x.


We also need a non-Euclidean notion of distance, namely the classical cross-ratio distance. For points
u, v in K , inducing a chord [p, q] with these points in the order p, u, v, q , the cross-ratio distance is

∥u − v∥ ∥p − q∥
dK (u, v) = .
∥p − u∥ ∥v − q∥

It is related to the Hilbert distance (which is a true distance) as follows:

dH (u, v) = ln (1 + dK (u, v)) .

The rst ingredient shows that if two points are close geometrically, then their next-step distributions
have signicant overlap.

Lemma 10.38. For two points u, v ∈ K with

1 2
dK (u, v) < and ∥u − v∥ ≤ √ max {F (u), F (v)}
8 n
1
we have dT V (Pu , Pv ) < 1 − 500 .

The second ingredient is an isoperimetric inequality (independent of any algorithm). The cross-ratio
distance has a nice isoperimetry inequality.

Theorem 10.39. For any partition S1 , S2 , S3 of a convex body K,

vol(S1 )vol(S2 )
vol(S3 ) ≥ dK (S1 , S2 ) .
vol(K)

However, this will not suce to prove a bound on the conductance of all subsets. The reason is that we
cannot guarantee a good lower bound on the minimum distance between subsets S1 , S2 . Instead, we will
need a weighted isoperimetric inequality, which uses an average distance.

Theorem 10.40. LetS1 , S2 , S3 be a partition of a convex body K . Let h : K → R+ be a function s.t. for
any u ∈ S1 ,v ∈ S2 , and any x on the chord through u and v , we have

1
h(x) ≤ min {1, dK (u, v)} .
3
Then,
vol(S3 ) ≥ EK (h(x)) min {vol(S1 ), vol(S2 )} .
For bounding the conductance, we will use a specic function h. To introduce it, we rst dene a step-size
function s(x):
vol(x + tB n ∩ K
 
s(x) = sup t : ≥γ
vol(tB n )
for some xed γ ∈ (0, 1].

Exercise 10.41. Show that the step-size function is concave over any convex body.
We will need the following relationship between the step-size function and the median step function.

Lemma 10.42. We have


1−γ
EK (s(x)) ≥ √ .
n
Moreover, for γ ≥ 63/64, we have F (x) ≥ s(x)/32.
s(x)
In the proof of Theorem 10.37, we will set h(x) = √
48 nD
. We are now ready for that proof.
10.6. Hit-and-Run 141

Proof of Thm. 10.37. Let K = S1 ∪ S2 be a partition of K into measurable sets. We will prove that

Z
c
Px (S2 ) dx ≥ min{vol(S1 ), vol(S2 )} (10.7)
S1 nD

Note that since the uniform distribution is stationary,


Z Z
Px (S2 ) dx = Px (S1 ) dx.
S1 S2

Consider the points that are deep inside these sets, i.e., unlikely to jump out of the set:
   
1 1

S1 = x ∈ S1 : Px (S2 ) < and S2 = x ∈ S2 : Px (S1 ) <

.
1000 1000

Let S3′ be the rest i.e., S3′ = K \ S1′ \ S2′ .


Suppose vol(S1′ ) < vol(S1 )/2. Then
Z
1 1
Px (S2 ) dx ≥ vol(S1 \ S1′ ) ≥ vol(S1 )
S1 1000 2000

which proves (10.7).


So we can assume that vol(S1′ ) ≥ vol(S1 )/2 and similarly vol(S2′ ) ≥ vol(S2 )/2. Now, for any u ∈ S1′ and
v ∈ S2′ ,
1
dT V (Pu , Pv ) ≥ 1 − Pu (S2 ) − Pv (S1 ) > 1 − .
500
Applying Lemma 10.38, we get that that one of the following holds:
1 2
dK (u, v) ≥ or ∥u − v∥ ≥ √ max {F (u), F (v)}
8 n

We will now prove that Thm. 10.40 using h(x) = 48s(x) √


nD
where s(x) is dened with γ = 63/64. To see that
this is a valid choice, rst note that if dK (u, v) ≥ 8 ,then we have h(x) ≤ dK (u, v)/3, as needed. So we can
1

assume the second condition above holds. Next, noting that x is some point on the chord through u, v , let
the endpoints of the chord be p, q. Suppose WLOG that x ∈ [u, q]. Then, by the concavity of s(x), and using
the second part of Lemma 10.42, we have,

|x − p|
s(x) ≤ s(u)
|u − p|
|x − p|
≤ 32 F (u)
|u − p|
√ |x − p|
≤ 16 n |u − v|
|u − p|

≤ 16dK (u, v) nD

which again implies the desired condition on h.


Now, applying Theorem 10.40 to the partition S1′ , S2′ , S3′ , and using the rst part of Lemma 10.42, we
have

vol(S3′ ) ≥ EK (h(x)) min{vol(S1′ ), vol(S2′ )}


1
≥ min{vol(S1 ), vol(S2 )}.
4000nD
10.7. Dikin walk 142

We can now prove (10.7) as follows:


Z Z Z
1 1
Px (S2 ) dx = Px (S2 ) dx + Px (S1 ) dx
S1 2 S1 2 S2
1 1
≥ vol(S3′ )
2 1000
1
≥ min{vol(S1′ ), vol(S2′ )}
223 nD
1
≥ min{vol(S1 ), vol(S2 )}.
224 nD

10.7 Dikin walk


Both the ball walk and hit-and-run have a dependence on the roundness of the target distribution, e.g.,
via its diameter or average distance to the center of gravity. Reducing this dependence to logarithmic by
rounding is polynomial time but expensive. The current best rounding algorithm (which achieves near-
isotropic position) has complexity O∗ (n3 ). Here we describe a dierent approach, which is ane-invariant,
but requires more knowledge of the convex body. In particular, we will focus on the special case of sampling
an explicit polytope P = {x : Ax ≥ b}.
The general Dikin walk is dened as follows. For a convex set K with a positive denite matrix H(u) for
each point u ∈ K , let
Eu (r) = x : (x − u)⊤ H(u)(x − u) ≤ r .


Algorithm 31: DikinWalk


Input: starting point x0 in a polytope P = {x : Ax ≥ b}.
Set r = 18 .
Repeat T times: at current point x,
1. Pick y from Ex (r). n o
2. Go to y with probability min 1, vol(E x (r))
vol(Ey (r)) .
return x.

10.7.1 Strong Self-Concordance


We require a family of matrices to have the following properties. Usually but not necessarily, these matrices
come from the Hessian of some convex function.
Denition 10.43 (Symmetric self-concordance). For any convex set K ⊂ Rn , we call a matrix function
H : K → Rn×n is self-concordant if for any u ∈ K , we have
d
−2∥h∥H(u) H(u) ⪯ H(u + th) ⪯ 2∥h∥H(u) H(u).
dt
We call H is a symmetric ν̄ -self-concordant barrier if H is self-concordant and for any u ∈ K ,

Eu (1) ⊆ K ∩ (2u − K) ⊆ Eu ( ν̄).

The following lemma shows that self-concordant matrix functions also enjoy a similar regularity as the
usual self-concordant functions.
Lemma 10.44. Given any self-concordant matrix function H on K ⊂ Rn , we dene ∥v∥2x = v ⊤ H(x)v .
Then, for any x, y ∈ K with ∥x − y∥x < 1, we have

2 1
(1 − ∥x − y∥x ) H(x) ⪯ H(y) ⪯ 2 H(x).
(1 − ∥x − y∥x )
10.7. Dikin walk 143

Proof. Let h = y − x, xt = x + th and ϕ(t) = h⊤ H(xt )h. Then,



d
|ϕ′ (t)| = h⊤ H(xt )h ≤ 2∥h∥3xt = 2ϕ(t)3/2 .

dt

Hence, we have dt ≤ 1. Therefore, we have √ 1 ≥ √ 1 − t and hence,
d 1
√ ϕ(t) ϕ(t) ϕ(0)

ϕ(0)
ϕ(t) ≤ p . (10.8)
(1 − t ϕ(0))2
Now we x any v and dene ψ(t) = v ⊤ H(xt )v . Then,


⊤d
H(xt )v ≤ 2∥h∥xt ∥v∥2xt = 2ϕ(t)ψ(t).

|ψ (t)| = v

dt
Using (10.8) at the end, we have p

d
ln ψ(t) ≤ 2 ϕ(0)
dt (1 − tpϕ(0)) .
Integrating both sides from 0 to 1,
Z 1 p
ψ(1)
ln ≤ 2 ϕ(0) 1
ψ(0) p dt = 2 ln( p ).
0 (1 − t ϕ(0)) 1 − ϕ(0)
The result follows from this, ψ(1) = v ⊤ H(y)v , ψ(0) = v ⊤ H(x)v , and ϕ(0) = ∥x − y∥2x .
Many natural barriers, including the logarithmic barrier and the LS-barrier, satisfy a much stronger
condition than self-concordant. However, this is not always true, as one can construct counterexamples even
in one-dimension.
Denition 10.45. For any convex set K ⊂ Rn , we say a matrix function H : K → Rn×n is strongly
self-concordant if for any u ∈ K , we have

H(x)−1/2 DH(x)[h]H(x)−1/2 ≤ 2 ∥h∥x

F

where DH(x)[h] is the directional derivative of H at x in the direction h.


Similar to Lemma 10.44, we have a global version of Lemma 10.44.
Lemma 10.46. Given any strongly self-concordant matrix function H on K ⊂ Rn . For any x, y ∈ K with
∥x − y∥x < 1, we have

1 1 ∥x − y∥x
∥H(x)− 2 (H(y) − H(x))H(x)− 2 ∥F ≤ .
(1 − ∥x − y∥x )2
Proof. Let xt = (1 − t)x + ty . Then, we have
Z 1
1 1 1 d 1
∥H(x)− 2 (H(y) − H(x))H(x)− 2 ∥F = ∥H(x)− 2 H(xt )H(x)− 2 ∥F dt.
0 dt
We note that H is self-concordant. Hence, Lemma 10.44 shows that
   
− 12 d − 12 2 −1 d −1 d
∥H(x) H(xt )H(x) ∥F = trH(x) H(xt ) H(x) H(xt )
dt dt dt
   
1 −1 d −1 d
≤ 4 trH(xt ) H(xt ) H(xt ) H(xt )
(1 − ∥x − xt ∥x ) dt dt
4 2
≤ 4 ∥x − xt ∥xt
(1 − ∥x − xt ∥x )
4 2
≤ 6 ∥x − xt ∥x
(1 − ∥x − xt ∥x )
10.8. Mixing with Strong Self-Concordance 144

where we used the assumption in the second inequality and Lemma 10.44 again for the last inequality.
Hence,
Z 1
− 12 − 21 2∥x − xt ∥x
∥H(x) (H(y) − H(x))H(x) ∥F ≤ 3 dt
0 (1 − ∥x − xt ∥x )
Z 1
2t∥x − y∥x
= dt
0 (1 − t∥x − y∥x )3
∥x − y∥x
= .
(1 − ∥x − y∥x )2

We note that strong self-concordance is stronger than self-concordance since the Frobenius norm is always
larger or equal to the spectral norm. As an example, we will verify that the conditions hold for the standard
log barrier (Lemma ??).
The Dikin walk has the following guarantee.
Theorem 10.47. The mixing rate of the Dikin walk for a symmetric, strongly self-concordant matrix function
with convex log determinant is O(nν̄).
Each step of the standard Dikin walk is fast, and does not need matrix multiplication.
Theorem 10.48. The Dikin walk with the logarithmic barrier for a polytope {Ax ≥ b} can be implemented
in time O(nnz(A) + n2 ) per step while maintaining the mixing rate of O(mn).
The next lemma results from studying strong self-concordance
√ for classical barriers. The KLS constant
below is conjectured to be O(1) and known to be O( log n).
Lemma 10.49. Let ψn be the KLS constant of isotropic logconcave densities in Rn , namely, for any isotropic
n
logconcave density p and any set S ⊂ R , we have

Z (Z Z )
1
p(x)dx ≥ min p(x)dx, p(x)dx .
∂S ψn S Rn \S

Let H(x) be the Hessian of the universal or entropic barriers. Then, we have

H(x)−1/2 DH(x)[h]H(x)−1/2 = O(ψn ) ∥h∥x .

F
n
In short, the universal and entropic barriers in R are strongly self-concordant up to a scaling factor depending
on ψn .
In fact, the proof shows that up to a logarithmic factor the strong self-concordance of these barriers is
equivalent to the KLS conjecture.

10.8 Mixing with Strong Self-Concordance


A key ingredient of the proof of Theorem 10.47 is the following lemma.
Lemma 10.50. For two points x, y ∈ P , with ∥x − y∥x ≤ 1

8 n
, we have dT V (Px , Py ) ≤ 3
4.

Proof. We have to prove two things: rst, the rejection probability is small, second the ellipsoids used by
the Dikin walk at x, y have large overlap. More precisely, we have
1 1 1 vol(Px \Py ) 1 vol(Py \Px )
dTV (Px , Py ) ≤ rej + rejy + +
2 x 2 2 vol(Px ) 2 vol(Py )
1 1 1 vol(Px ∩ Py ) 1 vol(Px ∩ Py )
= rejx + rej + 1 − − (10.9)
2 2 y 2 vol(Px ) 2 vol(Py )
10.8. Mixing with Strong Self-Concordance 145

where rejx and rejy are the rejection probabilities at x and y .


For the rejection probability at x, we consider the algorithm pick z from Ex (r). Let f (z) = ln det H(z).
The acceptance probability of the sample z is
  ( s )
vol(Ex (r)) det(H(z))
min 1, = min 1, . (10.10)
vol(Ez (r)) det(H(x))

By the assumption that f is a convex function, we have that


det(H(z))
ln = f (z) − f (x) ≥ ⟨∇f (x), z − x⟩. (10.11)
det(H(x))
To simplify the notation, we assume H(x) = I. Since z is sampled from unit ball of radius r centered at x,
we know that 2
P(v ⊤ (z − x) ≥ −ϵr∥v∥2 ) ≥ 1 − e−nϵ /2 .
In particular, with probability at least 0.99 in z , we have
4r
⟨∇f (x), z − x⟩ ≥ − √ ∥∇f (x)∥2 . (10.12)
n
To compute ∥∇f (x)∥22 , it is easier to compute directional derivative of ∇f . Note that

∥∇f (x)∥2 = max ∇f (x)⊤ v


∥v∥2 =1

= max tr(H(x)−1 DH(x)[v])


∥v∥2 =1
 1 1

= max tr H(x)− 2 DH(x)[v]H(x)− 2
∥v∥2 =1
√ 1 1
≤ max n∥H(x)− 2 DH(x)[v]H(x)− 2 ∥F
∥v∥2 =1

≤ n (10.13)
Pn √ pPn
where the rst inequality follows from | i=1 λi | ≤ n i=1 λi and the second inequality follows from the
2

denition of strong self-concordance.


Combining (10.10), (10.11), (10.12) and (10.13), we see that with probability at least 0.99 in z , acceptance
probability of the sample z is  
vol(Ex (r))
min 1, ≥ e−2r ≥ 0.77 (10.14)
vol(Ez (r))
where we used that r = 18 . Hence, the overall rejection probability rejx (and similarly rejy ) satises

rejx ≤ 0.24 and rejy ≤ 0.24. (10.15)

Now, we bound the fraction of volume in the intersection of the ellipsoids at x, y . Again, we can assume
that H(x) = I. Then, the strongly self-concordance and Lemma 10.46 shows that
1
∥H(y) − I∥F ≤ 2∥x − y∥x ≤ √ . (10.16)
4 n
In particular, we have that
1 3
I ⪯ H(y) ⪯ I. (10.17)
2 2
We partition the eigenvalues λi of H(y) into those of value at least 1 and the rest. Then consider the
ellipsoid E whose eigenvalues are min {1, λi }. This is contained in both Ex (1) and Ey (1). We will see that
vol(E) is a constant fraction of the volume of both Ex (1) and Ey (1). First, we compare E and Ex (1)
!
vol(E) Y Y X
= λi = (1 − (1 − λi )) ≥ exp −2 (1 − λi ) (10.18)
vol(Ex (1))
i:λi <1 i:λi <1 i:λi <1
10.8. Mixing with Strong Self-Concordance 146

where we used that 1 − x ≥ exp(−2x) for 0 ≤ x ≤ 12 and λi ≥ 12 (10.17). From the inequality (10.16), it
follows that sX
1
(λi − 1)2 ≤ √ .
i
4 n

Therefore, |λi − 1| ≤ 14 . Putting it into (10.18), we have


P
i:λi <1

vol(Px ∩ Py ) vol(E) 1
= ≥ e− 2 . (10.19)
vol(Px ) vol(Ex (1))
Similarly, we have
Q
vol(Px ∩ Py ) i:λi <1 λi 1 1 1
= Q =Q ≥ P ≥ e− 4 . (10.20)
vol(Py ) i:λi λi i:λi >1 λi exp( i:λi >1 (λi − 1))

Putting (10.15), (10.19) and (10.20) into (10.9), we have


1 1
0.24 0.24 e− 2 e− 4 3
dTV (Px , Py ) ≤ + +1− + ≤
2 2 2 2 4

The next lemma establishes isoperimetry. This only needs the symmetric containment assumption. The
isoperimetry is for the cross-ratio distance. For a convex body K , and any two points u, v ∈ K,suppose that
p, q are the endpoints of the chord through u, v in K , so that these points occur in the order p, u, v, q. Then,
the cross-ratio distance between u and v is dened as
∥u − v∥2 ∥p − q∥2
dK (u, v) = .
∥p − u∥2 ∥v − q∥2
Tbis distance enjoyes the following isoperimetric inequality.
Theorem 10.51 ([48]). For any convex body K, and disjoint subsets S1 , S2 of it, and S3 = K \ S1 \ S2 ,we
have
vol(S1 )vol(S2 )
vol(S3 ) ≥ dK (S1 , S2 ) .
vol(K)
We now relate the cross-ratio distance to the ellipsoidal norm.

Lemma 10.52. For u, v ∈ K , dK (u, v) ≥ ∥u−v∥u



ν̄
.

Proof. Consider the ellipsoid at u. For the chord [p, q] induced by u, v with these points in the order p, u, v,
√ q,
suppose that ∥p − u∥2 ≤ ∥v − q∥2 . Then by Lemma 10.47, p ∈ K ∩ (2u − K). And hence ∥p − u∥u ≤ ν̄.
Therefore,
∥u − v∥2 ∥p − q∥2 ∥u − v∥2 ∥u − v∥u ∥u − v∥u
dK (u, v) = ≥ = ≥ √ .
∥p − u∥2 ∥v − q∥2 ∥p − u∥2 ∥p − u∥u ν̄
We can now prove the main conductance bound.
*
We follow the standard high-level outline [74]. Consider any measurable subset S1 ⊆ K and let S2 = K\S1
be its complement. Dene the points with low escape probability for these subsets as
 
1
Si′ = x ∈ Si : Px (K \ Si ) <
8

and S3′ = K \ S1′ \ S2′ . Then, for any u ∈ S1′ , v ∈ S2′ , we have dT V (Pu , Pv ) > 1 − 14 . Hence, by Lemma 10.50,
we have ∥u − v∥u ≥ 8√1 n . Therefore, by Lemma 10.52,

1
dK (u, v) ≥ √ √ .
8 n · ν̄
10.9. Hamiltonian Monte Carlo 147

We can now bound the conductance of S1 . We may assume that vol(Si′ ) ≥ vol(Si )/2; otherwise, it immedi-
ately follows that the conductance of S1 is Ω(1). Assuming this, we have
Z Z
1 1
Px (S2 ) dx ≥ dx ≥ vol(S3′ )
S1 ′
S3 8 8
1 vol(S1′ )vol(S2′ )
using isoperimetry (Theorem 10.51) ≥ dK (S1′ , S2′ )
8 vol(P )
1
≥ √ min {vol(S1 ), vol(S2 )} .
512 nν̄

10.9 Hamiltonian Monte Carlo


Hamiltonian dynamics is an alternative way to formulate Newtonian mechanics. The Hamiltonian H captures
both the potential and kinetic energy of a particle as a function of its position and velocity. The dynamics
can be described by the following dierential equations.

dx ∂H(x, v)
= ,
dt ∂v
dv ∂H(x, v)
=− .
dt ∂x
These equations preserve the Hamiltonian function H . In the simplest Euclidean setting, it can be dened
as follows.
1 2
H(x, v) = f (x) + ∥v∥
2
so that
dx dv
= v, = −∇f (x)
dt dt
or
d2 x
= −∇f (x).
dt2
More generally, the Hamiltonian can depend on a function that denes a local metric:
1 1
H(x, v) = f (x) + log((2π)n det g(x)) + v T g(x)−1 v
2 2
where g(x) is a matrix, and when it is PSD, it denes a local norm at x. In this sense, we can view the
dynamics as evolving on a manifold with local metric g(x). Here in this chapter, we will focus on the case
when g(x) = I , the standard Euclidean metric.
(Riemannian) Hamiltonian Monte Carlo (or RHMC) [?, ?][?, ?] is a Markov Chain Monte Carlo method
for sampling from a desired distribution. Each step of the method consists of the following: At a current
point x,
1. Pick a random velocity y according to a local distribution dened by x (in the simplest setting, this is
the standard Gaussian distribution for every x).
2. Move along the Hamiltonian curve dened by Hamiltonian dynamics at (x, y) for time (distance) δ .
For the choice of H above, the marginal distribution of the current point x approaches the target distribution
with density proportional to e−f . Note that HMC does not require a Metropolis lter! Thus, unlike the walks
we have seen so far, its step sizes are not limited by this consideration even in high dimension. Hamiltonian
Monte Carlo can be used for sampling from a general distribution e−H(x,y) .

Denition 10.53. Given a continuous, twice-dierentiable function H : M × Rn ⊂ Rn × Rn → R (Hamil-


tonian ) where M is the x domain of H , we say (x(t), y(t)) follows a Hamiltonian curve if it satises the
Hamiltonian equations
10.9. Hamiltonian Monte Carlo 148

dx ∂H(x, y)
= ,
dt ∂y
dy ∂H(x, y)
=− . (10.21)
dt ∂x
def
We dene the map Tδ (x, y) = (x(δ), y(δ)) where the (x(t), y(t)) follows the Hamiltonian curve with the
initial condition (x(0), y(0)) = (x, y).
Hamiltonian Monte Carlo is a Markov chain generated by a sequence of randomly Hamiltonian curves.

Algorithm 32: Hamiltonian Monte Carlo


Input: some initial point x(1) ∈ M.
for i = 1, 2, · · · , T do
(k)
Sample y according to e−H(x ,y) /π(x(k) ) where π(x) = e−H(x,y) dy .
R
Rn
With probability 21 , set (x(k+1) , y (k+1) ) = Tδ (x(k) , y).
Otherwise, (x(k+1) , y (k+1) ) = T−δ (x(k) , y).
end
Output: (x(T +1) , y (T +1) ).

Time-reversibility
Lemma 10.54 (Energy Conservation). For any Hamiltonian curve (x(t), y(t)), we have that

d
H(x(t), y(t)) = 0.
dt
Proof. Note that
d ∂H dx ∂H dy ∂H ∂H ∂H ∂H
H(x(t), y(t)) = + = − = 0.
dt ∂x dt ∂y dt ∂x ∂y ∂y ∂x

Lemma 10.55 (Measure Preservation). For any t ≥ 0, we have that

det (DTt (x, y)) = 1


where DTt (x, y) is the Jacobian of the map Tt at the point (x, y).
Proof. Let (x(t, s), y(t, s)) be a family of Hamiltonian curves given by Tt (x + sdx , y + sdy ). We write
∂ ∂
u(t) = x(t, s)|s=0 , v(t) = y(t, s)|s=0 .
∂s ∂s
By dierentiating the Hamiltonian equations (10.21) w.r.t. s, we have that
du ∂ 2 H(x, y) ∂ 2 H(x, y)
= u+ v,
dt ∂y∂x ∂y∂y
dv ∂ 2 H(x, y) ∂ 2 H(x, y)
=− u− v,
dt ∂x∂x ∂x∂y
(u(0), v(0)) = (dx , dy ).
This can be captured by the following matrix ODE
∂ 2 H(x(t),y(t)) ∂ 2 H(x(t),y(t))
!
dΦ ∂y∂x ∂y∂y
= 2 2 Φ(t)
dt − ∂ H(x(t),y(t))
∂x∂x − ∂ H(x(t),y(t))
∂x∂y
Φ(0) = I
10.9. Hamiltonian Monte Carlo 149

using the equation      


dx u(t) dx
DTt (x, y) = = Φ(t) .
dy v(t) dy
Therefore, DTt (x, y) = Φ(t). Next, we observe that
∂ 2 H(x(t),y(t)) ∂ 2 H(x(t),y(t))
  !
d −1 d ∂y∂x ∂y∂y
log det Φ(t) = tr Φ(t) Φ(t) = tr 2 2 = 0.
dt dt − ∂ H(x(t),y(t))
∂x∂x − ∂ H(x(t),y(t))
∂x∂y

Hence,
det Φ(t) = det Φ(0) = 1.

Using the previous two lemmas, we can see that Hamiltonian Monte Carlo indeed converges to the desired
distribution.
Lemma 10.56 (Time reversibility). *Let px (x′ ) denote the probability density of one step of the Hamiltonian
Monte Carlo starting at x. We have that

π(x)px (x′ ) = π(x′ )px′ (x)

for almost everywhere in x and x′ where π(x) = Rn e−H(x,y) dy .


R

Proof. Fix x and x . Let Fδx (y) be the x component of Tδ (x, y). Let V+ = {y : Fδx (y) = x′ } and V− = {y :

F−δ (y) = x )}. Then,


x ′

e−H(x,y) e−H(x,y)
Z Z
1
′ 1
π(x)px (x ) = + det DF x (y) .

|det (DFδx (y))| 2

2 y∈V+ y∈V− −δ

We note that this formula assumed that DFδx is invertible. Sard's theorem showed that Fδx (N ) is measure
def
0 where N = {y : DFsx (y) is not invertible}. Therefore, the formula is correct except for a measure zero
subset.
By reversing time for the Hamiltonian curve, we have that for the same V± ,
′ ′ ′ ′
e−H(x ,y ) e−H(x ,y )
Z Z
1 1

π(x )px′ (x) = det DF x′ (y ′ ) + 2

det DF x′ (y ′ )
 (10.22)
2 y∈V+ −δ y∈V− δ

where y ′ denotes the y component of Tδ (x, y) and T−δ (x, y) in the rst
 and second  sum respectively.
A B
We compare the rst terms in both equations. Let DTδ (x, y) = . Since Tδ ◦ T−δ = I and
C D
Tδ (x, y) = (x′ , y ′ ), the inverse function theorem shows that DT−δ (x′ , y ′ ) is the inverse map of DTδ (x, y).
Hence, we have that
−1 
· · · −A−1 B(D − CA−1 B)−1
 
A B
DT−δ (x′ , y ′ ) = = .
C D ··· ···

Therefore, we have that Fδx (y) = B and F−δ
x
(y ′ ) = −A−1 B(D − CA−1 B)−1 . Hence, we have that

x′
 −1 |det B|
det DF−δ (y ′ ) = det A−1 det B det D − CA−1 B =  .

det A B

C D
 
A B
Using that det (DTt (x, y)) = det = 1 (Lemma 10.55), we have that
C D
 
x′
det DF−δ (y ′ ) = |det (DFδx (y))| .

10.9. Hamiltonian Monte Carlo 150

Hence, we have that

e−H(x,y) e−H(x,y)
Z Z
1 1
x = det DF x′ (y ′ )

2 y∈V+ |det (DFδ (y))| 2 y∈V+ −δ
′ ′
e−H(x ,y )
Z
1
= det DF x′ (y ′ )

2 y∈V+ −δ

′ ′
where we used that e−H(x,y) = e−H(x ,y ) (Lemma 10.54) at the end.
For the second term in (10.22), by the same calculation, we have that
′ ′
e−H(x,y) e−H(x ,y )
Z Z
1 1
det DF x (y) = 2 det DF x′ (y ′ )
 
2 y∈V− −δ y∈V+ δ

Convergence
2
First we consider the convergence in the case when H(x, v) = f (x) + 21 ∥v∥ for a strongly convex function f .
So the marginal of the stationary distribution along x is proportional to e−f . The idea here is coupling (as
we did for Langevin dynamics). We consider two separate processes x and y , with their next step directions
chosen to be identical. The key lemma is that with this setting the squared distance decreases up to a certain
time that depends on the condition number.
10.9. Hamiltonian Monte Carlo 151

Figure 10.4: Hit-and-Run Algorithm


10.9. Hamiltonian Monte Carlo 152

Figure 10.5: Hit-and-Run from a corner


10.9. Hamiltonian Monte Carlo 153


Figure 10.6: (a) Eu (1) ⊆ K ∩ (2u − K) ⊆ Eu ( ν̄). (b) Strong self-concordance measures the rate of change of
Hessian of a barrier in the Frobenius norm
Chapter 11

Annealing

11.1 Simulated Annealing


In this chapter we study a sampling-based approach for optimization and integration in high dimension. The
main idea is to sample a sequence of logconcave distributions, starting with one that is easy to integrate
and ending with the function whose integral is desired. This process is known as simulated annealing. The
same high-level algorithm can be used for optimization, volume computation/integration or rounding. For
integration, the desired integral can be expressed as the following telescoping product:
Z Z R R R
fi f2 fm
f = f0 R R ... R
Rn f0 f3 fm−1
where fm = f . Each ratio fi+1 / fi is the expectation of the estimator Y = fi+1 (X)
fi (X) for X drawn from
R R

the density proportional to fi . What sequence of functions should we use?


Algorithm 33: SimulatedAnnealing
1. For i = 0, . . . , m, dene
 i
1
ai = b 1 + √ and fi (x) = f (x)ai .
n
2. Let X01 , . . . , X0k be independent random points with density proportional to f0 .
3. For i = 0, . . . , m − 1: starting with Xi1 , . . . , Xik , generate random points Xi+1 = {Xi+1
1 k
, . . . , Xi+1 };
update a running estimate g based on these samples; update the isotropy transformation using the
samples.
4. Output the nal estimate of g .
return x.
For optimization, the function fm is set to be a suciently high power of f , the function to be maximized
while g is simply the maximum objective value so far. For integration and rounding, fm = f , the target
function to be integrated Ror rounded. For integration, the function g starts out as the integral of f0 and
is multiplied by the ratio fi+1 / fi in each step. For rounding, g is simply the estimate of the isotropic
R

transformation for the current function.


For sampling and optimization, it is natural to ask why one uses a sequence of functions rather than
jumping straight to the target function fm . For optimizing a linear function cT x over a convex set K , all
T
we need to do is sample according to the density proportional to e−a(c x) for a suciently large coecient
a (recall Theorem 9.6). The reason for using a sequence is that the function fm and corresponding density
pm can be very far from the starting function or distribution, and hence the mixing time can be high. The
sequence ensures that samples from the current function provide a good (warm) start for sampling from the
next function. This is captured in the next lemma.
Theorem 11.1. Let pi (x) = fi (x)/ fi with fi dened as in the annealing algorithm for some logconcave
R
n
function f : R → R+ . Then a random sample X ∼ pi satises
 
pi (x)
E pi = O(1).
pi+1 (x)
The underlying mathematical property behind this lemma is the following.

154
11.2. Volume Computation 155

Lemma 11.2. n
Z(a) = an f (x)a
R
Let f
be a logconcave function in R . Then Rn
is logconcave for a ≥ 0. If
n
R
f has support K , then Z(a) = a K
f (ax) is logconcave for a > 0.

11.2 Volume Computation


For volume computation, we can apply annealing as follows. We assume that the input convex body K
contains a unit ball, has diameter bounded by D and is given by a membership oracle. The polynomial-
time algorithm of Dyer, Frieze and Kannan [23] uses a sequence of uniform distributions on convex bodies,
starting with the ball contained inside the input body K . Each body in the sequence is a ball intersected
i
with the given convex body K : Ki = 2 n rB ∩ K and fi (x) is the indicator of Ki . The length the sequence is
m = O(n log D) so that the nal body is just K . A variance computation shows that O(m/ϵ2 ) samples per
distribution suce to get an overall 1 + ϵ multiplicative error approximation with high probability. The total
number of samples is O∗ (m2 ) = O∗ (n2 ) and the complexity of the resulting algorithm is O∗ (n5 ) as shown
in [33]. Table 11.1 below summarizes progress on the volume problem over the past three decades. Besides
improving the complexity of volume computation, each step has typically resulted in new techniques. For
more details, we refer the reader to surveys on the topic [66, 72].

Year/Authors New ingredients Steps


1989/Dyer-Frieze-Kannan [23] Everything n23
1990/Lovász-Simonovits [49] Better isoperimetry n16
1990/Lovász [47] Ball walk n10
1991/Applegate-Kannan [5] Logconcave sampling n10
1990/Dyer-Frieze [22] Better error analysis n8
1993/Lovász-Simonovits [50] Localization lemma n7
1997/Kannan-Lovász-Simonovits [33] Speedy walk, isotropy n5
2003/Lovász-Vempala [53] Annealing, hit-and-run n4
2015/Cousins-Vempala [20] (well-rounded) Gaussian Cooling n3
2
2017/Lee-Vempala (polytopes) Hamiltonian Walk mn 3

Table 11.1: The complexity of volume estimation, each step uses O(n)
e bit of randomness. The last algorithm needs
ω−1 2

O mn
e steps per iteration while the rest need O(n ) per oracle query.

In [53] this was improved by sampling from a sequence of nonuniform distributions. Then we consider
the following estimator:

fi+1 (X)
Y = .
fi (X)
We see that R
fi+1
Efi (Y ) = R .
fi
In the algorithm of DFK and KLS, this ratio is bounded by a constant in each phase, giving a total of O∗ (n)
phases since the ratio of nal to initial integrals is exponential. Instead of uniform densities, we consider
2
fi (x) ∝ exp(−ai ∥x∥)χK (x) or fi (x) ∝ exp(−ai ∥x∥ )χK (x).

The coecient ai (inverse temperature) will be changed by a factor of (1+ √1n ) in each phase, which implies
that m = O( e √n) phases suce to reach the target distribution. This is perhaps surprising since the ratio
e √n) phases, and hence
of the initial integral to the nal is typically nΩ(n) . Yet the algorithm uses only O(

estimates a ratio of nΩ̃( n) in one or more phases. The key insight is that even though the expected ratio
might be large, its variance is not.
11.2. Volume Computation 156

Lemma 11.3. For X ∼ fi with fi (x) = e−ai ∥x∥ χK (x) for a convex body K, or fi (x) = f (x)ai for a
logconcave function f , we have that the estimator Y = fi+1 (X)
fi (X) satises

 n
E Y2 a2i+1

2 ≤
E (Y ) (2ai+1 − ai )ai
 
which is bounded by a constant for ai = ai+1 1 + √1 .
n

Theorem 11.4 ([53]). n


The volume of a convex body in R (given by a membership oracle) can be computed
4 2 2
to relative error ε using O(n /ε ) oracle queries and O(n ) arithmetic operations per query.
e e

The LV algorithm has two parts. In the rst it nds a transformation that puts the body in near-
isotropic position. The complexity of this part is O(n
e 4 ). In the second part, it runs the annealing schedule,
while maintaining that the distribution being sampled is well-rounded, a weaker condition than isotropy.
Well-roundedness requires that a level set of measure 18 contains a constant-radius ball and the trace of the
covariance (expected√squared distance of a random point from the mean) to be bounded by O(n), so that
R/r is eectively O( n). To achieve the complexity guarantee for the second phase, it suces to use the
1
KLS bound of ψp ≳ n− 2 . Connecting improvements in the Cheeger constant directly to the complexity of
volume computation was an open question for a couple of decades. To apply improvements in the Cheeger
constant, one would need to replace well-roundedness with (near-)isotropy and maintain that. However,
maintaining isotropy appears to be much harder  possibly requiring a sequence of Ω(n) distributions and
Ω(n) samples from each, providing no gain over the current complexity of O∗ (n4 ) even if the KLS conjecture
turns out to be true.
A faster
√ algorithm is known for well-rounded convex bodies (any isotropic logconcave density satises
R
r = O( n) and is well-rounded). This variant of simulated annealing, called Gaussian cooling utilizes the
fact that the KLS conjecture holds for a Gaussian density restricted by any convex body, and completely
avoids computing an isotropic transformation.

Theorem 11.5 ([20]). The volume of a well-rounded convex body, i.e., with R/r = O∗ ( n), can be computed
∗ 3
using O (n ) oracle calls.

In 2021, it was shown that the complexity of rounding a convex body can be bounded as O∗ (n3 ψn2 ) where
ψn is the KLS constant bound for any isotropic logconcave density in Rn . Together with the next theorem,
it follows that the volume of a convex body can be computed in the same complexity. The current bound
on the KLS constant implies that this is in fact O∗ (n3 ).

Theorem 11.6. n
A near-isotropic transformation for any convex body in R can be computed using Õ(n3 )
n ∗ 3
oracle calls and the volume of any convex body in R can be computed using O (n ) oracle calls.
Bibliography

[1] Radosªaw Adamczak, Alexander Litvak, Alain Pajor, and Nicole Tomczak-Jaegermann. Quantitative
estimates of the convergence of the empirical covariance matrix in log-concave ensembles. Journal of
the American Mathematical Society, 23(2):535561, 2010.

[2] Deeksha Adil, Rasmus Kyng, Richard Peng, and Sushant Sachdeva. Iterative renement for âp-norm
regression. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms,
pages 14051424. SIAM, 2019.

[3] David Aldous and James Fill. Reversible Markov chains and random walks on graphs. Berkeley, 1995.

[4] Zeyuan Allen-Zhu, Zheng Qu, Peter Richtárik, and Yang Yuan. Even faster accelerated coordinate
descent using non-uniform sampling. In International Conference on Machine Learning, pages 1110
1119, 2016.

[5] David Applegate and Ravi Kannan. Sampling and integration of near log-concave functions. In Pro-
ceedings of the 23rd Annual ACM Symposium on Theory of Computing, May 5-8, 1991, New Orleans,
Louisiana, USA, pages 156163, 1991.

[6] Shiri Artstein-Avidan and Vitali Milman. The concept of duality in convex analysis, and the character-
ization of the legendre transform. Annals of mathematics, pages 661674, 2009.

[7] David S Atkinson and Pravin M Vaidya. A cutting plane algorithm for convex programming that uses
analytic centers. Mathematical Programming, 69(1-3):143, 1995.

[8] Alexandre Belloni, Tengyuan Liang, Hariharan Narayanan, and Alexander Rakhlin. Escaping the local
minima via simulated annealing: Optimization of approximately convex functions. In Conference on
Learning Theory, pages 240265, 2015.

[9] Dimitris Bertsimas and Santosh Vempala. Solving convex programs by random walks. Journal of the
ACM (JACM), 51(4):540556, 2004.

[10] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied
and computational harmonic analysis, 27(3):265274, 2009.

[11] J. Bourgain. Random points in isotropic convex sets. Convex geometric analysis, 34:5358, 1996.

[12] Graham Brightwell and Peter Winkler. Counting linear extensions. Order, 8:225242, 1991.

[13] Sébastien Bubeck, Ronen Eldan, and Yin Tat Lee. Kernel-based methods for bandit convex optimization.
arXiv preprint arXiv:1607.03084, 2016.

[14] Anthony Carbery and James Wright. Distributional and l q norm inequalities for polynomials over
convex bodies in r n. Mathematical research letters, 8(3):233248, 2001.

[15] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for nding stationary
points i. arXiv preprint arXiv:1710.11606, 2017.

[16] Michael B Cohen. Nearly tight oblivious subspace embeddings by trace inequalities. In Proceedings of
the twenty-seventh annual ACM-SIAM symposium on Discrete algorithms, pages 278287. SIAM, 2016.

157
Bibliography 158

[17] Michael B Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, and Aaron Sidford.
Uniform sampling for matrix approximation. In Proceedings of the 2015 Conference on Innovations in
Theoretical Computer Science, pages 181190. ACM, 2015.

[18] Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multi-
plication time. arXiv preprint arXiv:1810.07896, 2018.

[19] B. Cousins and S. Vempala. A cubic algorithm for computing Gaussian volume. In SODA, pages
12151228, 2014.

[20] B. Cousins and S. Vempala. Bypassing KLS: Gaussian cooling and an O∗ (n3 ) volume algorithm. In
STOC, pages 539548, 2015.

[21] Dmitriy Drusvyatskiy, Maryam Fazel, and Scott Roy. An optimal rst order method based on optimal
quadratic averaging. arXiv preprint arXiv:1604.06543, 2016.

[22] M. E. Dyer and A. M. Frieze. Computing the volume of a convex body: a case where randomness
provably helps. In Proc. of AMS Symposium on Probabilistic Combinatorics and Its Applications, pages
123170, 1991.

[23] M. E. Dyer, A. M. Frieze, and R. Kannan. A random polynomial time algorithm for approximating the
volume of convex bodies. In STOC, pages 375381, 1989.

[24] Vitaly Feldman, Cristobal Guzman, and Santosh Vempala. Statistical query algorithms for stochastic
convex optimization. CoRR, abs/1512.09170, 2015.

[25] R Fletcher. Practical methods of optimization. 1987.

[26] Roger Fletcher. A new variational result for quasi-newton formulae. SIAM Journal on Optimization,
1(1):1821, 1991.

[27] Matthieu Fradelizi and Olivier Guédon. The extreme points of subsets of s-concave probabilities and a
geometric localization theorem. Discrete & Computational Geometry, 31(2):327335, 2004.

[28] Martin Grötschel, László Lovász, and Alexander Schrijver. Geometric algorithms and combinatorial
optimization, volume 2. Algorithms and Combinatorics, 1988.

[29] B. Grunbaum. Partitions of mass-distributions and convex bodies by hyperplanes. Pacic J. Math.,
10:12571261, 1960.

[30] Haotian Jiang, Yin Tat Lee, Zhao Song, and Sam Chiu-wai Wong. An improved cutting plane method
for convex optimization, convex-concave games, and its applications. In Proceedings of the 52nd Annual
ACM SIGACT Symposium on Theory of Computing, pages 944953, 2020.

[31] Adam Tauman Kalai and Santosh Vempala. Simulated annealing for convex optimization. Math. Oper.
Res., 31(2):253266, February 2006.

[32] Ravi Kannan, László Lovász, and Miklós Simonovits. Isoperimetric problems for convex bodies and a
localization lemma. Discrete & Computational Geometry, 13(1):541559, 1995.

[33] Ravi Kannan, László Lovász, and Miklós Simonovits. Random walks and an o*(n5) volume algorithm
for convex bodies. Random structures and algorithms, 11(1):150, 1997.

[34] Ravindran Kannan and Santosh Vempala. Randomized algorithms in numerical linear algebra. Acta
Numerica, 26:95135, 2017.

[35] Alexander Karzanov and Leonid Khachiyan. On the conductance of order Markov chains. Order,
8(1):715, 1991.
Bibliography 159

[36] Tarun Kathuria, Yang P Liu, and Aaron Sidford. Unit capacity maxow in almost o(mΘ{4/3}) time.
In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 119130.
IEEE, 2020.
[37] L Khachiyan, S Tarasov, and E Erlich. The inscribed ellipsoid method. In Soviet Math. Dokl, volume
298, 1988.
[38] Leonid G Khachiyan. Polynomial algorithms in linear programming. USSR Computational Mathematics
and Mathematical Physics, 20(1):5372, 1980.

[39] Boâaz Klartag. Needle decompositions in Riemannian geometry, volume 249. American Mathematical
Society, 2017.
[40] Rasmus Kyng, Richard Peng, Sushant Sachdeva, and Di Wang. Flows in almost linear time via adaptive
preconditioning. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing,
pages 902913, 2019.
[41] Michel Ledoux. Concentration of measure and logarithmic sobolev inequalities. Seminaire de probabilites
de Strasbourg, 33:120216, 1999.

[42] Yin Tat Lee, Aaron Sidford, and Santosh S Vempala. Ecient convex optimization with membership
oracles. arXiv preprint arXiv:1706.07357, 2017.
[43] Yin Tat Lee, Aaron Sidford, and Sam Chiu-wai Wong. A faster cutting plane method and its implications
for combinatorial and convex optimization. In Foundations of Computer Science (FOCS), 2015 IEEE
56th Annual Symposium on, pages 10491065. IEEE, 2015.

[44] Yin Tat Lee and Santosh S Vempala. Stochastic localization+ stieltjes barrier= tight bound for log-
sobolev. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages
11221129. ACM, 2018.
[45] A Yu Levin. On an algorithm for the minimization of convex functions. In Soviet Mathematics Doklady,
volume 160, pages 12441247, 1965.
[46] Peter Li and Shing Tung Yau. Estimates of eigenvalues of a compact riemannian manifold. Geometry
of the Laplace operator, 36:205239, 1980.

[47] L. Lovász. How to compute the volume? Jber. d. Dt. Math.-Verein, Jubiläumstagung 1990, pages
138151, 1990.
[48] L. Lovász. Hit-and-run mixes fast. Math. Prog., 86:443461, 1998.
[49] L. Lovász and M. Simonovits. Mixing rate of Markov chains, an isoperimetric inequality, and computing
the volume. In ROCS, pages 482491, 1990.
[50] L. Lovász and M. Simonovits. Random walks in a convex body and an improved volume algorithm. In
Random Structures and Alg., volume 4, pages 359412, 1993.

[51] L. Lovász and S. Vempala. Hit-and-run from a corner. SIAM J. Computing, 35:9851005, 2006.
[52] László Lovász and Miklós Simonovits. Random walks in a convex body and an improved volume
algorithm. Random structures & algorithms, 4(4):359412, 1993.
[53] László Lovász and Santosh Vempala. Simulated annealing in convex bodies and an O∗ (n4 ) volume
algorithm. In FOCS, pages 650659, 2003.
[54] László Lovász and Santosh Vempala. The geometry of logconcave functions and sampling algorithms.
Random Struct. Algorithms, 30(3):307358, 2007.

[55] Haihao Lu, Robert M Freund, and Yurii Nesterov. Relatively-smooth convex optimization by rst-order
methods, and applications. arXiv preprint arXiv:1610.05708, 2016.
Bibliography 160

[56] Paolo Manselli and Carlo Pucci. Maximum length of steepest descent curves for quasi-convex functions.
Geometriae Dedicata, 38(2):211227, 1991.

[57] Yu Nesterov. Introductory lectures on convex programming volume i: Basic course. Lecture notes, 1998.
[58] Donald J Newman. Location of the maximum on unimodal surfaces. Journal of the ACM (JACM),
12(3):395398, 1965.
[59] Constantin P.. Niculescu and Lars-Erik Persson. Convex Functions and Their Applications: A Contem-
porary Approach. Springer., 2018.

[60] Bernt Oksendal. Stochastic dierential equations: an introduction with applications. Springer Science
& Business Media, 2013.
[61] Lawrence E Payne and Hans F Weinberger. An optimal poincaré inequality for convex domains. Archive
for Rational Mechanics and Analysis, 5(1):286292, 1960.

[62] Luis Rademacher. Approximating the centroid is hard. In Proceedings of the 23rd ACM Symposium on
Computational Geometry, Gyeongju, South Korea, June 6-8, 2007, pages 302305, 2007.
[63] M. Rudelson. Random vectors in the isotropic position. Journal of Functional Analysis, 164:6072,
1999.
[64] Sushant Sachdeva, Nisheeth K Vishnoi, et al. Faster algorithms via approximation theory.
®
Foundations
and Trends in Theoretical Computer Science, 9(2):125210, 2014.

[65] Naum Z Shor. Cut-o method with space extension in convex programming problems. Cybernetics and
systems analysis, 13(1):9496, 1977.

[66] Miklós Simonovits. How to compute the volume in high dimension? Math. Program., 97(1-2):337374,
2003.
[67] Daniel A Spielman and Nikhil Srivastava. Graph sparsication by eective resistances. SIAM Journal
on Computing, 40(6):19131926, 2011.

[68] Nikhil Srivastava, Roman Vershynin, et al. Covariance estimation for distributions with 2+eps moments.
The Annals of Probability, 41(5):30813111, 2013.

[69] George J Stigler. The cost of subsistence. Journal of farm economics, 27(2):303314, 1945.
[70] Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends ®
in Machine Learning, 8(1-2):1230, 2015.

[71] Pravin M. Vaidya. A new algorithm for minimizing convex functions over convex sets. In 30th Annual
Symposium on Foundations of Computer Science, Research Triangle Park, North Carolina, USA, 30
October - 1 November 1989, pages 338343, 1989.
[72] S. Vempala. Geometric random walks: A survey. MSRI Combinatorial and Computational Geometry,
52:573612, 2005.
[73] S. S. Vempala. The Random Projection Method. AMS, 2004.
[74] Santosh Vempala. Geometric random walks: a survey. Combinatorial and computational geometry,
52(573-612):2, 2005.
[75] Andre Wibisono. Sampling as optimization in the space of measures: The langevin dynamics as a
composite optimization problem. arXiv preprint arXiv:1802.08089, 2018.
[76] David P Woodru et al. Sketching as a tool for numerical linear algebra. Foundations and Trends ® in
Theoretical Computer Science, 10(12):1157, 2014.

[77] David B Yudin and Arkadii S Nemirovski. Evaluation of the information complexity of mathematical
programming problems. Ekonomika i Matematicheskie Metody, 12:128142, 1976.
Appendix A

Calculus - Review

A.1 Tips for Computing Gradients


In this section, we give some tips on how to do calculus.

A.1.1 Computing gradients via directional derivatives


Usually, computing gradients coordinate-by-coordinate is not the best way and we should avoid using sum-
mation notation as much as possible as it creates too many subscripts and is prone to mistakes. Instead, it
is usually better to compute the gradient via directional deriatives. Here, we give a few examples for this.
We dene Df (x)[h] be the directional derivative of f at x along the direction h. Namely,

def d
Df (x)[h] = f (x + th).
dt t=0

Similarly, we use Dk f (x)[h1 , h2 , · · · , hk ] to denote the directional k -th derivative of f at x along directions
h1 , · · · , hk .
Lemma A.1. Given A ∈ Rn×d . Let Φ(x) = ni=1 f (a⊤
P
i x) where ai is the i-th row of A. Then, we have
∇Φ(x) = A⊤ f ′ (Ax) and ∇2 Φ(x) = A⊤ diag(f ′′ (Ax))A where f ′ (Ax) is the vector dened by (f ′ (Ax))i =
f ′ (a⊤
i x).

Proof. Note that


n
X
DΦ(x)[h] = f ′ (a⊤ ⊤
i x)ai h,
i=1
n
X
D2 Φ(x)[h, h] = f ′′ (a⊤ ⊤ 2
i x)(ai h) .
i=1

To write it in the traditional form, we note that

∇Φ(x)⊤ h = DΦ(x)[h] = f ′ (Ax)⊤ Ah = (A⊤ f ′ (Ax))⊤ h.

Since both side are the same for all h, we have ∇Φ(x) = A⊤ f ′ (Ax).
Similarly, we have

h⊤ ∇2 Φ(x)h = D2 Φ(x)[h, h]
Xn
= (f ′′ (Ax))i (Ah)2i
i=1
= h⊤ A⊤ diag(f ′′ (Ax))Ah.

Since ∇2 Φ(x) − A⊤ diag(f ′′ (Ax))A is symmetric and both side are the same for all h, we have ∇2 Φ(x) =
A⊤ diag(f ′′ (Ax))A.

161
A.1. Tips for Computing Gradients 162

Figure A.1: The fastest curve

Exercise A.2. Use the above method to compute the gradient and Hessian of f (X) = log det AT XA.
Here is a more complicated example.

Lemma A.3 (Brachistochrone Problem). Let (x, u(x)) be the curve from (0, 0) to (1, −1) where the rst
coordinate is the x axis and the second coordinate is the y axis. Suppose that this is the curve that takes the
shortest time for a bead to slide along the curve frictionlessly from (0, 0) to (1, −1) under uniform gravity.
Then, we have that
2uu′′ + (u′ )2 + 1 = 0.
Remark. Take a look at Wikipedia for the Brachistochrone curve. It is counterintuitive!

Proof. Given a curve u = u(x), the total travel time is


1 1
p
1 + (u′ (x))2
Z Z
ds(x)
T (u) = = dx
0 v(x) 0 v(x)

where ds is the arc length element and v(x) is the velocity at x. By conservation of energy, i.e., the gained
kinetic energy must equal the lost potential energy for every point along the curve, we know that
1
mv(x)2 = −mgu(x).
2

Hence, we have v(x) = −2gu(x) and so


p

s
1
1 + (u′ (x))2
Z
T (u) = dx.
0 −2gu(x)

Since u is a shortest curve, any local change in u cannot reduce the time, i.e.,

DT (u)[h] = 0

for any change h of the curve u. We next compute the directional derivative of T (u), i.e., d
dt |t=0 T (u + th):
1 Z 1
p
1 1 + u′ (x)2 d 2u′ (x)
Z
1 d ′
DT (u)[h] = − √ 3/2
u(x)dx + p p u (x)dx.
0 2 −2gu(x) dt 0 2 −2gu(x) 1 + u′ (x)2 dt
Z 1 p Z 1
1 1 + u′ (x)2 u′ (x)h′ (x)
= − p h(x)dx + p p dx.
0 2 −2gu(x)u(x) 0 −2gu(x) 1 + u′ (x)2
A.1. Tips for Computing Gradients 163

Note that the second term involves h′ (x). To change the term h′ (x) to h(x), we use the integration by parts
(with respect to x, not t!):
Z 1 " #1 Z !
1
u′ (x)h′ (x) u′ (x)h(x) d u′ (x)
p p dx = p p − p p h(x)dx.
0 −2gu(x) 1 + (u′ (x))2 −2gu(x) 1 + (u′ (x))2 0 0 dx −2gu(x) 1 + (u′ (x))2

Since the endpoints of the curve are xed, we have h(1) = h(0) = 0. Hence, the rst term on the right hand
side is 0. Continuing,
Z 1 p Z 1 !
1 1 + u′ (x)2 d u′ (x)
DT (u)[h] = − p h(x)dx − p p h(x)dx
0 2 −2gu(x)u(x) 0 dx −2gu(x) 1 + u′ (x)2
Z 1 p Z 1
1 1 + (u′ (x))2 u′′ (x)
= − p h(x)dx − p p h(x)dx
0 2 −2gu(x)u(x) 0 −2gu(x) 1 + u′ (x)2
Z 1
1 u′ (x)2
+ p p h(x)dx
0 2 −2gu(x)u(x) 1 + u′ (x)2
Z 1
u′ (x)u′ (x)u′′ (x)
+ p h(x)dx.
0 −2gu(x)(1 + u′ (x)2 )3/2
R1
Hence, we have DT (u)[h] = 0 a(x)h(x)dx where
p
−1 1 + (u′ (x))2 u′′ (x) 1 u′ (x)2
a(x) = p −p p + p p
2 −2gu(x)u(x) −2gu(x) 1 + u′ (x)2 2 −2gu(x)u(x) 1 + u′ (x)2
u′ (x)u′ (x)u′′ (x)
+p . (A.1)
−2gu(x)(1 + u′ (x)2 )3/2
Note that a(x) is the gradient of Tp. Since DT (u)[h] = 0 for all h(x), we have that a(x) = 0 for all x.
Multiplying both sides of (A.1) by 2 −2gu(x)(1 + (u′ (x))2 )3/2 u(x), we have

0 = − (1 + (u′ (x))2 )2 − 2u(x)u′′ (x)(1 + (u′ (x))2 ) + u′ (x)2 (1 + (u′ (x))2 ) + 2u(x)u′ (x)2 u′′ (x)
= − 1 − u′ (x)2 − 2u(x)u′′ (x).

A.1.2 Taking derivatives on both sides


Suppose we have a function f (x, y) and g(x) such that f (x, g(x)) = 0. The implicit function theorem shows
that
Dx f (x, g(x)) + Dy f (x, g(x))Dg(x) = 0
where Dx f is the Jacobian of f with respect to x variables and Dg(x) is the Jacobian of g . We note that
the formula can be obtained from taking derivative on both sides with respective to x. Sometimes, taking
derivatives on the both sides can greatly simplify calculations. Here are some examples.
Lemma A.4. Consider xt = argminx∈Rn ft (x) where ft are strictly convex. Then, we have

dxt dft
= (∇2 ft (xt ))−1 ∇ (xt ).
dt dt
Proof. By the optimality condition, we have ∇ft (xt ) = 0. Taking derivatives on both sides, we have
dxt dft
∇2 ft (xt ) + ∇ (xt ) = 0.
dt dt
Since ft are strictly convex, ∇2 ft (xt ) is positive denite and is invertible. Hence, we have that the result.
In section 5.5, we used this to compute the derivative of central path.
A.2. Solving optimization problems by hand 164

A.2 Solving optimization problems by hand


In this section, we introduce the KKT condition and show how to use it to solve optimization problem by
hand.

Theorem A.5 (KarushKuhnTucker theorem). Consider the following optimization problem

min f (x) subject to hi (x) ≤ 0 and ℓj (x) = 0 for all i, j


x∈Ω

for some open set Ω and continuously dierentiable functions f , hi and ℓj . If x is a local minimum, x
satises the KKT conditions:
ˆ
P P
Stationary: ∇f (x) +
i ui ∇hi (x) + j vj ∇ℓj (x) = 0
ˆ Complementary Slackness: ui hi (x) = 0 for all i
ˆ Primal Feasibility: hi (x) ≤ 0 and ℓj (x) = 0 for all i, j
ˆ Dual Feasibility: ui ≥ 0 for all i
We prove Holder's inequality as an example:

Fact A.6. For any vector x, y ∈ Rn , we have ∥xy∥1 ≤ ∥x∥p ∥y∥q for any 1≤p≤∞ and 1≤q≤∞ with
1 1
p + q = 1.
Proof. By symmetries, it suces to compute
X
max xi yi
∥x∥p ≤1
i

for non-zero y . Now, we use the KKT theorem with f (x) = − xi yi , h(x) = ∥x∥p − 1 and Ω = Rn . By
P
i
the KKT conditions, for any maximizer x, we have that

−∇f (x) + u∇h(x) = 0,


uh(x) = 0,
h(x) ≤ 0,u ≥ 0.

Note that ∇f (x) = y and (∇h(x))i = p1 ∥x∥1−p


p · pxp−1
i = ∥x∥1−p
p · xp−1 . From the stationary condition, we
have
y = u∥x∥1−p
p · xp−1 .
To compute u, we note that y is non-zero and hence u ̸= 0. From the complementary slackness, we have
h(x) = 0 and hence ∥x∥p = 1. Therefore, we have

y = u · xp−1 .
p
Hence, we have 1 = i xpiP
= i (yi /u) p−1 = i (yi /u)q . Hence, we have u = ∥y∥q
P P P
Now, we can compute i xi yi as follows
X X yi 1 1 X p/(p−1)
xi yi = ( ) p−1 yi = 1/(p−1) yi = ∥y∥q .
i i
u u i
Appendix B

Notation

Symbol Description
oϵ (f ) o(f ) for any xed ϵ
⟨a, b⟩ Inner product aT b
nnz Number of nonzeros
Õ(.) Asymptotic complexity ignoring logarithmic terms
O∗ (.) Asymptotic complexity ignoring logarithmic terms and error terms
Bn Unit Euclidean ball in Rn
Bp (x, r) p-norm ball of radius r centered at x
p 1/p
p-norm: ( i |xi | )
P
∥.∥p

165

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy