Lecture Notes 2
Lecture Notes 2
1
Chapter 1
Contents
1.1 Mathematical Background . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 The Cauchy-Schwarz inequality . . . . . . . . . . . . 4
1.1.3 The spectral norm . . . . . . . . . . . . . . . . . . . . 6
1.1.4 The mean value theorem . . . . . . . . . . . . . . . . . 7
1.1.5 The fundamental theorem of calculus . . . . . . . . . 7
1.1.6 Differentiability . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 The mean value inequality . . . . . . . . . . . . . . . 10
1.3 Convex functions . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.1 First-order characterization of convexity . . . . . . . 16
1.3.2 Second-order characterization of convexity . . . . . . 19
1.3.3 Operations that preserve convexity . . . . . . . . . . 21
1.4 Minimizing convex functions . . . . . . . . . . . . . . . . . . 21
1.4.1 Strictly convex functions . . . . . . . . . . . . . . . . . 23
1.4.2 Example: Least squares . . . . . . . . . . . . . . . . . 24
1.4.3 Constrained Minimization . . . . . . . . . . . . . . . . 25
1.5 Existence of a minimizer . . . . . . . . . . . . . . . . . . . . . 26
1.5.1 Sublevel sets and the Weierstrass Theorem . . . . . . 27
1.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.6.1 Handwritten digit recognition . . . . . . . . . . . . . 28
1.6.2 Master’s Admission . . . . . . . . . . . . . . . . . . . 29
2
1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3
This chapter develops the basic theory of convex functions that we will
need later. Much of the material is also covered in other courses, so we will
refer to the literature for standard material and focus more on material that
we feel is less standard (but important in our context).
We also use
N = {1, 2, . . .} and R+ := {x ∈ R : x ≥ 0}
to denote the natural and non-negative real numbers, respectively. We are
freely using basic notions and material from linear algebra and analysis,
such as open and closed sets, vector spaces, matrices, continuity, conver-
gence, limits, triangle inequality, among others.
4
and this fraction can be used to define the angle α between u and v:
u⊤ v
cos(α) = ,
∥u∥ ∥v∥
where α ∈ [0, π]. The following shows the situation for two unit vectors
(∥u∥ = ∥v∥ = 1): The scalar product u⊤ v is the length of the projection of
v onto u (which is considered to be negative when α > π/2). This is just
the highschool definition of the cosine.
v
1 v 1
α α
u u
u> v > 0 u> v < 0
v=u v = −u α=π
u> v = 1 u>v = −1
Proof of the Cauchy-Schwarz inequality. There are many proof, but the
authors particularly like this one: define the quadratic function
d d
! d
! d
!
X X X X
f (x) = (ui x+vi )2 = u2i x2 + 2 ui vi x+ vi2 =: ax2 +bx+c.
i=1 i=1 i=1 i=1
5
We know that f (x) = ax2 + bx + c = 0 has the two solutions
√
−b ± b2 − 4ac
x1,2 = .
2a
This is known as the Mitternachtsformel in German-speaking countries, as
you are supposed to know it even when you are asleep at midnight.
As by definition, f (x) ≥ 0 for all x, f (x) = 0 has at most one real solu-
tion, and this is equivalent to having discriminant b2 − 4ac ≤ 0. Plugging
in the definitions of a, b, c, we get
d
!2 d
! d !
X X X
b2 −4ac = 2 ui vi −4 u2i vi2 = 4(u⊤ v)2 −4 ∥u∥2 ∥v∥2 ≤ 0.
i=1 i=1 i=1
6
1.1.4 The mean value theorem
We also recall the mean value theorem that we will frequently need:
Theorem 1.3 (Mean value theorem). Let a < b be real numbers, and let h :
[a, b] → R be a continuous function that is differentiable on (a, b); we denote the
derivative by h′ . Then there exists c ∈ (a, b) such that
h(b) − h(a)
h′ (c) = .
b−a
Geometrically, this means the following: We can interpret the value
(h(b) − h(a))/(b − a) as the slope of the line through the two points (a, h(a))
and (b, h(b)). Then the mean value theorem says that between a and b, we
find a tangent to the graph of h that has the same slope:
h(a)
h(b)
a c b
7
1.1.6 Differentiability
For univariate functions f : dom(f ) → R with dom(f ) ⊆ R, differentia-
bility is covered in high school. We will need the concept for multivari-
ate and vector-valued functions f : dom(f ) → Rm with dom(f ) ⊆ Rd .
Mostly, we deal with the case m = 1: real-valued functions in d variables.
As we frequently need this material, we include a refresher here.
where
∥r(v)∥
lim = 0.
v→0 ∥v∥
It then also follows that the matrix A is unique, and it is called the differential
or Jacobian of f at x. We will denote it by Df (x). More precisely, Df (x) is the
matrix of partial derivatives at the point x,
∂fi
Df (x)ij = (x).
∂xj
8
f (x) + ∇f (x)> (y − x)
f (y)
x y
Example 1.6. Consider the function f (x) = x2 . We know that its derivative is
f ′ (x) = 2x. But why? For fixed x and y = x + v, we compute
Here is an application of the chain rule that we will use frequently. Let
f : dom(f ) → Rm be a differentiable function with (open) convex domain,
and fix x, y ∈ dom(f ). There is an open interval I containing [0, 1] such
9
that x + t(y − x) ∈ dom(f ) for all t ∈ I. Define g : I → Rd by g(t) =
x + t(y − x) and set h = f ◦ g. Thus, h : I → Rm with h(t) = f (x + t(y − x)),
and for all t ∈ I, we have
then ∇f (x) = c; and if f (x) = ∥x∥2 = dj=1 x2j , then ∇f (x) = 2x.
P
y
y x
x
10
To motivate it, let us consider the univariate and real-valued case first.
Let f : dom(f ) → R be differentiable and suppose that f has bounded
derivatives over an interval X ⊆ dom(f ), meaning that for some real
number B, we have |f ′ (x)| ≤ B for all x ∈ X. The mean value theorem
then gives the mean value inequality
∥Df (x)∥ ≤ B, ∀x ∈ X.
Moreover, for every (not necessarily open) convex X ⊆ dom(f ), (ii) implies (i),
and this is the mean value inequality.
11
Proof. Suppose that f is B-Lipschitz over an open set X. For v ∈ Rd ,
v → 0, differentiability at x ∈ X yields for small v ∈ Rd that x + v ∈ X
and therefore
where ∥r(v)∥ / ∥v∥ → 0, the first inequality uses (i), and the last is the
reverse triangle inequality. Rearranging and dividing by ∥v∥, we get
Let v⋆ be a unit vector such that ∥Df (x)∥ = ∥Df (x)v⋆ ∥ / ∥v⋆ ∥ and let v =
tv⋆ for t → 0. Then we further get
∥r(v)∥
∥Df (x)∥ ≤ B + → B,
∥v∥
12
̸ f (y), as otherwise, (i) trivially holds;
We assume w.l.o.g. that f (x) =
now we set
f (y) − f (x)
z=
∥f (y) − f (x)∥.
With this, the previous inequality reduces to (i), so f is indeed B-Lipschitz
over X.
13
λf (x) + (1 − λ)f (y) f (y)
x λx + (1 − λ)y y
so epi(f ) is a convex set. In the other direction, let epi(f ) be a convex set
and consider two points x, y ∈ dom(f ), λ ∈ [0, 1]. By convexity of epi(f ),
we have
epi(f ) ∋ λ(x, f (x)) + (1 − λ)(y, f (y)) = (λx + (1 − λ)y, λf (x) + (1 − λ)f (y)),
m
! m
X X
f λi x i ≤ λi f (xi ).
i=1 i=1
14
epi(f ) epi(f )
graph of f
f (x)
f (x)
x x
Figure 1.4: Graph and epigraph of a non-convex function (left) and a con-
vex function (right)
Lemma 1.15. There exists an (infinite dimensional) vector space V and a linear
function f : V → R such that f is discontinuous at all v ∈ V .
Proof. This is a classical example. Let us consider the vector space V of all
univariate polynomials; the vector space operations are addition of two
polynomials, and multiplication of a polynomial with a scalar. We con-
sider a polynomial such as 3x5 + 2x2 + 1 as a function x 7→ 3x5 + 2x2 + 1
over the domain [−1, 1].
The standard norm in a function space such as V is the supremum norm
∥ · ∥∞ , defined for any bounded function h : [−1, 1] → R via ∥h∥∞ :=
supx∈[−1,1] |h(x)|. Polynomials are continuous and as such bounded over
[−1, 1].
We now consider the linear function f : V → R defined by f (p) = p′ (0),
the derivative of p at 0. The function f is linear, simply because the deriva-
tive is a linear operator. As dom(f ) is the whole space V , dom(f ) is open.
We claim that f is discontinuous at 0 (the zero polynomial). Since f is
linear, this implies discontinuity at every polynomial p ∈ V . To prove dis-
continuity at 0, we first observe that f (0) = 0 and then show that there are
polynomials p of arbitrarily small supremum norm with f (p) = 1. Indeed,
15
for n, k ∈ N, n > 0, consider the polynomial
k
(nx)2i+1 (nx)3 (nx)5 (nx)2k+1
1X 1
pn,k (x) = (−1)i = nx − + − ··· ±
n i=0 (2i + 1)! n 3! 5! (2k + 1)!
16
f (y)
f (x) + ∇f (x)> (y − x)
x y
17
For f (x1 , x2 ) = x21 + x22 , we have ∇f (x) = (2x1 , 2x2 ), hence (1.3) boils
down to
y12 + y22 ≥ x21 + x22 + 2x1 (y1 − x1 ) + 2x2 (y2 − x2 ),
which after some rearranging of terms is equivalent to
(y1 − x1 )2 + (y2 − x2 )2 ≥ 0,
hence true. There are relevant convex functions that are not differentiable,
see Figure 1.6 for an example. More generally, Exercise 8 asks you to prove
that the ℓ1 -norm (or 1-norm) f (x) = ∥x∥1 is convex.
f (x) = |x|
x 0
18
Multiplying this by −1 yields (1.4).
For the other direction, suppose that monotonicty of the gradient (1.4)
holds. Then we in particular have
(∇f (x + t(y − x)) − ∇f (x))⊤ (t(y − x)) ≥ 0
for all x, y ∈ dom(f ) and t ∈ (0, 1). Dividing by t, this yields
(∇f (x + t(y − x)) − ∇f (x))⊤ (y − x)) ≥ 0. (1.5)
Fix x, y ∈ dom(f ). For t ∈ [0, 1], let h(t) := f (x + t(y − x)). In our case
where f is real-valued, (1.1) yields h′ (t) = ∇f (x + t(y − x))⊤ (y − x), t ∈
(0, 1). Hence, (1.5) can be rewritten as
h′ (t) ≥ ∇f (x)⊤ (y − x), t ∈ (0, 1).
By the mean value theorem, there is c ∈ (0, 1) such that h′ (c) = h(1) − h(0).
Then
f (y) = h(1) = h(0) + h′ (c) = f (x) + h′ (c)
≥ f (x) + ∇f (x)⊤ (y − x).
This is the first-order characterization of convexity (Lemma 1.16).
19
(A symmetric matrix M is positive semidefinite, denoted by M ⪰ 0, if x⊤ M x ≥
0 for all x, and positive definite, denoted by M ≻ 0, if x⊤ M x > 0 for all x ̸= 0.)
h′ (t) = ∇f (x + tv)⊤ v,
h′′ (t) = v⊤ ∇2 f (x + tv)v.
The formula for h′ (t) has already been derived in the proof of Lemma 1.17,
and the formula for h′′ (t) is Exercise 9.
If f is convex, we always have h′′ (0) ≥ 0, as we will show next. Given
this, ∇2 f (x) ⪰ 0 follows for every x ∈ dom(f ): by openness of dom(f ),
for every v ∈ Rd of sufficiently small norm, there is y ∈ dom(f ) such that
v = y − x, and then v⊤ ∇2 f (x)v = h′′ (0) ≥ 0. By scaling, this inequality
extends to all v ∈ Rd .
To show h′′ (0) ≥ 0, we observe that for all sufficiently small δ, x + δv ∈
dom(f ) and hence
20
Geometrically, Lemma 1.18 means that the graph of f has non-negative
curvature everywhere and hence “looks like a bowl”. For f (x1 , x2 ) = x21 +
x22 , we have
2 2 0
∇ f (x) = ,
0 2
which is a positive definite matrix. In higher dimensions, the same ar-
gument can be used to show that the squared distance dy (x) = ∥x −
y∥2 to a fixed point y is a convex function; see Exercise 4. The non-
squared Euclidean distance ∥x − y∥ is also convex in x, as a consequence
of Lemma 1.19(ii) below and the fact that every seminorm (in particular
the Euclidean norm ∥x∥) is convex (Exercise 10). The squared Euclidean
distance has the advantage that it is differentiable, while the Euclidean
distance itself (whose graph is an “ice cream cone” for d = 2) is not.
m
Pmfunctions, λ1 , λ2 , . . . , λm ∈ R+ .TThen
(i) Let f1 , f2 , . . . , fm be convex
m
f :=
maxi=1 fi as well as f := i=1 λi fi are convex on dom(f ) := i=1 dom(fi ).
(ii) Let f be a convex function with dom(f ) ⊆ Rd , g : Rm → Rd an affine
function, meaning that g(x) = Ax + b, for some matrix A ∈ Rd×m and
some vector b ∈ Rd . Then the function f ◦ g (that maps x to f (Ax + b))
is convex on dom(f ◦ g) := {x ∈ Rm : g(x) ∈ dom(f )}.
21
Lemma 1.21. Let x⋆ be a local minimum of a convex function f : dom(f ) → R.
Then x⋆ is a global minimum, meaning that
Proof. Suppose there exists y ∈ dom(f ) such that f (y) < f (x⋆ ) and define
y′ := λx⋆ + (1 − λ)y for λ ∈ (0, 1). From convexity (1.2), we get that
that f (y′ ) < f (x⋆ ). Choosing λ so close to 1 that ∥y′ − x⋆ ∥ < ε yields a
contradiction to x⋆ being a local minimum.
This does not mean that a convex function always has a global mini-
mum. Think of f (x) = x as a trivial example. But also if f is bounded from
below over dom(f ), it may fail to have a global minimum (f (x) = ex ).
To ensure the existence of a global minimum, we need additional condi-
tions. For example, it suffices if outside some ball B, all function values
are larger than some value f (x), x ∈ B. In this case, we can restrict f
to B, without changing the smallest attainable value. And on B (which is
compact), f attains a minimum by continuity (Lemma 1.14). An easy ex-
ample: for f (x1 , x2 ) = x21 + x22 , we know that outside any ball containing 0,
f (x) > f (0) = 0.
Another easy condition in the differentiable case is given by the follow-
ing result.
22
Proof. Suppose that ∇f (x)i ̸= 0 for some i. For t ∈ R, we define x(t) =
x + tei , where ei is the i-th unit vector. For |t| sufficiently small, we have
x(t) ∈ dom(f ) since dom(f ) is open. Let z(t) = f (x(t)). By the chain rule,
z ′ (0) = ∇f (x)⊤ ei = ∇f (x)i ̸= 0. Hence, z decreases in one direction as we
move away from 0, and this yields f (x(t)) < f (x) for some t, so x is not a
global minimum.
This means that the open line segment connecting (x, f (x)) and (y, f (y))
is pointwise strictly above the graph of f . For example, f (x) = x2 is strictly
convex.
Lemma 1.25 ([BV04, 3.1.4]). Suppose that dom(f ) is open and that f is twice
continuously differentiable. If the Hessian ∇2 f (x) ≻ 0 for every x ∈ dom(f )
(i.e., z⊤ ∇2 f (x)z > 0 for any z ̸= 0), then f is strictly convex.
The converse is false, though: f (x) = x4 is strictly convex but has van-
ishing second derivative at x = 0.
Lemma 1.26. Let f : dom(f ) → R be strictly convex. Then f has at most one
global minimum.
Proof. Suppose x⋆ ̸= y⋆ are two global minima with fmin = f (x⋆ ) = f (y⋆ ),
and let z = 12 x⋆ + 12 y⋆ . By (1.8),
1 1
f (z) < fmin + fmin = fmin ,
2 2
a contradiction to x⋆ and y⋆ being global minima.
23
1.4.2 Example: Least squares
Suppose we want to fit a hyperplane to a set of data points x1 , . . . , xm in
Rd , based on the hypothesis that the points actually come (approximately)
from a hyperplane. A classical method for this is least squares. For con-
creteness, let us do this in R2 . Suppose that the data points are
(1, 10), (2, 11), (3, 11), (4, 10), (5, 9), (6, 10), (7, 9), (8, 10),
x x
Also, for simplicity (and quite appropriately in this case), let us restrict
to fitting a linear model, or more formally to fit non-vertical lines of the
form y = w0 + w1 x. If (xi , yi ) is the i-th data point, the least squares fit
chooses w0 , w1 such that the least squares objective
8
X
f (w0 , w1 ) = (w1 xi + w0 − yi )2
i=1
24
so we can check convexity directly using the second order condition. We
have gradient
and Hessian
2 16 72
∇ (w0 , w1 ) = .
72 408
A 2 × 2 matrix is positive semidefinite if the diagonal elements and the
determinant are positive, which is the case here, so f is actually strictly
convex and has a unique global minimum. To find it, we solve the linear
system ∇f (w0 , w1 ) = (0, 0) of two equations in two unknowns and obtain
the global minimum
43 1
⋆ ⋆
(w0 , w1 ) = ,− .
4 6
Hence, the “optimal” line is
1 43
y =− x+ ,
6 4
see Figure 1.7 (right).
f (x) ≤ f (y) ∀y ∈ X.
∇f (x⋆ )⊤ (x − x⋆ ) ≥ 0 ∀x ∈ X.
25
If X does not contain the global minimum, then Lemma 1.28 has a
nice geometric interpretation. Namely, it means that X is contained in the
halfspace {x ∈ Rd : ∇f (x⋆ )⊤ (x − x⋆ ) ≥ 0} (normal vector ∇f (x⋆ ) at x⋆
pointing into the halfspace); see Figure 1.8. In still other words, x − x⋆
forms a non-obtuse angle with ∇f (x⋆ ) for all x ∈ X.
∇f (x? )> (x − x? ) ≥ 0
X
∇f (x? )
x
x?
or
minimize f (x)
(1.11)
subject to x ∈ X.
26
convex function. To avoid technicalities, we restrict ourselves to the case
dom(f ) = Rd .
f ≤α f ≤α f ≤α
Figure 1.9: Sublevel set of a non-convex function (left) and a convex func-
tion (right)
It is easy to see from the definition that every sublevel set of a convex
function is convex. Moreover, as a consequence of continuity of f , sublevel
sets are closed. The following (known as the Weierstrass Theorem) just
formalizes an argument that we have made earlier.
Theorem 1.30. Let f : Rd → R be a continuous function, and suppose there is
a nonempty and bounded sublevel set f ≤α . Then f has a global minimum.
Proof. As the set (−∞, α] is closed, its pre-image f ≤α by the continuous
function f is closed. We know that f —as a continuous function—attains a
minimum over the (non-empty) closed and bounded (= compact) set f ≤α
at some x⋆ . This x⋆ is also a global minimum as it has value f (x⋆ ) ≤ α,
while any x ∈/ f ≤α has value f (x) > α ≥ f (x⋆ ).
27
Note that Theorem 1.30 holds for convex functions as convexity on Rd
implies continuity (Exercise 3).
1.6 Examples
In the following two sections, we give two examples of convex function
minimization tasks that arise from machine learning applications.
Figure 1.10: Some training images from the MNIST data set (picture from
http://corochann.com/mnist-dataset-introduction-1138.
html
28
The classical approach is the following. We represent an image as a
feature vector x ∈ R784 , where xi is the gray value of the i-th pixel (in some
order). During the training phase, we compute a matrix W ∈ R10×784 and
then use the vector y = W x ∈ R10 to predict the digit seen in an arbitrary
image x. The idea is that yj , j = 0, . . . , 9 corresponds to the probability
of the digit being j. This does not work directly, since the entries of y
may be negative and generally do not sum up to 1. But we can convert y
to a vector z of actual probabilities, such that a small yj leads to a small
probability zj and a large yj to a large probability zj . How to do this is not
canonical, but here is a well-known formula that works:
eyj
zj = zj (y) = P9 . (1.12)
yk
k=0 e
This function “punishes” images for which the correct digit j has low
probability zj (corresponding to a significantly negative value of log zj ).
In an ideal world, the correct digit would always have probability 1, re-
sulting in ℓ(W ) = 0. But under (1.12), probabilities are always strictly
between 0 and 1, so we have ℓ(W ) > 0 for all W .
Exercise 6 asks you to prove that ℓ is convex. In Exercise 7, you will
characterize the situations in which ℓ has a global minimum.
29
(rough) forecast of the applicant’s performance in the MSc program, based
on the submitted documents.1
Data on the actual performance of students admitted in the past is
available. To keep things simple in the following example, Let us base
the forecast on GPA (grade point average) and TOEFL (Test of English as
a Foreign Language) only. GPA scores are normalized to a scale with a
minimum of 0.0 and a maximum of 4.0, where admission starts from 3.5.
TOEFL scores are on an integer scale between 0 and 120, where admission
starts from 100.
Table 1.1 contains the known data. GGPA (graduation grade point av-
erage on a Swiss grading scale) is the average grade obtained by an ad-
mitted student over all courses in the MSc program. The Swiss scale goes
from 1 to 6 where 1 is the lowest grade, 6 is the highest, and 4 is the lowest
passing grade.
GPA TOEFL GGPA
3.52 100 3.92
3.66 109 4.34
3.76 113 4.80
3.74 100 4.67
3.93 100 5.52
3.88 115 5.44
3.77 115 5.04
3.66 107 4.73
3.87 106 5.03
3.84 107 5.06
Table 1.1: Data for 10 admitted students: GPA and TOEFL scores (at time
of application), GGPA (at time of graduation)
30
squares objective would be somewhat ugly; we already saw this in our
previous example (1.9), where the data points had large second coordinate,
resulting in the w1 -scale being very different from the w2 -scale. This time,
we normalize first, so that w1 und w2 become comparable and allow us to
understand the relative influences of GPA and TOEFL.
The general setting is this: we have n inputs x1 , . . . , xn , where each vec-
tor xi ∈ Rd consists of d input variables; then we have n outputs y1 , . . . , yn ∈
R. Each pair (xi , yi ) is an observation. In our case, d = 2, n = 10, and for
example, ((3.93, 100), 5.52) is an observation (of a student doing very well).
With variable weights w0 , w = (w1 , . . . , wd ) ∈ Rd , we plan to minimize
the least squares objective
n
X
f (w0 , w) = (w0 + w⊤ xi − yi )2 .
i=1
We first want to assume that the inputs and outputs are centered, mean-
ing that
n n
1X 1X
xi = 0, yi = 0.
n i=1 n i=1
31
After centering, the global minimum (w0⋆ , w⋆ ) of the least squares ob-
jective satisfies w0⋆ = 0 while w⋆ is unaffected by centering (Exercise 11),
so that we can simply omit the variable w0 in the sequel.
Finally, we assume that all d input variables are on the same scale,
meaning that
n
1X 2
x = 1, j = 1, . . . , d.
n i=1 ij
To achieve this for fixed j (assuming
q P that no variable is 0 in all inputs),
we multiply all xij by s(j) = n/ ni=1 x2ij (which, in the optimal solution
w⋆ , just multiplies wj⋆ by 1/s(j), an argument very similar to the one in
Exercise 11). For our data set, the resulting normalized data are shown in
Table 1.2 (right). Now the least squares objective (after omitting w0 ) is
10
X
f (w1 , w2 ) = (w1 xi1 + w2 xi2 − yi )2
i=1
≈ 10w12 + 10w22 + 1.99w1 w2 − 8.7w1 − 2.79w2 + 2.09.
This is minimized at
in the normalized data. This can quickly be checked, and the results are
not perfect, but not too bad, either; see Table 1.3 (ignore the last column
for now).
What we also see from (1.15) is that the first input variable (GPA) has a
much higher influence on the output (GGPA) than the second one (TOEFL).
In fact, if we drop the second one altogether, we obtain outputs zi⋆ (last col-
umn in Table 1.3) that seem equivalent to the predicted outputs yi⋆ within
the level of noise that we have anyway.
We conclude that TOEFL scores are probably not indicative for the per-
formance of admitted students, so the admission committee should not
care too much about them. Requiring a minimum score of 100 might make
sense, but whenever an applicant reaches at least this score, the actual
value does not matter.
32
xi1 xi2 yi yi⋆ zi⋆
-2.04 -1.28 -0.94 -1.00 -0.87
-0.88 0.32 -0.52 -0.35 -0.37
-0.05 1.03 -0.05 0.08 -0.02
-0.16 -1.28 -0.18 -0.19 -0.07
1.42 -1.28 0.67 0.49 0.61
1.02 1.39 0.59 0.57 0.44
0.06 1.39 0.19 0.16 0.03
-0.88 -0.04 -0.12 -0.38 -0.37
0.89 -0.21 0.17 0.36 0.38
0.62 -0.04 0.21 0.26 0.27
Table 1.3: Outputs yi⋆ predicted by the linear model (1.15) and by the model
zi⋆ = 0.43xi1 that simply ignores the second input variable
33
Pd
j=1 |wj |):
Pn ⊤ 2
minimize i=1 ∥w xi − yi ∥ (1.16)
subject to ∥w∥1 ≤ R,
where R ∈ R+ is some parameter. In our case, if we for example
34
b
10w12 + 10w22 + 1.99w1 w2 − 8.7w1 − 2.79w2 + 2.09 = 0.75
(0.43, 0.097)
for R = 0.4).
Even though we have presented a toy example in this section, the back-
ground is real. The theory of admission and in particular performance
forecasts has been developed in a recent PhD thesis by Zimmermann [Zim16].
1.7 Exercises
Exercise 1. Prove that a differentiable function is continuous!
35
Exercise 4. Prove that the function dy : Rd → R, x 7→ ∥x − y∥2 is strictly
convex for any y ∈ Rd . (Use Lemma 1.25.)
Exercise 5. Prove Lemma 1.19! Can (ii) be generalized to show that for two
convex functions f, g, the function f ◦ g is convex as well?
Exercise 7. Consider the logistic regression problem with two classes. Given a
training set P consisting of datapoint and label pairs (x, y) where x ∈ Rd and
y ∈ {−1, +1}, we define our loss ℓ for weight vector w ∈ Rd to be
X
− ln z(yw⊤ x) ,
ℓ(w) =
(x,y)∈P
y(w⊤ x) ≥ 0 .
y(w⊤ x) = 0 .
Exercise 8. Prove that the function f (x) = ∥x∥1 = di=1 |xi | (ℓ1 -norm) is con-
P
vex!
36
Exercise 10. A seminorm is a function f : Rd → R satisfying the following two
properties for all x, y ∈ Rd and all λ ∈ R.
Prove that w0⋆ = 0. Also, suppose x′i and yi′ are such that for all i, x′i = xi + q,
yi′ = yi + r. Show that (w0 , w) minimizes f if and only if (w0 − w⊤ q + r, w)
minimizes n
X
′
f (wo , w) = (w0 + w⊤ x′i − yi′ )2 .
i=1
37
Chapter 2
Gradient Descent
Contents
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.1.1 Convergence rates . . . . . . . . . . . . . . . . . . . . 40
2.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3 Vanilla analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4 Lipschitz convex functions: O(1/ε2 ) steps . . . . . . . . . . . 44
2.5 Smooth convex functions: O(1/ε) steps . . . . . . . . . . . . 46
2.6 Acceleration
√ for smooth convex functions:
O(1/ ε) steps . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.7 Interlude . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.8 Smooth and strongly convex functions:
O(log(1/ε)) steps . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
38
2.1 Overview
The gradient descent algorithm (including variants such as projected or
stochastic gradient descent) is the most useful workhorse for minimizing
loss functions in practice. The algorithm is extremely simple and surpris-
ingly robust in the sense that it also works well for many loss functions
that are not convex. While it is easy to construct (artificial) non-convex
functions on which gradient descent goes completely astray, such func-
tions do not seem to be typical in practice; however, understanding this
on a theoretical level is an open problem, and only few results exist in this
direction.
The vast majority of theoretical results concerning the performance of
gradient descent hold for convex functions only. In this and the following
chapters, we will present some of these results, but maybe more impor-
tantly, the main ideas behind them. As it turns out, the number of ideas
that we need is rather small, and typically, they are shared between dif-
ferent results. Our approach is therefore to fully develop each idea once,
in the context of a concrete result. If the idea reappears, we will typically
only discuss the changes that are necessary in order to establish a new re-
sult from this idea. In order to avoid boredom from ideas that reappear
too often, we omit other results and variants that one could also get along
the lines of what we discuss.
Let f : Rd → R be a convex and differentiable function. We also assume
that f has a global minimum x⋆ , and the goal is to find (an approximation
of) x⋆ . This usually means that for a given ε > 0, we want to find x ∈ Rd
such that
f (x) − f (x⋆ ) < ε.
Notice that we are not making an attempt to get near to x⋆ itself — there
can be several minima y⋆ ̸= x⋆ with f (x⋆ ) = f (y⋆ ).
Gradient descent is an iterative method, meaning that it generates a se-
quence x0 , x2 , . . . of solutions such that in some iteration T , we eventually
have f (xT ) − f (x⋆ ) < ε.
Table 2.1 gives an overview of the results that we will prove. They con-
cern several variants of gradient descent as well as several classes of func-
tions. The significance of each algorithm and function class will briefly be
discussed when it first appears.
In Chapter 6, we will also look at gradient descent on functions that
39
smooth &
Lipschitz smooth strongly
strongly
convex convex convex
convex
functions functions functions
functions
gradient Thm. 2.1 Thm. 2.8 Thm. 2.14
descent O(1/ε2 ) O(1/ε) O(log(1/ε))
accelerated
Thm. 2.9
gradient √
O(1/ ε)
descent
projected
Thm. 3.2 Thm. 3.4 Thm. 3.5
gradient
O(1/ε2 ) O(1/ε) O(log(1/ε))
descent
proximal
Thm. 3.14
gradient
O(1/ε)
descent
subgradient Thm. 4.7 Thm. 4.11
descent O(1/ε2 ) O(1/ε)
stochastic
Thm. 5.1 Thm. 5.2
gradient
O(1/ε2 ) O(1/ε)
descent
Table 2.1: Results on gradient descent. Below each theorem, the number
of steps is given which the respective variant needs on the respective func-
tion class to achieve additive approximation error at most ε.
are not convex. In this case, provably small approximation error can still
be obtained for some particularly well-behaved functions (we will give an
example). For smooth (but not necessarily convex) functions, we gener-
ally cannot show convergence in error, but a (much) weaker convergence
property still holds.
40
error measures. An algorithm is said to exhibit (at least) linear convergence
whenever there is a real number 0 < c < 1 such that
εt+1 ≤ cεt for all sufficiently large t.
The word linear comes from the fact that the error in step t + 1 is bounded
by a linear function of the error in step t.
This means that for t large enough, the error goes down by at least a
constant factor in each step. Linear convergence implies that an error of at
most ε is achieved within O(log(1/ε)) iterations. For example, this is the
bound provided by Theorem 2.14 (last entry in the first row of Table 2.1),
and it is proved by showing linear convergence of the algorithm.
The term superlinear convergence refers to an algorithm for which there
are constants r > 1 and c > 0 such that
εt+1 ≤ c(εt )r for all sufficiently large t.
The case r = 2 is known as quadratic convergence. Under quadratic con-
vergence, an error of at most ε is achieved within O(log log(1/ε)) iterations.
We will see an algorithm with quadratic convergence in Chapter 7.
If a (converging) algorithm does not exhibit at least linear convergence,
we say that it has sublinear convergence. One can also quantify sublinear
convergence more precisely if needed.
41
To get any decrease in function value at all, we have to choose vt such that
∇f (xt )⊤ vt < 0. But among all steps vt of the same length, we should in
fact choose the one with the most negative value of ∇f (xt )⊤ vt , so that we
maximize our decrease in function value. This is achieved when vt points
into the direction of the negative gradient −∇f (xt ). But as differentiability
guarantees decrease only for small steps, we also want to control how far
we go along the direction of the negative gradient.
Therefore, the step of gradient descent is defined by
Here, γ > 0 is a fixed stepsize, but it may also make sense to have γ depend
on t. For now, γ is fixed. We hope that for some reasonably small integer
t, in the t-th iteration we get that f (xt ) − f (x⋆ ) < ε; see Figure 2.1 for an
example.
Now it becomes clear why we are assuming that dom(f ) = Rd : The
update step (2.1) may in principle take us “anywhere”, so in order to get
a well-defined algorithm, we want to make sure that f is defined and dif-
ferentiable everywhere.
The choice of γ is critical for the performance. If γ is too small, the
process might take too long, and if γ is too large, we are in danger of
overshooting. It is not clear at this point whether there is a “right” stepsize.
1
gt⊤ (xt − x⋆ ) = (xt − xt+1 )⊤ (xt − x⋆ ). (2.3)
γ
42
x2
x5
3 x3 x4
x2
x1
x0
x1
4
Now we apply (somewhat out of the blue, but this will clear up in the next
step) the basic vector equation 2v⊤ w = ∥v∥2 + ∥w∥2 − ∥v − w∥2 (a.k.a. the
cosine theorem) to rewrite the same expression as
1
gt⊤ (xt − x⋆ ) = ∥xt − xt+1 ∥2 + ∥xt − x⋆ ∥2 − ∥xt+1 − x⋆ ∥2
2γ
1
γ 2 ∥gt ∥2 + ∥xt − x⋆ ∥2 − ∥xt+1 − x⋆ ∥2
=
2γ
γ 1
∥gt ∥2 + ∥xt − x⋆ ∥2 − ∥xt+1 − x⋆ ∥2
= (2.4)
2 2γ
Next we sum this up over the iterations t, so that the latter two terms in
43
the bracket cancel in a telescoping sum.
T −1 T −1
X γX 1
gt⊤ (xt ⋆
∥gt ∥2 + ∥x0 − x⋆ ∥2 − ∥xT − x⋆ ∥2
−x ) =
t=0
2 t=0 2γ
T −1
γX 1
≤ ∥gt ∥2 + ∥x0 − x⋆ ∥2 (2.5)
2 t=0 2γ
This gives us an upper bound for the average error f (xt ) − f (x⋆ ), t =
0, . . . , T − 1, hence in particular for the error incurred by the iterate with
the smallest function value. The last iterate is not necessarily the best one:
gradient descent with fixed stepsize γ will in general also make steps that
overshoot and actually increase the function value; see Exercise 15(i).
The question is of course: is this result any good? In general, the an-
swer is no. A dependence on ∥x0 − x⋆ ∥ is to be expected (the further we
start from x⋆ , the longer we will take); the dependence on the squared gra-
dients ∥gt ∥2 is more of an issue, and if we cannot control them, we cannot
say much.
44
Assuming bounded gradients rules out many interesting functions,
though. For example, f (x) = x2 (a supermodel in the world of convex
functions) already doesn’t qualify, as ∇f (x) = 2x—and this is unbounded
as x tends to infinity. But let’s care about supermodels later.
Theorem 2.1. Let f : Rd → R be convex and differentiable with a global mini-
mum x⋆ ; furthermore, suppose that ∥x0 − x⋆ ∥ ≤ R and ∥∇f (x)∥ ≤ B for all x.
Choosing the stepsize
R
γ := √ ,
B T
gradient descent (2.1) yields
T −1
1X RB
(f (xt ) − f (x⋆ )) ≤ √ .
T t=0 T
R2 B 2
T ≥
ε2
many iterations. This is not particularly good when it comes to concrete
numbers (think of desired error ε = 10−6 when R, B are somewhat larger).
On the other hand, the number of steps does not depend on d, the di-
mension of the space. This is very important since we often optimize in
high-dimensional spaces. Of course, R and B may depend on d, but in
many relevant cases, this dependence is mild.
45
What happens if we don’t know R and/or B? An idea is to “guess”
R and B, run gradient descent with T and γ resulting from the guess,
check whether the result has absolute error at most ε, and repeat with a
different guess otherwise. This fails, however, since in order to compute
the absolute error, we need to know f (x⋆ ) which we typically don’t. But
Exercise 16 asks you to show that knowing R is sufficient.
Next we want to look at functions for which f (y) can be bounded from
above by f (x)+∇f (x)⊤ (y−x), up to at most quadratic error. The following
definition applies to all differentiable functions, convexity is not required.
L
f (y) ≤ f (x) + ∇f (x)⊤ (y − x) + ∥x − y∥2 , ∀x, y ∈ X. (2.8)
2
If X = dom(f ), f is simply called smooth.
Recall that (2.7) says that for any x, the graph of f is above its tangential
hyperplane at (x, f (x)). In contrast, (2.8) says that for any x ∈ X, the
graph of f is below a not-too-steep tangential paraboloid at (x, f (x)); see
Figure 2.2.
This notion of smoothness has become standard in convex optimiza-
tion, but the naming is somewhat unfortunate, since there is an (older)
definition of a smooth function in mathematical analysis where it means a
function that is infinitely often differentiable.
We have the following simple characterization of smoothness.
Lemma 2.3 (Exercise 13). Suppose that dom(f ) is open and convex, and that
f : dom(f ) → R is differentiable. Let L ∈ R+ . Then the following two state-
ments are equivalent.
46
f (x) + ∇f (x)> (y − x) + L2 kx − yk2
f (y)
f (x) + ∇f (x)> (y − x)
x y
Let us discuss some cases. If L = 0, (2.7) and (2.8) together require that
47
Q⊤ )x, where 12 (Q + Q⊤ ) is symmetric. Therefore, we can assume without
loss of generality that Q is symmetric, i.e., it suffices to show that quadratic
functions defined by symmetric functions are smooth.
Lemma 2.4 (Exercise 14). Let f (x) = x⊤ Qx+b⊤ x+c, where Q is a symmetric
(d × d) matrix, b ∈ Rd , c ∈ R. Then f is smooth with parameter 2 ∥Q∥, where
∥Q∥ is the spectral norm of Q (Definition 1.2).
48
Lemma 2.6 (Exercise 17).
(i) Let f1 , f2 , . . . , fm be smooth with parameters P
L1 , L2 , . . . , Lm , and let
λm ∈ R+ . Then the function
λ1 , λ2 , . . . ,P f := mi=1 λi fi is smooth with
parameter m m
T
λ L
i=1 i i over dom(f ) := i=1 dom(f i ).
51
The obvious question resulting from this was whether there actually
exists a first-order method that has additive error O(1/T 2 ) after T steps, on
every smooth function. This was answered in the affirmative by Nesterov
in 1983 when he proposed an algorithm that is now known as (Nesterov’s)
accelerated gradient descent [Nes83]. Nesterov’s book (Sections 2.1 and 2.2)
is a comprehensive source for both lower and upper bound [Nes18].
It is not easy to understand why the accelerated gradient descent algo-
rithm is an optimal first-order method, and how Nesterov even arrived at
it. A number of alternative derivations of optimal algorithms have been
given by other authors, usually claiming that they provide a more natural
or easier-to-grasp approach. However, each alternative approach requires
some understanding of other things, and there is no well-established “sim-
plest approach”. Here, we simply throw the algorithm at the reader, with-
out any attempt to motivate it beyond some obvious words. Then we
present a short proof that the algorithm is indeed optimal.
Let f : Rd → R be convex, differentiable, and smooth with parame-
ter L. Accelerated gradient descent is the following algorithm: choose z0 =
y0 = x0 arbitrary. For t ≥ 0, set
1
yt+1 := xt − ∇f (xt ), (2.11)
L
t+1
zt+1 := zt − ∇f (xt ), (2.12)
2L
t+1 2
xt+1 := yt+1 + zt+1 . (2.13)
t+3 t+3
This means, we are performing a normal “smooth step” from xt to obtain
yt+1 and a more aggressive step from zt to get zt+1 . The next iterate xt+1
is a weighted average of yt+1 and zt+1 , where we compensate for the more
aggressive step by giving zt+1 a relatively low weight.
Theorem 2.9. Let f : Rd → R be convex and differentiable with a global min-
imum x⋆ ; furthermore, suppose that f is smooth with parameter L according
to (2.8). Accelerated gradient descent (2.11), (2.12), and (2.13), yields
2L ∥z0 − x⋆ ∥2
f (yT ) − f (x⋆ ) ≤ , T > 0.
T (T + 1)
Comparing this bound with the one from Theorem 2.8, we see that the
error is now indeed O(1/T 2 ) instead of O(1/T ); to reach error at most ε,
52
√
accelerated gradient descent therefore only needs O(1/ ε) steps instead
of O(1/ε).
Proof. The analysis uses a potential function argument [BG17]. We assign a
potential Φ(t) to each time t and show that Φ(t + 1) ≤ Φ(t). The potential
is
Φ(t) := t(t + 1) (f (yt ) − f (x⋆ )) + 2L ∥zt − x⋆ ∥2 .
If we can show that the potential always decreases, we get
1
f (yt+1 ) ≤ f (xt ) − ∥∇f (xt )∥2 ; (2.14)
2L
t+1
(ii) the vanilla analysis (Section 2.3) for step (2.12) with γ = 2L
, gt =
∇f (xt ):
t+1 L
gt⊤ (zt − x⋆ ) = ∥gt ∥2 + ∥zt − x⋆ ∥2 − ∥zt+1 − x⋆ ∥2 ;
(2.15)
4L t+1
(iii) convexity:
Now,
Φ(t + 1) − Φ(t)
∆ :=
t+1
53
can be bounded as follows.
2L
t (f (yt+1 ) − f (yt )) + 2 (f (yt+1 ) − f (x⋆ )) + ∥zt+1 − x⋆ ∥2 − ∥zt − x⋆ ∥2
∆ =
t+1
(2.15) t+1
= t (f (yt+1 ) − f (yt )) + 2 (f (yt+1 ) − f (x⋆ )) + ∥gt ∥2 − 2gt⊤ (zt − x⋆ )
2L
(2.14) 1
≤ t (f (xt ) − f (yt )) + 2 (f (xt ) − f (x⋆ )) − ∥gt ∥2 − 2gt⊤ (zt − x⋆ )
2L
≤ t (f (xt ) − f (yt )) + 2 (f (xt ) − f (x⋆ )) − 2gt⊤ (zt − x⋆ )
(2.16)
≤ tgt⊤ (xt − yt ) + 2gt⊤ (xt − x⋆ ) − 2gt⊤ (zt − x⋆ )
= gt⊤ ((t + 2)xt − tyt − 2zt )
(2.13)
= gt⊤ 0 = 0.
Hence, we indeed have Φ(t + 1) ≤ Φ(t).
2.7 Interlude
Let us get back to the supermodel f (x) = x2 (that is smooth with param-
eter L = 2, as we observed before). According to Theorem 2.8, gradient
descent (2.1) with stepsize γ = 1/2 satisfies
1 2
f (xT ) ≤ x. (2.17)
T 0
Here we used that the minimizer is x⋆ = 0. Let us check how good this
bound really is. For our concrete function and concrete stepsize, (2.1) reads
as
1
xt+1 = xt − ∇f (xt ) = xt − xt = 0,
2
so we are always done after one step! But we will see in the next section
that this is only because the function is particularly beautiful, and on top of
that, we have picked the best possible smoothness parameter. To simulate
a more realistic situation here, let us assume that we have not looked at the
supermodel too closely and found it to be smooth with parameter L = 4
only (which is a suboptimal but still valid parameter). In this case, γ = 1/4
and (2.1) becomes
1 xt xt
xt+1 = xt − ∇f (xt ) = xt − = .
4 2 2
54
So, we in fact have x
0 1 2
f (xT ) = f = x. (2.18)
2T 22T 0
This is still vastly better than the bound of (2.17)! While (2.17) requires
T ≈ x20 /ε to achieve f (xT ) ≤ ε, (2.18) requires only
2
1 x0
T ≈ log ,
2 ε
which is an exponential improvement in the number of steps.
and therefore says that every convex function satisfies (2.19) with µ = 0.
In the spirit of Lemma 2.3 for smooth functions, we can characterize
strong convexity via convexity of another function.
55
f (x) + ∇f (x)> (y − x) + L2 kx − yk2
f (y)
f (x) + ∇f (x)> (y − x) + µ2 kx − yk2
x y
Lemma 2.11 (Exercise 20). Suppose that dom(f ) is open and convex, and that
f : dom(f ) → R is differentiable. Let µ ∈ R+ . Then the following two state-
ments are equivalent.
56
Lemma 2.13 (Exercise 22). Let f : Rd → R be strongly convex with parameter
µ > 0 and smooth with parameter µ. Prove that f is of the form
µ
f (x) = ∥x − b∥2 + c,
2
where b ∈ Rd , c ∈ R.
1 µ
f (xt )−f (x⋆ ) ≤ γ 2 ∥∇f (xt )∥2 + ∥xt − x⋆ ∥2 − ∥xt+1 − x⋆ ∥2 − ∥xt −x⋆ ∥2 .
2γ 2
(2.20)
Rewriting this yields a bound on ∥xt+1 − x⋆ ∥2 in terms of ∥xt − x⋆ ∥2 , along
with some “noise” that we still need to take care of:
∥xt+1 −x⋆ ∥2 ≤ 2γ(f (x⋆ )−f (xt ))+γ 2 ∥∇f (xt )∥2 +(1−µγ)∥xt −x⋆ ∥2 . (2.21)
1
γ := ,
L
gradient descent (2.1) with arbitrary x0 satisfies the following two properties.
L µ T
f (xT ) − f (x⋆ ) ≤ 1− ∥x0 − x⋆ ∥2 , T > 0.
2 L
57
Proof. For (i), we show that the noise in (2.21) disappears. By sufficient
decrease (Lemma 2.7), we know that
1
f (x⋆ ) − f (xt ) ≤ f (xt+1 ) − f (xt ) ≤ − ∥∇f (xt )∥2 ,
2L
and hence the noise can be bounded as follows, using γ = 1/L, multiply-
ing by 2γ and rearranging the terms, we get:
From this, we can derivate a rate in terms of the number of steps re-
quired (T ). Using the inequality ln(1 + x) ≤ x, it follows that after
2
L R L
T ≥ ln ,
µ 2ε
2.9 Exercises
Exercise 12. Let c ∈ Rd . Prove that the spectral norm of c⊤ equals the Euclidean
norm of c, meaning that
|c⊤ x|
max = ∥c∥ .
x̸=0 ∥x∥
58
Exercise 13. Prove Lemma 2.3! (Alternative characterization of smoothness)
Exercise 14. Prove Lemma 2.4: The quadratic function f (x) = x⊤ Qx+b⊤ x+c,
Q symmetric, is smooth with parameter 2 ∥Q∥.
(i) Prove that f is strictly convex and differentiable, with a unique global min-
imum x⋆ = 0.
(ii) Prove that for every fixed stepsize γ in gradient descent (2.1) applied to f ,
there exists x0 for which f (x1 ) > f (x0 ).
(iv) Let X ⊆ R be a closed convex set such that 0 ∈ X and X ̸= {0}. Prove
that f is not smooth over X.
Exercise 16. In order to obtain average error at most ε in Theorem 2.1, we need
to choose iteration number and stepsize as
2
RB R
T ≥ , γ := √ .
ε B T
If R or B are unknown, we cannot do this.
Suppose now that we know R but not B. This means, we know a concrete
number R such that ∥x0 − x⋆ ∥ ≤ R; we also know that there exists a number B
such that ∥∇f (x)∥ ≤ B for all x, but we don’t know a concrete such number.
Develop an algorithm that—not knowing B—finds a vector x such that f (x)−
f (x⋆ ) < ε, using at most
2 !
RB
O
ε
many gradient descent steps!
Exercise 18. In order to obtain average error at most ε in Theorem 2.8, we need
to choose
1 R2 L
γ := , T ≥ ,
L 2ε
59
if ∥x0 − x⋆ ∥ ≤ R. If L is unknown, we cannot do this.
Now suppose that we know R but not L. This means, we know a concrete
number R such that ∥x0 − x⋆ ∥ ≤ R; we also know that there exists a number
L such that f is smooth with parameter L, but we don’t know a concrete such
number.
Develop an algorithm that—not knowing L—finds a vector x such that f (x)−
f (x⋆ ) < ε, using at most 2
R L
O
2ε
many gradient descent steps!
Exercise 19. Let a ∈ R. Prove that f (x) = x4 is smooth over X = (−a, a) and
determine a concrete smoothness parameter L.
Exercise 21. Prove Lemma 2.12! (Strongly convex functions have unique global
minimum)
Exercise 22. Prove Lemma 2.13! (Strongly convex and smooth functions)
60
Chapter 3
Contents
3.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2 Bounded gradients: O(1/ε2 ) steps . . . . . . . . . . . . . . . . 63
3.3 Smooth convex functions: O(1/ε) steps . . . . . . . . . . . . 64
3.4 Smooth and strongly convex functions: O(log(1/ε)) steps . . 67
3.5 Projecting onto ℓ1 -balls . . . . . . . . . . . . . . . . . . . . . . 69
3.6 Proximal gradient descent . . . . . . . . . . . . . . . . . . . . 73
3.6.1 The proximal gradient algorithm . . . . . . . . . . . . 74
3.6.2 Convergence in O(1/ε) steps . . . . . . . . . . . . . . 75
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
61
3.1 The Algorithm
Another way to control gradients in (2.5) is to minimize f over a closed
convex subset X ⊆ Rd . For example, we may have a constrained opti-
mization problem to begin with (for example the LASSO in Section 1.6.2),
or we happen to know some region X containing a global minimum x⋆ , so
that we can restrict our search to that region. In this case, gradient descent
also works, but we need an additional projection step. After all, it can hap-
pen that some iteration of (2.1) takes us “into the wild” (out of X) where
we have no business to do. Projected gradient descent is the following
modification. We choose x0 ∈ X arbitrary and for t ≥ 0 define
This means, after each iteration, we project the obtained iterate yt+1 back
to X. This may be very easy (think of X as the unit ball in which case
we just have to scale yt+1 down to length 1 if it is longer). But it may
also be very difficult. In general, computing ΠX (yt+1 ) means to solve an
auxiliary convex constrained minimization problem in each step! Here,
we are just assuming that we can do this. The projection is well-defined:
the squared distance function dy (x) := ∥x − y∥2 is strongly convex, and
hence, a unique minimum over the nonempty closed and convex set X
exists by Exercise 25.
We note that finding an initial x0 ∈ X also reduces to projection (of 0,
for example) onto X.
We will frequently need the following
Part (i) says that the vectors x − ΠX (y) and y − ΠX (y) form an obtuse
angle, and (ii) equivalently says that the square of the long side x − y in
the triangle formed by the three points is at least the sum of squares of the
two short sides; see Figure 3.1.
62
y
α ≥ 90o
α ΠX (y)
X
x
63
Theorem 3.2. Let f : dom(f ) → R be convex and differentiable, X ⊆ dom(f )
closed and convex, x⋆ a minimizer of f over X; furthermore, suppose that ∥x0 −
x⋆ ∥ ≤ R, and that ∥∇f (x)∥ ≤ B for all x ∈ X. Choosing the constant stepsize
R
γ := √ ,
B T
projected gradient descent (3.1) with x0 ∈ X yields
T −1
1X RB
(f (xt ) − f (x⋆ )) ≤ √ .
T t=0 T
Proof. The only required changes to the vanilla analysis are that in steps
(2.3) and (2.4), xt+1 needs to be replaced by yt+1 as this is the real next
(non-projected) gradient descent iterate after these steps; we therefore get
1
gt⊤ (xt − x⋆ ) = γ 2 ∥gt ∥2 + ∥xt − x⋆ ∥2 − ∥yt+1 − x⋆ ∥2 .
(3.3)
2γ
1
gt⊤ (xt − x⋆ ) ≤ γ 2 ∥gt ∥2 + ∥xt − x⋆ ∥2 − ∥xt+1 − x⋆ ∥2
(3.4)
2γ
and return to the previous vanilla analysis for the remainder of the proof.
64
Lemma 3.3. Let f : dom(f ) → R be differentiable and smooth with parameter L
over a closed and convex set X ⊆ dom(f ), according to (3.5). Choosing stepsize
1
γ :=,
L
projected gradient descent (3.1) with arbitrary x0 ∈ X satisfies
1 L
f (xt+1 ) ≤ f (xt ) − ∥∇f (xt )∥2 + ∥yt+1 − xt+1 ∥2 , t ≥ 0.
2L 2
More specifically, this already holds if f is smooth with parameter L over the line
segment connecting xt and xt+1 .
Proof. We proceed similar to the proof of the “unconstrained” sufficient
decrease Lemma 2.7, except that we now need to deal with projected gra-
dient descent. We again start from smoothness but then use yt+1 = xt −
∇f (xt )/L, followed by the usual equation 2v⊤ w = ∥v∥2 +∥w∥2 −∥v −w∥2 :
L
f (xt+1 ) ≤ f (xt ) + ∇f (xt )⊤ (xt+1 − xt ) + ∥xt − xt+1 ∥2
2
L
= f (xt ) − L(yt+1 − xt )⊤ (xt+1 − xt ) + ∥xt − xt+1 ∥2
2
L
∥yt+1 − xt ∥2 + ∥xt+1 − xt ∥2 − ∥yt+1 − xt+1 ∥2
= f (xt ) −
2
L
+ ∥xt − xt+1 ∥2
2
L L
= f (xt ) − ∥yt+1 − xt ∥2 + ∥yt+1 − xt+1 ∥2
2 2
1 L
= f (xt ) − ∥∇f (xt )∥2 + ∥yt+1 − xt+1 ∥2 .
2L 2
1 L
∥∇f (xt )∥2 ≤ f (xt ) − f (xt+1 ) + ∥yt+1 − xt+1 ∥2 (3.6)
2L 2
resulting from sufficient decrease (Lemma 3.3) to bound the squared gra-
dient ∥gt ∥2 = ∥∇f (xt )∥2 in the vanilla analysis. Unfortunately, (3.6) has
an extra term compared to what we got in the unconstrained case. But we
can compensate for this in the vanilla analysis itself. Let us go back to its
“constrained” version (3.3), featuring yt+1 instead of xt+1 :
1
gt⊤ (xt − x⋆ ) = γ 2 ∥gt ∥2 + ∥xt − x⋆ ∥2 − ∥yt+1 − x⋆ ∥2 .
2γ
Using f (xt ) − f (x⋆ ) ≤ gt⊤ (xt − x⋆ ) from convexity, we have (with γ = 1/L)
that
T −1
X T −1
X
(f (xt ) − f (x⋆ )) ≤ gt⊤ (xt − x⋆ ) (3.8)
t=0 t=0
T −1 T −1
1 X L LX
≤ ∥gt ∥2 + ∥x0 − x⋆ ∥2 − ∥yt+1 − xt+1 ∥2 .
2L t=0 2 2 t=0
66
Plugging this into (3.8), the extra terms cancel, and we arrive—as in the
unconstrained case—at
T
X L
(f (xt ) − f (x⋆ )) ≤ ∥x0 − x⋆ ∥2 .
t=1
2
The statement follows as in the proof of Theorem 2.8 from the fact that due
to sufficient decrease (Exercise 24), the last iterate is the best one.
68
3.5 Projecting onto ℓ1-balls
Problems that are ℓ1 -regularized appear among the most commonly used
models in machine learning and signal processing, and we have already
discussed the Lasso as an important example of that class. We will now
address how to perform projected gradient as an efficient optimization for
ℓ1 -constrained problems. Let
n d
X o
d
X = B1 (R) := x ∈ R : ∥x∥1 = |xi | ≤ R
i=1
be the ℓ1 -ball of radius R > 0 around 0, i.e., the set of all points with 1-
norm at most R. Our goal is to compute ΠX (v) for a given vector v, i.e. the
projection of v onto X; see Figure 3.2.
X = B1 (R)
v
ΠX (v)
0 R
At first sight, this may look like a rather complicated task. Geometri-
cally, X is a cross polytope (square for d = 2, octahedron for d = 3), and as
such it has 2d many facets. But we can start with some basic simplifying
observations.
Fact 3.6. We may assume without loss of generality that (i) R = 1, (ii) vi ≥ 0 for
all i, and (iii) di=1 vi > 1.
P
Proof. If we project v/R onto B1 (1), we obtain ΠX (v)/R (just scale Fig-
ure 3.2), so we can restrict to the case R = 1. For (ii), we observe that
69
simultaneously flipping the signs of a fixed subset of coordinates in both
v and x ∈ X yields vectors v′ and x′ ∈ X such that ∥x − v∥ = ∥x′ − v′ ∥;
thus, x minimizes the distance to v if and only if x′ minimizes the distance
to v′ . Hence, it suffices to compute ΠX (v) for vectors with nonnegative
entries. If di=1 vi ≤ 1, we have ΠX (v) = v and are done, so the interesting
P
case is (iii).
Fact 3.7. Under the assumptions of Fact 3.6, x = ΠX (v) satisfies xi ≥ 0 for all i
and di=1 xi = 1.
P
where
n d
X o
d
∆d := x ∈ R : xi = 1, xi ≥ 0 ∀i
i=1
x⋆i > 0, i ≤ p,
x⋆i = 0, i > p.
70
∆d
v
ΠX (v)
0 1
Lemma 3.11. Under the assumption of Fact 3.9, and with p as in Lemma 3.10,
x⋆i = vi − Θp , i ≤ p,
where p
1 X
Θp = vi − 1 .
p i=1
71
Proof. Again, we argue by contradiction. If not all x⋆i − vi , i ≤ p have the
same value −Θp , then we have x⋆i −vi < x⋆j −vj for some i, j ≤ p. As before,
we can then decrease x⋆j > 0 by some small positive ε and simultaneously
increase x⋆i by ε to obtain x ∈ ∆d such that
and we just need to find the right one. In order for candidate x⋆ (p) to
comply with Lemma 3.10, we must have
vp − Θp > 0, (3.13)
and this actually ensures x⋆ (p)i > 0 for all i ≤ p by the assumption of
Fact 3.9 and therefore x⋆ (p) ∈ ∆d . But there could still be several values of
p satisfying (3.13). Among them, we simply pick the one for which x⋆ (p)
minimizes the distance to v. It is not hard to see that this can be done in
time O(d log d), by first sorting v and then carefully updating the values
Θp and ∥x⋆ (p) − v∥2 as we vary p to check all candidates.
But actually, there is an even simpler criterion that saves us from com-
paring distances.
Lemma 3.12. Under the assumption of Fact 3.9, with x⋆ (p) as in (3.12), and
with p
⋆
1 X
p := max p ∈ {1, . . . , d} : vp − vi − 1 > 0 ,
p i=1
it holds that
argmin ∥x − v∥2 = x⋆ (p⋆ ).
x∈∆d
72
The proof is Exercise 26. Together with our previous reductions, we
obtain the following result.
Theorem 3.13. Let v ∈ Rd , R ∈ R+ , X = B1 (R) the ℓ1 -ball around 0 of
radius R. The projection
To obtain the last equality, we have just completed the quadratic ∥v∥2 +
2v⊤ w + ∥w∥2 = ∥v + w∥2 for v := γ∇g(xt ) and w := y − xt . Here it is
crucial that v is independent of the optimization variable y, so therefore
1
the term can be ignored when taking the argmin. The scaling by 2γ is also
irrelevant but we keep it for better illustrating the next step.
73
The interpretation of the above equivalent reformulation of the classic
gradient step is important for us, and is what has enabled the previous
convergence analysis in Section 2.5 for smooth unconstrained optimiza-
tion: For the particular choice of stepsize γ := L1 which we have used,
the above formulation shows that the gradient descent step exactly min-
imizes the local quadratic model of g at our current iterate xt , formed by
the smoothness property with parameter L as defined in (2.8).
74
A generalization of gradient descent. The proximal gradient descent
method (3.19) is also known as generalized gradient descent. In the special
case h ≡ 0, we of course recover classic gradient descent.
More interestingly, it is also a generalization of projected gradient de-
scent as we have discussed in the previous sections. Given a closed convex
set X, the indicator function of the set X is given as the convex function
ιX : Rd → R ∪ +∞
(
0 if x ∈ X,
x 7→ ιX (x) := (3.21)
+∞ otherwise.
When using the indicator function of our constraint set X as h ≡ ιX , it is
easy to see that the proximal mapping simply becomes
n1 o
proxh,γ (z) := argmin ∥y − z∥2 + ιX (y)
y 2γ
= argmin ∥y − z∥2 = ΠX (z) ,
y∈X
L
f (xT ) − f (x⋆ ) ≤ ∥x0 − x⋆ ∥2 , T > 0.
2T
Proof. The proof follows the vanilla analysis for the smooth case, applying
it only to g, while always keeping h separate, as in (3.17). We leave the
details as Exercise 27 for the reader.
3.7 Exercises
Exercise 23. Consider the projected gradient descent algorithm as in (3.1) and
(3.2), with a convex differentiable function f . Suppose that for some iteration t,
xt+1 = xt . Prove that in this case, xt is a minimizer of f over the closed and
convex set X!
f (xt+1 ) ≤ f (xt ).
Exercise 25. Let X ⊆ Rd be a nonempty closed and convex set, and let f be
strongly convex over X. Prove that f has a unique minimizer x⋆ over X! In
particular, for X = Rd , we obtain the existence of a unique global minimum.
76
Chapter 4
Subgradient Descent
Contents
4.1 Subgradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2 Differentiability of convex functions . . . . . . . . . . . . . . 80
4.3 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4 Lipschitz convex functions: O(1/ε2 ) steps . . . . . . . . . . . 81
4.5 Tame strong convexity: O(1/ε) steps . . . . . . . . . . . . . . 82
4.6 Optimality of first-order methods . . . . . . . . . . . . . . . . 85
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
77
4.1 Subgradients
Definition 4.1. Let f : dom(f ) → R. Then g ∈ Rd is a subgradient of f at
x ∈ dom(f ) if
f (x) = |x|
y 7→ 15 y
f (y) ≥ gy y 7→ − 25 y
Figure 4.1: The function f (x) = |x| has subgradients g ∈ [−1, 1] at 0, since
f (y) ≥ gy for exactly g ∈ [−1, 1].
78
get a “first order characterization” of convexity that also covers the non-
differentiable case.
79
4.2 Differentiability of convex functions
Before we move on to subgradient descent, we want to get a feeling for
how “wild” non-differentiable convex functions can be. The answer is:
they are surprisingly tame. While there are continuous functions that are
nowhere differentiable (the classical example is the Weierstrass function),
convex function cannot be as pathological. In fact, a convex function f
is differentiable almost everywhere. Formally, this means that wherever you
are in dom(f ), you find points arbitrarily close to you at which f is dif-
ferentiable. In still other words, the set of points where f is not differ-
entiable has measure 0 [Roc97, Theorem 25.5]. Again, all of this requires
dom(f ) ⊆ Rd , so let us remind ourselves that we are always in finite di-
mension throughout this text.
This does not mean that we can ignore non-differentiability in opti-
mization. For example, as Figure 4.1 demonstrates, the global minimum x⋆
can easily be a “kink”, a point where f is not differentiable. Also, while
running an iterative optimization scheme, we may always stumble upon
an intermediate kink.
An important fact is the following characterization of subdifferentials;
80
4.3 The algorithm
An iteration of subgradient descent is defined as
Let gt ∈ ∂f (xt )
xt+1 := xt − γt gt . (4.2)
Proof. The proof is identical to the one of Theorem 2.1 presented in Sec-
tion 2.4. The only change is that gt is a subgradient now and not a gra-
dient, so that the inequality (2.2) now follows from the subgradient prop-
erty (4.1) instead of the first-order characterization of convexity. The re-
quired bound ∥gt ∥2 ≤ B 2 follows from Lemma 4.4 (“convex and Lipschitz
= bounded subgradients”).
81
4.5 Tame strong convexity: O(1/ε) steps
(Projected) gradient descent converges in O(log(1/ε)) steps for functions
that are both smooth and strongly convex. But if a function is non-differen-
tiable, then it cannot be smooth under the natural definition of smoothness
(Exercise 31). It can still be strongly convex, however, so it is natural to ask
whether strong convexity alone allows us to obtain a convergence result.
The answer is no in general, but before we discuss this, let us define strong
convexity for not necessarily differentiable functions. This is straightfor-
ward; for differentiable functions, we recover Definition 2.10. Here, we
restrict to the unconstrained case for simplicity.
Definition 4.8. Let f : dom(f ) → R be convex, µ ∈ R+ , µ > 0. Function f is
called strongly convex (with parameter µ) if
µ
f (y) ≥ f (x)+g⊤ (y−x)+ ∥x−y∥2 , ∀x, y ∈ dom(f ), ∀g ∈ ∂f (x). (4.3)
2
Actually, requiring (4.3) only for some g ∈ ∂f (x) would be another
straightforward generalization of Definition 2.10, so which one is the “right”
one? The answer is that it does not matter if dom(f ) is open. We could
even afford to not require anything for points x where f is not differen-
tiable. This is a consequence of Theorem 4.6 (Exercise 32).
Strong convexity has the following useful characterization.
Lemma 4.9 (Exercise 33). Let f : dom(f ) → R be convex, dom(f ) open,
µ ∈ R+ , µ > 0. f is strongly convex with parameter µ if and only if fµ :
dom(f ) → R defined by
µ
fµ (x) = f (x) − ∥x∥2 , x ∈ dom(f )
2
is convex.
Let’s look at the problem with (sub)gradient descent on strongly con-
vex functions.
Lemma 4.10 (Exercise 34). The function f (x) = e|x| is strongly convex with
parameter µ = 1.
This function is of course far from being smooth; it grows exponen-
tially, so there can’t be any quadratic upper bounds. In fact, as strong
82
convexity ony requires quadratic lower bounds, strongly convex functions
can be extremely fast-growing. In such a situation, (sub)gradient descent
will overshoot already for tiny step sizes and diverge.
In case of f (x) = e|x| , the function is differentiable at x ̸= 0 with f ′ (x) =
sgn(x)e|x| , so the (sub)gradient step is
For |x| only mildly larger than 0, the step will overshoot the optimum
x∗ = 0 and take us (much) further away. To compensate for this, we would
need extremely small stepsizes. These in turn would lead to extremely
poor convergence for functions such as f (x) = x2 /2 (which is also strongly
convex with µ = 1) . Hence, there are no stepsizes that fit all strongly
convex functions with a fixed strong convexity parameter µ.
To succeed with (sub)gradient descent in this situation, we therefore
need to make some additional assumptions. Smoothness (quadratic upper
bounds) is such an assumption, but in the non-differentiable case, this is
precisely not an option. What people have done instead is to assume that
the subgradients gt that we encounter during the algorithm are bounded
in norm.
To ensure bounded subgradients, we could simply assume that f is
Lipschitz, but then we will only make a statement about an empty function
class. The reason is that a function cannot be globally strongly convex and
Lipschitz at the same time (Exercise 35). It can be strongly convex and
have bounded gradients over a closed and bounded set X, so analyzing
projected subgradient descent is an alternative.
But even when we optimize over Rd , we may be lucky and only hit
iterates with small subgradients. This will typically happen if we start
sufficiently close to optimality. In this case, there are step sizes γt (not
depending on the observed gradients) that give us useful error bounds.
Below, we prove such a bound for subgradient descent, and this re-
sult then clearly extends to gradient descent on differentiable and strongly
convex (but not necessarily smooth) functions. The bound on the number
of steps will be O(1/ε) which is of course much worse than O(log(1/ε)),
but still better than O(1/ε2 ) that we get in the Lipschitz case. So assum-
ing strong convexity results in a convergence behavior as in the smooth
case—if the gradients stay bounded, and this is what we mean by “tame”.
In order to analyze subgradient descent on strongly convex functions,
83
we will for the first time depart from algorithm variants with a constant
stepsize γ, but instead use a time-varying stepsize γt decreasing over time.
2
γt := , t > 0,
µ(t + 1)
γt 1
gt⊤ (xt − x⋆ ) = ∥gt ∥2 + ∥xt − x⋆ ∥2 − ∥xt+1 − x⋆ ∥2 .
2 2γt
Now we plug in the lower bound gt⊤ (xt − x⋆ ) ≥ f (xt ) − f (x⋆ ) + µ2 ∥xt − x⋆ ∥2
resulting from strong convexity to obtain (with ∥gt ∥2 ≤ B 2 ) that
B 2 γt (γt−1 − µ) γ −1
f (xt ) − f (x⋆ ) ≤ + ∥xt − x⋆ ∥2 − t ∥xt+1 − x⋆ ∥2 . (4.4)
2 2 2
Unlike in the vanilla analysis (where we had γt = γ, µ = 0), the right-hand
side does not telescope anymore when we sum over all t ≤ T ; to fix this,
we precisely need the time-varying stepsize.
Let’s make a small computation: to get telescoping behavior, we would
need that γt−1 = γt+1
−1
− µ. For example, γt−1 = µ(1 + t) satisfies this, but
−1
our choice γt = µ(1 + t)/2 does not. Exercise 36 asks you to compute
what happens when we actually choose γt−1 = µ(1 + t); this will let you
84
2
appreciate the seemingly “wrong” choice of γt = µ(t+1) here. Plugging in
this stepsize and multiplying with t on both the sides, we get
⋆
B2t µ ⋆ 2 ⋆ 2
t · f (xt )−f (x ) ≤ + t(t − 1) ∥xt −x ∥ − (t + 1)t ∥xt+1 −x ∥
µ(t + 1) 4
B2 µ
≤ + t(t − 1) ∥xt − x⋆ ∥2 − (t + 1)t ∥xt+1 − x⋆ ∥2 .
µ 4
Summing from t = 1, . . . , T , we obtain a telescoping sum:
T
X T B2 µ T B2
t · f (xt ) − f (x⋆ ) ≤ + 0 − T (T + 1) ∥xT +1 − x⋆ ∥2 ≤ .
t=1
µ 4 µ
Since
T
2 X
t = 1,
T (T + 1) t=1
Jensen’s inequality (Lemma 1.13) yields
T T
2 X
⋆ 2 X
t · f (xt ) − f (x⋆ ) .
f t · xt − f (x ) ≤
T (T + 1) t=1 T (T + 1) t=1
85
Theorem 4.12 (Nesterov). For any T ≤ d − 1 and starting point x0 , there is a
function f in the problem class of B-Lipschitz functions over Rd , such that any
(sub)gradient method has an objective error at least
RB
f (xT ) − f (x⋆ ) ≥ √ .
2(1 + T + 1)
The above theorem applies to all first-order methods which form iter-
ates by linearly combining past iterates and (sub)gradients, and requires
the dimension d to be sufficiently large.
4.7 Exercises
Exercise 28. Prove Lemma 4.2, meaning that a function that is differentiable at x
has at most one subgradient there, namely ∇f (x).
Exercise 29. Prove the easy direction of Lemma 4.3, meaning that the existence
of subgradients everywhere implies convexity!
Exercise 30. Prove Lemma 4.4 (Lipschitz continuity and bounded subgradients).
Exercise 31. Generalizing Definition 2.2, let us call a (not necessarily differen-
tiable) function f : Rd → R smooth with parameter L ∈ R+ if for all x ∈ Rd ,
there exists a subgradient gx ∈ Rd such that
L
f (y) ≤ f (x) + gx⊤ (y − x) + ∥x − y∥2 , ∀x, y ∈ Rd .
2
This means that for every point x, the graph of f is below the graph of the
quadratic function f (x) + gx⊤ (y − x) + L2 ∥x − y∥2 .
Prove that if f is smooth according to this definition, then f is differentiable,
with gx = ∇f (x) for all x. In particular, for differentiable functions, the notion of
smoothness introduced above coincides with the one of Definition 2.2; moreover,
non-differentiable functions cannot be smooth.
Does the above hold if gx is not a subgradient?
86
for all x such that ∇f (x) exists, and for all y. Prove that this implies
µ
f (y) ≥ f (x) + gx⊤ (y − x) + ∥x − y∥2
2
for all x, all gx ∈ ∂f (x) and all y.
Exercise 33. Prove Lemma 4.9: f is strongly convex with parameter µ over an
7 f (x) − µ2 ∥x∥2 is convex over the same
open domain if and only if fµ : x →
domain.
Exercise 34. Prove Lemma 4.10: f (x) = e|x| is strongly convex with parameter
µ = 1.
Exercise 36. Which result can you prove when you use the “telescoping stepsize”
1
γt =
µ(t + 1)
2
in Theorem 4.11 instead of γt = µ(t+1)
?
87
Chapter 5
Contents
5.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3 Bounded stochastic gradients: O(1/ε2 ) steps . . . . . . . . . 91
5.4 Tame strong convexity: O(1/ε) steps . . . . . . . . . . . . . . 92
5.5 Stochastic Subgradient Descent . . . . . . . . . . . . . . . . . 93
5.6 Mini-batch variants . . . . . . . . . . . . . . . . . . . . . . . . 93
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
88
5.1 The algorithm
Many objective functions occurring in machine learning are formulated as
sum structured objective functions
n
1X
f (x) = fi (x). (5.1)
n i=1
Here fi is typically the cost function of the i-th datapoint, taken from a
training set of n elements in total.
We have already seen an example for this: the loss function (1.13) in
the handwritten digit recognition (Section 1.6.1) has one term for each of
the n training images x ∈ P :
X
ℓ(W ) = − ln zd(x) (W x).
x∈P
The normalizing factor 1/n that we assume in the general setting (5.1)
will just simplify the following a bit.
An iteration of stochastic gradient descent (SGD) in its basic form is de-
fined as
sample i ∈ [n] uniformly at random
xt+1 := xt − γt ∇fi (xt ). (5.2)
This update looks almost identical to the classical gradient method, the
only difference being that we have computed the gradient not of the en-
tire f but only of one particular (randomly chosen) function fi . As we will
need varying stepsizes a bit later, we allow for the stepsize to depend on t
now.
In the above setting, the update vector gt := ∇fi (xt ) is called a stochastic
gradient. Formally, gt is a vector of d random variables, but we will also
simply call this a random variable.
The crucial advantage of SGD versus its classical gradient descent coun-
terpart is the efficiency per iteration: While computing the full gradient for
a sum structured problem (5.1) would require us to compute n individual
gradients of the fi functions, an iteration of SGD requires only a single
one of those, and therefore is n times cheaper. SGD has therefore become
the main workhorse for training machine learning models. Whether such
cheaper iterations also give similar progress is another question, which we
analyze next.
89
5.2 Unbiasedness
We would like to start with the vanilla analysis again, but now we can-
not bound the random variable gt⊤ (xt − x⋆ ) from below using (2.2), as the
inequality
f (xt ) − f (x⋆ ) ≤ gt⊤ (xt − x⋆ )
may hold or not hold, depending on how gt turns out. But it still holds in
expectation, as we show now.
The vector gt may be far from the true gradient, and of high variance,
but in expectation over the random choice of i, it does coincide with the
full gradient of f . We formalize this as
n
1X
∇fi (x) = ∇f (x), x ∈ Rd .
E gt xt = x = (5.3)
n i=1
Here, E gt xt = x is the conditional expectation of gt , given the event
{xt = x}. If this event is nonempty, linearity of conditional expectations
yields that
⊤
E gt⊤ (x − x⋆ ) xt = x = E gt xt = x (x − x⋆ ) = ∇f (x)⊤ (x − x⋆ ).
Using the fact that {xt = x} can occur only for x in some finite set X (one
element for every choice of indices throughout all iterations), the partition
theorem further gives us
X
E gt⊤ (xt − x⋆ ) = E gt⊤ (x − x⋆ ) xt = x prob(xt = x)
x∈X
X
= ∇f (x)⊤ (x − x⋆ ) prob(xt = x)
x∈X
= E ∇f (xt )⊤ (xt − x⋆ ) .
Hence, we have
E gt⊤ (xt − x⋆ ) = E ∇f (xt )⊤ (xt − x⋆ ) ≥ E f (xt ) − f (x⋆ ) .
(5.4)
The last inequality is by convexity, and this is means that the lower bound
(2.2) holds in expectation.
Exercise 37 lets you recall some basics around conditional expectations.
Under (5.3) we say that the stochastic gradient gt is an unbiased estimator
of the gradient, for any time-step t.
90
5.3 Bounded stochastic gradients: O(1/ε2) steps
To get a first result out of the vanilla analysis, we assumed in Section 2.4
that ∥∇f (x)∥2 ≤ B 2 for all x ∈ Rd , where B was a constant. Here, we
are assuming the same for the expected squared norms of our stochastic
gradients. And we are getting the same result, except that it now holds for
the expected function values.
Theorem 5.1. Let f : Rd → R be a convex and differentiable function, and let
x⋆ be a global
minimum of f ; furthermore, suppose that ∥x0 − x⋆ ∥ ≤ R, and that
E ∥gt ∥ ≤ B 2 for all t. Choosing the constant stepsize
2
R
γ := √
B T
stochastic gradient descent (5.2) yields
T −1
1X RB
E f (xt ) − f (x⋆ ) ≤ √ .
T t=0 T
Proof. Taking expectations on both sides of the vanilla analyis (2.5) and
using linearity of expectations, we get
T −1 T −1
X ⊤ ⋆
γX 1
E ∥gt ∥2 + ∥x0 − x⋆ ∥2 .
E gt (xt − x ) ≤ (5.5)
t=0
2 t=0 2γ
By (5.4),
E f (xt ) − f (x⋆ ) ≤ E gt⊤ (xt − x⋆ ) .
Plugging this into (5.5), using E ∥gt ∥2 ≤ B 2 and ∥x0 − x⋆ ∥ ≤ R, we get
T −1
X γ 1
E f (xt ) − f (x⋆ ) ≤ B 2 T + R2 ,
t=0
2 2γ
from which the statement follows from the choice of γ as in Theorem 2.1.
91
5.4 Tame strong convexity: O(1/ε) steps
It is possible to strengthen our above SGD analysis. One way to do so
is under the additional assumption of strong convexity of the objective
function f (as in Definition 2.10). Again, the proof works by “taking ex-
pectations” over a previous analysis, in this case the one for subgradient
descent in the tame strongly convex case (Theorem 4.11).
Proof. We start from the vanilla analysis (2.4) (with γ = γt ) and take expec-
tations on both sides:
γt 1
E gt⊤ (xt − x⋆ ) = E ∥gt ∥2 + E ∥xt − x⋆ ∥2 − E ∥xt+1 − x⋆ ∥2 .
2 2γt
Now we use (5.4) along with strong convexity to get a lower bound
B 2 γt (γt−1 − µ) ⋆ 2
γt−1
⋆
E ∥xt+1 − x⋆ ∥2 .
E[f (xt ) − f (x )] ≤ + E ∥xt − x ∥ −
2 2 2
The proof continues as in Theorem 4.11, with every step being the “ex-
pected version” of the corresponding step in the earlier proof.
92
5.5 Stochastic Subgradient Descent
For problems which are not necessarily differentiable, we modify SGD to
use a subgradient of fi in each iteration. The update of stochastic subgra-
dient descent is given by
sample i ∈ [n] uniformly at random
let gt ∈ ∂fi (xt ) (5.6)
xt+1 := xt − γt gt .
Let gi : Rd → Rd denote the function that selects the subgradient of fi
at the current point. Then we have gt = gi (xt ) for random i. Unbiasedness
now becomes
n
1X
gi (x) =: g(x), x ∈ Rd .
E gt xt = x =
n i=1
93
where gtj = ∇fij (xt ) for an index ij . The set of the (distinct) ij indices is
called a mini-batch, and m is the mini batch size.
Using the step direction g̃t defines mini-batch SGD. For m = 1, we re-
cover SGD as originally defined, while for m = n we recover full gradient
descent.
Mini-batch SGD can be advantageous in several applications. For ex-
ample, parallelization over up to m processors will easily give a speed-up
for the gradient computation, which is typically the main cost of running
SGD. Here, parallelization exploits the fact that all gtj are defined at the
same iterate xt and can therefore be computed independently.
Taking an average of many independent random variables reduces the
variance. In the context of mini-batch SGD, we obtain that for larger size
of the mini-batch m our estimate g̃t will be closer to the true gradient, in
expectation:
h 1 Xm
h 2i 2i
E g̃t − ∇f (xt ) =E gtj − ∇f (xt )
m j=1
1 1
E ∥gt − ∇f (xt )∥2
=
m
1 1 B2
= E ∥gt1 ∥2 − ∥∇f (xt )∥2 ≤
.
m m m
Using a modification of the above analysis, it is possible to use this
property to relate the above convergence rate of SGD to the rate of full
gradient descent.
5.7 Exercises
Exercise 37. Let Y be a random variable over a finite probability space (Ω, prob)
where prob : 2Ω → [0, 1]; this avoids subtleties in defining conditional probabili-
ties and expectations; and it covers the random variables occurring in SGD, since
in each step, we are randomly choosing among a finite set of n indices. Further-
more, let B ⊆ Ω be an event.
For nonemepty B, the conditional expectation of Y given B is the number
X
E Y B := y · prob Y = y B .
y∈Y (Ω)
94
where Y = y is shorthand for the event {ω ∈ Ω : Y (ω) = y}.
Finally, for two events A and B ̸= ∅, the conditional probability prob A B
is defined as
prob A ∩ B
prob A B := .
prob B
If B = ∅, E Y B can be defined arbitrarily.
Prove the following statements.
95
Chapter 6
Nonconvex functions
Contents
6.1 Smooth functions . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2 Trajectory analysis . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2.1 Deep linear neural networks . . . . . . . . . . . . . . 104
6.2.2 A simple nonconvex function . . . . . . . . . . . . . . 106
6.2.3 Smoothness along the trajectory . . . . . . . . . . . . 109
6.2.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . 112
6.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
96
So far, all convergence results that we have given for variants of gra-
dient descent have been for convex functions. And there is a good reason
for this: on nonconvex functions, gradient descent can in general not be
expected to come close (in distance or function value) to the global mini-
mum x⋆ , even if there is one.
As an example, consider the nonconvex function from Figure 1.4 (left).
Figure 6.1 shows what happens if we start gradient descent somewhere “to
the right”, with a not too large stepsize so that we do not overshoot. For
any sufficiently large T , the iterate xT will be close to the local minimum
y⋆ , but not to the global minimum x⋆ .
x∗ y∗ x0
97
x0 y∗ x∗ x∗ x0
Figure 6.2: Gradient descent may get stuck in a flat region (saddle point)
y⋆ (left), or reach neither a local minimum nor a saddle point (right).
98
f (x) + ∇f (x)> (y − x) + L2 kx − yk2
f (y)
x y
99
x y
f (x) + ∇f (x)> (y − x)
f (y)
100
So far, we had only equalities, now we start estimating:
101
It is tempting to interpret convergence of ∥∇f (xt )∥2 to 0 as convergence
to a critical point of f (a point where the gradient vanishes). But this inter-
pretation is not fully accurate in general, as Figure 6.2 (right) shows: The
algorithm may enter a region where f asymptotically approaches some
value, without reaching it (think of the rightmost piece of the function in
the figure as f (x) = e−x ). In this case, the gradient converges to 0, but the
iterates are nowhere near a critical point.
Proof. We recall that sufficient decrease (Lemma 2.7) does not require con-
vexity, and this gives
1
f (xt+1 ) ≤ f (xt ) − ∥∇f (xt )∥2 , t ≥ 0.
2L
Rewriting this into a bound on the gradient yields
102
In the smooth setting, gradient descent has another interesting prop-
erty: with stepsize 1/L, it cannot overshoot. By this, we mean that it
cannot pass a critical point (in particular, not the global minimum) when
moving from xt to xt+1 . Equivalently, with a smaller stepsize, no critical
point can be reached. With stepsize 1/L, it is possible to reach a critical
point, as we have demonstrated for the supermodel function f (x) = x2 in
Section 2.7.
x x0 y ? x y ? x0 x x0 = y ?
103
simplified setting that allows us to show the main ideas (and limitations)
behind one particular trajectory analysis [ACGH18].
In our simplified setting, we will look at the task of minimizing a con-
crete and very simple nonconvex function. This function turns out be
smooth along the trajectories that we analyze, and this is one important
ingredient. However, smoothness alone does not suffice to prove con-
vergence to the global minimum, let alone fast convergence: As we have
seen in the last section, we can in general only guarantee that the gradient
norms converge to 0, and at a rather slow rate. To get beyond this, we will
need to exploit additional properties of the function under consideration.
y i ≈ w ⊤ xi ,
yi ≈ W xi ,
for a weight matrix W ∈ Rm×d to be learned. The matrix that best fits this
hypothesis on the given observations is the least-squares matrix
n
X
W ⋆ = argmin ∥W xi − yi ∥2 .
W ∈Rm×d i=1
If we let X ∈ Rd×n be the matrix whose columns are the xi and Y ∈ Rm×n
the matrix whose columns are the yi , we can equivalently write this as
104
qP
2
where ∥A∥F = i,j aij is the Frobenius norm of a matrix A.
Finding W ∗ (the global minimum of a convex quadratic function) is a
simple task that boils down to solving a system of linear equations; see
also Section 1.4.2. A fancy way of saying this is that we are training a
linear neural network with one layer, see Figure 6.6 (left).
h21
x1
x1 h11 h22
x2
x2 h12 h23 y1
y1 x3
x3 h13 h24 y2
y2 x4
x4 h14 h25
x5
x5 h26
W W1 W2 W3
But what if we have ℓ layers (Figure 6.6 (right)? Training such a net-
work corresponds to minimizing
∥Wℓ Wℓ−1 · · · W1 X − Y ∥2F ,
over ℓ weight matrices W1 , . . . , Wℓ to be learned. In case of linear neural
networks, there is no benefit in adding layers, as any linear transforma-
tion x 7→ Wℓ Wℓ−1 · · · W1 X can of course be represented as x 7→ W X with
W := Wℓ−1 · · · W1 . But from a theoretical point of view, a deep linear neu-
ral network gives us a simple playground in which we can try to under-
stand why training deep neural networks with gradient descent works,
105
despite the fact that the objective function is no longer convex. The hope
is that such an understanding can ultimately lead to an analyis of gradient
descent (or other suitable methods) for “real” (meaning non-linear) deep
neural networks.
In the next section, we will discuss the case where all matrices are 1 × 1,
so they are just numbers. This is arguably a toy example in our already
simple playground. Still, it gives rise to a nontrivial nonconvex function,
and the analysis of gradient descent on it will require similar ingredients
as the one on general deep linear neural networks [ACGH18].
What areQthe critical points, the ones where ∇f (x) vanishes? This hap-
pens when k xk = 1 in which case we have a global minimum (level 0
in Figure 6.7). But there are other critical points. Whenever at least two
of the xk are zero, the gradient also vanishes, and the value of f is 1/2 at
such a point (point 0 in Figure 6.7). This already shows that the function
cannot be convex, as for convex functions, every critical point is a global
minimum (Lemma 1.22). It is easy to see that every non-optimal critical
point must have two or more zeros.
106
Figure 6.7: Levels sets of f (x1 , x2 ) = 21 (x1 x2 − 1)2
In fact, all critical points except the global minima are saddle points.
This is because at any such point x, we can slightly perturb the (two or
more) zero entries in such a way that the product of all entries becomes
either positive or negative, so that the function value either decreases or
increases.
Figure 6.8 visualizes (scaled) negative gradients of f for d = 2; these are
the directions in which gradient descent would move from the tails of the
respective arrows. The figure already indicates that it is difficult to avoid
convergence to a global minimum, but it is possible (see Exercise 42).
We now want to Q show that for any dimension d, and from anywhere in
X = {x : x > 0, k xk ≤ 1}, gradient descent will converge to a global
minimum. Unfortunately, our function f is not smooth over X. For the
analysis, we will therefore show that f is smooth along the trajectory of
107
Figure 6.8: Scaled negative gradients of f (x1 , x2 ) = 21 (x1 x2 − 1)2
1
f (xt+1 ) ≤ f (xt ) − ∥∇f (xt )∥2 , t≥0
2L
by Lemma 2.7.
This already shows that gradient descent cannot converge to a saddle
point: all these have (at least two) zero entries and therefore function value
1/2. But for starting point x0 ∈ X, we have f (x0 ) < 1/2, so we can never
reach a saddle while decreasing f .
But doesn’t this mean that we necessarily have to converge to a global
minimum? No, because the sublevel sets of f are unbounded, so it could in
principle happen that gradient descent runs off to infinity while constantly
improving f (xt ) (an example is gradient descent on f (x) = e−x ). Or some
108
other bad behavior occurs (we haven’t characterized what can go wrong).
So there is still something to prove. Q
How about convergence from other starting points? For x > 0, k xk ≥
1, we also get convergence (Exercise 41). But there are also starting points
from which gradient descent will not converge to a global minimum (Ex-
ercise 42).
The following simple lemma is the key to showing that gradient de-
scent behaves nicely in our case.
Definition 6.4. Let x > 0 (componentwise) , and let c ≥ 1 be a real number. x
is called c-balanced if xi ≤ cxj for all 1 ≤ i, j ≤ d.
In fact, any initial iterate x0 > 0 is c-balanced for some (possibly large) c.
Q
Lemma 6.5. Let x > 0 be c-balanced with k xk ≤ 1. Then for any stepsize
γ > 0, x′ := x−γ∇f (x) satisfies x′ ≥ x (componentwise) and is also c-balanced.
If c = 1 (all entries of x are equal), this is easy to see since then also
all entries of ∇f (x) in (6.4) are equal.
Q Later we will show that for suitable
step size, we also maintain that k x′k ≤ 1, so that gradient descent only
goes through balanced iterates.
Q Q
Proof. Set ∆ := −γ( k xk − 1)( k xk ) ≥ 0. Then the gradient descent
update assumes the form
∆
x′k = xk + ≥ xk , k = 1, . . . , d.
xk
For i, j, we have xi ≤ cxj and xj ≤ cxi (⇔ 1/xi ≤ c/xj ). We therefore get
∆ ∆c
x′i = xi + ≤ cxj + = cx′j .
xi xj
109
definition, ∇2 f (x)ij is the j-th partial derivative of the i-th entry of ∇f (x).
This i-th entry is !
Y Y
(∇f )i = xk − 1 xk
k k̸=i
110
Proof. The fact that ∥A∥ ≤ ∥A∥F is Exercise 43. To bound the Frobenius
norm, we use the previous lemma to compute
!2
Y
∇2 f (x)ii = xi ≤ c2
k̸=i
and for i ̸= j,
Y Y Y
∇2 f (x)ij ≤ 2 xk xk + xk ≤ 3c2 .
k̸=i k̸=j k̸=i,j
2
Hence, ∥∇2 f (x)∥F ≤ 9d2 c4 . Taking square roots, the statement follows.
This now implies smoothness of f along the whole trajectory of gradi-
ent descent, under the usual “smooth stepsize” γ = 1/L = 1/3dc2 .
Lemma 6.8. Let x > 0 be c-balanced with k xk < 1, L = 3dc2 . Let γ := 1/L.
Q
We already know from Lemma 6.5 that
x′ := x − γ∇f (x) ≥ x
Proof. Image traveling from x to x′ along the line segment. As long as the
product of all variables remains bounded by 1, Hessians remain bounded
by Lemma 6.7, and f is smooth over the part of the segment traveled so
far, by Lemma 6.1. So f can only fail to be smoothQover the whole segment
when there is y ̸= x′ on the segment such that k yk = 1. Consider the
first such y. Note that f is still smooth with parameter LQover the segment
connecting x and y. Also, ∇f (x) ̸= 0 (due to x > 0, k xk < 1), so x is
not a critical point, and y results from x by a gradient descent step with
stepsize < 1/L (stepsize 1/L takes us toQ x′ ). Hence, y is also not a critical
point by Lemma 6.3, and we can’t have k yk = 1.
Consequently, f is smooth over the whole line segment connecting x
and x′ .
111
6.2.4 Convergence
Theorem
Q 6.9. Let c ≥ 1 and δ > 0 such that x0 > 0 is c-balanced with δ ≤
k (x )
0 k < 1. Choosing stepsize
1
γ= ,
3dc2
gradient descent satisfies
T
δ2
f (xT ) ≤ 1 − 4 f (x0 ), T ≥ 0.
3c
This means that the loss indeed converges to its optimal value 0, and
does so with a fast exponential error decrease. Exercise 44 asks you to
prove that also the iterates themselves converge (to an optimal solution),
so gradient descent will not run off to infinity.
Proof. For each t ≥ 0, f is smooth over conv({xt , xt+1 }) with parameter
L = 3dc2 , hence Lemma 2.7 yields sufficient decrease:
1
f (xt+1 ) ≤ f (xt ) − 2
∥∇f (xt )∥2 . (6.5)
6dc
Q
For every c-balanced x with δ ≤ k xk ≤ 1, we have
d
!2
X Y
∥∇f (x)∥2 = 2f (x) xk
i=1 k̸=i
!2−2/d
d Y
≥ 2f (x) xk (Lemma 6.6)
c2 k
!2
d Y
≥ 2f (x) 2 xk
c k
d
≥ 2f (x) 2 δ 2 .
c
Then, (6.5) further yields
δ2
1 d 2
f (xt+1 ) ≤ f (xt ) − 2f (xt ) 2 δ = f (xt ) 1 − 4 ,
6dc2 c 3c
proving the theorem.
112
This looks great: just as for strongly convex functions, we seem to have
fast convergence since the function value goes down by a constant factor
in each step. There is a catch, though. To see this, consider the starting
solution x0 = (1/2, . . . , 1/2). This is c-balanced with c = 1, but the δ that
we get is 1/2d . Hence, the “constant factor” is
1
1− ,
3 · 4d
113
6.3 Exercises
Exercise 38. Let f : Rn → R twice differentiable, with X ⊆ dom(f ) an open
convex set, and suppose that f is smooth with parameter L over X. Prove that
under these conditions, the largest eigenvalue of the Hessian λmax (∇2 f (x)) ≤ L
for all x ∈ X.
Exercise 39. Prove that the statement of Theorem 6.2 implies that
Exercise 40. Prove Lemma 6.3 (gradient descent does not overshoot on smooth
functions).
Q 2
d
Exercise 41. Consider the function f (x) = 21 k=1 x k − 1 . Prove that for
any starting point x0 ∈ X = {x ∈ Rd : x > 0, k xk ≥ 1} and any ε > 0,
Q
gradient descent attains f (xT ) ≤ ε for some iteration T .
Q 2
1 d
Exercise 42. Consider the function f (x) = 2 k=1 xk − 1 . Prove that for
even dimension d ≥ 2, there is a point x0 (not a critical point) such that gradient
descent does not converge to a global minimum when started at x0 , regardless of
step size(s).
Exercise 43. Prove that for any matrix A, ∥A∥ ≤ ∥A∥F , where ∥·∥ is the spectral
norm and ∥·∥F the Frobenius norm.
Exercise 44. Prove that the sequence (xT )T ≥0 of iterates in Theorem 6.9 con-
verges to a an optimal solution x⋆ .
114
Chapter 7
Newton’s Method
Contents
7.1 1-dimensional case . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2 Newton’s method for optimization . . . . . . . . . . . . . . . 118
7.3 Once you’re close, you’re there. . . . . . . . . . . . . . . . . . . 120
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
115
7.1 1-dimensional case
The Newton method (or Newton-Raphson method, invented by Sir Isaac
Newton and formalized by Joseph Raphson) is an iterative method for
finding a zero of a differentiable univariate function f : R → R. Starting
from some number x0 , it computes
f (xt )
xt+1 := xt − , t ≥ 0. (7.1)
f ′ (xt )
Figure 7.1 shows what happens. xt+1 is the point where the tangent line
to the graph of f at (xt , f (xt )) intersects the x-axis. In formulas, xt+1 is the
solution of the linear equation
f (x)
xt xt+1
The Newton step (7.1) obviously fails if f ′ (xt ) = 0 and may get out of
control if |f ′ (xt )| is very small. Any theoretical analysis will have to make
suitable assumptions to avoid this. But before going into this, we look at
Newton’s method in a benign case.
116
√ √
Let f (x) = x2 − R, where R ∈ R+ . f has two zeros, √ R and − R.
Starting for example at x0 = R, we hope to converge to R quickly. In this
case, (7.1) becomes
x2t − R
1 R
xt+1 = xt − = xt + . (7.2)
2xt 2 xt
This is in fact the Babylonian method to compute square roots, and here we
see that it is just a special case of Newton’s method. √
Can we prove that we indeed quickly converge to R? What we im-
mediately see from (7.2) is that all iterates will be positive and hence
1 R xt
xt+1 = xt + ≥ .
2 xt 2
√
So we cannot be too fast. Suppose R ≥ 1. In order to even get xt < 2 R,
we need at least T ≥ log(R)/2 steps. It√turns out that the Babylonian
method starts taking off only when xt − R < 1/2, say (Exercise 45 asks
you to prove that it takes O(log R) steps to get there).√
√ let us now suppose that x0 − R < 1/2, so we are
To watch takeoff,
starting close to R already. We rewrite (7.2) as
√ xt R √ 1 √ 2
xt+1 − R = + − R= xt − R . (7.3)
2 2xt 2xt
√
Assuming for now that R ≥ 1/4, all iterates have value at least R ≥
1/2, hence we get
√ √ 2
xt+1 − R ≤ xt − R .
This means that the error goes to 0 quadratically, and
2T
√ √ 2T 1
xT − R ≤ x0 − R < , T ≥ 0. (7.4)
2
√
What does this tell us? In order to get xT − R < ε, we only √ need
1
T = log log( ε ) steps! Hence, it takes a while to get to roughly R, but
from there, we achieve high accuracy very fast.
Let us do a concrete example of the practical behavior (on a computer
with IEEE√ 754 double arithmetic). If R = 1000, the method takes
√ 7 steps to
get x7 − 1000 < 1/2, and then 3 more steps to get x13 equal to 1000 up to
the machine precision (53 binary digits). In this last phase, we essentially
double the number of correct digits in each iteration!
117
7.2 Newton’s method for optimization
Suppose we want to find a global minimum x⋆ of a differentiable con-
vex function f : R → R (assuming that a global minimum exists). Lem-
mata 1.22 and 1.23 guarantee that we can equivalently search for a zero of
the derivative f ′ . To do this, we can apply Newton’s method if f is twice
differentiable; the update step then becomes
f ′ (xt )
xt+1 := xt − = xt − f ′′ (xt )−1 f ′ (xt ), t ≥ 0. (7.5)
f ′′ (xt )
There is no reason to restrict to d = 1. Here is Newton’s method for min-
imizing a convex function f : Rd → R. We choose x0 arbitrarily and then
iterate:
xt+1 := xt − ∇2 f (xt )−1 ∇f (xt ), t ≥ 0. (7.6)
The update vector ∇2 f (xt )−1 ∇f (xt ) is the result of a matrix-vector mul-
tiplication: we invert the Hessian at xt and multiply the result with the
gradient at xt . As before, this fails if the Hessian is not invertible, and may
get out of control if the Hessian has small norm.
We have introduced iteration (7.6) simply as a (more or less natural)
generalization of (7.5), but there’s more to it. If we consider (7.6) as a
special case of a general update scheme
xt+1 = xt − H(xt )∇f (xt ),
where H(xt ) ∈ Rd×d is some matrix, then we see that also gradient de-
scent (2.1) is of this form, with H(xt ) = γI. Hence, Newton’s method can
also be thought of as “adaptive gradient descent” where the adaptation is
w.r.t. the local geometry of the function at xt . Indeed, as we show next,
this allows Newton’s method to converge on all nondegenerate quadratic
functions in one step, while gradient descent only does so with the right
stepsize on “beautiful” quadratic functions whose sublevel sets are Eu-
clidean balls (Exercise 22).
Lemma 7.1. A nondegenerate quadratic function is a function of the form
1
f (x) = x⊤ M x − q⊤ x + c,
2
d×d
where M ∈ R is an invertible symmetric matrix, q ∈ Rd , c ∈ R. Let x⋆ =
M −1 q be the unique solution of ∇f (x) = 0 (the unique global minimum if f is
convex). With any starting point x0 ∈ Rd , Newton’s method (7.6) yields x1 = x⋆ .
118
Proof. We have ∇f (x) = M x − q (this implies x⋆ = M −1 q) and ∇2 f (x) =
M . Hence,
g g −1
Nf ◦g
yt yt+1
119
Hence, while gradient descent suffers if the coordinates are at very dif-
ferent scales, Newton’s method doesn’t.
We conclude the general exposition with another interpretation of New-
ton’s method: each step minimizes the local second-order Taylor approxi-
mation.
1
xt+1 = argmin f (xt ) + ∇f (xt )⊤ (x − xt ) + (x − xt )⊤ ∇2 f (xt )(x − xt ).
x∈Rd 2
120
a step size to (7.6) and always making only steps that decrease the function
value (which may not happen under the full Newton step).
An alternative is to use gradient descent to get us sufficiently close to
the global minimum, and then switch to Newton’s method for the rest. In
Chapter 2, we have seen that under favorable conditions, we may know
when gradient descent has taken us close enough.
In practice, Newton’s method is often (but not always) much faster
than gradient descent in terms of the number of iterations. The price to pay
is a higher iteration cost, since we need to compute (and invert) Hessians.
After this disclaimer, let us state the main result right away. We follow
Vishnoi [Vis15], except that we do not require convexity.
Theorem 7.4. Let f : dom(f ) → R be twice differentiable with a critical
point x⋆ . Suppose that there is a ball X ⊆ dom(f ) with center x⋆ such that
the following two properties hold.
(i) Bounded inverse Hessians: There exists a real number µ > 0 such that
1
∥∇2 f (x)−1 ∥ ≤ , ∀x ∈ X.
µ
In both cases, the matrix norm is the spectral norm defined in Lemma 2.6. Prop-
erty (i) in particular stipulates that Hessians are invertible at all points in X.
Then, for xt ∈ X and xt+1 resulting from the Newton step (7.6), we have
B
∥xt+1 − x⋆ ∥ ≤ ∥xt − x⋆ ∥2 .
2µ
As an example, let us consider a nondegenerate quadratic function f
(constant Hessian M = ∇2 f (x) for all x; see Lemma 7.1). Then f has ex-
actly one critical point x⋆ . Property (i) is satisfied with µ = 1/∥M −1 ∥ over
X = Rd ; property (ii) is satisfied for B = 0. According to the statement of
the theorem, Newton’s method will thus reach x⋆ after one step—which
we already know from Lemma 7.1.
In general, there could be several critical points for which properties
(i) and (ii) hold, and it may seem surprising that the theorem makes a
121
statement about all of them. But in fact, if xt is far away from such a
critical point, the statement allows xt+1 to be even further away from it; we
cannot expect to make progress towards all critical points simultaneously.
The theorem becomes interesting only if we are very close to some critical
point. In this case, we will actually converge to it. In particular, this critical
point is then isolated and the only one nearby, so that Newton’s method
cannot avoid getting there.
Corollary 7.5 (Exercise 47). With the assumptions and terminology of Theo-
rem 7.4, and if x0 ∈ X satisfies
µ
∥x0 − x⋆ ∥ ≤ ,
B
then Newton’s method (7.6) yields
2T −1
⋆ µ 1
∥xT − x ∥ ≤ , T ≥ 0.
B 2
Hence, we have a bound as (7.4) for the last phase of the Babylonian
method: in order to get ∥xT − x⋆ ∥ < ε, we only need T = log log( 1ε ) steps.
But before this fast behavior kicks in, we need to be µ/B-close to x⋆ al-
ready. The fact that x0 is this close to only one critical point necessarily
follows.
An intuitive reason for a unique critical point near x0 (and for fast con-
vergence to it) is that under our assumptions, the Hessians we encounter
are almost constant when we are close to x⋆ . This means that locally, our
function behaves almost like a nondegenerate quadratic function which
has truly constant Hessians and allows Newton’s method to convergence
to its unique critical point in one step (Lemma 7.1).
Lemma 7.6 (Exercise 48). With the assumptions and terminology of Theorem 7.4,
and if x0 ∈ X satisfies
µ
∥x0 − x⋆ ∥ ≤ ,
B
then the Hessians in Newton’s method satisfy the relative error bound
2t −1
∥∇2 f (xt ) − ∇2 f (x⋆ )∥ 1
≤ , t ≥ 0.
∥∇2 f (x⋆ )∥ 2
122
We now still owe the reader the proof of the main convergence result,
Theorem 7.4:
Proof of Theorem 7.4. To simplify notation, let us abbreviate H := ∇2 f , x =
xt , x′ = xt+1 . Subtracting x⋆ from both sides of (7.6), we get
x′ − x⋆ = x − x⋆ − H(x)−1 ∇f (x)
= x − x⋆ + H(x)−1 (∇f (x⋆ ) − ∇f (x))
Z 1
⋆ −1
= x − x + H(x) H(x + t(x⋆ − x))(x⋆ − x)dt.
0
The last step, which applies the fundamental theorem of calculus, needs
some explanations. In fact, we have applied it to each component hi (t) of
the vector-valued function h(t) = ∇f (x + t(x⋆ − x)):
Z 1
hi (1) − hi (0) = h′i (t), i = 1, . . . , d.
0
where h′ (t) has components h′1 (t), . . . , h′d (t), and the integral is also under-
∂f
stood componentwise. Furthermore, as hi (t) = ∂x i
(x + t(x⋆ − x)), the chain
rule yields h′i (t) = dj=1 ∂x∂f
(x + t(x⋆ − x))(x⋆j − xj ). This summarizes to
P
ij
h′ (t) = H(x + t(x⋆ − x))(x⋆ − x).
Also note that we are allowed to apply the fundamental theorem of
calculus in the first place, since f is twice continuously differentiable over
X (as a consequence of assuming Lipschitz continuous Hessians), so also
h′ (t) is continuous.
After this justifying intermezzo, we further massage the expression we
have obtained last. Using
Z 1
⋆ −1 ⋆ −1
x − x = H(x) H(x)(x − x ) = H(x) −H(x)(x⋆ − x)dt,
0
123
Taking norms, we have
Z 1
′ ⋆ −1
∥x − x ∥ ≤ ∥H(x) ∥ · (H(x + t(x⋆ − x)) − H(x)) (x⋆ − x)dt ,
0
We can now use the properties (i) and (ii) (bounded inverse Hessians, Lip-
schitz continuous Hessians) to conclude that
Z 1 Z 1
′ ⋆ 1 ⋆ ⋆ B ⋆ 2
∥x − x ∥ ≤ ∥x − x∥ B∥t(x − x)∥dt = ∥x − x∥ tdt .
µ 0 µ 0
| {z }
1/2
How realistic are properties (i) and (ii)? If f is twice continuously dif-
ferentiable (meaning that the second derivative ∇2 f is continuous), then
we will always find suitable values of µ and B over a ball X with center
x⋆ —provided that ∇2 f (x⋆ ) ̸= 0.
Indeed, already in the one-dimensional case, we see that under f ′′ (x⋆ ) =
0 (vanishing second derivative at the global minimum), Newton’s method
will in the worst reduce the distance to x⋆ at most by a constant factor in
each step, no matter how close to x⋆ we start. Exercise 50 asks you to find
such an example. In such a case, we have linear convergence, but the fast
quadratic convergence (O(log log(1/ε)) steps cannot be proven.
One way to ensure bounded inverse Hessians is to require strong con-
vexity over X.
124
Lemma 7.7 (Exercise 52). Let f : dom(f ) → R be twice differentiable and
strongly convex with parameter µ over an open convex subset X ⊆ dom(f )
according to Definition 2.10, meaning that
µ
f (y) ≥ f (x) + ∇f (x)⊤ (y − x) + ∥x − y∥2 , ∀x, y ∈ X.
2
Then ∇2 f (x) is invertible and ∥∇2 f (x)−1 ∥ ≤ 1/µ for all x ∈ X, where ∥ · ∥ is
the spectral norm defined in Lemma 2.6.
7.4 Exercises
Exercise
√ 45. Consider the Babylonian method (7.2). Prove that we get xT −
R < 1/2 for T = O(log R).
Exercise 46. Prove Lemma 7.2!
Exercise 47. Prove Corollary 7.5!
Exercise 48. Prove Lemma 7.6!
Exercise 49. Prove Lemma 7.3!
Exercise 50. Let δ > 0 be any real number. Find an example of a convex function
f : R → R such that (i) the unique global minimum x⋆ has a vanishing second
derivative f ′′ (x⋆ ) = 0, and (ii) Newton’s method satisfies
|xt+1 − x⋆ | ≥ (1 − δ)|xt − x⋆ |,
for all xt ̸= x⋆ .
Exercise 51. This exercise is just meant to recall some basics around integrals.
Show that for a vector-valued function g : R → Rd , the inequality
Z 1 Z 1
g(t)dt ≤ ∥g(t)∥dt
0 0
holds, where ∥ · ∥ is the 2-norm (always assuming that the funtions under consid-
eration are integrable)! You may assume (i) that integrals are linear:
Z 1 Z 1 Z 1
(λ1 g1 (t) + λ2 g2 (t))dt = λ1 g1 (t)dt + λ2 g2 (t)dt,
0 0 0
R1
And (ii), if g(t) ≥ 0 for all t ∈ [0, 1], then 0
g(t)dt ≥ 0.
125
Exercise 52. Prove Lemma 7.7! You may want to proceed in the following steps.
(i) Prove that the function g(x) = f (x) − µ2 ∥x∥2 is convex over X (see also
Exercise 20).
(iii) Prove that all eigenvalues of ∇2 f (x)−1 are positive and at most 1/µ.
(iv) Prove that for a symmetric matrix M , the spectral norm ∥M ∥ is the largest
absolute eigenvalue.
126
Chapter 8
Quasi-Newton Methods
Contents
8.1 The secant method . . . . . . . . . . . . . . . . . . . . . . . . 128
8.2 The secant condition . . . . . . . . . . . . . . . . . . . . . . . 130
8.3 Quasi-Newton methods . . . . . . . . . . . . . . . . . . . . . 130
8.4 Greenstadt’s approach (Optional Material) . . . . . . . . . . . 131
8.4.1 The method of Lagrange multipliers . . . . . . . . . . 133
8.4.2 Application to Greenstadt’s Update . . . . . . . . . . 134
8.4.3 The Greenstadt family . . . . . . . . . . . . . . . . . . 135
8.4.4 The BFGS method . . . . . . . . . . . . . . . . . . . . 138
8.4.5 The L-BFGS method . . . . . . . . . . . . . . . . . . . 139
8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
127
The main computational bottleneck in Newton’s method (7.6) is the
computation and inversion of the Hessian matrix in each step. This matrix
has size d × d, so it will take up to O(d3 ) time to invert it (or to solve the
system ∇2 f (xt )∆x = −∇f (xt ) that gives us the next Newton step ∆x).
Already in the 1950s, attempts were made to circumvent this costly step,
the first one going back to Davidon [Dav59].
In this chapter, we will (for a change) not prove convergence results;
rather, we focus on the development of Quasi-Newton methods, and how
state-of-the-art methods arise from first principles. To motivate the ap-
proach, let us go back to the 1-dimensional case.
128
approximates the Newton step (two starting values x0 , x1 need to be cho-
sen here). Figure 8.1 shows what the method does: it constructs the line
through the two points (xt−1 , f (xt−1 )) and (xt , f (xt )) on the graph of f ; the
next iterate xt+1 is where this line intersects the x-axis. Exercise 53 asks
you to formally prove this.
f (x)
xt−1 xt xt+1
129
8.2 The secant condition
Applying finite difference approximation to the second derivative of f
(we’re still in the 1-dimensional case), we get
f ′ (xt ) − f ′ (xt−1 )
Ht := ≈ f ′′ (xt ),
xt − xt−1
which we can write as
f ′ (xt ) − f ′ (xt−1 ) = Ht (xt − xt−1 ) ≈ f ′′ (xt )(xt − xt−1 ). (8.3)
Now, while Newton’s method for optimization uses the update step
xt+1 = xt − f ′′ (xt )−1 f ′ (xt ), t ≥ 0,
the secant method works with the approximation Ht ≈ f ′′ (xt ):
xt+1 = xt − Ht−1 f ′ (xt ), t ≥ 1. (8.4)
The fact that Ht approximates f ′′ (xt ) in the twice differentiable case
was our motivation for the secant method, but in the method itself, there
is no reference to f ′′ (which is exactly the point). All that is needed is the
secant condition from (8.3) that defines Ht :
f ′ (xt ) − f ′ (xt−1 ) = Ht (xt − xt−1 ). (8.5)
This view can be generalized to higher dimensions. If f : Rd → R is
differentiable, (8.4) becomes
xt+1 = xt − Ht−1 ∇f (xt ), t ≥ 1, (8.6)
where Ht ∈ Rd×d is now supposed to be a symmetric matrix satisfying the
d-dimensional secant condition
∇f (xt ) − ∇f (xt−1 ) = Ht (xt − xt−1 ). (8.7)
130
We might therefore hope that Ht ≈ ∇2 f (xt ), and this would mean that
(8.6) approximates Newton’s method. Therefore, whenever we use (8.6)
with a symmetric matrix satisfying the secant condition (8.7), we say that
we have a Quasi-Newton method.
In the 1-dimensional case, there is only one Quasi-Newton method—
the secant method (8.1). Indeed, equation (8.5) uniquely defines the num-
ber Ht in each step.
But in the d-dimensional case, the matrix Ht in the secant condition is
underdetermined, starting from d = 2: Taking the symmetry requirement
into account, (8.7) is a system of d equations in d(d + 1)/2 unknowns, so if
it is satisfiable at all, there are many solutions Ht . This raises the question
of which one to choose, and how to do so efficiently; after all, we want to
get some savings over Newton’s method.
Newton’s method is a Quasi-Newton method if and only if f is a non-
degenerate quadratic function (Exercise 54). Hence, Quasi-Newton meth-
ods do not generalize Newton’s method but form a family of related algo-
rithms.
The first Quasi-Newton method was developed by William C. Davi-
don in 1956; he desperately needed iterations that were faster than those
of Newton’s method in order obtain results in the short time spans be-
tween expected failures of the room-sized computer that he used to run
his computations on.
But the paper he wrote about his new method got rejected for lacking
a convergence analysis, and for allegedly dubious notation. It became a
very influential Technical Report in 1959 [Dav59] and was finally officially
published in 1991, with a foreword giving the historical context [Dav91].
Ironically, Quasi-Newton methods are today the methods of choice in a
number of relevant machine learning applications.
131
We draw some intuition from (the analysis of) Newton’s method. Re-
call that we have shown ∇2 f (xt ) to fluctuate only very little in the region
of extremely fast convergence (Lemma 7.6); in fact, Newton’s method is
optimal (one step!) when ∇2 f (xt ) is actually constant— this is the case of
a quadratic function (Lemma 7.1). Hence, in a Quasi-Newton method, it
also makes sense to have that Ht ≈ Ht−1 , or Ht−1 ≈ Ht−1
−1
.
−1
Greenstadt’s approach from 1970 [Gre70] is to update Ht−1 by an “error
matrix” Et to obtain
Ht−1 = Ht−1 −1
+ Et .
Moreover, the errors should be as small as possible, subject to the con-
straints that Ht−1 is symmetric and satisfies the secant condition (8.7). A
simple measure of error introduced by an update matrix E is its squared
Frobenius norm
Xd X d
2
∥E∥F := e2ij .
i=1 j=1
132
Greenstadt’s approach can now be distilled into the following convex
constrained minimization problem in the d2 variables Eij :
∇f (x⋆ )⊤ = λ⊤ C.
∇f (x⋆ )⊤ (x − x⋆ ) = λ⊤ C(x − x⋆ ) = λ⊤ (e − e) = 0.
133
8.4.2 Application to Greenstadt’s Update
In order to apply this method to (8.10), we need to compute the gradient
of f (E) = 12 ∥AEA⊤ ∥2F . Formally, this is a d2 -dimensional vector, but it is
customary and more practical to write it as a matrix again,
∂f (E)
∇f (E) = .
∂Eij 1≤i,j≤d
Fact 8.2 (Exercise 56). Let A, B ∈ Rd×d two matrices. With f : Rd×d → R,
f (E) := 12 ∥AEB∥2F , we have
∇f (E) = A⊤ AEBB ⊤ .
The second step is to write the system of equations Ey = r, E ⊤ − E = 0
in Greenstadt’s convex program (8.10) in matrix form Cx = e so that we
can apply the method of Lagrange multipliers according to Theorem 8.1.
As there are d + d2 equations in d2 variables, it is best to think of the
rows of C as being indexed with elements i ∈ [d] := {1, . . . , d} for the first
d equations Ey = r, and pairs (i, j) ∈ [d] × [d] for the last d2 symmetry
constraints (more than half of which are redundant but we don’t care).
Columns of C are indexed with pairs (i, j) as well.
Let us denote by λ ∈ Rd the Lagrange multipliers for the first d equa-
tions and Γ ∈ Rd×d the ones for the last d2 ones.
In column (i, j) of C corresponding to variable Eij , we have entry yj in
row i as well as entries 1 (row (j, i)) and −1 (row (i, j)). Taking the inner
product with the Lagrange multipliers, this column therefore yields
λi yj + Γji − Γij .
After aggregating these entries into a d × d matrix, Theorem 8.1 tells us
that we should aim for equality with ∇f (E) as derived in Fact 8.2. We
have proved the following intermediate result.
Lemma 8.3. An update matrix E ⋆ satisfying the constraints Ey = r (secant
condition in the next step) and E ⊤ − E = 0 (symmetry) is a minimizer of the
error function f (E) := 21 ∥AEA⊤ ∥2F subject to the aforementioned constraints if
and only if there exists a vector λ ∈ Rd and a matrix Γ ∈ Rd×d such that
W E ⋆ W = λy⊤ + Γ⊤ − Γ, (8.11)
where W := A⊤ A (a symmetric and positive definite matrix).
134
Note that λy⊤ is the outer product of a column and a row vector and
hence a matrix. As we assume A to be invertible, the quadratic func-
tion f (E) is easily seen to be strongly convex and as a consequence has
a unique minimizer E ⋆ subject to the set of linear equations in (8.10) (see
Lemma 2.12 which also applies if we minimize over a closed set). Hence,
we know that the minimizer E ⋆ and corresponding Lagrange multipiers
λ, Γ exist.
135
To also eliminate λ, we now use (8.12)—the secant condition in the next
step—to get
1
Ey = M λy⊤ + yλ⊤ M y = r.
2
−1
Premultiplying with 2M gives
2M −1 r = λy⊤ + yλ⊤ M y = λy⊤ M y + yλ⊤ M y.
Hence,
1 −1 ⊤
λ= 2M r − yλ M y . (8.17)
y⊤ M y
To get rid of λ on the right hand side, we premultiply this with y⊤ M to
obtain
1 ⊤ 2y⊤ r
y⊤ M λ = ⊤ 2y r − (y⊤ M y)(λ⊤ M y) = ⊤ − λ⊤ M y
| {z } y M y | {z } y M y | {z }
z z z
It follows that
y⊤ r
z = λ⊤ M y = .
y⊤ M y
This in turn can be substituted into the right-hand side of (8.17) to remove
λ there, and we get
(y⊤ r)
1 −1
λ= ⊤ 2M r − ⊤ y .
y My y My
Consequently,
(y⊤ r)
⊤ 1
λy = ⊤
2M −1 ry⊤ − yy ⊤
,
y My y⊤ M y
(y⊤ r)
1
yλ⊤ = 2yr⊤ M −1 − yy ⊤
.
y⊤ M y y⊤ M y
This gives us an explicit formula for E, by substituting the previous ex-
pressions back into (8.16). For this, we compute
(y⊤ r)
⊤ 1 ⊤ ⊤
M λy M = 2ry M − ⊤ M yy M ,
y⊤ M y y My
(y⊤ r)
⊤ 1 ⊤ ⊤
M yλ M = 2M yr − ⊤ M yy M ,
y⊤ M y y My
136
and consequently,
(y⊤ r)
1 1
E = M λy⊤ + yλ⊤ M = ⊤ ⊤ ⊤ ⊤
ry M + M yr − ⊤ M yy M .
2 y My y My
(8.18)
Finally, we use r = σ − Hy to obtain the update matrix E ⋆ in terms
−1
of the original parameters H = Ht−1 (previous approximation of the in-
verse Hessian that we now want to update to Ht−1 = H ′ = H + E ⋆ ),
σ = xt − xt−1 (previous Quasi-Newton step) and y = ∇f (xt ) − ∇f (xt−1 )
(previous change in gradients). This gives us the Greenstadt family of
Quasi-Newton methods.
where H0 = I (or some other positive definite matrix), and Ht−1 = Ht−1
−1
+ Et is
−1
chosen for all t ≥ 1 in such a way that Ht is symmetric and satisfies the secant
condition
∇f (xt ) − ∇f (xt−1 ) = Ht (xt − xt−1 ).
For any fixed t, set
−1
H := Ht−1 ,
H′ := −1
Ht ,
σ := xt − xt−1 ,
y := ∇f (xt ) − ∇f (xt−1 ),
and define
1 ⊤
⋆
E = ⊤ σy M + M yσ ⊤ − Hyy⊤ M − M yy⊤ H
y My
1 ⊤ ⊤ ⊤
− (y σ − y Hy)M yy M . (8.19)
y⊤ M y
137
8.4.4 The BFGS method
In his paper, Greenstadt suggested two obvious choices for the matrix M
In Definition 8.4, namely M = H (the previous approximation of the in-
verse Hessian) and M = I. In the next paper of the same issue of the same
journal, Goldfarb suggested to use the matrix M = H ′ , the next approxi-
mation of the inverse Hessian. Even though we don’t yet have it, we can
use it in the formula (8.19) since we know that H ′ will by design satisfy the
secant condition H ′ y = σ. And as M always appears next to y in (8.19),
M y = H ′ y = σ, so H ′ disappears from the formula!
Definition 8.5. The BFGS method is the Greenstadt method with parameter
M = H ′ = Ht−1 in step t, in which case the update matrix E ⋆ assumes the form
1 ⊤ ⊤ ⊤ 1 ⊤ ⊤ ⊤
E⋆ = 2σσ − Hyσ − σy H − (y σ − y Hy)σσ
y⊤ σ σ⊤y
1 ⊤ ⊤
y⊤ Hy ⊤
= − Hyσ − σy H + 1 + σσ ., (8.20)
y⊤ σ y⊤ σ
−1
where H = Ht−1 , σ = xt − xt−1 , y = ∇f (xt ) − ∇f (xt−1 ).
We leave it as Exercise 57 (i) to prove that the denominator y⊤ σ appear-
ing twice in the formula is positive, unless the function f is flat between
the iterates xt−1 and xt . And under y⊤ σ > 0, the BFGS method has an-
other nice property: if the previous matrix H is positive definite, then also
the next matrix H ′ is positive definite; see Exercise 57 (ii). In this sense, the
matrices Ht−1 behave like proper inverse Hessians.
The method is named after Broyden, Fletcher, Goldfarb and Shanno
who all came up with it independently around 1970. Greenstadt’s name is
mostly forgotten.
Let’s take a step back and see what we have achieved. Recall that our
starting point was that Newton’s method needs to compute and invert
Hessian matrices in each iteration and therefore has in practice a cost of
O(d3 ) per iteration. Did we improve over this?
First of all, any method in Greenstadt’s family avoids the computation
of Hessian matrices altogether. Only gradients are needed. In the BFGS
method in particular, the cost per iteration drops to O(d2 ). Indeed, the
computation of the update matrix E ⋆ in Definition 8.5 reduces to matrix-
vector multiplications and outer-product computations, all of which can
be done in O(d2 ) time.
138
Newton and Quasi-Newton methods are often performed with scaled
steps. This means that the iteration becomes
for some αt ∈ R+ . This parameter can for example be chosen such that
f (xt+1 ) is minimized (line search). Another approach is backtracking line
search where we start with αt = 1, and as long as this does not lead to
sufficient progress, we halve αt . Line search ensures that the matrices Ht−1
in the BFGS method remain positive definite [Gol70].
As the Greenstadt update method just depends on the step σ = xt −
xt−1 but not on how it was obtained, the update works in exactly the same
way as before even if scaled steps are being used.
σy⊤ yσ ⊤ σσ ⊤
′
H = I− ⊤ H I− ⊤ + ⊤ . (8.22)
y σ y σ y σ
To verify this, simply expand the product in the right-hand side and
compare with (8.20).
We further observe that we do not need the actual matrix H ′ = Ht−1 to
perform the next Quasi-Newton step (8.6), but only the vector H ′ ∇f (xt ).
Here is the crucial insight.
σy⊤ yσ ⊤ σσ ⊤
′
H = I− ⊤ H I− ⊤ + ⊤ .
y σ y σ y σ
139
Let g′ ∈ Rd . Suppose that we have an oracle to compute s = Hg for any vector
g. Then s′ = H ′ g′ can be computed with one oracle call and O(d) additional
arithmetic operations, assuming that σ and y are known.
σy⊤ yσ ⊤ σσ ⊤
′ ′
Hg = I− ⊤ H I− ⊤ g′ + ⊤ g′ .
y σ y σ y σ
| {z } | {z }
g h
| {z }
s
| {z }
w
| {z }
z
σσ ⊤ ′ σ ⊤ g′
h= g = σ ,
y⊤ σ y⊤ σ
so h can be computed with two inner products, a real division, and a mul-
tiplication of σ with a scalar. For g, we obtain
yσ ⊤ σ ⊤ g′
g= I− ⊤ g′ = g′ − y ⊤ .
y σ y σ
H ′ g′ = z = w + h
140
How do we implement the oracle? We simply apply the previous
Lemma recursively. Let
σ k = xk − xk−1 ,
yk = ∇f (xk ) − ∇f (xk−1 )
By Lemma 8.7, the runtime of BFGS- STEP(t, ∇f (xt )) is O(td). For t >
d, this is slower (and needs more memory) than the standard BFGS step
according to Definition 8.5 which always takes O(d2 ) time.
141
The benefit of the recursive variant is that it can easily be adapted to
a step that is faster (and needs less memory) than the standard BFGS step.
The idea is to let the recursion bottom out after a fixed number m of recur-
sive calls (in practice, values of m ≤ 10 are not uncommon). The step then
has runtime O(md) which is a substantial saving over the standard step if
m is much smaller than d.
The only remaining question is what we return when the recursion
now bottoms out prematurely at k = t − m. As we don’t know the matrix
−1 −1
Ht−m , we cannot return Ht−m g′ (which would be the correct output in this
case). Instead, we pretend that we have started the whole method just now
and use our initial matrix H0 instead of Ht−m .1 The resulting algorithm is
depicted in Algorithm 2.
142
the matrix H ′ will satisfy the secant condition by design, irrespective of H.
8.5 Exercises
Exercise 53. Consider a step of the secant method:
xt − xt−1
xt+1 = xt − f (xt ) , t ≥ 1.
f (xt ) − f (xt−1 )
Assuming that xt ̸= xt−1 and f (xt ) ̸= f (xt−1 ), prove that the line through
the two points (xt−1 , f (xt−1 )) and (xt , f (xt )) intersects the x-axis at the point
x = xt+1 .
Exercise 55. Prove the direction (i)⇒(ii) of Theorem 8.1! You may want to do
proceed in the following steps.
143
Exercise 56. Prove Fact 8.2!
(ii) Prove that if H is positive definite and y⊤ σ > 0, then also H ′ is positive
definite. You may want to use the product form of the BFGS update as
developed in Observation 8.6.
144
Chapter 9
Coordinate Descent
Contents
9.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
9.2 Alternative analysis of gradient descent . . . . . . . . . . . . 146
9.2.1 The Polyak-Łojasiewicz inequality . . . . . . . . . . . 146
9.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9.3 Coordinate-wise smoothness . . . . . . . . . . . . . . . . . . 148
9.4 Coordinate descent algorithms . . . . . . . . . . . . . . . . . 149
9.4.1 Randomized coordinate descent . . . . . . . . . . . . 150
9.4.2 Importance Sampling . . . . . . . . . . . . . . . . . . 152
9.4.3 Steepest coordinate descent . . . . . . . . . . . . . . . 153
9.4.4 Greedy coordinate descent . . . . . . . . . . . . . . . 156
9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
145
9.1 Overview
In large-scale learning, an issue with the gradient descent algorithms dis-
cussed in Chapter 2 is that in every iteration, we need to compute the full
gradient ∇f (xt ) in order to obtain the next iterate xt+1 . If the number of
variables d is large, this can be very costly. The idea of coordinate descent
is to update only one coordinate of xt at a time, and to do this, we only
need to compute one coordinate of ∇f (xt ) (one partial derivative). We ex-
pect this to be by a factor of d faster than computation of the full gradient
and update of the full iterate.
But we also expect to pay a price for this in terms of a higher number of
iterations. In this chapter, we will analyze a number of coordinate descent
variants on smooth and strongly convex functions. It turns out that in
the worst case, the number of iterations will increase by a factor of d, so
nothing is gained (but also nothing is lost).
But under suitable additional assumptions about the function f , coor-
dinate descent variants can actually lead to provable speedups. In prac-
tice, coordinate descent algorithms are popular due to their simplicity and
often good performance.
Much of this chapter’s material is from Karimi at al. [KNS16] and Nu-
tini et al. [NSL+ 15]. As a warm-up, we return to gradient descent.
146
equality) if the following holds for some µ > 0:
1
∥∇f (x)∥2 ≥ µ(f (x) − f (x⋆ )), ∀ x ∈ Rd . (9.1)
2
The inequality was proposed by Polyak in 1963, and also by Łojasiewicz
in the same year; see Karimi et al. and the references therein [KNS16]. It
says that the squared gradient norm at every point x is at least propor-
tional to the error in objective function value at x. It also directly implies
that every critical point (a point where ∇f (x) = 0) is a minimizer of f .
The interesting result for us is that strong convexity over Rd implies
the PL inequality.
Lemma 9.2 (Strong Convexity ⇒ PL inequality). Let f : Rd → R be dif-
ferentiable and strongly convex with parameter µ > 0 (in particular, a global
minimum x⋆ exists by Lemma 2.12). Then f satisfies the PL inequality for the
same µ.
Proof. Using strong convexity, we get
µ
f (x⋆ ) ≥ f (x) + ∇f (x)⊤ (x⋆ − x) + ∥x⋆ − x∥2
2
µ
≥ f (x) + min ∇f (x) (y − x) + ∥y − x∥2
⊤
y 2
1
= f (x) − ∥∇f (x)∥2 .
2µ
The latter equation results from solving a convex minimization problem
in y by finding a critical point (Lemma 1.22). The PL inequality follows.
9.2.2 Analysis
We can now easily analyze gradient descent on smooth functions that in
addition satisfy the PL inequality. By Exercise 58, this result also covers
some nonconvex optimization problems.
147
Theorem 9.3. Let f : Rd → R be differentiable with a global minimum x⋆ .
Suppose that f is smooth with parameter L according to (3.5) and satisfies the PL
inequality (9.1) with parameter µ > 0. Choosing stepsize
1
γ= ,
L
gradient descent (2.1) with arbitrary x0 satisfies
µ T
f (xT ) − f (x⋆ ) ≤ 1 − (f (x0 ) − f (x⋆ )), T > 0.
L
Proof. For all t, we have
1
f (xt+1 ) ≤ f (xt ) − ∥∇f (xt )∥2 (sufficient decrease, Lemma 2.7)
2L
µ
≤ f (xt ) − (f (xt ) − f (x⋆ )) (PL inequality (9.1)).
L
If we subtract f (x⋆ ) on both sides, we get
µ
f (xt+1 ) − f (x⋆ ) ≤ 1 − (f (xt ) − f (x⋆ )),
L
and the statement follows.
Li 2
f (x + λei ) ≤ f (x) + λ∇i f (x) + λ ∀x ∈ Rd , λ ∈ R, . (9.2)
2
If Li = L for all i, f is said to be coordinate-wise smooth with parameter L.
148
with the regular smoothness inequality (2.8), when applied to vectors y of
the form y = x + λei .
But we may be able to say more. For example, f (x1 , x2 ) = x21 + 10x22
is smooth with parameter L = 20 (due to the 10x22 term, no smaller value
will do), but f is coordinate-wise smooth with parameter L = (2, 20). So
coordinate-wise smoothness allows us to obtain a more fine-grained pic-
ture of f than smoothness.
There are even cases where the best possible smoothness parameter
is L, but we can choose coordinate-wise smoothness parameters Li (sig-
nificantly) smaller than L for all i. Consider f (x1 , x2 ) = x21 + x22 + M x1 x2
for a constant M > 0. For y = (y, y) and x = 0, smoothness requires
that (M + 2)y 2 = f (y) ≤ L2 ∥y∥2 = Ly 2 , so we need smoothness parameter
L ≥ (M + 2).
On the other hand, f is coordinate-wise smooth with L = (2, 2): fixing
one cordinate, we obtain a univariate function of the form x2 + ax + b. This
is smooth with parameter 2 (use Lemma 2.6 (i) along with the fact that
affine functions are smooth with parameter 0).
Here, ei denotes the i-th unit basis vector in Rd , and λi is a suitable stepsize
for the selected coordinate i. We will focus on the gradient-based choice
of the stepsize as
Here, ∇i f (x) denotes the i-th entry of the gradient ∇f (x), and in this
regime, we refer to γi > 0 as the stepsize.
In the coordinate-wise smooth case, we obtain a variant of sufficient
decrease for coordinate descent.
149
Lemma 9.5. Let f : Rd → R be differentiable and coordinate-wise smooth with
parameter L = (L1 , L2 , . . . , Ld ) according to (9.2). With active coordinate i in
iteration t and stepsize
1
γi = ,
Li
coordinate descent (9.4) satisfies
1
f (xt+1 ) ≤ f (xt ) − |∇i f (xt )|2 .
2Li
Proof. We apply the coordinate-wise smoothness condition (9.2) with λ =
−∇i f (xt )/Li , for which we have xt+1 = xt + λei . Hence
Li 2
f (xt+1 ) ≤ f (xt ) + λ∇i f (xt ) + λ
2
1 1
= f (xt ) − |∇i f (xt )|2 + |∇i f (xt )|2
Li 2Li
1
= f (xt ) − |∇i f (xt )|2 .
2Li
150
If we additionally assume the PL inequality, we can obtain fast conver-
gence as follows.
Theorem 9.6. Let f : Rd → R be differentiable with a global minimum x⋆ .
Suppose that f is coordinate-wise smooth with parameter L according to Defini-
tion 9.4 and satisfies the PL inequality (9.1) with parameter µ > 0. Choosing
stepsize
1
γi = ,
L
randomized coordinate descent (9.5) with arbitrary x0 satisfies
⋆
µ T
E[f (xT ) − f (x )] ≤ 1 − (f (x0 ) − f (x⋆ )), T > 0.
dL
Comparing this to the result for gradient descent in Theorem 9.3, the
number of iterations to reach optimization error at most ε is by a factor
of d higher. To see this, note that (for µ/L small)
µ µ d
1− ≈ 1− .
L dL
This means, while each iteration of coordinate descent is by a factor of d
cheaper, the number of iterations is by a factor of d higher, so we have a
zero-sum game here. But in the next section, we will refine the analysis
and show that there are cases where coordinate descent will actually be
faster. But first, let’s prove Theorem 9.6.
Proof. By definition, f is coordinate-wise smooth with (L, L, . . . , L), so suf-
ficient decrease according to Lemma 9.5 yields
1
f (xt+1 ) ≤ f (xt ) − |∇i f (xt )|2 .
2L
By taking the expectation of both sides with respect to the choice of i, we
have
d
1 X1
E [f (xt+1 )|xt ] ≤ f (xt ) − |∇i f (xt )|2
2L i=1 d
1
= f (xt ) − ∥∇f (xt )∥2
2dL
µ
≤ f (xt ) − (f (xt ) − f (x⋆ )) (PL inequality (9.1)).
dL
151
In the second line, we conveniently used the fact that the squared Eu-
clidean norm is additive. Subtracting f (x⋆ ) from both sides, we therefore
obtain
⋆
µ
E[f (xt+1 ) − f (x )|xt ] ≤ 1 − (f (xt ) − f (x⋆ )).
dL
Taking expectations (over xt ), we obtain
⋆
µ
E[f (xt+1 ) − f (x )] ≤ 1 − E[f (xt ) − f (x⋆ )].
dL
The statement follows.
In the proof, we have used conditional expectations: E [f (xt+1 )|xt ] is a
random variable whose expectation is E [f (xt+1 )].
Li
sample i ∈ [d] with probability Pd
j=1 Lj
1
xt+1 := xt − ∇i f (xt )ei . (9.6)
Li
Here is the result.
Theorem 9.7. Let f : Rd → R be differentiable with a global minimum x⋆ .
Suppose that f is coordinate-wise smooth with parameter L = (L1 , L2 , . . . , Ld )
according to (9.2) and satisfies the PL inequality (9.1) with parameter µ > 0. Let
d
1X
L̄ = Li
d i=1
⋆
µ T
f (xT ) − f (x ) ≤ 1 − (f (x0 ) − f (x⋆ )), T > 0.
dL
153
This result is a bit disappointing: individual iterations seem to be as
costly as in gradient descent, but the number of iterations is by factor of d
larger. This comparison with Theorem 9.3 is not fully fair, though, since
in contrast to gradient descent, steepest coordinate descent requires only
coordinate-wise smoothness, and as we have seen in Section 9.3, this can
be better than global smoothness. But steepest coordinate descent also
cannot compete with randomized gradient descent (same number of it-
erations, but higher cost per iteration). However, we show next that the
algorithm allows for a speedup in certain cases; also, it may be possible to
efficiently maintain the maximum absolute gradient value throughout the
iterations, so that evaluation of the full gradient can be avoided.
154
to ℓ∞ -norm in the PL inequality. This has to do with convex conjugates,
but we will not go into it here.
⋆
µ1 T
f (xT ) − f (x ) ≤ 1 − (f (x0 ) − f (x⋆ )), T > 0.
L
Proof. By definition, f is coordinate-wise smooth with (L, L, . . . , L), so suf-
ficient decrease according to Lemma 9.5 yields
1 1
f (xt+1 ) ≤ f (xt ) − |∇i f (xt )|2 = f (xt ) − ∥∇f (xt )∥2∞ ,
2L 2L
by definition of steepest gradient descent. Using the PL inequality (9.9),
we further get
µ1
f (xt+1 ) ≤ f (xt ) − (f (xt ) − f (x⋆ )).
L
Now we proceed as in the alternative analysis of gradient descent: Sub-
tracting f (x⋆ ) from both sides, we obtain
µ1
f (xt+1 ) − f (x⋆ ) ≤ 1 − (f (xt ) − f (x⋆ )),
L
and the statement follows.
155
9.4.4 Greedy coordinate descent
This is a variant that does not even require f to be differentiable. In each
iteration, we make the step that maximizes the progress in the chosen co-
ordinate. This requires to perform a line search by solving a 1-dimensional
optimization problem:
choose i ∈ [d]
λ⋆ := argmin f (xt + λei )
λ∈R
xt+1 := xt + λ⋆ ei (9.10)
There are cases where the line search can exactly be done analytically,
or approximately by some other means. In the differentiable case, we can
take any of the previously studied coordinate descent variants and replace
some of its steps by greedy steps if it turns out that we can perform line
search along the selected coodinate. This will not compromise the conver-
gence analysis, as stepwise progress can only be better.
Some care is in order when applying the greedy variant in the nondif-
ferentiable case for which the previous variants don’t work. The algorithm
can get stuck in non-optimal points, as for example in the objective func-
tion of Figure 9.1. But not all hope is lost. There are relevant cases where
this scenario does not happen, as we show next.
156
Theorem 9.11. Let f : Rd → R be of the form
X
f (x) := g(x) + h(x) with h(x) = hi (xi ), x ∈ Rd , (9.11)
i
Figure 9.2: The function f (x) := ∥x∥2 + ∥x∥1 . Greedy coordinate descent
cannot get stuck. Figure by Alp Yurtsever & Volkan Cevher, EPFL
157
One very important class of applications here are objective functions of
the form
f (x) + λ∥x∥1 ,
where f is convex and smooth, and h(x) = λ∥x∥1 is a (separable) ℓ1 -
regularization term. The LASSO ( Section 1.6) in its regularized form gives
rise to a concrete such case:
9.5 Summary
Coordinate descent methods are used widely in machine learning appli-
cations. Variants of coordinate methods form the state of the art for the
class of generalized linear models, including linear classifiers and regression
models, as long as separable convex regularizers are used (e.g. ℓ1 -norm or
squared ℓ2 -norm).
The following table summarizes the converegence bounds of coordi-
nate descent algorithms on coordinate-wise smooth and strongly convex
functions (we only use the PL inequality, a consequence of strong convex-
ity). The Bound column contains the factor by which the error is guaran-
teed to decrease in every step.
158
In the best case, Steeper (than Steepest) matches the performance of
gradient descent in terms of iteration count. The algorithm is therefore an
attractive choice for problems where we can obtain (or maintain) the steep-
est coordinate of the gradient efficiently. This includes several practical
case, for example when the gradients are sparse, e.g. because the original
data is sparse.
Importance sampling is attractive when most coordinate-wise smooth-
ness parameters Li are much smaller than the maximum. In the best case,
it can be d times faster than gradient descent. On the downside, applying
the method requires to know all the Li . In the other methods, an upper
bound on all Li is sufficient in order to run the algorithm.
9.6 Exercises
Exercise 58. Provide an example of a nonconvex function that satisfies the PL
inequality 9.1!
Exercise 60. Derive the solution to exact coordinate minimization for the Lasso
problem (9.12), for the i-th coordinate. Write A−i for the n × (d − 1) matrix
obtained by removing the i-th column from A, and same for the vector x−i with
one entry removed accordingly.
Exercise 61. Prove Lemma 9.9, proceeding as in the proof of Lemma 9.2!
159
Chapter 10
Contents
10.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
10.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
10.3 On linear minimization oracles . . . . . . . . . . . . . . . . . 164
10.3.1 LASSO and the ℓ1 -ball . . . . . . . . . . . . . . . . . . 164
10.3.2 Semidefinite Programming and the Spectahedron . . 165
10.4 Duality gap — A certificate for optimization quality . . . . . 166
10.5 Convergence in O(1/ε) steps . . . . . . . . . . . . . . . . . . . 167
10.5.1 Convergence analysis for γt = 2/(t + 2) . . . . . . . . 168
10.5.2 Stepsize variants . . . . . . . . . . . . . . . . . . . . . 169
10.5.3 Affine invariance . . . . . . . . . . . . . . . . . . . . . 170
10.5.4 The curvature constant . . . . . . . . . . . . . . . . . . 171
10.5.5 Convergence in duality gap . . . . . . . . . . . . . . . 173
10.6 Sparsity, extensions and use cases . . . . . . . . . . . . . . . . 174
10.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
160
10.1 Overview
As constrained optimization problems do appear often in practice, we will
give them a second look here. We again consider problems of the form
minimize f (x)
(10.1)
subject to x ∈ X,
f (x)
x
X ✓ Rd
<latexit sha1_base64="lA5r741TZdGyssowL328g73wxGg=">AAAFUnicddRBb9MwFABgd2thlME6OHKxqJA4oMqp1lFuE2MSF8SoSFepKZXtOF20JA62066L8ku4wk/iwG/hgtMWsbwyS4ms9z4/O45llkahNoT8qu3s1hv37u89aD7cf/T4oHX4ZKhlprhwuYykGjGqRRQmwjWhicQoVYLGLBIX7Oq0zF/MhdKhTD6bZSomMZ0lYRByamxo2joYeTpjWhjxFXuDL/601SYdctw7cvqYdHrE6Ts92+n2HEK62OmQVWujTTufHtZfeb7kWSwSwyOq9dghqZnkVJmQR6JoepkWKeVXdCbGtpvQWOhJvlp5gV/YiI8DqeyTGLyK3h6R01jrZcysjKm51DBXBv+XG2cm6E/yMEkzIxK+nijIImwkLrcB+6ES3ERL26FchXatmF9SRbmxm9Vseu+E/RglPtjCH1OhqJEq97hM5kW+et8h7G8o8vJlSyRiwWUc08TPvUGRe+UiGcsHRYFxNXv2L3tWwKHUTukxGfnlPsgop1ZUAAOAQcAB4BD4APgQCAAEBAEAAQQzAGYQxADEEKQApBBoADQEGQAZBHMA5hAsAFhAcA3ANQRLAJYQ3ABws3Ug3LgqXFhiCMBwqwSpAgIrhH4VeKE0dKvKXHLKNkeX0/U81YMnVXorf2rz9n75e4nguztut/Om43w6ap+83Vw0e+gZeo5eIge9RifoPTpHLuIoQ9/Qd/Sj/rP+u1Fr7K7pTm0z5imqtMb+H4RY72I=</latexit>
The only algorithm we have discussed for this case was projected gra-
dient descent in Chapter 3. This comes with a clear downside that pro-
jections onto a set X can sometimes be very complex to compute, even in
cases when the set is convex. Would it still be possible to solve constrained
optimization problems using a gradient-based algorithm, but without any
projection steps?
From a different perspective, coordinate descent, as we have discussed
in Chapter 9, had the attractive advantage that it only modified one coor-
dinate in every step, keeping all others unchanged. Yet, it is not applicable
in the general constrained case, as we can not easily know when a coordi-
nate step would exit the constraint set X (except in easy cases when X is
defined as a product of intervals). Is there a coordinate-like algorithm also
for general constraint sets X?
It turns out the answer to both previous questions is yes. The algorithm
was discovered by Marguerite Frank and Philip Wolfe in 1956 [FW56],
161
giving rise to the name of the method. Historically, the motivation for
the method was different from the two aspects mentioned above. After
the second world war, linear programming (that is to minimize a linear
function over set of linear constraints) had significant impact for many in-
dustrial applications (e.g. in logistics). Given these successes with linear
objectives, Marguerite Frank and Philip Wolfe studied if similar methods
could be generalized to non-linear objectives (including quadratic as well
as general objectives), that is problems of the form (10.1).
162
f
f (x)
s x
X ✓ Rd
<latexit sha1_base64="lA5r741TZdGyssowL328g73wxGg=">AAAFUnicddRBb9MwFABgd2thlME6OHKxqJA4oMqp1lFuE2MSF8SoSFepKZXtOF20JA62066L8ku4wk/iwG/hgtMWsbwyS4ms9z4/O45llkahNoT8qu3s1hv37u89aD7cf/T4oHX4ZKhlprhwuYykGjGqRRQmwjWhicQoVYLGLBIX7Oq0zF/MhdKhTD6bZSomMZ0lYRByamxo2joYeTpjWhjxFXuDL/601SYdctw7cvqYdHrE6Ts92+n2HEK62OmQVWujTTufHtZfeb7kWSwSwyOq9dghqZnkVJmQR6JoepkWKeVXdCbGtpvQWOhJvlp5gV/YiI8DqeyTGLyK3h6R01jrZcysjKm51DBXBv+XG2cm6E/yMEkzIxK+nijIImwkLrcB+6ES3ERL26FchXatmF9SRbmxm9Vseu+E/RglPtjCH1OhqJEq97hM5kW+et8h7G8o8vJlSyRiwWUc08TPvUGRe+UiGcsHRYFxNXv2L3tWwKHUTukxGfnlPsgop1ZUAAOAQcAB4BD4APgQCAAEBAEAAQQzAGYQxADEEKQApBBoADQEGQAZBHMA5hAsAFhAcA3ANQRLAJYQ3ABws3Ug3LgqXFhiCMBwqwSpAgIrhH4VeKE0dKvKXHLKNkeX0/U81YMnVXorf2rz9n75e4nguztut/Om43w6ap+83Vw0e+gZeo5eIge9RifoPTpHLuIoQ9/Qd/Sj/rP+u1Fr7K7pTm0z5imqtMb+H4RY72I=</latexit>
and makes a step into the direction of the minimizer; see Figure 10.2.
163
10.3 On linear minimization oracles
The algorithm is particularly useful for cases when the constraint set X can
be described as a convex hull of a finite or otherwise “nice” set of points
A, formally conv(A) = X. We call A the atoms describing the constraint
set.
In this case, a solution to the linear subproblem LMOX defined in (10.2)
is always attained by anPatom a ∈ A. Indeed, every s ∈ Pnconv(X) is a
n
convex combination s = i=1 λi ai of finitely many atoms ( i=1 λi = 1, all
λi nonnegative). It follows that for every g, there is always an atom such
that g⊤ s ≥ a⊤ ⊤
i g. Hence, if s minimizes g z, then there is also an atomic
minimizer.
This allows us to significantly reduce the candidate solutions for the
step directions used by the Frank-Wolfe algorithm. (Note that subprob-
lem (10.2) might still have optimal solutions which are not atoms, but there
is always at least one atomic solution LMOX (g) ∈ A).
The set A = X is a valid (but not too useful) set of atoms. The “opti-
mal” set of atoms is the set of extreme points. A point x ∈ X is extreme if
x ̸∈ conv(X \ {x}). Such an extreme point must be in every set of atoms,
but not every atom must be extreme. All that we require for A to be a set
of atoms is that conv(A) = X.
We give two interesting examples next.
164
vectors and their negatives):
= argmin z⊤ g (10.6)
z∈{±e1 ,...,±en }
So we only have to look at the vector g and identify its largest coordinate
(in absolute value). This operation is of course significantly more efficient
than projection onto an ℓ1 -ball. The latter we have analyzed in Section 3.5
and have shown a more sophisticated algorithm that still did not have
runtime linear in the dimension.
165
sired convex combination of atoms. We remark that ai is a (unit length)
eigenvector of Z w.r.t. eigenvalue λi .
Lemma 10.1. Let λ1 be the smallest eigenvalue of G, and let s1 be a corresponding
eigenvector of unit length. Then we can choose LMOX (G) = s1 s⊤ 1.
The second equality follows from G • zz⊤ = z⊤ Gz for all z (simple rewrit-
ing), and the last equality is a standard result from linear algebra that can
be proved via elementary calculations, involving diagonalization of G.
Now, s1 is easily seen to attain the last minimum, hence s1 s⊤
1 attains the
⊤
first minimum, and LMOX (G) = s1 s1 follows.
166
f
f (x)
g(x)
s x
X ✓ Rd
<latexit sha1_base64="lA5r741TZdGyssowL328g73wxGg=">AAAFUnicddRBb9MwFABgd2thlME6OHKxqJA4oMqp1lFuE2MSF8SoSFepKZXtOF20JA62066L8ku4wk/iwG/hgtMWsbwyS4ms9z4/O45llkahNoT8qu3s1hv37u89aD7cf/T4oHX4ZKhlprhwuYykGjGqRRQmwjWhicQoVYLGLBIX7Oq0zF/MhdKhTD6bZSomMZ0lYRByamxo2joYeTpjWhjxFXuDL/601SYdctw7cvqYdHrE6Ts92+n2HEK62OmQVWujTTufHtZfeb7kWSwSwyOq9dghqZnkVJmQR6JoepkWKeVXdCbGtpvQWOhJvlp5gV/YiI8DqeyTGLyK3h6R01jrZcysjKm51DBXBv+XG2cm6E/yMEkzIxK+nijIImwkLrcB+6ES3ERL26FchXatmF9SRbmxm9Vseu+E/RglPtjCH1OhqJEq97hM5kW+et8h7G8o8vJlSyRiwWUc08TPvUGRe+UiGcsHRYFxNXv2L3tWwKHUTukxGfnlPsgop1ZUAAOAQcAB4BD4APgQCAAEBAEAAQQzAGYQxADEEKQApBBoADQEGQAZBHMA5hAsAFhAcA3ANQRLAJYQ3ABws3Ug3LgqXFhiCMBwqwSpAgIrhH4VeKE0dKvKXHLKNkeX0/U81YMnVXorf2rz9n75e4nguztut/Om43w6ap+83Vw0e+gZeo5eIge9RifoPTpHLuIoQ9/Qd/Sj/rP+u1Fr7K7pTm0z5imqtMb+H4RY72I=</latexit>
167
10.5.1 Convergence analysis for γt = 2/(t + 2)
Theorem 10.3. Consider the constrained minimization problem (10.1) where
f : Rd → R is convex and smooth with parameter L, and X is convex, closed
and bounded (in particular, a minimizer x⋆ of f over X exists, and all linear
minimization oracles have minimizers). With any x0 ∈ X, and with stepsizes
γt = 2/(t + 2), the Frank-Wolfe algorithm yields
2L diam(X)2
f (xT ) − f (x⋆ ) ≤ , T ≥ 1,
T +1
where diam(X) := maxx,y∈X ∥x − y∥ is the diameter of X (which exists since X
is closed and bounded).
The following descent lemma forms the core of the convergence proof:
Lemma 10.4. For a step xt+1 := xt + γt (s − xt ) with stepsize γt ∈ [0, 1], it holds
that
L
f (xt+1 ) ≤ f (xt ) − γt g(xt ) + γt2 ∥s − xt ∥2 ,
2
where s = LMOX (∇f (xt )).
Proof. From the definition of smoothness of f , we have
f (xt+1 ) = f (xt + γt (s − xt ))
L
≤ f (xt ) + ∇f (xt )⊤ γt (s − xt ) + γt2 ∥s − xt ∥2 (10.11)
2
L
= f (xt ) − γt g(xt ) + γt2 ∥s − xt ∥2 ,
2
using the definition (10.9) of the duality gap.
Proof of Theorem 10.3. Writing h(x) := f (x) − f (x⋆ ) for the (unknown) op-
timization gap at point x, und using the certificate property (10.10) of the
duality gap, that is h(x) ≤ g(x), Lemma 10.4 implies that
L
h(xt+1 ) ≤ h(xt ) − γt g(xt ) + γt2 ∥s − xt ∥2
2
L
≤ h(xt ) − γt h(xt ) + γt2 ∥s − xt ∥2
2
L
= (1 − γt )h(xt ) + γt2 ∥s − xt ∥2
2
2
≤ (1 − γt )h(xt ) + γt C, (10.12)
168
where C := L2 diam(X)2 .
The convergence proof finishes by induction. Exercise 63 asks you to
2
prove that for γt = t+2 , we obtain
4C
h(xt ) ≤ , t ≥ 1.
t+1
Line search stepsize. Here, γt ∈ [0, 1] is chosen such that the progress in
f -value (and hence also in h-value) is maximized,
γt := argmin f (1 − γ)xt + γs .
γ∈[0,1]
Let yt+1 be the iterate obtained from xt with the standard stepsize µt .
From (10.12) and the definition of γt , we obtain the desired inequality
h(xt+1 ) ≤ h(yt+1 ) ≤ (1 − µt )h(xt ) + µ2t C. (10.13)
Gap-based stepsize. This chooses γt such that the right-hand side in the
first line of (10.12) is minimized. A simple calculation shows that this re-
sults in
g(xt )
γt := min ,1 .
L ∥s − xt ∥2
Now we establish (10.13) as follows:
L
h(xt+1 ) ≤ h(xt ) − γt g(xt ) + γt2 ∥s − xt ∥2
2
L
≤ h(xt ) − µt g(xt ) + µ2t ∥s − xt ∥2
2
L
≤ h(xt ) − µt h(xt ) + µ2t ∥s − xt ∥2
2
2
≤ (1 − µt )h(xt ) + µt C.
169
Directly plugging in the definition of γt yields
h(xt ) 1 − γ2t , γt < 1,
h(xt+1 ) ≤
h(xt ), γt = 1,
So we make progress in every iteration under the gap-based stepsize (this
is not guaranteed under the standard stepsize), but faster convergence is
not implied.
170
Figure 10.4: Two optimization problems (f, X) and (f ′ , X ′ ) that are equiv-
alent under an affine transformation.
where c is some constant, the linear minimization oracle in (b) returns the
step direction s′ = A−1 (s − b) ∈ X ′ corresponding to the step direction s ∈
X in (a). It follows that also the next iterates in (a) and (b) will correspond
to each other and have the same function values. In particular, after any
number of steps, both (a) and (b) will incur the same optimization error.
171
transformations, unlike the bound of Theorem 10.3. For this, we define
a curvature constant of the constrained optimization problem (10.1). The
quantity serves as a combined notion of complexity of both the objective
function f and the constraint set X:
1 ⊤
C(f,X) := sup f (y) − f (x) − ∇f (x) (y − x) . (10.15)
x,s∈X,γ∈(0,1] γ2
y=(1−γ)x+γs
4C(f,X)
f (xT ) − f (x⋆ ) ≤ , T ≥ 1.
T +1
Proof. The crucial step is to prove the following version of (10.11):
After this, we can follow the remainder of the proof of Theorem 10.3, with
C(f,X) instead of C = L2 diam(X)2 . To show (10.16), we use
172
and rewrite the definition of the curvature constant (10.15) to get
Lemma 10.6 (Exercise 64). Let f be a convex function which is smooth with
parameter L over X. Then
L
C(f,X) ≤ diam(X)2 .
2
27/2 · C(f,X)
g(xt ) ≤
T +1
Still, compared to our previous theorem, the convergence of the gap
here is a stronger and more useful result, because g(xt ) is easy to com-
pute in any iteration of the Frank-Wolfe algorithm, and as we have seen in
(10.10) serves as an upper bound (certificate) to the unknown primal error,
that is f (xt ) − f (x⋆ ) ≤ g(xt ).
The proof of the theorem is left as Exercise 65, and is difficult. The
argument leverages that not all gaps can be small, and will again crucially
rely on the descent Lemma 10.4.
173
10.6 Sparsity, extensions and use cases
A very important feature of the Frank-Wolfe algorithm has been pointed
out before, but we would like to make it explicit here. Consider the con-
vergence bound of Theorem 10.5,
4C(f,X)
f (xT ) − f (x⋆ ) ≤ , T ≥ 1.
T +1
This means that O(1/ε) many iterations are sufficent to obtain optimality
gap at most ε. At this time, the current solution is a convex combination of
x0 and O(1/ε) many atoms of the constraint set X. Thinking of ε as a con-
stant (such as 0.01), this means that constantly many atoms are sufficient in
order to get an almost optimal solution. This is quite remarkable, and it
connects to the notion of coresets in computational geometry. A coreset is a
small subsets of a given set of objects that is representative (with respect to
some measure) for the set of all objects. Some algorithms for finding small
coresets are inspired by the Frank-Wolfe algorithm [Cla10].
The algorithm and analysis above can be extended to several settings,
including
• Approximate LMO, that is we can allow a linear minimization oracle
which is not exact but is of a certain additive or multiplicative ap-
proximation quality for the subproblem (10.2). Convergence bounds
are essentially as in the exact case [Jag13].
• Randomized LMO, that is that the LMOX solves the linear minimiza-
tion oracle only over a random subset of X. Convergence in O(1/ε)
steps still holds [KPd18].
• Stochastic LMO, that is LMOX is fed with a stochastic gradient instead
of the true gradient [HL20].
• Unconstrained problems. This is achieved by considering growing ver-
sions of a constraint set X. For instance when X is an ℓ1 -norm ball,
the algorithm will become similar to popular steepest coordinate
methods as we have discussed in Section 9.4.3. In this case, the
resulting algorithms are also known as matching-pursuit, and are
widely used in the literature on sparse recovery of a signal, also
known as compressed sensing. For more details, we refer the reader
to [LKTJ17].
174
The Frank-Wolfe algorithm and its variants have many popular use-
cases. The most attractive uses are for constraint sets X where a projection
step bears significantly more computational cost compared to solving a
linear problem over X. Some examples of such sets include:
• Lasso and other L1-constrained problems, as discussed in Section 10.3.1.
• Matrix Completion. For several low-rank approximation problems,
including matrix completion as in recommender systems, the Frank-
Wolfe algorithm is a very scalable algorithm, and has much lower
iteration cost compared to projected gradient descent. For a more
formal treatment, see Exercise 66.
• Relaxation of combinatorial problems, where we would like to opti-
mize over a discrete set A (e.g. matchings, network flows etc). In
this case, the Frank-Wolfe algorithm is often used together with early
stopping, in order to achieve a good iterate xt being a combination
of at most t of the original points A.
Many of these applications can also be written as constraint sets of the
form X := conv(A) for some set of atoms A, as illustrated in the following
table:
Examples A |A| dim. LMOX (g)
L1-ball {±ei } 2d d ±ei with argmaxi |gi |
Simplex {ei } d d ei with argmini gi
Spectahedron {xx⊤ , ∥x∥ = 1} ∞ d2 argmin∥x∥=1 x⊤ Gx
Norms {x, ∥x∥ ≤ 1} ∞ d argmin ⟨s, g⟩
s,∥s∥≤1
Nuclear norm {Y, ∥Y ∥∗ ≤ 1} ∞ d2 ..
Wavelets .. ∞ ∞ ..
10.7 Exercises
Exercise 63 (Induction for the Frank-Wolfe convergence analysis). Given
some constant C > 0 and a sequence of real values h0 , h1 , . . . satisfying (10.12),
i.e.
ht+1 ≤ (1 − γt )ht + γt2 C for t = 0, 1, . . .
175
2
for γ = t+2
, prove that
4C
ht ≤ for t ≥ 1.
t+1
Exercise 64 (Relating Curvature and Smoothness). Prove Lemma 10.6:
where the optimization domain X is the set of matrices in the unit ball of the trace
norm (or nuclear norm), which is defined the convex hull of the rank-1 matrices
n o
⊤ u∈Rn , ∥u∥2 =1
X := conv(A) with A := uv v∈Rm , ∥v∥ =1 .
2
Here Ω ⊆ [n] × [m] is the set of observed entries from a given data matrix Z
(collecting the ratings given by users to items for example).
1. Derive the LMOX for this set X for a gradient at iterate Y ∈ Rn×m .
2. Derive the projection step onto X. How do the LMOX and the projection
step compare, in terms of computational cost?
176
Bibliography
[ACGH18] Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A
convergence analysis of gradient descent for deep linear neu-
ral networks. CoRR, abs/1810.02281, 2018.
177
high dimensions. In Proceedings of the 25th International Confer-
ence on Machine Learning, pages 272–279, 2008.
[KNS16] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear Con-
vergence of Gradient and Proximal-Gradient Methods Under
the Polyak-Łojasiewicz Condition. In ECML PKDD 2016: Ma-
chine Learning and Knowledge Discovery in Databases, pages 795–
811. Springer, 2016.
178
[KPd18] Thomas Kerdreux, Fabian Pedregosa, and Alexandre
d’Aspremont. Frank-wolfe with subsampling oracle, 2018.
179
[NY83] Arkady. S. Nemirovsky and D. B. Yudin. Problem complexity
and method efficiency in optimization. Wiley, 1983.
180