Lecture05_descent
Lecture05_descent
Spring 2020
Stanley Chan
Mathematical Background
Lecture 4: Intro to Optimization
Lecture 5: Gradient Descent
lim [f (x ∗ + d ) − f (x ∗ )] = ∇f (x ∗ )T d
1
→0
| {z }
≥0, ∀d
=⇒ ∇f (x ∗ )T d ≥ 0, ∀d
But if x (t) is not optimal, then we want
f (x (t) + d ) ≤ f (x (t) )
So,
|→0 {z }
≤0, for some d
=⇒ ∇f (x (t) )T d ≤ 0
E.g., if f (x ) = 1
2 x T Hx + c T x , then
∇f (x (t) )T d (t)
α(t) = − .
d (t)T Hd (t)
3. Inexact line search:
Amijo / Wolfe conditions. See Nocedal-Wright Chapter 3.1. c Stanley Chan 2020. All Rights Reserved.
9 / 31
Convergence
Let x ∗ be the global minimizer.Assume the followings:
Assume f is twice differentiable so that ∇2 f exist.
Assume 0 λmin I ∇2 f (x ) λmax I for all x ∈ Rn
Run gradient descent with exact line search.
Then, (Nocedal-Wright Chapter 3, Theorem 3.3)
λmin 2
f (x ) − f (x ) ≤ 1 − f (x (t) ) − f (x ∗ )
(t+1) ∗
λmax
λmin 4
f (x (t−1) ) − f (x ∗ )
≤ 1−
λmax
..
≤ .
λmin 2t
f (x (1) ) − f (x ∗ ) .
≤ 1−
λmax
Thus, f (x (t) ) → f (x ∗ ) as t → ∞. c Stanley Chan 2020. All Rights Reserved.
10 / 31
Understanding Convergence
Gradient descent can be viewed as successive approximation.
Approximate the function as
f (x t + d ) ≈ f (x t ) + ∇f (x t )T d + kd k2 .
1
2α
We can show that the d that minimizes f (x t + d ) is d = −α∇f (x t ).
This suggests: Use a quadratic function to locally approximate f .
Converge when curvature α of the approximation is not too big.
Mathematical Background
Lecture 4: Intro to Optimization
Lecture 5: Gradient Descent
Step size: SGD with constant step size does not converge.
If θ ∗ is a minimizer, then J(θ ∗ ) = N1 N ∗
P
n=1 Jn (θ ) = 0. But
1 X
Jn (θ ∗ ) 6= 0, since B is a subset.
|B|
n∈B
Typical strategy: Start with large step size and gradually decrease:
η t → 0, e.g., η t = t −a for some constant a.
c Stanley Chan 2020. All Rights Reserved.
17 / 31
Perspectives of SGD
y t+1 def
= x t+1 − η∇f (x t+1 )
= (y t − η w t ) − η∇f (y t − η w t )
Assume E[w ] = 0, then
E[y t+1 ] = y t − η∇E[f (y t − η w t )]
c Stanley Chan 2020. All Rights Reserved.
20 / 31
Smoothing the Landscape
Gradient Descent
S. Boyd and L. Vandenberghe, “Convex Optimization”, Chapter 9.2-9.4.
J. Nocedal and S. Wright, “Numerical Optimization”, Chapter 3.1-3.3.
Y. Nesterov, “Introductory lectures on convex optimization”, Chapter 2.
CMU 10.725 Lecture https://www.stat.cmu.edu/~ryantibs/
convexopt/lectures/grad-descent.pdf
Lemma
If n is a random variable with uniform distribution over {1, . . . , N}, then
" #
1 X
E ∇Jn (θ) = ∇J(θ).
|B|
n∈B
x t+1 = x t − α
t−1
+ (1 − β)g t ,
βg
J(θ) = kAθ − y k2
1
2
Then the gradient is
∇J(θ) = AT (Aθ − y )
θ t+1 = θ t + η AT (Aθ t − y ).
| {z }
∇J(θ t )
Since this is a quadratic equation, you can find the exact line search
step size (assume d = −∇f (θ)):
η = −kd k2 /(d T AT Ad ).
c Stanley Chan 2020. All Rights Reserved.
29 / 31
Q&A 4: In finding the steepest direction, why is δ
unimportant?
∇f (x )
d = −δ k∇f (x )k
.