0% found this document useful (0 votes)
5 views28 pages

25 Optimization

The document provides an introduction to optimization and gradient descent in deep learning, focusing on concepts such as local and global minima, convex functions, and the gradient descent algorithm. It discusses the choice of learning rate, convergence rates, and the application of stochastic gradient descent (SGD) and mini-batch SGD. The content is aimed at students of UC Berkeley's STAT 157 course, highlighting the importance of these optimization techniques in training deep learning models.

Uploaded by

wen zhou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views28 pages

25 Optimization

The document provides an introduction to optimization and gradient descent in deep learning, focusing on concepts such as local and global minima, convex functions, and the gradient descent algorithm. It discusses the choice of learning rate, convergence rates, and the application of stochastic gradient descent (SGD) and mini-batch SGD. The content is aimed at students of UC Berkeley's STAT 157 course, highlighting the importance of these optimization techniques in training deep learning models.

Uploaded by

wen zhou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Introduction to Deep Learning

22. Optimization, Gradient Descent

STAT 157, Spring 2019, UC Berkeley

Alex Smola and Mu Li


courses.d2l.ai/berkeley-stat-157
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Optimization

courses.d2l.ai/berkeley-stat-157
Optimization Problems

• General form:
minimize f(x) subject to x ∈ C

• Cost function f : ℝn → ℝ
• Constraint set example
C = {x | h1(x) = 0,…, hm(x) = 0, g1(x) ≤ 0,…, gr(x) ≤ 0}

• Unconstraint if C = ℝn

courses.d2l.ai/berkeley-stat-157
Local Minima and Global Minima

• Most optimization problems have no close form solution


• We then aim to find a minima through iterative methods
• Global minima x*
x * np.cos(np.pi * x)

f(x*) ≤ f(x) ∀x ∈ C

• Local minima x* , there exists ε

f(x*) ≤ f(x) ∀x : ∥x − x*∥ ≤ ε

courses.d2l.ai/berkeley-stat-157
Convex Set

• A subset C of ℝn is called
convex if

αx + (1 − α)y ∈ C
∀α ∈ [0,1] ∀x, y ∈ C

courses.d2l.ai/berkeley-stat-157
Convex Function

• f : C → ℝ is called convex if
f(αx + (1 − α)y)
≤ αf(x) + (1 − α)f(y)
∀α ∈ [0,1] ∀x, y ∈ C
• If the inequality is strict
whenever α ∈ (0,1) and
x ≠ y, then f is called strictly
convex

courses.d2l.ai/berkeley-stat-157
First-order condition

• f is convex if and only if

f(y) ≥ f(x) + ∇f(x)T (y − x) ∀x, y ∈ C

• If the inequality is strict, then f is strictly convex

courses.d2l.ai/berkeley-stat-157
Second-order conditions

• f is convex if and only if

∇2 f(x) ⪰ 0 ∀x ∈ C

• f is strictly convex if and only if

∇2 f(x) ≻ 0 ∀x ∈ C

courses.d2l.ai/berkeley-stat-157
Convex and Non-convex Examples

• Convex
• Linear regression f(x) = ∥Wx − b∥22
∇f(x) = 2WT (Wx − b), ∇2 f(x) = 2WT W
• Softmax regression
• Non-convex
• Multi-layer perception
• Convolution neural networks
• Recurrent neural networks

courses.d2l.ai/berkeley-stat-157
Convex Optimization

• If f is a convex function, and C is a convex set, then the


problem is called a convex problem
• Any local minima is a global minima
• Unique global minima if strictly convex

Global minima
courses.d2l.ai/berkeley-stat-157
Proof

• Assume local minima x , if exists a global minima y


• Choose α ≤ 1 − ε/ | x + y | and z = αx + (1 − α)y
• Then ∥x − z∥ = (1 − α)∥x + y∥ ≤ ε
• Due to y is a global minima, so f(y) < f(x)
f(z) ≤ αf(x) + (1 − a)f(z) < αf(x) + (1 − a)f(x) = f(x)
• It contradicts x is a local minima

courses.d2l.ai/berkeley-stat-157
Gradient
Descent

courses.d2l.ai/berkeley-stat-157
Algorithm

• Choose initial x0
• At time t = 1,…, T
xt = xt−1 − η ∇f(xt−1)

• η is called learning rate

courses.d2l.ai/berkeley-stat-157
The Choice of Learning Rate

• Given ∥Δ∥ < ε , for any f, by the Taylor expansion

f(x + Δ) ≈ f(x) + ΔT ∇f(x)

• Choose small enough learning rate η ≤ ε/∥∇f(x)∥


∥ − η ∇f(x)∥ ≤ ε

f(x − η ∇f(x)) ≈ f(x) − η∥∇f(x)∥2 ≤ f(x)

courses.d2l.ai/berkeley-stat-157
Convergence Rate

• Assume f is convex, and its gradient is Lipschitz


continuous with constant L
Gradient does
∥∇f(x) − ∇f(y)∥ ≤ L∥x − y∥ not change
• If use learning rate η ≤ 1/L , after T steps dramatically
∥x0 − x*∥2
f(xT ) − f(x*) ≤
2ηT
• Convergence rate O(1/T )
• To get f(xT ) − f(x*) ≤ ϵ , needs O(1/ϵ) iterations

courses.d2l.ai/berkeley-stat-157
Proof

• Gradient L-Lipschitz means


L
f(y) ≤ f(x) + ∇f(x) (y − x) + ∥y − x∥2
T
2
• Plug in y = x − η ∇f(x)

( 2 )

f(y) ≤ f(x) − 1 − η∥∇f(x)∥2

• Take 0 < η ≤ 1/L


η f decreases
f(y) ≤ f(x) − ∥∇f(x)∥2 every time
2
courses.d2l.ai/berkeley-stat-157
Proof II

• By the convexity: f(x) ≤ f (x*) + ∇f(x)T (x − x*)


η
• Plug in to f(y) ≤ f(x) − ∥∇f(x)∥2
2 η
f(y) ≤ f(x*) + ∇f(x) (x − x*) − ∥∇f(x)∥2
T
2
f(y) − f (x*) ≤ (2η ∇f(x)T (x − x*) − η 2∥∇f(x)∥2) /2η
= (∥x − x*∥2 + 2η ∇f(x)T (x − x*) − η 2∥∇f(x)∥2 − ∥x − x*∥2) /2η
= (∥x − x*∥2 − ∥x − η ∇f(x) − x*∥2) /2η
= (∥x − x*∥2 − ∥y − x*∥2) /2η

courses.d2l.ai/berkeley-stat-157
Proof III

• Sum all T steps


T T


f(xt) − f(x*) ≤
∑ ( ∥xt−1 − x*∥2 − ∥xt − x*∥2) /2η
t=1 t=1

= (∥x0 − x*∥2 − ∥xT − x*∥2) /2η ≤ ∥x0 − x*∥2 /2η


• f is decreasing every time:
1 T ∥x0 − x*∥2

f(xT ) − f(x*) ≤ f(xt) − f(x*) ≤
T t=1 2ηT

courses.d2l.ai/berkeley-stat-157
Apply to Deep Learning

• f is the sum of loss over all training data, x is the learnable


parameters
1 n

f(x) = ℓi(x) ℓi(x) the loss for the i-th example
n i=0

• f is often not convex, so the convergence analysis before


cannot be applied

courses.d2l.ai/berkeley-stat-157
Stochast
Gradient
Descent
Singapore Dollar (SGD) 1000
~740 USD

courses.d2l.ai/berkeley-stat-157
Algorithm

• At time t , sample example ti


xt = xt−1 − ηt ∇ℓti(xt−1)

• Compare to gradient descent

xt = xt−1 − η ∇f(xt−1)
1 n

f(x) = ℓi(x)
n i=0

courses.d2l.ai/berkeley-stat-157
Sample Example

• Two rules to sample example it at time t


• Random rule: choose it ∈ {1,…, n} uniformly at random
• Cyclic rule: choose it = 1,2,…, n,1,2,…, n
• Often called incremental gradient descent
• Randomized rule is more common in practice

𝔼 [ ∇ℓti(x)] = 𝔼[ ∇f(x)]

• An unbiased estimate of the gradient

courses.d2l.ai/berkeley-stat-157
Convergence Rate

• Assume f is convex with a diminishing ηt , e.g. ηt = O(1/t)


𝔼[ f(xT )] − f(x*) = O(1/ T )

• Under the same assumption, for gradient descent


f(xT ) − f(x*) = O(1/ T )
• Assume gradient L-Lipschitz and fixed η
f(xT ) − f(x*) = O(1/T )

• Does not improve for SGD

courses.d2l.ai/berkeley-stat-157
In Practice

• Does not diminish the learning rate so dramatically


• We don’t care about optimizing to high accuracy
• Despite converging slower, SGD is way faster on
computing the gradient than GD in each iteration
• Specially for deep learning with complex models and
large-scale datasets

courses.d2l.ai/berkeley-stat-157
Code…

courses.d2l.ai/berkeley-stat-157
Mini-batch SGD

courses.d2l.ai/berkeley-stat-157
Algorithm

• At time t, sample a random subset It ⊂ {1,…, n}with | It | = b


ηt

xt = xt−1 − ∇ℓi(xt−1)
b i∈It
• Again, it’s an unbiased estimate
1
𝔼[ ∇ℓi(x)] = ∇f(x)

b i∈I
t

• Reduces variance by a factor of 1/b compared to SGD

courses.d2l.ai/berkeley-stat-157
Code…

courses.d2l.ai/berkeley-stat-157

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy