0% found this document useful (0 votes)

5 views28 pages

25 Optimization

The document provides an introduction to optimization and gradient descent in deep learning, focusing on concepts such as local and global minima, convex functions, and the gradient descent algorithm. It discusses the choice of learning rate, convergence rates, and the application of stochastic gradient descent (SGD) and mini-batch SGD. The content is aimed at students of UC Berkeley's STAT 157 course, highlighting the importance of these optimization techniques in training deep learning models.

Uploaded by

wen zhou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views28 pages

25 Optimization

Uploaded by

wen zhou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Introduction to Deep Learning

22. Optimization, Gradient Descent

STAT 157, Spring 2019, UC Berkeley

Alex Smola and Mu Li

courses.d2l.ai/berkeley-stat-157
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Optimization

courses.d2l.ai/berkeley-stat-157
Optimization Problems

• General form:
minimize f(x) subject to x ∈ C

• Cost function f : ℝn → ℝ
• Constraint set example
C = {x | h1(x) = 0,…, hm(x) = 0, g1(x) ≤ 0,…, gr(x) ≤ 0}

• Unconstraint if C = ℝn

courses.d2l.ai/berkeley-stat-157
Local Minima and Global Minima

• Most optimization problems have no close form solution

• We then aim to find a minima through iterative methods
• Global minima x*
x * np.cos(np.pi * x)

f(x*) ≤ f(x) ∀x ∈ C

• Local minima x* , there exists ε

f(x) ≤ f(x) ∀x : ∥x − x∥ ≤ ε

courses.d2l.ai/berkeley-stat-157
Convex Set

• A subset C of ℝn is called
convex if

αx + (1 − α)y ∈ C
∀α ∈ [0,1] ∀x, y ∈ C

courses.d2l.ai/berkeley-stat-157
Convex Function

• f : C → ℝ is called convex if
f(αx + (1 − α)y)
≤ αf(x) + (1 − α)f(y)
∀α ∈ [0,1] ∀x, y ∈ C
• If the inequality is strict
whenever α ∈ (0,1) and
x ≠ y, then f is called strictly
convex

courses.d2l.ai/berkeley-stat-157
First-order condition

• f is convex if and only if

f(y) ≥ f(x) + ∇f(x)T (y − x) ∀x, y ∈ C

• If the inequality is strict, then f is strictly convex

courses.d2l.ai/berkeley-stat-157
Second-order conditions

• f is convex if and only if

∇2 f(x) ⪰ 0 ∀x ∈ C

• f is strictly convex if and only if

∇2 f(x) ≻ 0 ∀x ∈ C

courses.d2l.ai/berkeley-stat-157
Convex and Non-convex Examples

• Convex
• Linear regression f(x) = ∥Wx − b∥22
∇f(x) = 2WT (Wx − b), ∇2 f(x) = 2WT W
• Softmax regression
• Non-convex
• Multi-layer perception
• Convolution neural networks
• Recurrent neural networks

courses.d2l.ai/berkeley-stat-157
Convex Optimization

• If f is a convex function, and C is a convex set, then the

problem is called a convex problem
• Any local minima is a global minima
• Unique global minima if strictly convex

Global minima
courses.d2l.ai/berkeley-stat-157
Proof

• Assume local minima x , if exists a global minima y

• Choose α ≤ 1 − ε/ | x + y | and z = αx + (1 − α)y
• Then ∥x − z∥ = (1 − α)∥x + y∥ ≤ ε
• Due to y is a global minima, so f(y) < f(x)
f(z) ≤ αf(x) + (1 − a)f(z) < αf(x) + (1 − a)f(x) = f(x)
• It contradicts x is a local minima

courses.d2l.ai/berkeley-stat-157
Gradient
Descent

courses.d2l.ai/berkeley-stat-157
Algorithm

• Choose initial x0
• At time t = 1,…, T
xt = xt−1 − η ∇f(xt−1)

• η is called learning rate

courses.d2l.ai/berkeley-stat-157
The Choice of Learning Rate

• Given ∥Δ∥ < ε , for any f, by the Taylor expansion

f(x + Δ) ≈ f(x) + ΔT ∇f(x)

• Choose small enough learning rate η ≤ ε/∥∇f(x)∥

∥ − η ∇f(x)∥ ≤ ε

f(x − η ∇f(x)) ≈ f(x) − η∥∇f(x)∥2 ≤ f(x)

courses.d2l.ai/berkeley-stat-157
Convergence Rate

• Assume f is convex, and its gradient is Lipschitz

continuous with constant L
Gradient does
∥∇f(x) − ∇f(y)∥ ≤ L∥x − y∥ not change
• If use learning rate η ≤ 1/L , after T steps dramatically
∥x0 − x*∥2
f(xT ) − f(x*) ≤
2ηT
• Convergence rate O(1/T )
• To get f(xT ) − f(x*) ≤ ϵ , needs O(1/ϵ) iterations

courses.d2l.ai/berkeley-stat-157
Proof

• Gradient L-Lipschitz means

L
f(y) ≤ f(x) + ∇f(x) (y − x) + ∥y − x∥2
T
2
• Plug in y = x − η ∇f(x)

( 2 )
Lη
f(y) ≤ f(x) − 1 − η∥∇f(x)∥2

• Take 0 < η ≤ 1/L

η f decreases
f(y) ≤ f(x) − ∥∇f(x)∥2 every time
2
courses.d2l.ai/berkeley-stat-157
Proof II

• By the convexity: f(x) ≤ f (x) + ∇f(x)T (x − x)

η
• Plug in to f(y) ≤ f(x) − ∥∇f(x)∥2
2 η
f(y) ≤ f(x*) + ∇f(x) (x − x*) − ∥∇f(x)∥2
T
2
f(y) − f (x*) ≤ (2η ∇f(x)T (x − x*) − η 2∥∇f(x)∥2) /2η
= (∥x − x*∥2 + 2η ∇f(x)T (x − x*) − η 2∥∇f(x)∥2 − ∥x − x*∥2) /2η
= (∥x − x*∥2 − ∥x − η ∇f(x) − x*∥2) /2η
= (∥x − x*∥2 − ∥y − x*∥2) /2η

courses.d2l.ai/berkeley-stat-157
Proof III

• Sum all T steps

T T

∑
f(xt) − f(x*) ≤
∑ ( ∥xt−1 − x*∥2 − ∥xt − x*∥2) /2η
t=1 t=1

= (∥x0 − x∥2 − ∥xT − x∥2) /2η ≤ ∥x0 − x*∥2 /2η

• f is decreasing every time:
1 T ∥x0 − x*∥2
∑
f(xT ) − f(x*) ≤ f(xt) − f(x*) ≤
T t=1 2ηT

courses.d2l.ai/berkeley-stat-157
Apply to Deep Learning

• f is the sum of loss over all training data, x is the learnable

parameters
1 n
∑
f(x) = ℓi(x) ℓi(x) the loss for the i-th example
n i=0

• f is often not convex, so the convergence analysis before

cannot be applied

courses.d2l.ai/berkeley-stat-157
Stochast
Gradient
Descent
Singapore Dollar (SGD) 1000
~740 USD

courses.d2l.ai/berkeley-stat-157
Algorithm

• At time t , sample example ti

xt = xt−1 − ηt ∇ℓti(xt−1)

• Compare to gradient descent

xt = xt−1 − η ∇f(xt−1)
1 n
∑
f(x) = ℓi(x)
n i=0

courses.d2l.ai/berkeley-stat-157
Sample Example

• Two rules to sample example it at time t

• Random rule: choose it ∈ {1,…, n} uniformly at random
• Cyclic rule: choose it = 1,2,…, n,1,2,…, n
• Often called incremental gradient descent
• Randomized rule is more common in practice

𝔼 [ ∇ℓti(x)] = 𝔼[ ∇f(x)]

• An unbiased estimate of the gradient

courses.d2l.ai/berkeley-stat-157
Convergence Rate

• Assume f is convex with a diminishing ηt , e.g. ηt = O(1/t)

𝔼[ f(xT )] − f(x*) = O(1/ T )

• Under the same assumption, for gradient descent

f(xT ) − f(x*) = O(1/ T )
• Assume gradient L-Lipschitz and fixed η
f(xT ) − f(x*) = O(1/T )

• Does not improve for SGD

courses.d2l.ai/berkeley-stat-157
In Practice

• Does not diminish the learning rate so dramatically

• We don’t care about optimizing to high accuracy
• Despite converging slower, SGD is way faster on
computing the gradient than GD in each iteration
• Specially for deep learning with complex models and
large-scale datasets

courses.d2l.ai/berkeley-stat-157
Code…

courses.d2l.ai/berkeley-stat-157
Mini-batch SGD

courses.d2l.ai/berkeley-stat-157
Algorithm

• At time t, sample a random subset It ⊂ {1,…, n}with | It | = b

ηt
∑
xt = xt−1 − ∇ℓi(xt−1)
b i∈It
• Again, it’s an unbiased estimate
1
𝔼[ ∇ℓi(x)] = ∇f(x)
∑
b i∈I
t

• Reduces variance by a factor of 1/b compared to SGD

courses.d2l.ai/berkeley-stat-157
Code…

courses.d2l.ai/berkeley-stat-157

Deep Reinforcement Learning
100% (4)
Deep Reinforcement Learning
48 pages
RD Sharma Jan2021 Class 8 Maths Chapter 7
No ratings yet
RD Sharma Jan2021 Class 8 Maths Chapter 7
51 pages
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
No ratings yet
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
19 pages
CS3534 Operating Systems 18 42
No ratings yet
CS3534 Operating Systems 18 42
26 pages
M2 P&F-AlgoStoch
No ratings yet
M2 P&F-AlgoStoch
132 pages
9 Environment
No ratings yet
9 Environment
80 pages
Research or Study Plan-1
No ratings yet
Research or Study Plan-1
1 page
Setting Parameters of A Deep Neural Network - Hierarchical Representations
No ratings yet
Setting Parameters of A Deep Neural Network - Hierarchical Representations
10 pages
7 Capacity
No ratings yet
7 Capacity
27 pages
Les Hoches 2022 Convex Optimization
No ratings yet
Les Hoches 2022 Convex Optimization
34 pages
Ndarray
No ratings yet
Ndarray
21 pages
Lecture05 Descent
No ratings yet
Lecture05 Descent
31 pages
Lesson 9
No ratings yet
Lesson 9
15 pages
Conv Layer
No ratings yet
Conv Layer
9 pages
Numerical Analysis: MATLAB Practical (Autumn 2020) B.E. III Semester Thapar Institute of Engineering & Technology Patiala
0% (1)
Numerical Analysis: MATLAB Practical (Autumn 2020) B.E. III Semester Thapar Institute of Engineering & Technology Patiala
8 pages
Bounding Box
No ratings yet
Bounding Box
8 pages
Result SVM Diabetes
No ratings yet
Result SVM Diabetes
16 pages
Optimal Stochastic Non-Smooth Non-Convex Optimization Through
No ratings yet
Optimal Stochastic Non-Smooth Non-Convex Optimization Through
39 pages
Convexity: 18.657: Mathematics of Machine Learning
No ratings yet
Convexity: 18.657: Mathematics of Machine Learning
6 pages
Padding and Strides
No ratings yet
Padding and Strides
3 pages
Work 1 Matlab Chiheb Ismail Final Report
No ratings yet
Work 1 Matlab Chiheb Ismail Final Report
38 pages
Raghu Meka Notes
No ratings yet
Raghu Meka Notes
7 pages
ConvexSpring25 Week9
No ratings yet
ConvexSpring25 Week9
26 pages
Stochastic Optimization For Machine Learning
No ratings yet
Stochastic Optimization For Machine Learning
59 pages
Class06 SGD
No ratings yet
Class06 SGD
24 pages
Berkeley-Tutorial Optimization For Machine Learning-Part1
No ratings yet
Berkeley-Tutorial Optimization For Machine Learning-Part1
37 pages
Assignment4 Cse330 Fall2022
No ratings yet
Assignment4 Cse330 Fall2022
1 page
Bharathidasan University-Econometrics-QP-Nov-2010
No ratings yet
Bharathidasan University-Econometrics-QP-Nov-2010
2 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Chapter02 Introduction To DeepLearning
No ratings yet
Chapter02 Introduction To DeepLearning
84 pages
02 Grad Desc
No ratings yet
02 Grad Desc
54 pages
PGDAOR DDE Annamalai QP
No ratings yet
PGDAOR DDE Annamalai QP
3 pages
K10 Ordinary Differetial Equations
No ratings yet
K10 Ordinary Differetial Equations
14 pages
LESSON 1. Factoring A Perfect Square Trinomial
No ratings yet
LESSON 1. Factoring A Perfect Square Trinomial
2 pages
Optimization
No ratings yet
Optimization
6 pages
Week 10 Notes MLF
No ratings yet
Week 10 Notes MLF
20 pages
ENGR 3157-Operations Research-Lecture 10-GoalProgramming
No ratings yet
ENGR 3157-Operations Research-Lecture 10-GoalProgramming
10 pages
Convex Module B
No ratings yet
Convex Module B
29 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
4 Optimization
No ratings yet
4 Optimization
48 pages
p5 CO Opti Algo
No ratings yet
p5 CO Opti Algo
15 pages
Neural Metwork: Institut Teknologi Sepuluh Nopember (ITS) Surabaya - Indonesia
No ratings yet
Neural Metwork: Institut Teknologi Sepuluh Nopember (ITS) Surabaya - Indonesia
43 pages
Lecture 8
No ratings yet
Lecture 8
16 pages
Stats 102B Cheat Sheet
No ratings yet
Stats 102B Cheat Sheet
4 pages
Handbook of Convergence Theorems
No ratings yet
Handbook of Convergence Theorems
70 pages
Binomial Py Qs
No ratings yet
Binomial Py Qs
7 pages
Midsem 2 Important Questions
No ratings yet
Midsem 2 Important Questions
2 pages
Lecture 11
No ratings yet
Lecture 11
4 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
Optimization For ML (2) : CS771: Introduction To Machine Learning Piyush Rai
No ratings yet
Optimization For ML (2) : CS771: Introduction To Machine Learning Piyush Rai
14 pages
Advanced Gradient Descent
No ratings yet
Advanced Gradient Descent
14 pages
Unit IV BPA GD
No ratings yet
Unit IV BPA GD
12 pages
05 Task Performance 1
No ratings yet
05 Task Performance 1
3 pages
BE LP3 Q2 41239 ML MiniProject
No ratings yet
BE LP3 Q2 41239 ML MiniProject
6 pages
L23 Stochastic Gradient and Mini Batch
No ratings yet
L23 Stochastic Gradient and Mini Batch
9 pages
MLSS Complete PDF
No ratings yet
MLSS Complete PDF
106 pages
Backpropagation Optimization Tutorial
No ratings yet
Backpropagation Optimization Tutorial
14 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Lecture 3 ML - Optimization
No ratings yet
Lecture 3 ML - Optimization
32 pages
Gradient
No ratings yet
Gradient
37 pages
Lecture 5
No ratings yet
Lecture 5
4 pages
Lecture 7 8 Other Descent Methods
No ratings yet
Lecture 7 8 Other Descent Methods
7 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
From The Numerical Solution To The Symbolic Form.
No ratings yet
From The Numerical Solution To The Symbolic Form.
10 pages
Lecture 7
No ratings yet
Lecture 7
54 pages
Ls-Dyna: No.8 1 Floor, First Main Road, New Colony, Chrompet, Chennai-600044
No ratings yet
Ls-Dyna: No.8 1 Floor, First Main Road, New Colony, Chrompet, Chennai-600044
4 pages
Tutorial 8 Questions
No ratings yet
Tutorial 8 Questions
3 pages
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
22 pages
Berkeley-Tutorial Optimization For Machine Learningpart2
No ratings yet
Berkeley-Tutorial Optimization For Machine Learningpart2
35 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Unit VI Optimization Techniques Question Bank Solved Answer
No ratings yet
Unit VI Optimization Techniques Question Bank Solved Answer
20 pages
Gradient
No ratings yet
Gradient
31 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Finite Element Analysis Prof. Dr. B. N. Rao Department of Civil Engineering Indian Institute of Technology, Madras
No ratings yet
Finite Element Analysis Prof. Dr. B. N. Rao Department of Civil Engineering Indian Institute of Technology, Madras
28 pages
4th-Order Runge-Kutta Method
No ratings yet
4th-Order Runge-Kutta Method
10 pages
Learning Goal: SWBAT Find The Solution of Linear Systems by The Method of Multiplication and Elimination
No ratings yet
Learning Goal: SWBAT Find The Solution of Linear Systems by The Method of Multiplication and Elimination
3 pages
Nisheeth VishnoiFall2014 ConvexOptimization PDF
No ratings yet
Nisheeth VishnoiFall2014 ConvexOptimization PDF
114 pages
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
No ratings yet
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
45 pages
CS 256: LMS Algorithms
No ratings yet
CS 256: LMS Algorithms
23 pages
Solution of Non Linear Equations
No ratings yet
Solution of Non Linear Equations
29 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Solution To Final Exam of Operations Research (Version A)
No ratings yet
Solution To Final Exam of Operations Research (Version A)
11 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
27 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
14 Efficient Learning
No ratings yet
14 Efficient Learning
7 pages
Chap 4 FACTORS AND POLYNOMIALS
No ratings yet
Chap 4 FACTORS AND POLYNOMIALS
18 pages
Gauss Elimination Method: Linear System $ A (Consisting of Equations in Unknowns) by Bringing The Augmented Matrix
No ratings yet
Gauss Elimination Method: Linear System $ A (Consisting of Equations in Unknowns) by Bringing The Augmented Matrix
4 pages
NLP and ML Project
100% (1)
NLP and ML Project
37 pages
Class Notes - 1547284812
No ratings yet
Class Notes - 1547284812
150 pages
Error Analysis Numerical Methods PDF
100% (1)
Error Analysis Numerical Methods PDF
2 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

25 Optimization

Uploaded by

25 Optimization

Uploaded by

Introduction to Deep Learning

22. Optimization, Gradient Descent

STAT 157, Spring 2019, UC Berkeley

Alex Smola and Mu Li

• Most optimization problems have no close form solution

• Local minima x* , there exists ε

f(x*) ≤ f(x) ∀x : ∥x − x*∥ ≤ ε

• f is convex if and only if

f(y) ≥ f(x) + ∇f(x)T (y − x) ∀x, y ∈ C

• If the inequality is strict, then f is strictly convex

• f is convex if and only if

• f is strictly convex if and only if

• If f is a convex function, and C is a convex set, then the

• Assume local minima x , if exists a global minima y

• η is called learning rate

• Given ∥Δ∥ < ε , for any f, by the Taylor expansion

f(x + Δ) ≈ f(x) + ΔT ∇f(x)

• Choose small enough learning rate η ≤ ε/∥∇f(x)∥

f(x − η ∇f(x)) ≈ f(x) − η∥∇f(x)∥2 ≤ f(x)

• Assume f is convex, and its gradient is Lipschitz

• Gradient L-Lipschitz means

• Take 0 < η ≤ 1/L

• By the convexity: f(x) ≤ f (x*) + ∇f(x)T (x − x*)

• Sum all T steps

= (∥x0 − x*∥2 − ∥xT − x*∥2) /2η ≤ ∥x0 − x*∥2 /2η

• f is the sum of loss over all training data, x is the learnable

• f is often not convex, so the convergence analysis before

• At time t , sample example ti

• Compare to gradient descent

• Two rules to sample example it at time t

• An unbiased estimate of the gradient

• Assume f is convex with a diminishing ηt , e.g. ηt = O(1/t)

• Under the same assumption, for gradient descent

• Does not improve for SGD

• Does not diminish the learning rate so dramatically

• At time t, sample a random subset It ⊂ {1,…, n}with | It | = b

• Reduces variance by a factor of 1/b compared to SGD

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

f(x) ≤ f(x) ∀x : ∥x − x∥ ≤ ε

• By the convexity: f(x) ≤ f (x) + ∇f(x)T (x − x)

= (∥x0 − x∥2 − ∥xT − x∥2) /2η ≤ ∥x0 − x*∥2 /2η