0% found this document useful (0 votes)

6 views31 pages

Lecture05_descent

The document presents Lecture 5 of ECE 595: Machine Learning I, focusing on Gradient Descent and its variations, including Stochastic Gradient Descent (SGD). It covers the algorithm's mechanics, convergence properties, and practical advice for implementation, highlighting the differences between full and stochastic gradients. Additionally, it discusses the advantages and challenges of using SGD in machine learning contexts, along with references for further reading.

Uploaded by

swa tu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views31 pages

Lecture05_descent

Uploaded by

swa tu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

ECE 595: Machine Learning I

Lecture 05 Gradient Descent

Spring 2020

Stanley Chan

School of Electrical and Computer Engineering

Purdue University

c Stanley Chan 2020. All Rights Reserved.

1 / 31
Outline

c Stanley Chan 2020. All Rights Reserved.

2 / 31
Outline

Mathematical Background
Lecture 4: Intro to Optimization
Lecture 5: Gradient Descent

Lecture 5: Gradient Descent

Gradient Descent
Descent Direction
Step Size
Convergence
Stochastic Gradient Descent
Difference between GD and SGD
Why does SGD work?

c Stanley Chan 2020. All Rights Reserved.

3 / 31
Gradient Descent
The algorithm:

x (t+1) = x (t) − α(t) ∇f (x (t) ), t = 0, 1, 2, . . . ,

where α(t) is called the step size.

c Stanley Chan 2020. All Rights Reserved.

4 / 31
Why is the direction −∇f (x )?
Recall (Lecture 4): If x ∗ is optimal, then

lim [f (x ∗ + d ) − f (x ∗ )] = ∇f (x ∗ )T d
1
→0
| {z }
≥0, ∀d

=⇒ ∇f (x ∗ )T d ≥ 0, ∀d
But if x (t) is not optimal, then we want
f (x (t) + d ) ≤ f (x (t) )
So,

lim f (x (t) + d ) − f (x (t) ) = ∇f (x (t) )T d

1h i

|→0 {z }
≤0, for some d

=⇒ ∇f (x (t) )T d ≤ 0

c Stanley Chan 2020. All Rights Reserved.

5 / 31
Descent Direction
Pictorial illustration:
∇f (x ) is perpendicular to the contour.
A search direction d can either be on the positive side ∇f (x )T d ≥ 0
or negative side ∇f (x )T d < 0.
Only those on the negative side can reduce the cost.
All such d ’s are called the descent directions.

c Stanley Chan 2020. All Rights Reserved.

6 / 31
The Steepest d
Previous slide: If x (t) is not optimal yet, then some d will give
∇f (x (t) )T d ≤ 0.

So, let us make ∇f (x (t) )T as negative as possible.

d (t) = argmin ∇f (x (t) )T d ,
kd k2 =δ

We need δ to control the magnitude; Otherwise d is unbounded.

The solution is
d (t) = −∇f (x (t) )
Why? By Cauchy Schwarz,
∇f (x (t) )T d ≥ −k∇f (x (t) )k2 kd k2 .
Minimum attained when d = −∇f (x (t) ).
Set δ = k∇f (x (t) )k2 .
c Stanley Chan 2020. All Rights Reserved.
7 / 31
Steepest Descent Direction
Pictorial illustration:
Put a ball surrounding the current point.
All d ’s inside the ball are feasible.
Pick the one that minimizes ∇f (x )T d .
This direction must be parallel (but opposite sign) to ∇f (x ).

c Stanley Chan 2020. All Rights Reserved.

8 / 31
Step Size
The algorithm:
x (t+1) = x (t) − α(t) ∇f (x (t) ), t = 0, 1, 2, . . . ,
where α(t)is called the step size.
1. Fixed step size
α(t) = α.
2. Exact line search
x (t) + αd (t)

α(t) = argmin f ,
α

E.g., if f (x ) = 1
2 x T Hx + c T x , then
∇f (x (t) )T d (t)
α(t) = − .
d (t)T Hd (t)
3. Inexact line search:
Amijo / Wolfe conditions. See Nocedal-Wright Chapter 3.1. c Stanley Chan 2020. All Rights Reserved.
9 / 31
Convergence
Let x ∗ be the global minimizer.Assume the followings:
Assume f is twice differentiable so that ∇2 f exist.
Assume 0 λmin I ∇2 f (x ) λmax I for all x ∈ Rn
Run gradient descent with exact line search.
Then, (Nocedal-Wright Chapter 3, Theorem 3.3)
λmin 2

f (x ) − f (x ) ≤ 1 − f (x (t) ) − f (x ∗ )

(t+1) ∗
λmax
λmin 4

f (x (t−1) ) − f (x ∗ )

≤ 1−
λmax
..
≤ .
λmin 2t

f (x (1) ) − f (x ∗ ) .

≤ 1−
λmax
Thus, f (x (t) ) → f (x ∗ ) as t → ∞. c Stanley Chan 2020. All Rights Reserved.
10 / 31
Understanding Convergence
Gradient descent can be viewed as successive approximation.
Approximate the function as

f (x t + d ) ≈ f (x t ) + ∇f (x t )T d + kd k2 .
1
2α
We can show that the d that minimizes f (x t + d ) is d = −α∇f (x t ).
This suggests: Use a quadratic function to locally approximate f .
Converge when curvature α of the approximation is not too big.

c Stanley Chan 2020. All Rights Reserved.

11 / 31
Advice on Gradient Descent

Gradient descent is useful because

Simple to implement (compared to ADMM, FISTA, etc)
Low computational cost per iteration (no matrix inversion)
Requires only first order derivative (no Hessian)
Gradient is available in deep networks (via back propagation)
Most machine learning has built-in (stochastic) gradient descents
Welcome to implement your own, but you need to be careful
Convex non-differentiable problems, e.g., `1 -norm
Non-convex problem, e.g., ReLU in deep network
Trap by local minima
Inappropriate step size, a.k.a. learning rate
Consider more “transparent” algorithms such as CVX when
Formulating problems. No need to worry about algorithm.
Trying to obtain insights.

c Stanley Chan 2020. All Rights Reserved.

12 / 31
Outline

Mathematical Background
Lecture 4: Intro to Optimization
Lecture 5: Gradient Descent

Lecture 5: Gradient Descent

Gradient Descent
Descent Direction
Step Size
Convergence
Stochastic Gradient Descent
Difference between GD and SGD
Why does SGD work?

c Stanley Chan 2020. All Rights Reserved.

13 / 31
Stochastic Gradient Descent
Most loss functions in machine learning problems are separable:
N N
L(gθ (x n ), y n ) =
1 X 1 X
J(θ) = Jn (θ). (1)
N N
n=1 n=1
For example,
Square-loss:
N
(gθ (x n ) − y n )2
X
J(θ) =
n=1
Cross-entropy loss:
N
y log gθ (x ) + (1 − y ) log(1 − gθ (x
X
n n n n
J(θ) = − ))
n=1
Logistic loss:
N
n θT x n
X
J(θ) = log(1 + e −y )
c Stanley Chan 2020. All Rights Reserved.
n=1 14 / 31
Full Gradient VS Partial Gradient
Vanilla gradient descent:

θ t+1 = θ t − η t ∇J(θ t ) . (2)

| {z }
main computation

The full gradient of the loss is

N
1X
∇J(θ) = ∇Jn (θ) (3)
N
n=1

Stochastic gradient descent:

1 X
∇J(θ) ≈ ∇Jn (θ) (4)
|B|
n∈B

where B ⊆ {1, . . . , N} is a random subset. |B| = batch size.

c Stanley Chan 2020. All Rights Reserved.
15 / 31
SGD Algorithm
Algorithm (Stochastic Gradient Descent)
1 Given {(x n , y n ) | n = 1, . . . , N}.
2 Initialize θ (zero or random)
3 For t = 1, 2, 3, . . .
Draw a random subset B ⊆ {1, . . . , N}.
Update
1 X
θ t+1 = θ t − η t ∇Jn (θ) (5)
|B|
n∈B

If |B| = 1, then use only one sample at a time.

The approximate gradient is unbiased: (See Appendix for Proof)
" #
1 X
E ∇Jn (θ) = ∇J(θ).
|B|
n∈B
c Stanley Chan 2020. All Rights Reserved.
16 / 31
Interpreting SGD
Just showed that the SGD step is unbiased:
" #
1 X
E ∇Jn (θ) = ∇J(θ).
|B|
n∈B

Unbiased gradient implies that each update is

gradient + zero-mean noise

Step size: SGD with constant step size does not converge.
If θ ∗ is a minimizer, then J(θ ∗ ) = N1 N ∗
P
n=1 Jn (θ ) = 0. But

1 X
Jn (θ ∗ ) 6= 0, since B is a subset.
|B|
n∈B

Typical strategy: Start with large step size and gradually decrease:
η t → 0, e.g., η t = t −a for some constant a.
c Stanley Chan 2020. All Rights Reserved.
17 / 31
Perspectives of SGD

Classical optimization literature have the following observations.

Compared to GD in convex problems:
SGD offers a trade-off between accuracy and efficiency
More iterations
Less gradient evaluation per iteration
Noise is a by-product

Recent studies of SGD for non-convex problems found that

SGD for training deep neural networks works
SGD finds solution faster
SGD find a better local minima
Noise matters
c Stanley Chan 2020. All Rights Reserved.
18 / 31
GD compared to SGD

c Stanley Chan 2020. All Rights Reserved.

19 / 31
Smoothing the Landscape
Analyzing SGD is an active research topic. Here is one by Kleinberg et al.
(https://arxiv.org/pdf/1802.06175.pdf ICML 2018)
The SGD step can be written as GD + noise:
x t+1 = x t − η(∇f (x t ) + w t )
= x t − η∇f (x t ) − η w t .
| {z }
def t
=y

y t is the “ideal” location returned by GD.

Let us analyze y t+1 :

y t+1 def
= x t+1 − η∇f (x t+1 )
= (y t − η w t ) − η∇f (y t − η w t )
Assume E[w ] = 0, then
E[y t+1 ] = y t − η∇E[f (y t − η w t )]
c Stanley Chan 2020. All Rights Reserved.
20 / 31
Smoothing the Landscape

Let us look at E[f (y t − η w t )]:

Z
E[f (y − η w )] = f (y − η w )p(w ) d w ,

where p(w ) is the distribution of w .

f (y − η w )p(w ) d w is the convolution between f and p.
R

p(w ) ≥ 0 for all w , so the convolution always smoothes the function.

Learning rate controls the smoothness
Too small: Under-smooth. You have not yet escaped from bad local
minimum.
Too large: Over-smooth. You may miss a local minimum.

21 / 31
Smoothing the Landscape

22 / 31
Reading List

Gradient Descent
S. Boyd and L. Vandenberghe, “Convex Optimization”, Chapter 9.2-9.4.
J. Nocedal and S. Wright, “Numerical Optimization”, Chapter 3.1-3.3.
Y. Nesterov, “Introductory lectures on convex optimization”, Chapter 2.
CMU 10.725 Lecture https://www.stat.cmu.edu/~ryantibs/
convexopt/lectures/grad-descent.pdf

Stochastic Gradient Descent

CMU 10.725 Lecture https://www.stat.cmu.edu/~ryantibs/
convexopt/lectures/stochastic-gd.pdf
Kleinberg et al. (2018) “When Does SGD Escape Local Minima”,
https://arxiv.org/pdf/1802.06175.pdf

23 / 31
Appendix

24 / 31
Proof of Unbiasedness of SGD gradient

Lemma
If n is a random variable with uniform distribution over {1, . . . , N}, then
" #
1 X
E ∇Jn (θ) = ∇J(θ).
|B|
n∈B

Denote the density function of n as p(n) = 1/N. Then,

" # N
1 X 1 X 1 X X
E ∇Jn (θ) = E [∇Jn (θ)] = Jn (θ)p(n)
|B| |B| |B|
n∈B n∈B n∈B n=1
N
1 X 1 X 1 X
= Jn (θ) = ∇J(θ) = ∇J(θ).
|B| N |B|
n∈B n=1 n∈B

25 / 31
Q&A 1: What is momentum method?

The momentum method was originally proposed by Polyak (1964).

Momentum method says:

x t+1 = x t − α
t−1
+ (1 − β)g t ,

βg

where g t = ∇f (x t ), and 0 < β < 1 is the damping constant.

Momentum method can be applied to both gradient descent and
stochastic gradient descent.
A variant is the Nesterov accelerated gradient (NAG) method (1983).
Importance of NAG is elaborated by Sutskever et al. (2013).
The key idea of NAG is to write x t+1 as a linear combination of xt
and the span of the past gradients.
Yurii Nesterov proved that such combination is the best one can do
with first order methods.
c Stanley Chan 2020. All Rights Reserved.
26 / 31
Q&A 1: What is momentum method?

Here are some references on momentum method.

Sutskever et al. (2013), “On the importance of initialization and
momentum in deep learning”,
http://proceedings.mlr.press/v28/sutskever13.pdf
UIC Lecture Note
https://www2.cs.uic.edu/~zhangx/teaching/agm.pdf
Cornell Lecture Note http:
//www.cs.cornell.edu/courses/cs6787/2017fa/Lecture3.pdf
Yurii Nesterov, “Introductory Lectures on Convex Optimization”,
2003. (See Assumption 2.1.4 and discussions thereafter)
G. Goh, “Why Momentum Really Works”,
https://distill.pub/2017/momentum/

27 / 31
Q&A 2: With exact line search, will we get to a minimum
in one step?
No. Exact line search only allows you to converge faster. It does not
guarantee convergence in one step.
Here is an example. The function is called the rosenbrock function.

28 / 31
Q&A 3: Any example of gradient descent?
Consider the loss function

J(θ) = kAθ − y k2
1
2
Then the gradient is

∇J(θ) = AT (Aθ − y )

So the gradient descent step is

θ t+1 = θ t + η AT (Aθ t − y ).
| {z }
∇J(θ t )

Since this is a quadratic equation, you can find the exact line search
step size (assume d = −∇f (θ)):

The constraint kd k = δ is necessary for minimizing ∇f (x )T d .

Without the constraint, this problem is unbounded below and the
solution is −∞ times whatever direction d that lives on the negative
half plane of ∇f .
For any δ, the solution (according to Cauchy Schwarz inequality), is

∇f (x )
d = −δ k∇f (x )k
.

You can show that this d minimizes ∇f (x )T d and satisfies kd k = δ.

Now, if we use this d in the gradient descent step, the step size α will
compensate for the δ.
So we can just choose δ = 1 and the above derivation will still work.
c Stanley Chan 2020. All Rights Reserved.
30 / 31
Q&A 5: What is a good batch size for SGD?
There is no definite answer. Generally you need to look at the validation
curve to determine if you need to increase/decrease the mini-batch size.
Here are some suggestions in the literature.
Bengio (2012) https://arxiv.org/pdf/1206.5533.pdf
[batch size] is typically chosen between 1 and a few hundreds, e.g.
[batch size] = 32 is a good default value, with values above 10 taking
advantage of the speedup of matrix-matrix products over
matrix-vector products.
Masters and Luschi (2018) https://arxiv.org/abs/1804.07612
The presented results confirm that using small batch sizes achieves
the best training stability and generalization performance, for a given
computational cost, across a wide range of experiments. In all cases
the best results have been obtained with batch sizes m = 32 or
smaller, often as small as m = 2 or m = 4.
c Stanley Chan 2020. All Rights Reserved.
31 / 31

Gradient Descent
No ratings yet
Gradient Descent
17 pages
Aggregate Planning, Linear Programming and Excel Solver
No ratings yet
Aggregate Planning, Linear Programming and Excel Solver
7 pages
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
No ratings yet
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
45 pages
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
22 pages
Lecture 7 (with notes)
No ratings yet
Lecture 7 (with notes)
39 pages
Lecture 5
No ratings yet
Lecture 5
4 pages
SGD
No ratings yet
SGD
19 pages
7
No ratings yet
7
5 pages
2501.08425v1
No ratings yet
2501.08425v1
50 pages
Convex Module B
No ratings yet
Convex Module B
29 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Lecture Note SGD
No ratings yet
Lecture Note SGD
4 pages
2,5 Stochastic Gradient Descent
No ratings yet
2,5 Stochastic Gradient Descent
11 pages
Stochastic Gradient Descent - Term Paper
No ratings yet
Stochastic Gradient Descent - Term Paper
8 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
5 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Cs3491 - Aiml - Unit III - Gradient Descent
No ratings yet
Cs3491 - Aiml - Unit III - Gradient Descent
12 pages
Paper 2
No ratings yet
Paper 2
27 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
4 pages
5 Why Does SGD Prefer Flat Minim
No ratings yet
5 Why Does SGD Prefer Flat Minim
15 pages
Stochastic Gradient Descent - Math and Python Code
No ratings yet
Stochastic Gradient Descent - Math and Python Code
28 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Mlfa Autumn 22 Lec 04
No ratings yet
Mlfa Autumn 22 Lec 04
24 pages
Gradient Descent Algorithm is a first
No ratings yet
Gradient Descent Algorithm is a first
5 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Discussion 4 CS771
No ratings yet
Discussion 4 CS771
25 pages
ConvexSpring25_Week9
No ratings yet
ConvexSpring25_Week9
26 pages
Gradient Descent & Stockastic Gradient Descent
No ratings yet
Gradient Descent & Stockastic Gradient Descent
6 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
optimization
No ratings yet
optimization
6 pages
Gradient Descent_PR
No ratings yet
Gradient Descent_PR
31 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Chapter Gradient Descent
No ratings yet
Chapter Gradient Descent
6 pages
SGD
No ratings yet
SGD
3 pages
Technical_writing
No ratings yet
Technical_writing
8 pages
optim
No ratings yet
optim
33 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Gradient_decent
No ratings yet
Gradient_decent
15 pages
HMD-Deep Learning-Lecture 2-2024
No ratings yet
HMD-Deep Learning-Lecture 2-2024
47 pages
Technical_writing (2)
No ratings yet
Technical_writing (2)
9 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
04 Batch SGD Mini Batch Gradient Descent Algorithms
No ratings yet
04 Batch SGD Mini Batch Gradient Descent Algorithms
3 pages
Technical_writing (1)
No ratings yet
Technical_writing (1)
9 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
23 pages
Gradient_Descent
No ratings yet
Gradient_Descent
52 pages
Lec 5 Scaling and Opt
No ratings yet
Lec 5 Scaling and Opt
68 pages
2104.00423v1
No ratings yet
2104.00423v1
19 pages
Gradient Descent Algorithm.Y... (1)
No ratings yet
Gradient Descent Algorithm.Y... (1)
10 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
Adaptive Proximal Gradient Method For Convex Optimization: 1 Intro
No ratings yet
Adaptive Proximal Gradient Method For Convex Optimization: 1 Intro
23 pages
INT255_unit-4
No ratings yet
INT255_unit-4
40 pages
topic5_stoch_grad_d_Oct202023
No ratings yet
topic5_stoch_grad_d_Oct202023
29 pages
Unit3_rev3
No ratings yet
Unit3_rev3
201 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
F5 Mapit Workbook Questions - Solutions PDF
100% (2)
F5 Mapit Workbook Questions - Solutions PDF
206 pages
List of Algorithms Interview Questions
No ratings yet
List of Algorithms Interview Questions
9 pages
Tuhin SCT
No ratings yet
Tuhin SCT
10 pages
Hepm3203 Operations Research
100% (1)
Hepm3203 Operations Research
95 pages
521584821 - د-محمد احمد توفيق
No ratings yet
521584821 - د-محمد احمد توفيق
6 pages
Lund - Finite Element Based Design Sensitivity Analysis and Optimization-Disser
No ratings yet
Lund - Finite Element Based Design Sensitivity Analysis and Optimization-Disser
236 pages
11 - 2023 - An Ecological Robustness Oriented Optimal Power Flow For Power Systems' Survivability
No ratings yet
11 - 2023 - An Ecological Robustness Oriented Optimal Power Flow For Power Systems' Survivability
16 pages
Assignment 1 MS 262782
No ratings yet
Assignment 1 MS 262782
12 pages
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
No ratings yet
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
59 pages
Multilayer Perceptron PDF
No ratings yet
Multilayer Perceptron PDF
5 pages
Spare Inventory
50% (2)
Spare Inventory
56 pages
EC004 OutputDynamics - Microfoundation 2022 Lecture3
No ratings yet
EC004 OutputDynamics - Microfoundation 2022 Lecture3
15 pages
Chuong 10 - Tinh Toan Ket Cau
No ratings yet
Chuong 10 - Tinh Toan Ket Cau
33 pages
C 367
No ratings yet
C 367
22 pages
Sensitivity Analysis
100% (1)
Sensitivity Analysis
16 pages
Graphical Solution Methods
No ratings yet
Graphical Solution Methods
9 pages
Hill Climbing
No ratings yet
Hill Climbing
11 pages
Assignment Nptel
No ratings yet
Assignment Nptel
5 pages
General Disassembly Process
No ratings yet
General Disassembly Process
17 pages
UAV Path Planning Using Artificial Potential Field Method Updated by Optimal Control Theory
No ratings yet
UAV Path Planning Using Artificial Potential Field Method Updated by Optimal Control Theory
15 pages
Jntua - M Tech - r17 - Jntua M.tech Regulation r17 Eee Electrical Power Engineering Course Structure Syllabus
No ratings yet
Jntua - M Tech - r17 - Jntua M.tech Regulation r17 Eee Electrical Power Engineering Course Structure Syllabus
53 pages
Unit-4 Dynamic Programming: Dr. Gopi Sanghani
No ratings yet
Unit-4 Dynamic Programming: Dr. Gopi Sanghani
65 pages
Nisa Presentation
No ratings yet
Nisa Presentation
16 pages
Costing Theory: Golden Rule To Clear Costing Subject
No ratings yet
Costing Theory: Golden Rule To Clear Costing Subject
82 pages
4. Linier Programming
No ratings yet
4. Linier Programming
23 pages
LP - Week 12, Integer Programming
No ratings yet
LP - Week 12, Integer Programming
30 pages
Optimization and Control Theory For Smart Grids
No ratings yet
Optimization and Control Theory For Smart Grids
1 page
Catapult System Tutorial (Autodesign)
No ratings yet
Catapult System Tutorial (Autodesign)
25 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture05_descent

Uploaded by

Lecture05_descent

Uploaded by

ECE 595: Machine Learning I

Lecture 05 Gradient Descent

School of Electrical and Computer Engineering

c Stanley Chan 2020. All Rights Reserved.

c Stanley Chan 2020. All Rights Reserved.

Lecture 5: Gradient Descent

c Stanley Chan 2020. All Rights Reserved.

x (t+1) = x (t) − α(t) ∇f (x (t) ), t = 0, 1, 2, . . . ,

where α(t) is called the step size.

c Stanley Chan 2020. All Rights Reserved.

lim f (x (t) + d ) − f (x (t) ) = ∇f (x (t) )T d

c Stanley Chan 2020. All Rights Reserved.

c Stanley Chan 2020. All Rights Reserved.

So, let us make ∇f (x (t) )T as negative as possible.

We need δ to control the magnitude; Otherwise d is unbounded.

c Stanley Chan 2020. All Rights Reserved.

c Stanley Chan 2020. All Rights Reserved.

Gradient descent is useful because

c Stanley Chan 2020. All Rights Reserved.

Lecture 5: Gradient Descent

c Stanley Chan 2020. All Rights Reserved.

θ t+1 = θ t − η t ∇J(θ t ) . (2)

The full gradient of the loss is

Stochastic gradient descent:

where B ⊆ {1, . . . , N} is a random subset. |B| = batch size.

If |B| = 1, then use only one sample at a time.

Unbiased gradient implies that each update is

gradient + zero-mean noise

Classical optimization literature have the following observations.

Recent studies of SGD for non-convex problems found that

c Stanley Chan 2020. All Rights Reserved.

y t is the “ideal” location returned by GD.

Let us look at E[f (y t − η w t )]:

where p(w ) is the distribution of w .

p(w ) ≥ 0 for all w , so the convolution always smoothes the function.

c Stanley Chan 2020. All Rights Reserved.

c Stanley Chan 2020. All Rights Reserved.

Stochastic Gradient Descent

c Stanley Chan 2020. All Rights Reserved.

c Stanley Chan 2020. All Rights Reserved.

Denote the density function of n as p(n) = 1/N. Then,

c Stanley Chan 2020. All Rights Reserved.

The momentum method was originally proposed by Polyak (1964).

where g t = ∇f (x t ), and 0 < β < 1 is the damping constant.

Here are some references on momentum method.

c Stanley Chan 2020. All Rights Reserved.

c Stanley Chan 2020. All Rights Reserved.

So the gradient descent step is

The constraint kd k = δ is necessary for minimizing ∇f (x )T d .

You can show that this d minimizes ∇f (x )T d and satisfies kd k = δ.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

lim f (x (t) + d ) − f (x (t) ) = ∇f (x (t) )T d