0% found this document useful (0 votes)

13 views37 pages

4 - Gradient Descent and Stochastic GD

The document provides an overview of gradient descent, a method for finding local optima of differentiable functions, detailing its algorithm, step size considerations, and variations like stochastic gradient descent and momentum-based methods. It discusses the importance of learning rates, the challenges of convergence, and introduces adaptive methods such as AdaGrad and RMSProp. Additionally, it highlights techniques like backtracking line search and mini-batching to improve efficiency in optimization processes.

Uploaded by

ghughudekhecho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views37 pages

4 - Gradient Descent and Stochastic GD

Uploaded by

ghughudekhecho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

Gradient Descent

1
The content of these slides has been gathered from various online sources. We extend our sincere gratitude to everyone who has contributed their work.
Gradient Descent
• Method to find local optima of
differentiable function
• Intuition: gradient tells us direction of greatest
increase, negative gradient gives us direction
of greatest decrease
• Take steps in directions that reduce the function
value

• Definition of derivative guarantees that if we

take a small enough step in the direction of
the negative gradient, the function will
decrease in value
• How small is small enough?
Gradient Descent
• Gradient Descent Algorithm:
• Pick an initial point
• Iterate until convergence

where is the step size (learning rate)

When do we stop?

Possible Stopping Criteria: iterate until for some

How small should be?

Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
𝑥(0) =− 4
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
𝑥(0) =− 4
𝑥(1)=− 4 −.8 ⋅2 ⋅( − 4)
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
𝑥(0) =− 4
𝑥(1)=2.4
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
𝑥(0) =− 4
𝑥(1)=2.4
𝑥(2) =2.4 −.8 ⋅ 2⋅2.4

𝑥(1)=0.4
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
𝑥(0) =− 4
𝑥(1)=2.4
𝑥(2) =−1.44
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
𝑥(0) =− 4
𝑥(1)=2.4
1.44
(3)
𝑥 =.864
𝑥(4 )=− 0.5184
𝑥(5) =0.31104

𝑥(30) =−8.84296 𝑒−07

Gradient Descent

• If the learning rate is big, the

weights slosh to and fro across
the ravine.
• If the learning rate is too big,
this oscillation diverges.
• What we would like to achieve:
• Move quickly in directions with
small but consistent gradients.
• Move slowly in directions with
big but inconsistent gradients.
Step size: .9

1
Gradient Descent

Step size: .2

1
Gradient Descent

Step size matters!

1
Gradient Descent

Step size matters!

1
Line Search
• Instead of picking a fixed step size that may or may
not actually result in a decrease in the function
value, we can consider minimizing the function along
the direction specified by the gradient to guarantee
that the next iteration decreases the function value
• In other words choose,

• This is called exact line search

• This optimization problem can be expensive to solve

exactly
• However, if is convex, this is a univariate convex
optimization problem

1
Backtracking Line Search
• Instead of exact line search, could use a strategy
that finds some step size that decreases the
function value (one must exist)

• Backtracking line search: start with a large step

size, , and keep shrinking it until

• This always guarantees a decrease, but it may

not decrease as much as exact line search
• Still, this is typically much faster in practice as it only
requires a few function evaluations

1
Backtracking Line Search
• To implement backtracking line search,
choose two parameters

• Set

• While

Iterations continue until

a step size is found that
• Set decreases the function
“enough”

1
Backtracking Line Search

𝛽=.99

1
Backtracking Line Search

𝛽=.3

1
Stochastic Gradient Descent
• Consider minimizing an average of functions

• This setting is common in machine learning,

where this average of functions is equivalent
to a loss function.
• Each is associated to the loss term of an
individual sample point .
• The full gradient descent step is given by

1
Stochastic Gradient Descent
• If the dataset is highly redundant, then idea is now to just use a
subset of all samples, i.e. all possible ’s to approximate the full
gradient.
• This is called stochastic gradient descent, or short, SGD. More

this approach updates weights after each case. It" ’𝒔 "called

formally, stochastic gradient repeats. The extreme version of

“online”.

where is a randomly chosen idnex at iteration k. Because we have ), the

estimate is unbiased.

The indicies are usually chosen without replacement until we complete

one full cycle through the entire data set.

2
Mini-batching
• A common technique employed with SGD
is mini-batching, where we choose a
random subset with size . We then repeat

• Because we have ), the estimate is unbiased of the

full gradient.
• Mini-batches need to be balanced for
classes

2
Complexity
• For a problem with n data points, mini-
batch size b and feature dimension d, we
obtain the following costs of standard SGD
and batch-SGD:
• full gradient: O(nd)
• mini-batch: O(bd)
• standard SGD: O(d)

2
Some observations about gradient descent
• It takes a lot of time to navigate regions
having a gentle slope
• This is because the gradient in these regions
is very small
• Can we do something better ?
• Yes, let’s take a look at ‘Momentum based
gradient descent’

2
Momentum based Gradient Descent
• Intuition
• If I am repeatedly being asked to move in the
same direction, then I should probably gain
some confidence and start taking bigger steps in
that direction
• Just as a ball gains momentum while rolling
down a slope

2
Motivation of momentum

• Let be a simple quadratic function

over :

for parameters . The unique minimizer

of is
• The gradient of is

2
Motivation of momentum

• The gradient of is

• If we run gradient descent with a

constant step size , the relation between
iterates and is

2
Motivation of momentum

• Since we want the iterates to go as fast as

possible to zero, we would like to choose such
that
and

2
Motivation of momentum

and

• If , this is fine, we can easily set

• But if is much smaller than and if set ,

and the second coordinate of the iterates, ,

diverges when . Similar observation

2
Motivation of momentum

• In this situation, gradient descent is slow.

• Till iteration 15 with and , for

• Now, the convergence will be too slow

2
Momentum based Gradient Descent
• A possible remedy to this slow
convergence is to use the
information given by the past
gradients when we define from :
• instead of moving in the direction
given by −, we move in a direction
which is a (weighted) average
between − and the previous
gradients −
• Concretely, this yields the
following iteration formula:

3
Some observations and questions
• Even in the regions having gentle slopes,
momentum-based gradient descent can take
large steps because the momentum carries it
along
• Is moving fast always good? Would there be
a situation where momentum would cause
us to run pass our goal?
• Momentum based gradient descent oscillates in
and out of the minima valley as the momentum
carries it out of the valley
• Despite oscillations, it still converges faster than
vanilla gradient descent

3
Adaptive Sub-Gradient Method
• Adaptive methods adjust the learning rate for each
parameter individually based on the history of the
gradients.
• The idea is to give frequently occurring features a smaller
learning rate and infrequent features a larger learning rate.
• Variants:
• AdaGrad: Accumulates the square of the gradients over time and
uses this information to scale the learning rate for each
parameter. It works well for sparse data but tends to make the
learning rate too small over time.
• RMSProp: Modifies AdaGrad by using an exponentially decaying
average of squared gradients instead of a cumulative sum, which
helps to prevent the learning rate from decaying too much.
• Adam: Combines the ideas of momentum and RMSProp. It uses
running averages of both the gradients and their squares, making
it one of the most widely used optimizers.

3
Adagrad
• Divide the learning rate by “average”
gradient
• High-dimensional non-convex nature of neural
networks optimization could lead to different
sensitivity on each dimension.
• The learning rate could be too small in some
dimension 𝑡 +1
and could
𝑡 𝜂 be𝑡 too large in another
𝑤 ←𝑤 − 𝑔
dimension. 𝜎

: Average gradient Estimated while

of parameter w updating the
parameters

If has small average gradient Larger learning rate

If has large average gradient Smaller learning rate

Adagrad

1 0 𝜂 0
𝑤 ←𝑤 − 0
𝑔 0
𝜎 =𝑔
0
𝜎

2 𝜂
𝑤 ←𝑤 − 1 𝑔
𝜎
1 1 1 1 02
2 √
𝜎 = [ ( 𝑔 ) +( 𝑔 ) ]
1 2

3
𝑤 ←𝑤 −
2
…… 𝜂
𝜎
2
𝑔
2 2 1 02
3 √
𝜎 = [ ( 𝑔 ) + ( 𝑔 ) +( 𝑔 ) ]
1 2 2 2

√
𝑡
1
𝑤
𝑡 +1 𝜂 𝑡
←𝑤 − 𝑡 𝑔
𝜎
𝑡
𝜎= 𝑡
∑
𝑡 +1 𝑖=0
𝑖 2
(𝑔 )
Adagrad
• Divide the learning rate by “average”
gradient
• The “average” gradient is obtained while
updating the parameters

√
𝑡
𝜂𝜂𝑡 1
𝑤
𝑡 +1 𝑡
←𝑤 −
𝜎
𝑡
𝑔
𝑡
𝜎= 𝑡
∑
𝑡 +1 𝑖=0
𝑖 2
(𝑔 )
𝜂 𝑡 +1 𝑡 𝜂 𝑡
𝜂𝑡 = 𝑤 ←𝑤 − 𝑔
√ 𝑡 +1
√
𝑡

∑ ( 𝑔𝑖 )
2

1/t decay 𝑖 =0
Adagrad
Original Gradient Descent
𝜃 ← 𝜃 −𝜂 𝛻 𝐶 ( 𝜃 )
𝑡 𝑡 −1 𝑡 −1

Each parameter w are considered separately

𝜕 𝐶 ( 𝜃𝑡 )
𝑡
𝑤𝑡 +1 ←𝑤 𝑡 − 𝜂 𝑤 𝑔𝑡 𝑔 =
𝜕𝑤

Parameter dependent
learning rate

𝜂 constant
𝜂 𝑤=

√
𝑡
Summation of the square of the
∑ ( 𝑔𝑖 )
2

𝑖=0 previous derivatives

Acknowledgement
• http://wavelab.uwaterloo.ca/wp-content/uploads/2017/04/
Lecture_3.pdf
• https://heartbeat.fritz.ai/deep-learning-best-practices-regula
rization-techniques-for-better-performance-of-neural-networ
k-94f978a4e518
• https://cedar.buffalo.edu/~srihari/CSE676/7.12%20Dropout.
pdf
• http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lect
ure/DNN%20tip.pptx
• Accelerating Deep Network Training by Reducing Internal
Covariate Shift, Jude W. Shavlik
• http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Le
cture/ForDeep.pptx
• Deep Learning Tutorial. Prof. Hung-yi Lee, NTU.

Gradient Descent
No ratings yet
Gradient Descent
17 pages
MTH601-11 Kal
83% (6)
MTH601-11 Kal
74 pages
Discussion 4 CS771
No ratings yet
Discussion 4 CS771
25 pages
Euler Equation - Wave Equation Connection: T. H. Pulliam Stanford University
No ratings yet
Euler Equation - Wave Equation Connection: T. H. Pulliam Stanford University
12 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Soal CISDM
No ratings yet
Soal CISDM
3 pages
Signals Spectra, and Signal Processing Laboratory: October 7, 2020 October 7, 2020
No ratings yet
Signals Spectra, and Signal Processing Laboratory: October 7, 2020 October 7, 2020
16 pages
13 - Chapter 5 PDF
No ratings yet
13 - Chapter 5 PDF
40 pages
VHDL Implementation of 128 Bit Pipelined Blowfish Algorithm
No ratings yet
VHDL Implementation of 128 Bit Pipelined Blowfish Algorithm
5 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
Image Compression Models: Fig: Functional Block Diagram of A General Image Compression System
No ratings yet
Image Compression Models: Fig: Functional Block Diagram of A General Image Compression System
2 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
Akar Persamaan 1
No ratings yet
Akar Persamaan 1
15 pages
Modular Assessment Grade 11: Statistics and Probability Mr. Antonio E. Soto JR
No ratings yet
Modular Assessment Grade 11: Statistics and Probability Mr. Antonio E. Soto JR
4 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Scsa1702 - Ai - Unit Ii
No ratings yet
Scsa1702 - Ai - Unit Ii
42 pages
Naval Research Laboratory Washington, DC 20375-5320 Nrl/Mr/6410!93!7192
No ratings yet
Naval Research Laboratory Washington, DC 20375-5320 Nrl/Mr/6410!93!7192
134 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
ML Lecture2
No ratings yet
ML Lecture2
36 pages
A Comparative Study On Fake Job Post Prediction Using Different Data Mining Techniques
100% (1)
A Comparative Study On Fake Job Post Prediction Using Different Data Mining Techniques
5 pages
School of Advanced Sciences MAT5005: Advanced Mathematical Methods Question Bank
No ratings yet
School of Advanced Sciences MAT5005: Advanced Mathematical Methods Question Bank
2 pages
DRCT Sample Paper
No ratings yet
DRCT Sample Paper
6 pages
B.Tech Tutorial CAD Algorithm For Synthesis of Digital Systems
No ratings yet
B.Tech Tutorial CAD Algorithm For Synthesis of Digital Systems
2 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Linear - Programming-Notes Unit 1
No ratings yet
Linear - Programming-Notes Unit 1
31 pages
Multiclass Classification of DGA Based Malware Using NLP
No ratings yet
Multiclass Classification of DGA Based Malware Using NLP
3 pages
406 QM MCQ 75
No ratings yet
406 QM MCQ 75
3 pages
5 Optimizers
No ratings yet
5 Optimizers
10 pages
Quiz6 - Computer Vision
No ratings yet
Quiz6 - Computer Vision
3 pages
GD Types
No ratings yet
GD Types
98 pages
UNIT3
No ratings yet
UNIT3
37 pages
Deep Learning (MODULE-2)
No ratings yet
Deep Learning (MODULE-2)
86 pages
Bivariate Regression - Part I: Indep Var / Dep Var Continuous Discrete
No ratings yet
Bivariate Regression - Part I: Indep Var / Dep Var Continuous Discrete
4 pages
ML - Week 06
No ratings yet
ML - Week 06
31 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Gradient Descent A Fundamental Optimization Algorithm
No ratings yet
Gradient Descent A Fundamental Optimization Algorithm
30 pages
DL Unit - 2
No ratings yet
DL Unit - 2
20 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
PCA and Convex Optimization and Bias, Variance-2
No ratings yet
PCA and Convex Optimization and Bias, Variance-2
29 pages
Deep Learning For Diagnosis and Classification of Faults in Industrial Rotating Machinery
No ratings yet
Deep Learning For Diagnosis and Classification of Faults in Industrial Rotating Machinery
23 pages
Gradient Descent
No ratings yet
Gradient Descent
27 pages
Gradient Descent Final
No ratings yet
Gradient Descent Final
27 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Unit3 Rev3
No ratings yet
Unit3 Rev3
201 pages
Cours 5
No ratings yet
Cours 5
23 pages
Gradient Descent DS Rohit Sharma Fench Knjs
No ratings yet
Gradient Descent DS Rohit Sharma Fench Knjs
15 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
14-RMSProp and Adam Optimization-12!08!2024
No ratings yet
14-RMSProp and Adam Optimization-12!08!2024
2 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
K Means
No ratings yet
K Means
19 pages
LInear
No ratings yet
LInear
14 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Lec05 Quantization I
No ratings yet
Lec05 Quantization I
70 pages
Gradient Descent
No ratings yet
Gradient Descent
4 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
OEE Templet
No ratings yet
OEE Templet
2 pages
Backpropagation Optimization Tutorial
No ratings yet
Backpropagation Optimization Tutorial
14 pages
Gradient Descent
No ratings yet
Gradient Descent
15 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Wa0007
No ratings yet
Wa0007
3 pages
Gradient Descent (GD) - GD With Momentum - Nesterov Accelerated GD - Stochastic GD - OrIGINAL
No ratings yet
Gradient Descent (GD) - GD With Momentum - Nesterov Accelerated GD - Stochastic GD - OrIGINAL
25 pages
Cycle Test 1 ADSA
No ratings yet
Cycle Test 1 ADSA
1 page
Lab 6
No ratings yet
Lab 6
8 pages
알파폴드1논문
No ratings yet
알파폴드1논문
27 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Optim
No ratings yet
Optim
33 pages
chp2 Gradient Descent Algorithm
No ratings yet
chp2 Gradient Descent Algorithm
5 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Paper 03
No ratings yet
Paper 03
13 pages
Gradient Descent
No ratings yet
Gradient Descent
52 pages
Adam Optimizer
No ratings yet
Adam Optimizer
22 pages
Simple Equations Questions
No ratings yet
Simple Equations Questions
4 pages
Module 4 Lab 3
No ratings yet
Module 4 Lab 3
6 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Gradient Descent Algorithm.Y...
No ratings yet
Gradient Descent Algorithm.Y...
10 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

4 - Gradient Descent and Stochastic GD

Uploaded by

4 - Gradient Descent and Stochastic GD

Uploaded by

Gradient Descent

• Definition of derivative guarantees that if we

where is the step size (learning rate)

Possible Stopping Criteria: iterate until for some

How small should be?

𝑥(30) =−8.84296 𝑒−07

• If the learning rate is big, the

Step size matters!

Step size matters!

• This is called exact line search

• This optimization problem can be expensive to solve

• Backtracking line search: start with a large step

• This always guarantees a decrease, but it may

Iterations continue until

• This setting is common in machine learning,

this approach updates weights after each case. It" ’𝒔 "called

where is a randomly chosen idnex at iteration k. Because we have ), the

The indicies are usually chosen without replacement until we complete

• Because we have ), the estimate is unbiased of the

• Let be a simple quadratic function

for parameters . The unique minimizer

• If we run gradient descent with a

• Since we want the iterates to go as fast as

• If , this is fine, we can easily set

and the second coordinate of the iterates, ,

• In this situation, gradient descent is slow.

• Till iteration 15 with and , for

: Average gradient Estimated while

If has small average gradient Larger learning rate

If has large average gradient Smaller learning rate

Each parameter w are considered separately

𝑖=0 previous derivatives

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.