0% found this document useful (0 votes)
13 views37 pages

4 - Gradient Descent and Stochastic GD

The document provides an overview of gradient descent, a method for finding local optima of differentiable functions, detailing its algorithm, step size considerations, and variations like stochastic gradient descent and momentum-based methods. It discusses the importance of learning rates, the challenges of convergence, and introduces adaptive methods such as AdaGrad and RMSProp. Additionally, it highlights techniques like backtracking line search and mini-batching to improve efficiency in optimization processes.

Uploaded by

ghughudekhecho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views37 pages

4 - Gradient Descent and Stochastic GD

The document provides an overview of gradient descent, a method for finding local optima of differentiable functions, detailing its algorithm, step size considerations, and variations like stochastic gradient descent and momentum-based methods. It discusses the importance of learning rates, the challenges of convergence, and introduces adaptive methods such as AdaGrad and RMSProp. Additionally, it highlights techniques like backtracking line search and mini-batching to improve efficiency in optimization processes.

Uploaded by

ghughudekhecho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Gradient Descent

1
The content of these slides has been gathered from various online sources. We extend our sincere gratitude to everyone who has contributed their work.
Gradient Descent
• Method to find local optima of
differentiable function
• Intuition: gradient tells us direction of greatest
increase, negative gradient gives us direction
of greatest decrease
• Take steps in directions that reduce the function
value

• Definition of derivative guarantees that if we


take a small enough step in the direction of
the negative gradient, the function will
decrease in value
• How small is small enough?
Gradient Descent
• Gradient Descent Algorithm:
• Pick an initial point
• Iterate until convergence

where is the step size (learning rate)


When do we stop?

Possible Stopping Criteria: iterate until for some

How small should be?


Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
𝑥(0) =− 4
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
𝑥(0) =− 4
𝑥(1)=− 4 −.8 ⋅2 ⋅( − 4)
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
𝑥(0) =− 4
𝑥(1)=2.4
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
𝑥(0) =− 4
𝑥(1)=2.4
𝑥(2) =2.4 −.8 ⋅ 2⋅2.4

𝑥(1)=0.4
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
𝑥(0) =− 4
𝑥(1)=2.4
𝑥(2) =−1.44
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
𝑥(0) =− 4
𝑥(1)=2.4
1.44
(3)
𝑥 =.864
𝑥(4 )=− 0.5184
𝑥(5) =0.31104

𝑥(30) =−8.84296 𝑒−07


Gradient Descent

• If the learning rate is big, the


weights slosh to and fro across
the ravine.
• If the learning rate is too big,
this oscillation diverges.
• What we would like to achieve:
• Move quickly in directions with
small but consistent gradients.
• Move slowly in directions with
big but inconsistent gradients.
Step size: .9

1
Gradient Descent

Step size: .2

1
Gradient Descent

Step size matters!

1
Gradient Descent

Step size matters!

1
Line Search
• Instead of picking a fixed step size that may or may
not actually result in a decrease in the function
value, we can consider minimizing the function along
the direction specified by the gradient to guarantee
that the next iteration decreases the function value
• In other words choose,

• This is called exact line search

• This optimization problem can be expensive to solve


exactly
• However, if is convex, this is a univariate convex
optimization problem

1
Backtracking Line Search
• Instead of exact line search, could use a strategy
that finds some step size that decreases the
function value (one must exist)

• Backtracking line search: start with a large step


size, , and keep shrinking it until

• This always guarantees a decrease, but it may


not decrease as much as exact line search
• Still, this is typically much faster in practice as it only
requires a few function evaluations

1
Backtracking Line Search
• To implement backtracking line search,
choose two parameters

• Set

• While

Iterations continue until


a step size is found that
• Set decreases the function
“enough”

1
Backtracking Line Search

𝛽=.99

1
Backtracking Line Search

𝛽=.3

1
Stochastic Gradient Descent
• Consider minimizing an average of functions

• This setting is common in machine learning,


where this average of functions is equivalent
to a loss function.
• Each is associated to the loss term of an
individual sample point .
• The full gradient descent step is given by

1
Stochastic Gradient Descent
• If the dataset is highly redundant, then idea is now to just use a
subset of all samples, i.e. all possible ’s to approximate the full
gradient.
• This is called stochastic gradient descent, or short, SGD. More

this approach updates weights after each case. It" ’𝒔 "called


formally, stochastic gradient repeats. The extreme version of

“online”.

where is a randomly chosen idnex at iteration k. Because we have ), the


estimate is unbiased.

The indicies are usually chosen without replacement until we complete


one full cycle through the entire data set.

2
Mini-batching
• A common technique employed with SGD
is mini-batching, where we choose a
random subset with size . We then repeat

• Because we have ), the estimate is unbiased of the


full gradient.
• Mini-batches need to be balanced for
classes

2
Complexity
• For a problem with n data points, mini-
batch size b and feature dimension d, we
obtain the following costs of standard SGD
and batch-SGD:
• full gradient: O(nd)
• mini-batch: O(bd)
• standard SGD: O(d)

2
Some observations about gradient descent
• It takes a lot of time to navigate regions
having a gentle slope
• This is because the gradient in these regions
is very small
• Can we do something better ?
• Yes, let’s take a look at ‘Momentum based
gradient descent’

2
Momentum based Gradient Descent
• Intuition
• If I am repeatedly being asked to move in the
same direction, then I should probably gain
some confidence and start taking bigger steps in
that direction
• Just as a ball gains momentum while rolling
down a slope

2
Motivation of momentum

• Let be a simple quadratic function


over :

for parameters . The unique minimizer


of is
• The gradient of is

2
Motivation of momentum

• The gradient of is

• If we run gradient descent with a


constant step size , the relation between
iterates and is

2
Motivation of momentum

• Since we want the iterates to go as fast as


possible to zero, we would like to choose such
that
and

2
Motivation of momentum

and

• If , this is fine, we can easily set


• But if is much smaller than and if set ,

and the second coordinate of the iterates, ,


diverges when . Similar observation

2
Motivation of momentum

• In this situation, gradient descent is slow.

• Till iteration 15 with and , for


• Now, the convergence will be too slow

2
Momentum based Gradient Descent
• A possible remedy to this slow
convergence is to use the
information given by the past
gradients when we define from :
• instead of moving in the direction
given by −, we move in a direction
which is a (weighted) average
between − and the previous
gradients −
• Concretely, this yields the
following iteration formula:

3
Some observations and questions
• Even in the regions having gentle slopes,
momentum-based gradient descent can take
large steps because the momentum carries it
along
• Is moving fast always good? Would there be
a situation where momentum would cause
us to run pass our goal?
• Momentum based gradient descent oscillates in
and out of the minima valley as the momentum
carries it out of the valley
• Despite oscillations, it still converges faster than
vanilla gradient descent

3
Adaptive Sub-Gradient Method
• Adaptive methods adjust the learning rate for each
parameter individually based on the history of the
gradients.
• The idea is to give frequently occurring features a smaller
learning rate and infrequent features a larger learning rate.
• Variants:
• AdaGrad: Accumulates the square of the gradients over time and
uses this information to scale the learning rate for each
parameter. It works well for sparse data but tends to make the
learning rate too small over time.
• RMSProp: Modifies AdaGrad by using an exponentially decaying
average of squared gradients instead of a cumulative sum, which
helps to prevent the learning rate from decaying too much.
• Adam: Combines the ideas of momentum and RMSProp. It uses
running averages of both the gradients and their squares, making
it one of the most widely used optimizers.

3
Adagrad
• Divide the learning rate by “average”
gradient
• High-dimensional non-convex nature of neural
networks optimization could lead to different
sensitivity on each dimension.
• The learning rate could be too small in some
dimension 𝑡 +1
and could
𝑡 𝜂 be𝑡 too large in another
𝑤 ←𝑤 − 𝑔
dimension. 𝜎

: Average gradient Estimated while


of parameter w updating the
parameters

If has small average gradient Larger learning rate

If has large average gradient Smaller learning rate


Adagrad

1 0 𝜂 0
𝑤 ←𝑤 − 0
𝑔 0
𝜎 =𝑔
0
𝜎

2 𝜂
𝑤 ←𝑤 − 1 𝑔
𝜎
1 1 1 1 02
2 √
𝜎 = [ ( 𝑔 ) +( 𝑔 ) ]
1 2

3
𝑤 ←𝑤 −
2
…… 𝜂
𝜎
2
𝑔
2 2 1 02
3 √
𝜎 = [ ( 𝑔 ) + ( 𝑔 ) +( 𝑔 ) ]
1 2 2 2


𝑡
1
𝑤
𝑡 +1 𝜂 𝑡
←𝑤 − 𝑡 𝑔
𝜎
𝑡
𝜎= 𝑡

𝑡 +1 𝑖=0
𝑖 2
(𝑔 )
Adagrad
• Divide the learning rate by “average”
gradient
• The “average” gradient is obtained while
updating the parameters


𝑡
𝜂𝜂𝑡 1
𝑤
𝑡 +1 𝑡
←𝑤 −
𝜎
𝑡
𝑔
𝑡
𝜎= 𝑡

𝑡 +1 𝑖=0
𝑖 2
(𝑔 )
𝜂 𝑡 +1 𝑡 𝜂 𝑡
𝜂𝑡 = 𝑤 ←𝑤 − 𝑔
√ 𝑡 +1

𝑡

∑ ( 𝑔𝑖 )
2

1/t decay 𝑖 =0
Adagrad
Original Gradient Descent
𝜃 ← 𝜃 −𝜂 𝛻 𝐶 ( 𝜃 )
𝑡 𝑡 −1 𝑡 −1

Each parameter w are considered separately


𝜕 𝐶 ( 𝜃𝑡 )
𝑡
𝑤𝑡 +1 ←𝑤 𝑡 − 𝜂 𝑤 𝑔𝑡 𝑔 =
𝜕𝑤

Parameter dependent
learning rate

𝜂 constant
𝜂 𝑤=


𝑡
Summation of the square of the
∑ ( 𝑔𝑖 )
2

𝑖=0 previous derivatives


Acknowledgement
• http://wavelab.uwaterloo.ca/wp-content/uploads/2017/04/
Lecture_3.pdf
• https://heartbeat.fritz.ai/deep-learning-best-practices-regula
rization-techniques-for-better-performance-of-neural-networ
k-94f978a4e518
• https://cedar.buffalo.edu/~srihari/CSE676/7.12%20Dropout.
pdf
• http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lect
ure/DNN%20tip.pptx
• Accelerating Deep Network Training by Reducing Internal
Covariate Shift, Jude W. Shavlik
• http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Le
cture/ForDeep.pptx
• Deep Learning Tutorial. Prof. Hung-yi Lee, NTU.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy