4 - Gradient Descent and Stochastic GD
4 - Gradient Descent and Stochastic GD
1
The content of these slides has been gathered from various online sources. We extend our sincere gratitude to everyone who has contributed their work.
Gradient Descent
• Method to find local optima of
differentiable function
• Intuition: gradient tells us direction of greatest
increase, negative gradient gives us direction
of greatest decrease
• Take steps in directions that reduce the function
value
Step size:
𝑥(0) =− 4
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥
Step size:
𝑥(0) =− 4
𝑥(1)=− 4 −.8 ⋅2 ⋅( − 4)
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥
Step size:
𝑥(0) =− 4
𝑥(1)=2.4
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥
Step size:
𝑥(0) =− 4
𝑥(1)=2.4
𝑥(2) =2.4 −.8 ⋅ 2⋅2.4
𝑥(1)=0.4
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥
Step size:
𝑥(0) =− 4
𝑥(1)=2.4
𝑥(2) =−1.44
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥
Step size:
𝑥(0) =− 4
𝑥(1)=2.4
1.44
(3)
𝑥 =.864
𝑥(4 )=− 0.5184
𝑥(5) =0.31104
1
Gradient Descent
Step size: .2
1
Gradient Descent
1
Gradient Descent
1
Line Search
• Instead of picking a fixed step size that may or may
not actually result in a decrease in the function
value, we can consider minimizing the function along
the direction specified by the gradient to guarantee
that the next iteration decreases the function value
• In other words choose,
1
Backtracking Line Search
• Instead of exact line search, could use a strategy
that finds some step size that decreases the
function value (one must exist)
1
Backtracking Line Search
• To implement backtracking line search,
choose two parameters
• Set
• While
1
Backtracking Line Search
𝛽=.99
1
Backtracking Line Search
𝛽=.3
1
Stochastic Gradient Descent
• Consider minimizing an average of functions
1
Stochastic Gradient Descent
• If the dataset is highly redundant, then idea is now to just use a
subset of all samples, i.e. all possible ’s to approximate the full
gradient.
• This is called stochastic gradient descent, or short, SGD. More
“online”.
2
Mini-batching
• A common technique employed with SGD
is mini-batching, where we choose a
random subset with size . We then repeat
2
Complexity
• For a problem with n data points, mini-
batch size b and feature dimension d, we
obtain the following costs of standard SGD
and batch-SGD:
• full gradient: O(nd)
• mini-batch: O(bd)
• standard SGD: O(d)
2
Some observations about gradient descent
• It takes a lot of time to navigate regions
having a gentle slope
• This is because the gradient in these regions
is very small
• Can we do something better ?
• Yes, let’s take a look at ‘Momentum based
gradient descent’
2
Momentum based Gradient Descent
• Intuition
• If I am repeatedly being asked to move in the
same direction, then I should probably gain
some confidence and start taking bigger steps in
that direction
• Just as a ball gains momentum while rolling
down a slope
2
Motivation of momentum
2
Motivation of momentum
• The gradient of is
2
Motivation of momentum
2
Motivation of momentum
and
2
Motivation of momentum
2
Momentum based Gradient Descent
• A possible remedy to this slow
convergence is to use the
information given by the past
gradients when we define from :
• instead of moving in the direction
given by −, we move in a direction
which is a (weighted) average
between − and the previous
gradients −
• Concretely, this yields the
following iteration formula:
3
Some observations and questions
• Even in the regions having gentle slopes,
momentum-based gradient descent can take
large steps because the momentum carries it
along
• Is moving fast always good? Would there be
a situation where momentum would cause
us to run pass our goal?
• Momentum based gradient descent oscillates in
and out of the minima valley as the momentum
carries it out of the valley
• Despite oscillations, it still converges faster than
vanilla gradient descent
3
Adaptive Sub-Gradient Method
• Adaptive methods adjust the learning rate for each
parameter individually based on the history of the
gradients.
• The idea is to give frequently occurring features a smaller
learning rate and infrequent features a larger learning rate.
• Variants:
• AdaGrad: Accumulates the square of the gradients over time and
uses this information to scale the learning rate for each
parameter. It works well for sparse data but tends to make the
learning rate too small over time.
• RMSProp: Modifies AdaGrad by using an exponentially decaying
average of squared gradients instead of a cumulative sum, which
helps to prevent the learning rate from decaying too much.
• Adam: Combines the ideas of momentum and RMSProp. It uses
running averages of both the gradients and their squares, making
it one of the most widely used optimizers.
3
Adagrad
• Divide the learning rate by “average”
gradient
• High-dimensional non-convex nature of neural
networks optimization could lead to different
sensitivity on each dimension.
• The learning rate could be too small in some
dimension 𝑡 +1
and could
𝑡 𝜂 be𝑡 too large in another
𝑤 ←𝑤 − 𝑔
dimension. 𝜎
1 0 𝜂 0
𝑤 ←𝑤 − 0
𝑔 0
𝜎 =𝑔
0
𝜎
2 𝜂
𝑤 ←𝑤 − 1 𝑔
𝜎
1 1 1 1 02
2 √
𝜎 = [ ( 𝑔 ) +( 𝑔 ) ]
1 2
3
𝑤 ←𝑤 −
2
…… 𝜂
𝜎
2
𝑔
2 2 1 02
3 √
𝜎 = [ ( 𝑔 ) + ( 𝑔 ) +( 𝑔 ) ]
1 2 2 2
√
𝑡
1
𝑤
𝑡 +1 𝜂 𝑡
←𝑤 − 𝑡 𝑔
𝜎
𝑡
𝜎= 𝑡
∑
𝑡 +1 𝑖=0
𝑖 2
(𝑔 )
Adagrad
• Divide the learning rate by “average”
gradient
• The “average” gradient is obtained while
updating the parameters
√
𝑡
𝜂𝜂𝑡 1
𝑤
𝑡 +1 𝑡
←𝑤 −
𝜎
𝑡
𝑔
𝑡
𝜎= 𝑡
∑
𝑡 +1 𝑖=0
𝑖 2
(𝑔 )
𝜂 𝑡 +1 𝑡 𝜂 𝑡
𝜂𝑡 = 𝑤 ←𝑤 − 𝑔
√ 𝑡 +1
√
𝑡
∑ ( 𝑔𝑖 )
2
1/t decay 𝑖 =0
Adagrad
Original Gradient Descent
𝜃 ← 𝜃 −𝜂 𝛻 𝐶 ( 𝜃 )
𝑡 𝑡 −1 𝑡 −1
Parameter dependent
learning rate
𝜂 constant
𝜂 𝑤=
√
𝑡
Summation of the square of the
∑ ( 𝑔𝑖 )
2