0% found this document useful (0 votes)
3 views51 pages

DL Regularization

The document provides an overview of optimization algorithms used in deep learning, focusing on techniques such as Gradient Descent, Stochastic Gradient Descent, and Mini-batch Gradient Descent. It discusses the importance of understanding these algorithms for tuning hyperparameters to enhance model performance and addresses challenges like local minima and learning rate adjustments. Additionally, it introduces advanced methods like Gradient Descent with Momentum, Adagrad, and RMSProp, highlighting their mechanisms and applications in optimizing deep neural networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views51 pages

DL Regularization

The document provides an overview of optimization algorithms used in deep learning, focusing on techniques such as Gradient Descent, Stochastic Gradient Descent, and Mini-batch Gradient Descent. It discusses the importance of understanding these algorithms for tuning hyperparameters to enhance model performance and addresses challenges like local minima and learning rate adjustments. Additionally, it introduces advanced methods like Gradient Descent with Momentum, Adagrad, and RMSProp, highlighting their mechanisms and applications in optimizing deep neural networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Deep Learning

BITS Pilani
Pilani Campus
Deep Neural Network

Disclaimer and Acknowledgement

• The content for these slides has been obtained from books and various other source on the Internet
• I here by acknowledge all the contributors for their material and inputs.
• I have provided source information wherever necessary
• I have added and modified the content to suit the requirements of the course

BITS Pilani, Pilani Campus


Session Agenda

• Optimization Algorithms
• Gradient Descent
• Stochastic Gradient Descent
• Mini batch Gradient Descent
• Gradient Descent with momentum
• ADAgrad
• RMSProp
• Adam

BITS Pilani, Pilani Campus


Optimization Algorithm

• Optimization algorithms train deep learning models.


• Optimization algorithms are the tools that allow
• continue updating model parameters
• to minimize the value of the loss function, as evaluated
on the training set.
• In optimization, a loss function is often referred to as the
objective function of the optimization problem.
• By tradition and convention most optimization algorithms are
concerned with minimization.
• The goal of optimization is to reduce the training error. The
goal of deep learning is to reduce the generalization error, this
requires reduction in overfitting also.
Why Optimization Algorithm?

• Understanding the principles of different optimization


algorithms and the role of their hyperparameters will enable
us to tune the hyperparameters in a targeted manner to
improve the performance of deep learning models
Optimization
Finding Minimum of a Function
Derivatives and critical points

𝑓’(𝑥)= 0 at 𝑥
𝑓’(𝑥)= 0 just before 𝑥 𝑓’(𝑥)= 0 just after 𝑥
𝑥 may be a saddle
𝑓’(𝑥)= 0 at 𝑥,
𝑓’(𝑥)>0 just before 𝑥
𝑓’(𝑥)<0 just after 𝑥
𝑥 is a maxima

𝑓’(𝑥)= 0 at 𝑥
𝑓’(𝑥)< 0 just before 𝑥
𝑓’(𝑥)>0 just after 𝑥
𝑓’(𝑥)= 0 and 𝒇’’ 𝒙 < 𝟎, 𝒙 is a maxima 𝑥 is a minima
𝑓’(𝑥)= 0 and 𝒇’’ 𝒙 > 𝟎. 𝒙 is a minima
𝑓’(𝑥)= 0 and 𝒇’’ 𝒙 = 𝟎, 𝒙 may be a
saddle.
Reminder: The error surface for a linear neuron

• The error surface lies in a space with a


horizontal axis for each weight and E
one vertical axis for the error.
• For a linear neuron with a squared
error, it is a quadratic bowl.
• For multi-layer, non-linear nets the
error surface is much more
w1
complicated.
• But locally, a piece of a
quadratic bowl is usually a very
w2
good approximation.
• GD can be a good approximation
Gradient Descent

One-Dimensional Gradient Descent


• In one-dimensional gradient descent, the algorithm
adjusts a single parameter to minimize a function by
following the negative direction of the gradient.
Multivariate Gradient Descent
• For multivariate functions, gradient descent adjusts
multiple parameters simultaneously by iteratively
updating them in the direction of the negative gradient.

BITS Pilani, Pilani Campus


Gradient descent
• Objective function is usually the average of the loss functions for each example in
the training dataset.
n= no of examples training dataset
𝑓𝑖 (x) = loss function with respect to the training example of index 𝑖
x = parameter vector
• objective function :

• gradient of the objective function at x :

• Update rule
• In vanilla GD, computational cost for each independent variable iteration is O(n)
which grows linearly with n
• Larger the training dataset is larger, higher the cost of GD for each iteration
Gradient descent

• Loss functions of most deep


learning models are non-
convex
• Non-convex functions have
multiple minima. Usually
harder to optimize as
compared to convex functions
Gradient descent
Learning rate
• The role of the learning rate is to moderate the degree to
which weights are changed at each step.
• Learning rate η is set by the algorithm designer.

Optimal 
Slow Learning Oscillations
If we pick  too large, the
If we pick  too GD converges, a suitable
solution oscillates and in the
small, we make little  is often found only
worst case it might
progress. after multiple
diverge.
experiments
Learning rate
Local minima vs global minimum for
DNN
• Tweak the learning rate
• Gradually reduce the learning rate, then increase it and
slowly reduce it, again, several times.
• Increasing the learning rate reduces the stability of the
algorithm, but gives the algorithm the ability to jump out of a
local optimum
• Another takeaway
• Finding the global minimum is probably not the best as it
would probably represent extreme overfitting on the
training set.
• Empirical evidence shows that generalization performance
is same for local minima solution and global minimum
solution
Stochastic gradient descent

• SGD reduces computational cost at each iteration

• At each iteration of stochastic gradient descent, we uniformly


sample an index 𝑖  {1, . . . , n} for data examples at random

• compute the gradient to update x:

• computational cost for each iteration drops from O(n) of the


GD to the constant O(1), hence it is computationally
efficient
Stochastic gradient descent
• It may diverge near minimum

• This leaves us with the only alternative:


change the learning rate 

• However, if we pick this too small, we


will not make any meaningful progress
initially

• On the other hand, if we pick it too


large, we will not get a good solution

• The only way to resolve these conflicting


goals is to reduce the learning rate
dynamically as optimization progresses
Dynamic learning rate

• Replace  with time-dependent learning rate (t)


• adds to the complexity of controlling convergence of an optimization
algorithm.

1. Piecewise constant
• decrease the learning rate, e.g., whenever progress in optimization stalls.
• This is a common strategy for training deep networks.

2. Exponential decay
• Leads to premature stopping before the algorithm has converged.

3. Polynomial decay with α= 0.5


Dynamic learning rate

Exponential decay
• Variance in the parameters is
significantly reduced. But,
• The algorithm fails to converge at all.

Polynomial decay

• Convergence gets better after only 50


steps.
Mini batch GD

• GD is not particularly data efficient


• SGD is not particularly computationally efficient since CPUs and GPUs cannot
exploit the full power of vectorization
• Mini batch GD: read mini batches of data rather than single observations to
update parameters
• Weight update:

• Pick a mini batch (which is hyper parameter) that is large enough to offer good
computational efficiency while still fitting into the memory of a GPU.
• For small training sets (e.g., <2000), use GD
• For larger sets, sizes between 64 to 512 (preferably powers of 2) are typical.
Steps for mini batch GD

1. Split the data into mini-batches, e.g., X(1) through X(1000), X(1001)
through X(2000), and so on.
2. For each mini-batch, perform forward propagation using only the
data in that mini-batch.
3. Compute the cost function for that mini-batch.
4. Implement backpropagation to compute gradients.
5. Update the weights and biases using the gradients.
1. Enables progress in gradient descent even when only partially
through the training set.
2. With GD , cost should decrease every iteration.
3. For mini-batch GD, cost might not decrease every iteration due
to different training batches.
4. Cost function J should generally trend downwards but may
oscillate due to varying difficulty of mini-batches.
GD vs SGD vs Mini batch GD

Stochastic gradient descent Gradient descent


Noisy, oscillates around the minimum Takes consistent steps toward minimum.
but may not converge to it.

Mini batch gradient descent


Noisy, oscillates around the minimum but may not converge to it.
GD vs SGD vs Mini batch GD

GD is computationally heavy, converges to a “flat minima” but performs well on the test
data.
Drawback of gradient based methods
• The most critical challenge to optimizing deep networks is finding the
correct trajectory to move in.
• Gradient isn’t usually a very good indicator of the good trajectory.
• when the contours are perfectly circular , gradient always point in the
direction of the local minimum.
• However, if the contours are extremely elliptical (as is usually the case for
the error surfaces of deep networks), the gradient can be as inaccurate as
90 degrees away from the correct direction!
Error surface of DNN

• A critical point can only be a local minimum if it appears as a


local minimum in every single one of the 𝑑 one-dimensional
subspaces
• As the dimensionality of our parameter space increases, local
minima become exponentially more rare
• As our function has more and more dimensions (i.e., we have
more and more parameters in our model), saddle points are
exponentially more likely than local minima
Exponentially Weighted Averages
(EWA)
• Technique used to calculate the moving average of a time series, where more
recent data points are given higher weights, and older data points are given
lower weights.
• The weights decline exponentially as the data points get older, hence the name
“exponentially weighted”
• Applied in various optimization algorithms in deep learning, such as Gradient
Descent with Momentum, RMSprop, and Adam
• The EWA is calculated using the following formula:

Vt = β ∗ (Vt−1)+(1−β) ∗ NewSample

Vt = weighted average at time t


β = coefficient of momentum, parameter that determines weight given to
previous values (between 0 and 1)
NewSample is t= he new data point at time t
EWA
EWA
Vt = β ∗ (Vt−1)+(1−β) ∗ NewSample

n as the numbers of
observations used to adapt
your EWA.
Gradient Descent with Momentum
Vt = β ∗ (Vt−1)+(1−β) ∗ NewSample

• Added a mechanism in GD for aggregating a history of past


gradients to accelerate convergence
• Taking knowledge from previous steps about where we should
be heading. We are introducing a new hyperparameter 
• The term vt is called momentum
• This term accumulates the gradients from the past several
steps
• This method updates the parameters 𝑤 in the direction of the
weighted average of the past gradients

𝑣𝑡 ← 𝛽𝑣𝑡−1 + 𝑔𝑡,𝑡−1
𝑤𝑡 ← 𝑤𝑡−1 −  𝑣𝑡
Gradient Descent with Momentum
Gradient Descent with Momentum
Momentum Update
Plain gradient update With momentum

• Momentum averages out the oscillations

• In a vertical direction, oscillations are reduced because positive and negative


numbers average out, Reduces updates for dimensions whose gradients
change directions

• In a horizontal direction, the movement remains aggressive, which speeds up


the descent, Increases updates for dimensions whose gradients point in the
same directions.
Gradient Descent with Momentum

𝑣𝑡 ← 𝛽𝑣𝑡−1 + 𝑔𝑡,𝑡−1
𝑤𝑡 ← 𝑤𝑡−1 −  𝑣𝑡
Gradient Descent with Momentum

• Momentum replaces gradients with a leaky average


over past gradients. This accelerates convergence
significantly.
• Momentum prevents stalling of the optimization process that
is much more likely to occur for stochastic gradient descent.
• The effective number of gradients is given by 1/ (1−β)
due to exponentiated downweighting of past data.
• Implementation is quite straightforward but it requires us to
store an additional state vector (momentum v).
Gradient Descent with Momentum

Gradient descent with momentum uses the momentum of the


gradient for parameter optimization
cost ℒ 𝜃
Movement = Negative of Gradient +
Momentum
Negative of Gradient
Momentum
Real Movement

𝜃
Gradient = 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial
Solved problem SGD and SGD +
momentum
Consider the following loss function, with initial value of W{0} = -
2.8, and learning rate  = 0.05 and =0.7. Use SGD and SGD
+momentum to find the updated value of W{1} after the first
iteration.
𝐿(𝑤) = 0.3 ∗ 𝑤 4 − 0.1 ∗ 𝑤 3 − 2 ∗ 𝑤 2 − 0.8 ∗ 𝑤

Answer
Iteration SGD
𝑔𝑖 w𝑖 Iteration SGD𝑔𝑖+momentum
v𝑖 w𝑖

1 -18.2943 -1.88527 1 -18.294 0 -1.88527

2 -2.36614 -1.76697 2 -2.3661 -15.172 -1.126


Adagrad - motivation
• In GD same LR for all parameters and iteration.
• What happens to features that occur infrequently (sparse features) e.g w is
sparse, b is not (as x0 is always 1) e.g:

• Parameters associated with infrequent features only receive meaningful


updates whenever these features occur
b

W
Adagrad

• Decay the learning rate for parameters in proportion to their update history
• An individual learning rate per parameter(feature)
• Accumulating past squared gradients in st

Accumulate the history

• ϵ is a smoothing term that avoids division by zero


• Initialize s0 = 0
• Adagrad's main benefits is that it eliminates the need to manually tune the
learning rate.
• Most implementations use a default value of 0.01 and leave it at that
• Application: In natural language processing and image recognition
applications
Adagrad

• It adapts the learning rate to


the parameters
• smaller updates (i.e. low
learning rates) for
parameters associated
with frequently occurring
features
• larger updates (i.e. high
learning rates) for
parameters associated
with infrequent features
Drawback of AdaGrad

• AdaGrad decays the learning rate very aggressively (as the


denominator grows).
• As a result, after a while, the frequent parameters will start
receiving very small updates because of the decayed learning
rate.

• As a result, the algorithm may experience sluggish


convergence or even premature stalling
• To avoid this why not decay the denominator and prevent its
rapid growth
RMSProp
• The issue :
• Adagrad accumulates the squares of the gradient gt into a state vector
𝑠𝑡 = 𝑠𝑡−1 + 𝑔𝑡2
• As a result st keeps on growing without bound due to the lack of
normalization

• Use a leaky average in the same way we used in the momentum method

Parameter  > 0
The constant 𝜖 > 0 is typically set to 10-6
• Faster convergence compared to AdaGrad
• works well on big and redundant datasets
Review of techniques learned so far

1. Stochastic gradient descent


○ more compute effective than Gradient Descent when solving
optimization problems
○ Mini batch Stochastic gradient descent affords significant additional
efficiency arising from vectorization, using larger sets of
observations in one mini batch.
○ This is the key to efficient multi-machine,multi-GPU and overall
parallel processing.
2. Momentum
○ added a mechanism for aggregating a history of past gradients to
accelerate convergence.
3. Adagrad
○ used per-coordinate scaling to allow for a computationally efficient
preconditioner.
4. RMSProp
○ leaky average + dynamic learning rate
Adam-Adaptive Moment Estimation

• Adam combines all these techniques into one efficient


learning algorithm.
• computes individual adaptive learning rates for different
parameters from estimates of the first and second moments
of the gradients.
• utilize the momentum concept from “SGD with momentum”
and adaptive learning rate from “RMSProp”
• Adam can diverge due to poor variance control.
(disadvantage)
 V for Momentum
 S for RMSProp
Adam Algorithm
 first moment (the mean)of gradient

second moment (uncentered


variance) of gradients

• 1 and 2 are nonnegative weighting parameters.


• Common choices for them are 1 = 0.9 and 2 = 0.999.
• Initialize v0 = s0 = 0
• Normalize the state variables (bias correction)→
Bias-correction helps Adam slightly outperform RMSprop

• Rescale the gradient →

4
• Compute updates →
6
Optimization algorithm comparison

Gradient Descent 𝑤𝑡 ← 𝑤𝑡−1 −  𝑔𝑡

GD with Momentum: replaces gradients 𝑣𝑡 ← 𝛽𝑣𝑡−1 + 𝑔𝑡,𝑡−1


with a leaky average over past gradients 𝑤𝑡 ← 𝑤𝑡−1 −  𝑣𝑡

AdaGrad (Adaptive learning rate): An


individual learning rate per parameter

RMSProp : exponentially decaying


average of past squared gradients

Adam : exponentially decaying average of past


gradients + exponentially decaying average of
past squared gradients
ADAM VS SGD

Adaptive methods (such as Adam) do not generalize as well as SGD


with momentum when tested on a diverse set of deep learning tasks

In the research paper by Nitish Shirish Keskar, Richard Socher :


Improving Generalization Performance by Switching from Adam to
SGD :
Earlier stages of training Adam still outperforms SGD but later the
learning saturates. They proposed simple strategy in which they
start training deep neural network with Adam but then switch to
SGD when certain criteria hits
Optimization algorithm comparison
References

Chapter – 12 Dive into Deep Learning


https://d2l.ai/
Thank You All !

BITS Pilani, Pilani Campus

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy