0% found this document useful (0 votes)

3 views51 pages

DL Regularization

The document provides an overview of optimization algorithms used in deep learning, focusing on techniques such as Gradient Descent, Stochastic Gradient Descent, and Mini-batch Gradient Descent. It discusses the importance of understanding these algorithms for tuning hyperparameters to enhance model performance and addresses challenges like local minima and learning rate adjustments. Additionally, it introduces advanced methods like Gradient Descent with Momentum, Adagrad, and RMSProp, highlighting their mechanisms and applications in optimizing deep neural networks.

Uploaded by

prapti.agarwal.bits24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views51 pages

DL Regularization

Uploaded by

prapti.agarwal.bits24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Deep Learning

BITS Pilani
Pilani Campus
Deep Neural Network

Disclaimer and Acknowledgement

• The content for these slides has been obtained from books and various other source on the Internet
• I here by acknowledge all the contributors for their material and inputs.
• I have provided source information wherever necessary
• I have added and modified the content to suit the requirements of the course

BITS Pilani, Pilani Campus

Session Agenda

• Optimization Algorithms
• Gradient Descent
• Stochastic Gradient Descent
• Mini batch Gradient Descent
• Gradient Descent with momentum
• ADAgrad
• RMSProp
• Adam

BITS Pilani, Pilani Campus

Optimization Algorithm

• Optimization algorithms train deep learning models.

• Optimization algorithms are the tools that allow
• continue updating model parameters
• to minimize the value of the loss function, as evaluated
on the training set.
• In optimization, a loss function is often referred to as the
objective function of the optimization problem.
• By tradition and convention most optimization algorithms are
concerned with minimization.
• The goal of optimization is to reduce the training error. The
goal of deep learning is to reduce the generalization error, this
requires reduction in overfitting also.
Why Optimization Algorithm?

• Understanding the principles of different optimization

algorithms and the role of their hyperparameters will enable
us to tune the hyperparameters in a targeted manner to
improve the performance of deep learning models
Optimization
Finding Minimum of a Function
Derivatives and critical points

𝑓’(𝑥)= 0 at 𝑥
𝑓’(𝑥)= 0 just before 𝑥 𝑓’(𝑥)= 0 just after 𝑥
𝑥 may be a saddle
𝑓’(𝑥)= 0 at 𝑥,
𝑓’(𝑥)>0 just before 𝑥
𝑓’(𝑥)<0 just after 𝑥
𝑥 is a maxima

𝑓’(𝑥)= 0 at 𝑥
𝑓’(𝑥)< 0 just before 𝑥
𝑓’(𝑥)>0 just after 𝑥
𝑓’(𝑥)= 0 and 𝒇’’ 𝒙 < 𝟎, 𝒙 is a maxima 𝑥 is a minima
𝑓’(𝑥)= 0 and 𝒇’’ 𝒙 > 𝟎. 𝒙 is a minima
𝑓’(𝑥)= 0 and 𝒇’’ 𝒙 = 𝟎, 𝒙 may be a
saddle.
Reminder: The error surface for a linear neuron

• The error surface lies in a space with a

horizontal axis for each weight and E
one vertical axis for the error.
• For a linear neuron with a squared
error, it is a quadratic bowl.
• For multi-layer, non-linear nets the
error surface is much more
w1
complicated.
• But locally, a piece of a
quadratic bowl is usually a very
w2
good approximation.
• GD can be a good approximation
Gradient Descent

One-Dimensional Gradient Descent

• In one-dimensional gradient descent, the algorithm
adjusts a single parameter to minimize a function by
following the negative direction of the gradient.
Multivariate Gradient Descent
• For multivariate functions, gradient descent adjusts
multiple parameters simultaneously by iteratively
updating them in the direction of the negative gradient.

BITS Pilani, Pilani Campus

Gradient descent
• Objective function is usually the average of the loss functions for each example in
the training dataset.
n= no of examples training dataset
𝑓𝑖 (x) = loss function with respect to the training example of index 𝑖
x = parameter vector
• objective function :

• gradient of the objective function at x :

• Update rule
• In vanilla GD, computational cost for each independent variable iteration is O(n)
which grows linearly with n
• Larger the training dataset is larger, higher the cost of GD for each iteration
Gradient descent

• Loss functions of most deep

learning models are non-
convex
• Non-convex functions have
multiple minima. Usually
harder to optimize as
compared to convex functions
Gradient descent
Learning rate
• The role of the learning rate is to moderate the degree to
which weights are changed at each step.
• Learning rate η is set by the algorithm designer.

Optimal 
Slow Learning Oscillations
If we pick  too large, the
If we pick  too GD converges, a suitable
solution oscillates and in the
small, we make little  is often found only
worst case it might
progress. after multiple
diverge.
experiments
Learning rate
Local minima vs global minimum for
DNN
• Tweak the learning rate
• Gradually reduce the learning rate, then increase it and
slowly reduce it, again, several times.
• Increasing the learning rate reduces the stability of the
algorithm, but gives the algorithm the ability to jump out of a
local optimum
• Another takeaway
• Finding the global minimum is probably not the best as it
would probably represent extreme overfitting on the
training set.
• Empirical evidence shows that generalization performance
is same for local minima solution and global minimum
solution
Stochastic gradient descent

• SGD reduces computational cost at each iteration

• At each iteration of stochastic gradient descent, we uniformly

sample an index 𝑖  {1, . . . , n} for data examples at random

• compute the gradient to update x:

• computational cost for each iteration drops from O(n) of the

GD to the constant O(1), hence it is computationally
efficient
Stochastic gradient descent
• It may diverge near minimum

• This leaves us with the only alternative:

change the learning rate 

• However, if we pick this too small, we

will not make any meaningful progress
initially

• On the other hand, if we pick it too

large, we will not get a good solution

• The only way to resolve these conflicting

goals is to reduce the learning rate
dynamically as optimization progresses
Dynamic learning rate

• Replace  with time-dependent learning rate (t)

• adds to the complexity of controlling convergence of an optimization
algorithm.

1. Piecewise constant
• decrease the learning rate, e.g., whenever progress in optimization stalls.
• This is a common strategy for training deep networks.

2. Exponential decay
• Leads to premature stopping before the algorithm has converged.

3. Polynomial decay with α= 0.5

Dynamic learning rate

Exponential decay
• Variance in the parameters is
significantly reduced. But,
• The algorithm fails to converge at all.

Polynomial decay

• Convergence gets better after only 50

steps.
Mini batch GD

• GD is not particularly data efficient

• SGD is not particularly computationally efficient since CPUs and GPUs cannot
exploit the full power of vectorization
• Mini batch GD: read mini batches of data rather than single observations to
update parameters
• Weight update:

• Pick a mini batch (which is hyper parameter) that is large enough to offer good
computational efficiency while still fitting into the memory of a GPU.
• For small training sets (e.g., <2000), use GD
• For larger sets, sizes between 64 to 512 (preferably powers of 2) are typical.
Steps for mini batch GD

1. Split the data into mini-batches, e.g., X(1) through X(1000), X(1001)
through X(2000), and so on.
2. For each mini-batch, perform forward propagation using only the
data in that mini-batch.
3. Compute the cost function for that mini-batch.
4. Implement backpropagation to compute gradients.
5. Update the weights and biases using the gradients.
1. Enables progress in gradient descent even when only partially
through the training set.
2. With GD , cost should decrease every iteration.
3. For mini-batch GD, cost might not decrease every iteration due
to different training batches.
4. Cost function J should generally trend downwards but may
oscillate due to varying difficulty of mini-batches.
GD vs SGD vs Mini batch GD

Stochastic gradient descent Gradient descent

Noisy, oscillates around the minimum Takes consistent steps toward minimum.
but may not converge to it.

Mini batch gradient descent

Noisy, oscillates around the minimum but may not converge to it.
GD vs SGD vs Mini batch GD

GD is computationally heavy, converges to a “flat minima” but performs well on the test
data.
Drawback of gradient based methods
• The most critical challenge to optimizing deep networks is finding the
correct trajectory to move in.
• Gradient isn’t usually a very good indicator of the good trajectory.
• when the contours are perfectly circular , gradient always point in the
direction of the local minimum.
• However, if the contours are extremely elliptical (as is usually the case for
the error surfaces of deep networks), the gradient can be as inaccurate as
90 degrees away from the correct direction!
Error surface of DNN

• A critical point can only be a local minimum if it appears as a

local minimum in every single one of the 𝑑 one-dimensional
subspaces
• As the dimensionality of our parameter space increases, local
minima become exponentially more rare
• As our function has more and more dimensions (i.e., we have
more and more parameters in our model), saddle points are
exponentially more likely than local minima
Exponentially Weighted Averages
(EWA)
• Technique used to calculate the moving average of a time series, where more
recent data points are given higher weights, and older data points are given
lower weights.
• The weights decline exponentially as the data points get older, hence the name
“exponentially weighted”
• Applied in various optimization algorithms in deep learning, such as Gradient
Descent with Momentum, RMSprop, and Adam
• The EWA is calculated using the following formula:

Vt = β ∗ (Vt−1)+(1−β) ∗ NewSample

Vt = weighted average at time t

β = coefficient of momentum, parameter that determines weight given to
previous values (between 0 and 1)
NewSample is t= he new data point at time t
EWA
EWA
Vt = β ∗ (Vt−1)+(1−β) ∗ NewSample

n as the numbers of
observations used to adapt
your EWA.
Gradient Descent with Momentum
Vt = β ∗ (Vt−1)+(1−β) ∗ NewSample

• Added a mechanism in GD for aggregating a history of past

gradients to accelerate convergence
• Taking knowledge from previous steps about where we should
be heading. We are introducing a new hyperparameter 
• The term vt is called momentum
• This term accumulates the gradients from the past several
steps
• This method updates the parameters 𝑤 in the direction of the
weighted average of the past gradients

𝑣𝑡 ← 𝛽𝑣𝑡−1 + 𝑔𝑡,𝑡−1
𝑤𝑡 ← 𝑤𝑡−1 −  𝑣𝑡
Gradient Descent with Momentum
Gradient Descent with Momentum
Momentum Update
Plain gradient update With momentum

• Momentum averages out the oscillations

• In a vertical direction, oscillations are reduced because positive and negative

numbers average out, Reduces updates for dimensions whose gradients
change directions

• In a horizontal direction, the movement remains aggressive, which speeds up

the descent, Increases updates for dimensions whose gradients point in the
same directions.
Gradient Descent with Momentum

𝑣𝑡 ← 𝛽𝑣𝑡−1 + 𝑔𝑡,𝑡−1
𝑤𝑡 ← 𝑤𝑡−1 −  𝑣𝑡
Gradient Descent with Momentum

• Momentum replaces gradients with a leaky average

over past gradients. This accelerates convergence
significantly.
• Momentum prevents stalling of the optimization process that
is much more likely to occur for stochastic gradient descent.
• The effective number of gradients is given by 1/ (1−β)
due to exponentiated downweighting of past data.
• Implementation is quite straightforward but it requires us to
store an additional state vector (momentum v).
Gradient Descent with Momentum

Gradient descent with momentum uses the momentum of the

gradient for parameter optimization
cost ℒ 𝜃
Movement = Negative of Gradient +
Momentum
Negative of Gradient
Momentum
Real Movement

𝜃
Gradient = 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial
Solved problem SGD and SGD +
momentum
Consider the following loss function, with initial value of W{0} = -
2.8, and learning rate  = 0.05 and =0.7. Use SGD and SGD
+momentum to find the updated value of W{1} after the first
iteration.
𝐿(𝑤) = 0.3 ∗ 𝑤 4 − 0.1 ∗ 𝑤 3 − 2 ∗ 𝑤 2 − 0.8 ∗ 𝑤

Answer
Iteration SGD
𝑔𝑖 w𝑖 Iteration SGD𝑔𝑖+momentum
v𝑖 w𝑖

1 -18.2943 -1.88527 1 -18.294 0 -1.88527

2 -2.36614 -1.76697 2 -2.3661 -15.172 -1.126

Adagrad - motivation
• In GD same LR for all parameters and iteration.
• What happens to features that occur infrequently (sparse features) e.g w is
sparse, b is not (as x0 is always 1) e.g:

• Parameters associated with infrequent features only receive meaningful

updates whenever these features occur
b

W
Adagrad

• Decay the learning rate for parameters in proportion to their update history
• An individual learning rate per parameter(feature)
• Accumulating past squared gradients in st

Accumulate the history

• ϵ is a smoothing term that avoids division by zero

• Initialize s0 = 0
• Adagrad's main benefits is that it eliminates the need to manually tune the
learning rate.
• Most implementations use a default value of 0.01 and leave it at that
• Application: In natural language processing and image recognition
applications
Adagrad

• It adapts the learning rate to

the parameters
• smaller updates (i.e. low
learning rates) for
parameters associated
with frequently occurring
features
• larger updates (i.e. high
learning rates) for
parameters associated
with infrequent features
Drawback of AdaGrad

• AdaGrad decays the learning rate very aggressively (as the

denominator grows).
• As a result, after a while, the frequent parameters will start
receiving very small updates because of the decayed learning
rate.

• As a result, the algorithm may experience sluggish

convergence or even premature stalling
• To avoid this why not decay the denominator and prevent its
rapid growth
RMSProp
• The issue :
• Adagrad accumulates the squares of the gradient gt into a state vector
𝑠𝑡 = 𝑠𝑡−1 + 𝑔𝑡2
• As a result st keeps on growing without bound due to the lack of
normalization

• Use a leaky average in the same way we used in the momentum method

Parameter  > 0
The constant 𝜖 > 0 is typically set to 10-6
• Faster convergence compared to AdaGrad
• works well on big and redundant datasets
Review of techniques learned so far

1. Stochastic gradient descent

○ more compute effective than Gradient Descent when solving
optimization problems
○ Mini batch Stochastic gradient descent affords significant additional
efficiency arising from vectorization, using larger sets of
observations in one mini batch.
○ This is the key to efficient multi-machine,multi-GPU and overall
parallel processing.
2. Momentum
○ added a mechanism for aggregating a history of past gradients to
accelerate convergence.
3. Adagrad
○ used per-coordinate scaling to allow for a computationally efficient
preconditioner.
4. RMSProp
○ leaky average + dynamic learning rate
Adam-Adaptive Moment Estimation

• Adam combines all these techniques into one efficient

learning algorithm.
• computes individual adaptive learning rates for different
parameters from estimates of the first and second moments
of the gradients.
• utilize the momentum concept from “SGD with momentum”
and adaptive learning rate from “RMSProp”
• Adam can diverge due to poor variance control.
(disadvantage)
 V for Momentum
 S for RMSProp
Adam Algorithm
 first moment (the mean)of gradient

second moment (uncentered

variance) of gradients

• 1 and 2 are nonnegative weighting parameters.

• Common choices for them are 1 = 0.9 and 2 = 0.999.
• Initialize v0 = s0 = 0
• Normalize the state variables (bias correction)→
Bias-correction helps Adam slightly outperform RMSprop

• Rescale the gradient →

4
• Compute updates →
6
Optimization algorithm comparison

Gradient Descent 𝑤𝑡 ← 𝑤𝑡−1 −  𝑔𝑡

GD with Momentum: replaces gradients 𝑣𝑡 ← 𝛽𝑣𝑡−1 + 𝑔𝑡,𝑡−1

with a leaky average over past gradients 𝑤𝑡 ← 𝑤𝑡−1 −  𝑣𝑡

AdaGrad (Adaptive learning rate): An

individual learning rate per parameter

RMSProp : exponentially decaying

average of past squared gradients

Adam : exponentially decaying average of past

gradients + exponentially decaying average of
past squared gradients
ADAM VS SGD

Adaptive methods (such as Adam) do not generalize as well as SGD

with momentum when tested on a diverse set of deep learning tasks

In the research paper by Nitish Shirish Keskar, Richard Socher :

Improving Generalization Performance by Switching from Adam to
SGD :
Earlier stages of training Adam still outperforms SGD but later the
learning saturates. They proposed simple strategy in which they
start training deep neural network with Adam but then switch to
SGD when certain criteria hits
Optimization algorithm comparison
References

Chapter – 12 Dive into Deep Learning

https://d2l.ai/
Thank You All !

BITS Pilani, Pilani Campus

Gradient Descent
No ratings yet
Gradient Descent
17 pages
MYRIAD MODEL User Reference Guide
No ratings yet
MYRIAD MODEL User Reference Guide
74 pages
Manual Neuraltools6 en
No ratings yet
Manual Neuraltools6 en
104 pages
Basic Fuse Terminology: Fuse: Fuse Elements: Rated Current: Fusing Current: Fusing Factor
50% (2)
Basic Fuse Terminology: Fuse: Fuse Elements: Rated Current: Fusing Current: Fusing Factor
3 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Cours 5
No ratings yet
Cours 5
23 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Optimizer
No ratings yet
Optimizer
13 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Optim
No ratings yet
Optim
33 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Deep Learning Chapter 1
No ratings yet
Deep Learning Chapter 1
46 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
1 Intro
No ratings yet
1 Intro
91 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
Training NNs
No ratings yet
Training NNs
34 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
Gradient Descent 5 Part 2
No ratings yet
Gradient Descent 5 Part 2
15 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
Stochastic Gradient Descent - Term Paper
No ratings yet
Stochastic Gradient Descent - Term Paper
8 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
Deep Learning (MODULE-2)
No ratings yet
Deep Learning (MODULE-2)
86 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
Optimizers
No ratings yet
Optimizers
4 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
Opti Incertitude
No ratings yet
Opti Incertitude
231 pages
Technical Writing
No ratings yet
Technical Writing
8 pages
PCA and Convex Optimization and Bias, Variance-2
No ratings yet
PCA and Convex Optimization and Bias, Variance-2
29 pages
SCSA3015 Deep Learning Unit 4 PDF
No ratings yet
SCSA3015 Deep Learning Unit 4 PDF
30 pages
Unit 2 Introduction To Deep Learning
No ratings yet
Unit 2 Introduction To Deep Learning
79 pages
UNIT2
No ratings yet
UNIT2
25 pages
Module 2
No ratings yet
Module 2
67 pages
Hill Climbing: Fundamentals and Applications
From Everand
Hill Climbing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
DL Regularization
No ratings yet
DL Regularization
28 pages
DL Activation
No ratings yet
DL Activation
41 pages
CS 3 Vector Semantics
No ratings yet
CS 3 Vector Semantics
46 pages
Contact Session8 - Viterbie and Forward Backward Algortihm
No ratings yet
Contact Session8 - Viterbie and Forward Backward Algortihm
63 pages
Contact Session7 - POS Tagging
No ratings yet
Contact Session7 - POS Tagging
65 pages
Oracle Recommended Patches R12.ATG - PF.B
No ratings yet
Oracle Recommended Patches R12.ATG - PF.B
32 pages
Pascal Output Answer
100% (1)
Pascal Output Answer
13 pages
S Block dpp2
No ratings yet
S Block dpp2
3 pages
1 4 Scalars and Vectors MCQ
No ratings yet
1 4 Scalars and Vectors MCQ
30 pages
Determination of Flow Properties of Powders and Granules
No ratings yet
Determination of Flow Properties of Powders and Granules
6 pages
550 KV Gas Insulated Transmission
No ratings yet
550 KV Gas Insulated Transmission
8 pages
Biswaranjan Biswal NCES MCQ
100% (1)
Biswaranjan Biswal NCES MCQ
11 pages
LabVIEW MathScript
No ratings yet
LabVIEW MathScript
57 pages
Compressive Strength Characteristic of Cowdung Ash Blended Cement Concrete
No ratings yet
Compressive Strength Characteristic of Cowdung Ash Blended Cement Concrete
7 pages
AISI 1040 Properties Efunda
No ratings yet
AISI 1040 Properties Efunda
3 pages
CFA L2 2024 Volume1
100% (1)
CFA L2 2024 Volume1
168 pages
32 + 44 B10F-Ball-Valve
No ratings yet
32 + 44 B10F-Ball-Valve
1 page
SETS
50% (2)
SETS
26 pages
Fluid Focus Lens PDF
No ratings yet
Fluid Focus Lens PDF
25 pages
4th Sem End Semester Question Papers
No ratings yet
4th Sem End Semester Question Papers
15 pages
Chapter 4 Conic Section and Its Application
100% (1)
Chapter 4 Conic Section and Its Application
13 pages
Unit Conversion Table: Distance Foot (FT) Inch (In) Meter (M) Centimeter (CM) Mile (Mi)
No ratings yet
Unit Conversion Table: Distance Foot (FT) Inch (In) Meter (M) Centimeter (CM) Mile (Mi)
2 pages
Narayana 14-06-2022 Outgoing SR Jee Main Model GTM 9 QP Final
No ratings yet
Narayana 14-06-2022 Outgoing SR Jee Main Model GTM 9 QP Final
19 pages
Reactive Power - Obligatory (Synch Gens) v1.1
No ratings yet
Reactive Power - Obligatory (Synch Gens) v1.1
4 pages
6 Mips Datapath
No ratings yet
6 Mips Datapath
55 pages
Simple Neon Lamp Circuits and Working Explained 2
No ratings yet
Simple Neon Lamp Circuits and Working Explained 2
36 pages
Screw Conveyor Design
100% (1)
Screw Conveyor Design
8 pages
01 - Python Pandas 1 & 2
No ratings yet
01 - Python Pandas 1 & 2
5 pages
2.4-p1-p71 Vertical
No ratings yet
2.4-p1-p71 Vertical
7 pages
SDM Lab Report
No ratings yet
SDM Lab Report
35 pages
Design of Rotation Inducing Rocket Fins and Their Analysis For Aerodynamic Stability
No ratings yet
Design of Rotation Inducing Rocket Fins and Their Analysis For Aerodynamic Stability
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DL Regularization

Uploaded by

DL Regularization

Uploaded by

Deep Learning

Disclaimer and Acknowledgement

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

• Optimization algorithms train deep learning models.

• Understanding the principles of different optimization

• The error surface lies in a space with a

One-Dimensional Gradient Descent

BITS Pilani, Pilani Campus

• gradient of the objective function at x :

• Loss functions of most deep

• SGD reduces computational cost at each iteration

• At each iteration of stochastic gradient descent, we uniformly

• compute the gradient to update x:

• computational cost for each iteration drops from O(n) of the

• This leaves us with the only alternative:

• However, if we pick this too small, we

• On the other hand, if we pick it too

• The only way to resolve these conflicting

• Replace  with time-dependent learning rate (t)

3. Polynomial decay with α= 0.5

• Convergence gets better after only 50

• GD is not particularly data efficient

Stochastic gradient descent Gradient descent

Mini batch gradient descent

• A critical point can only be a local minimum if it appears as a

Vt = weighted average at time t

• Added a mechanism in GD for aggregating a history of past

• Momentum averages out the oscillations

• In a vertical direction, oscillations are reduced because positive and negative

• In a horizontal direction, the movement remains aggressive, which speeds up

• Momentum replaces gradients with a leaky average

Gradient descent with momentum uses the momentum of the

1 -18.2943 -1.88527 1 -18.294 0 -1.88527

2 -2.36614 -1.76697 2 -2.3661 -15.172 -1.126

• Parameters associated with infrequent features only receive meaningful

Accumulate the history

• ϵ is a smoothing term that avoids division by zero

• It adapts the learning rate to

• AdaGrad decays the learning rate very aggressively (as the

• As a result, the algorithm may experience sluggish

1. Stochastic gradient descent

• Adam combines all these techniques into one efficient

second moment (uncentered

• 1 and 2 are nonnegative weighting parameters.

• Rescale the gradient →

Gradient Descent 𝑤𝑡 ← 𝑤𝑡−1 −  𝑔𝑡

GD with Momentum: replaces gradients 𝑣𝑡 ← 𝛽𝑣𝑡−1 + 𝑔𝑡,𝑡−1

AdaGrad (Adaptive learning rate): An

RMSProp : exponentially decaying

Adam : exponentially decaying average of past

Adaptive methods (such as Adam) do not generalize as well as SGD

In the research paper by Nitish Shirish Keskar, Richard Socher :

Chapter – 12 Dive into Deep Learning

BITS Pilani, Pilani Campus

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.