AML_unit_1
AML_unit_1
Module - 1
Nayan Ranjan Paul
Assistant Professor
Department of Computer Science and Engineering
Silicon Institute of Technology
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 1 / 160
Outline
1 Deep Learning
2 Success Stories
3 History
5 Perceptron
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 2 / 160
Deep Learning
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 3 / 160
Success Stories
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 4 / 160
Success Stories
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 5 / 160
Success Stories
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 6 / 160
Success Stories
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 7 / 160
Success Stories
Describing photos
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 8 / 160
Success Stories
Translation
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 9 / 160
Success Stories
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 10 / 160
History - McCulloch Pitts Neuron (1943)
The first model of a neuron was invented by McCulloch (physiologists)
and Pitts (logician).
The model had two inputs and a single output.
A neuron would not activate if only one of the inputs was active.
The weights for each input were equal, and the output was binary.
Until the inputs summed up to a certain threshold level, the output would
remain zero.
The McCulloch and Pitts’ neuron has become known today as a logic cir-
cuit.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 11 / 160
History - McCulloch Pitts Neuron(MPN) (1943)
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 12 / 160
History - Perceptron (1958)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 13 / 160
History - Perceptron, The dream(1958)
“the embryo of an electronic computer that [the Navy] expects will be able to
walk, talk, see, write, reproduce itself and be conscious of its existence.”
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 14 / 160
History - MPN vs Perceptron
Apparently McCulloch and Pitts’ neuron is a better model for the electro-
chemical process inside the neuron than the perceptron.
But perceptron is the basis and building block for modern neural networks.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 15 / 160
History - Optimization (1960)
Widrow and Hoff proposed a method for adjusting the weights. They in-
troduced a gradient search method based on minimizing the error squared
(Least Mean Squares).
In the 1960’s, there were many articles promising robots that could think.
It seems there was a general belief that perceptrons could solve any prob-
lem.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 16 / 160
History - shattered dream (1969)
The book shows that perceptrons could only solve linearly separable prob-
lems.
They showed that it is not possible for perceptron to learn an XOR func-
tion.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 17 / 160
History - Backpropagation (1974)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 18 / 160
History
1980
The field of artificial neural network research experienced a resurgence.
2000
Neural network fell out of favor partly due to BP limitations.
Backpropagation Limitations
It requires labeled training data.
It is very slow in networks with multiple layers (doesn’t scale well).
It can converge to poor local minima.
Backpropagation returns
Backpropagation returned again in the 2010s, now able to train much larger
networks using huge modern computing power such as GPUs. For example, in
2013 top speech recognizers now use backpropagation-trained neural networks.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 19 / 160
Feedforward Deep Networks
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 20 / 160
Feedforward Deep Networks
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 21 / 160
Feedforward Deep Networks
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 22 / 160
Perceptron
It was invented by Rosenblatt in 1957 at Cornell Labs, and was first men-
tioned in the paper ‘The Perceptron – a perceiving and recognizing au-
tomaton’.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 23 / 160
Perceptron
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 24 / 160
Perceptron
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 25 / 160
Perceptron
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 26 / 160
Perceptron
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 27 / 160
Perceptron
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 28 / 160
Perceptron
3) We set β ∗ = ∥β∥
β
as the unit normal vector of the hyper-plane L .
For simplicity, we call β ∗ norm vector. The distance of point x to L is given
by
βT x (β T x+β0 )
β ∗T (x − x0 ) = β ∗T x − β ∗T x0 = ∥β∥ + β0
∥β∥ = ∥β∥
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 29 / 160
Perceptron
di = −yi (β T xi + β0 ),
Since we need to find the distance from the hyperplane to the misclassified
data points, we need to add a negative sign in front. When the data point is
misclassified, β T xi + β0 will produce an opposite sign of yi . Since we need
a positive sign for distance, we add a negative sign.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 30 / 160
Learning Perceptron
That is, it finds the steepest slope in the D-dimensional space at a given
point and descends down in the direction of the negative slope.
Note that unless the error function is convex, it is possible to get stuck in
a local minima.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 31 / 160
Learning Perceptron
β new ← β old − α ∂E
∂β
rr
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 32 / 160
Learning Perceptron
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 33 / 160
Learning Perceptron
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 34 / 160
Learning Perceptron
In this situation the number of iterations depends on the learning rate and
the margin. However, if the data is not linearly separable there is no guar-
antee that the algorithm converges.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 35 / 160
Perceptron Features
The smaller the gap between the two classes, the longer the time of con-
vergence.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 36 / 160
Perceptron Features
When the data is not separable, the algorithm will not converge (it should
be stopped after N steps).
A learning rate that is too high will make the perception periodically os-
cillate around the solution unless additional steps are taken.
Learning rate affects the accuracy of the solution and the number of iter-
ations directly.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 37 / 160
Neural Network
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 38 / 160
Neural Network
For regression, typically k = 1 (the number of nodes in the last layer), there
is only one output unit y1 at the end.
For c-class classification, there are typically c units at the end with the cth
unit modelling the probability of class c, each yc is coded as 0-1 variable
for the cth class.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 39 / 160
Backpropagation
P
ai = l zl uil
zi = σ(ai )
1
σ(a) = 1+e−a
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 40 / 160
Backpropagation
∂|y−ŷ|2
∂uil = δj .zl
∂|y−ŷ|2
where δj = ∂aj
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 41 / 160
Backpropagation
∂|y−ŷ|2
δi = ∂ai =
P ∂|y−ŷ|2 ∂aj
j ∂aj . ∂ai
P ∂aj ∂zi
δi = j δj . ∂zi . ∂ai
P ′
δi = j δj .uji .σ (ai )
where 2
δj = ∂|y−ŷ|
∂aj
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 42 / 160
Backpropagation
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 43 / 160
Backpropagation
where ak = ŷ
assume an activation function is not applied in the output layer.
∂(y−ŷ)2
δk = ∂ ŷ
δk = −2(y − ŷ)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 44 / 160
Backpropagation
2
uil ← uil − α ∂(y−ŷ)
∂uil
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 45 / 160
Backpropagation
Apply x to the FFNN’s input layer and calculate the outputs of all input
neurons.
Propagate the outputs of each hidden layer forward, one hidden layer at a
time, and calculate the outputs of all hidden neurons.
Once x reaches the output layer, calculate the output(s) of all output neu-
ron(s) given the outputs of the previous hidden layer.
At the output layer, compute δk = −2(yk − yˆk ) for each output neuron(s).
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 46 / 160
Backpropagation
∂(y−ŷ)2
Compute ∂ui l = δi zl for all weights uil .
∂(y−ŷ) 2
Then update unew
il ← uold
il − α. ∂uil for all weights uil .
Continue for the next data points and iterate on the training set until weights
converge.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 47 / 160
Epochs
It is common to cycle through all of the data points multiple times in order
to reach convergence.
An epoch represents one cycle in which you feed all of your data points
through the neural network.
It is good practice to randomize the order you feed the points to the neural
network within each epoch; this can prevent your weights from changing
in cycles.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 48 / 160
Regularization for Deep Learning
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 49 / 160
Overview of Supervised Learning
The mapping function is often called the target function because it is the
function that a given supervised machine learning algorithm aims to ap-
proximate.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 50 / 160
Overview of Supervised Learning
The learned model will be used to predict the unseen test data.
However, if the learned model is not accurate, it can make prediction er-
rors.
The main aim of ML analysts is to reduce these errors in order to get more
accurate results.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 51 / 160
Capacity, Overfitting and Underfitting
The training error is defined as the error computed on the training set by
the model during training.
The test error or also called generalization error is defined as the expected
value of the error on a new unseen input or test set. Here the expectation
is taken across different possible inputs, drawn from the distribution of
inputs we expect the system to encounter in practice.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 52 / 160
Capacity, Overfitting and Underfitting
Our objective is not only to reduce the training error but also to reduce the
generalization error or test error.
The factors determining how well a machine learning algorithm will per-
form are its ability to:
1 Make the training error small.
2 Make the gap between training and test error small.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 53 / 160
Capacity, Overfitting and Underfitting
Underfitting occurs when the model is not able to obtain a sufficiently low
error value on the training set.
Overfitting occurs when the gap between the training error and test error
is too large.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 54 / 160
Capacity, Overfitting and Underfitting
Models with low capacity may struggle to fit the training set.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 55 / 160
Capacity, Overfitting and Underfitting
Specifies which family of functions the learning algorithm can choose from
when varying the parameters to reduce a training objective. This is called
the representational capacity of the model.
In many cases, finding the best function within this family is a very diffi-
cult optimization problem.
In practice, the learning algorithm does not actually find the best function,
but merely one that significantly reduces the training error.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 56 / 160
Capacity, Overfitting and Underfitting
Machine learning algorithms will generally perform best when their ca-
pacity is appropriate for the true complexity of the task they need to per-
form and the amount of training data they are provided with.
Models with insufficient capacity are unable to solve complex tasks which
may lead to underfitting.
Models with high capacity can solve complex tasks, but when their capac-
ity is higher than needed to solve the present task they may overfit.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 57 / 160
Capacity, Overfitting and Underfitting
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 58 / 160
Introduction to Regularization
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 59 / 160
Introduction to Regularization
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 60 / 160
What is Regularization? An Intuitive view
Plausible rules
What’s my rule? 3 consecutive single digits
3 consecutive integers
1 2 3 ⇐ Satisfies rule
3 numbers in ascending order
4 5 6 ⇐ Satisfies rule
3 numbers whose sum is less than 25
7 8 9 ⇐ Satisfies rule
1, 4, 7 in the first column
9 2 32 ⇐ Satisfies rule
“yes” to the first 3 sequences and “no” to
all others.
Regularization corresponds to methods used in machine learning to improve
generalization performance (avoid overfitting the training data).
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 61 / 160
Parameter Norm Penalties
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 62 / 160
Supervised learning (Inference)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 63 / 160
Supervised learning (Learning)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 64 / 160
Norms (Definition)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 65 / 160
Squared L2 Norm
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 66 / 160
Squared L2 Norm
xT = [x1 , x2 ]
L2 norm
p
f (x) = x21 + x22
′
f (x) = x1 (x21 + x22 )− 12
Squared L2 norm
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 67 / 160
Norm penalties
Limiting the capacity of models by adding norm penalty Ω(θ) to the ob-
jective function J
Not modifying the model in the inference phase, but adding penalties to
the objective function in the learning phase
Also known as weight decay.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 68 / 160
L2 Regularization
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 69 / 160
L2 Regularization - Analysis
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 70 / 160
L2 Regularization - Analysis
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 71 / 160
L2 Regularization - Analysis
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 72 / 160
L2 Regularization - Analysis
We see that the effect of weight decay is to rescale w∗ along the axes
defined by the eigenvectors of H
Specifically, the component of w∗ that is aligned with the i-th eigenvector
of H is rescaled by a factor of λiλ+α
i
Along the directions where the eigenvalues of H are relatively large, for
example, where λi ≫ α, the effect of regularization is relatively small.
However, components with λi ≪ α will be shrunk to have nearly zero
magnitude.
Only directions along which the parameters contribute significantly to re-
ducing the objective function are preserved relatively intact.
Summary: The weight vector (w∗ ) is getting rotated to w̃ on using L2
regularization. All of its elements are shrinking but some are shrinking
more than others. This ensures that only important features are given high
weights. .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 73 / 160
L2 Regularization - Analysis
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 74 / 160
L2 Regularization - Analysis
1
1 The matrix X T X is proportional to the covariance matrix mX X
T
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 75 / 160
L1 Regularization
1 Let the objective function is J(w) and the norm penalty is Ω(w)
2 The regularized objective function is
˜
J(w) = J(w) + Ω(w)
3 Substituting L1 norm to the Ω(w)
˜
J(w) = J(w) + α ∥ w ∥1
4 Gradient of the objective function is
∇J(w)
˜ = ∇(J(w) + α ∥ w ∥1 )
= ∇J(w) + αsign(w)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 76 / 160
L1 Regularization - Analysis
Effect of L1 regularization
1 The regularization contribution to the gradient no longer scales linearly
with each wi ; instead, it is a constant factor with a sign equal to sign(wi ).
2 It has no algebraic solutions.
3 The approximate solution in this regard
b
J(w) = H(w − w∗ )
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 77 / 160
Comparison L1 and L2 norm Regularization
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 78 / 160
Comparison L1 and L2 norm Regularization
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 79 / 160
Norm Penalties as Constrained Optimization
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 80 / 160
Constrained optimization
Sometimes one may wish to find the maximal or minimal value of f (x)
for value of x in some set S
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 81 / 160
Expression of constrained function
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 82 / 160
Generalized Lagrange function
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 83 / 160
Generalized Lagrange function
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 84 / 160
Norm Penalties with respect to Constrained Optimization
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 85 / 160
Norm Penalties with respect to Constrained Optimization
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 86 / 160
Regularization and Under-Constrained Problems
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 87 / 160
Regularization and Under-Constrained Problems
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 88 / 160
Dataset Augmentation
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 89 / 160
Dataset Augmentation
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 90 / 160
Dataset Augmentation
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 91 / 160
Noise Robustness
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 92 / 160
Noise Robustness
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 93 / 160
Noise Robustness
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 94 / 160
Early Stopping
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 95 / 160
Early Stopping
How to stop early?
1 Execute the algorithm itera-
tively
2 Track the validation error
3 Have a patience parameter p
4 If you are at step k and there
was no improvement in vali-
dation error in the previous p
steps then stop training and re-
turn the model stored at step
k−p
5 Basically, stop the training
early before it drives the train-
ing error to 0 and blows up the
validation error . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 96 / 160
Early Stopping
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 97 / 160
How does Early Stopping act as a regularizer?
(Mathematical analysis)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 98 / 160
How does Early Stopping act as a regularizer?
(Mathematical analysis)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 99 / 160
How does Early Stopping act as a regularizer?
(Mathematical analysis)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 100 / 160
How does Early Stopping act as a regularizer?
(Mathematical analysis)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 101 / 160
Algorithm
Objective - To determine the best amount of time to train model
Q: Number of steps between execution
1: n :No. of steps between execution
2: p :The number of times to observe worsening validation set error before giving up.(stands
for patience)
3: θo :Initial parameter vector
4: θ ← θ0
5: i ← 0, j ← 0, v ← ∞, θ∗ ← θ, i∗ ← i # θ∗ :Best parameter set
6: while j < p do
7: Update θ by executing training algorithm for n steps
8: i←i+n
′
9: v ← V alidationSetError(θ)
′
10: if v < v then
11: j←0
12: θ∗ ← θ
13: i∗ ← i
14: v ← v∗
15: else
16: j ←j+1
17: end if . . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 102 / 160
Parameter Tying
1 Parameter dependency
1 L2 regularization or weight decay penalizes model parameters for deviating
from the fixed value of zero.
2 Sometimes we need other ways to express prior knowledge of parameters.
3 We may know from domain and model architecture that there should be
some dependencies between model parameters.
2 The goal of parameter tying
1 We want to express that certain parameters should be close to one another.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 103 / 160
Parameter Tying
(A scenario of parameter tying)
1 Two models performing the same classification task with the same set of
classes but with somewhat different input distributions.
2 Model A with parameter W (A)
3 Model B with parameter W (B)
4 The two models will map the input to two different, but related output:
ŷ (A) = f (W (A) , x)
′
ŷ (B) = g(W (B) , x )
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 104 / 160
Parameter Tying
(L2 penalty for parameter tying)
1 If the tasks are similar enough (perhaps with similar input and output dis-
tributions) then we believe that the model parameters should be close to
each other:
(A) (B)
∀, wi ≈ wi
2 We can leverage this information via regularization.
3 Use a parameter norm penalty (other choices are possible)
Ω(W (A) , W (B) ) =∥ W (A) − W (B) ∥22
˜ X, y) = J(θ; X, y) + αΩ(θ)
so J(θ;
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 105 / 160
Parameter Tying
(L2 penalty for parameter tying)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 106 / 160
Parameter Sharing
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 107 / 160
Parameter Sharing
Advantage
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 108 / 160
Sparse Representation
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 109 / 160
Parameter Sparsity
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 110 / 160
Representational Sparsity
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 111 / 160
Orthogonal matching pursuit
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 112 / 160
Bagging and other ensemble methods
Background
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 113 / 160
Bagging and other ensemble methods
Approach
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 114 / 160
Bagging and other ensemble methods
Approach
1 In the case, the errors are perfectly correlated and c=v, this reduces to v,
and the model averaging does not help at all.
2 If the errors are perfectly uncorrelated and c=0, then expected squared
error of the ensemble is only k1 v.
3 On average, the ensemble will perform at least as well as any of its mem-
bers.
4 If the members make independent errors, the ensemble will perform sig-
nificantly better than its members.
5 Question - Errors during ensembling are less than errors in average?
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 115 / 160
Bagging method
1 Bagging is a method, that allows the same kind of model, training algo-
rithm and objective function to be reused several times.
2 At the outset this method constructs k-different data set.
3 Each dataset has the same number of examples as the original dataset,
but each dataset is built by sampling with replacement from the original
dataset.
4 This means it may contain several duplicate examples, however, the in-
tersection between different datasets should be almost ϕ. Then only the
model can be properly trained and the averaging value better in compari-
son to individual algorithms.
5 Model i is then trained on the dataset i.
6 This method is powerful in reducing generalization errors.
7 Model complexity and memory requirement are more.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 116 / 160
Dropout
Background(Why/How)
1 Bagging and other ensemble techniques require large neural network mod-
els.
2 Training and evaluating such networks is costly in terms of runtime and
memory.
3 In this case bagging and ensembling should not be applied.
4 So dropout provides a computationally inexpensive but powerful method
of regularizing abroad family of models
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 117 / 160
Dropout Method
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 118 / 160
Dropout - How does it select units?
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 119 / 160
Dropout - How does it select units?
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 120 / 160
Optimization for Training Deep Learning Models//
Introduction
1 Of all the many optimization problems involved in deep learning, the most
difficult is neural network training.
2 It is quite common to invest days to months of time on hundreds of ma-
chines in order to solve even a single instance of neural network training
problem.
3 Because this problem is so important and so expensive, a specialized set
of optimization techniques have been developed for solving it.
4 Finding the parameter θ of a neural network that significantly reduces a
cost function J(θ)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 121 / 160
How Learning Differs from Pure Optimization?
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 122 / 160
How Learning Differs from Pure Optimization?
1 In the context of deep learning, the loss function can be written as an av-
erage over the training set.
J(θ) = E(x,y)∼P̂data L(f (x; θ), y)
Where L is the per-example loss function
f (x; θ) is the predicted output for x
y is the actual output for x
P̂data is the empirical distribution
2 Here the expectation is over the training data, we would usually prefer to
minimize the expectation over the data generating distribution Pdata rather
than over the finite training set:
s.t. J ∗ (θ) = E(x,y)∼Pdata L(f (x; θ), y)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 123 / 160
Empirical Risk Minimization
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 124 / 160
What is the problem with empirical risk Minimization?
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 125 / 160
Surrogate Loss Function
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 126 / 160
Early Stopping
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 127 / 160
Batch and Minibatch Algorithms
1 Optimization algorithms that use the entire training set to compute the
gradient are called batch or deterministic gradient methods.
2 Algorithms that use a single training example for that task are called stochas-
tic or online gradient methods.
3 Algorithms that use b training examples are called minibatch where 1 ≤
b ≤ m and m is the size of the entire training examples.
4 Larger batch sizes provide a more accurate estimate of the gradient, but
with less than linear returns.
5 Max batch size will be limited by available memory.
6 Small batches can offer a regularizing effect.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 128 / 160
Challenges in Neural Network Optimization
Optimization is a Hard Problem
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 129 / 160
Ill Conditioning
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 130 / 160
Issue of Local Minima
1 When optimizing a convex function, any minima is guaranteed to be a
global minimum.
2 For non-convex optimization functions (generally for NN), there can be
an extremely large number of local minima.
3 Functions involved in deep models are guaranteed to have an extremely
large number of local minima, which is an issue.
4 Local minima more generally are particularly problematic when they have
high costs compared to the global minimum.
5 It is an open question but researchers largely suspect that these local min-
imums are not a problem.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 131 / 160
Issue of Saddle Point
1 For any high-dimensional non-convex functions, local minima and max-
ima are in fact rare compared to saddle point.
2 A saddle point is a point where the Hessian matrix has both positive and
negative eigenvalues.
3 We can think of a saddle point as being a local minimum along one cross-
section of the cost function and a local maximum along another cross-
section.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 132 / 160
Issue of Saddle Point
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 133 / 160
Issue of Saddle Point
Implications
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 134 / 160
Cliffs and Exploding Gradients
1 Neural network with many layers often have extremely deep regions re-
assembling cliffs.
2 Cliffs results from the multiplication of several large weights together.
3 On the face of an extremely steep cliff structure, the gradient update step
can move the parameters extremely far, usually jumping off the cliff struc-
ture altogether.
4 We can deal with these cliff structures using gradient clipping.
5 Cliffs are common in RNNs due to the multiplication of many factors.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 135 / 160
Long Term Dependency
1 It arises when the computational graph is very deep, The result of this
problem is a vanishing and exploding gradient.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 136 / 160
Stochastic Gradient Descent(SGD)
1 Stochastic gradient descent (SGD) and its variants are probably the most
used optimization algorithms for machine learning in general and for deep
learning in particular.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 137 / 160
SGD
Explanation
(i)
1 The dataset is D = x(i) , yi=1toN
2 Select a minibatch of M from the N samples.
3 The error can be minimized by considering the minibatch.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 138 / 160
SGD
Analysis
1 The
P∞convergence of SGD is guaranteed if
Pk=1 ϵk = ∞ and
∞
k=1 ϵk < ∞
2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 139 / 160
SGD
Analysis
1 Select ϵ0 , it executes the algorithm several times and then determines the
best-performing step. Then do change ϵ0 and check again, the perfor-
mance that is better than the previous.
2 Further increase the minibatch from M to M+j during training so that the
ϵi can be selected properly.
3 Q - Can we increase the performance of SGD method?
Ans - By adding momentum to the process.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 140 / 160
Momentum
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 141 / 160
Momentum Definition
1 the momentum algorithm introduces a variable v that plays the role of
velocity.
2 velocity—it is the direction and speed at which the parameters move through
parameter space.
3 The velocity is set to an exponentially decaying average of the negative
gradient.
4 The name momentum derives from a physical analogy, in which the neg-
ative gradient is a force moving a particle through parameter space, ac-
cording to Newton’s laws of motion.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 142 / 160
Momentum Definition
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 143 / 160
SGD With Momentum Algorithm
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 144 / 160
Analysis of SGD and SGD-Momentum
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 145 / 160
Nesterov Momentum
where the parameter α and ϵ plays the simillar role as in the standard mo-
mentum method.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 146 / 160
Nesterov Momentum Vs. Standard Momentum
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 147 / 160
SGD with Nesterov Momentum Algorithm
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 148 / 160
Algorithms with Adaptive Learning Rates
1 The learning rate was reliably one of the hyperparameters that is the most
difficult to set because it has a significant impact on model performance.
2 Cost is often highly sensitive to some directions in parameter space and
insensitive to others.
3 The momentum algorithm can mitigate these issues somewhat, but does
so at the expense of introducing another hyperparameter.
4 Is there another way?
If we believe that the directions of sensitivity are somewhat axis-aligned,
it can make sense to use a separate learning rate for each parameter, and
automatically adapt these learning rates throughout the course of learning.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 149 / 160
AdaGrad Algorithm
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 150 / 160
AdaGrad Algorithm
1 The AdaGrad algorithm, individually adapts the learning rates of all model
parameters by scaling them inversely proportional to the square root of the
sum of all of their historical squared values.
2 The parameters with the largest partial derivative of the loss have a cor-
respondingly rapid decrease in their learning rate, while parameters with
small partial derivatives have a relatively small decrease in their learning
rate.
3 Empirically it has been found that—for training deep neural network mod-
els, the accumulation of squared gradients from the beginning of training
can result in a premature and excessive decrease in the effective learning
rate.
4 AdaGrad performs well for some but not all deep learning models.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 151 / 160
RMSProp
1 SGD, SGD with momentum, and AdaGrad are suitable for convex func-
tions but RMSProp is good for non-convex functions.
2 It modifies AdaGrad for a non-convex setting.
3 It changes the gradient accumulation into exponentially weighted moving
avarage.
4 AdaGrad shrinks the learning rate according to the entire history of the
squared gradient and may have made the learning rate too small before
arriving at such a convex structure.
5 RMSProp uses an exponentially decaying average to discard history from
the extreme past so that it can converge rapidly.
6 The use of the moving average introduces a new hyperparameter, ρ , that
controls the length scale of the moving average.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 152 / 160
AdaGrad Algorithm
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 153 / 160
AdaGrad Vs. RMSProp
RMSProp
Adagrad
1 Exponential decay rate
1 No decay rate
2 Numerical stability constant δ =
2 Numerical stability constant δ =
10−6
10−7
3 Variation along the path is less.
3 Variation is more.
4 Time to converge is less compared
4 Time to converge is more.
to Adagrad.
5 For convex bowl the result is not 5 For convex bowl RMSProp pro-
that faster.
duce result faster.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 154 / 160
RMSProp algorithm with Nesterov momentum
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 155 / 160
Charecterstics of RMSProp with Nesterov momentum
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 156 / 160
Adam
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 157 / 160
Adam Algorithm
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 158 / 160
Adam Algorithm
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 159 / 160
Thank You
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 160 / 160