0% found this document useful (0 votes)
2 views

AML_unit_1

The document outlines a module on Advanced Machine Learning, focusing on deep learning, its history, and the perceptron as a foundational concept in neural networks. It discusses the evolution of deep learning techniques, notable success stories, and key historical developments such as the introduction of the perceptron and backpropagation. The document also emphasizes the significance of feedforward deep networks in modern machine learning applications.

Uploaded by

Dibyajyoti Dash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

AML_unit_1

The document outlines a module on Advanced Machine Learning, focusing on deep learning, its history, and the perceptron as a foundational concept in neural networks. It discusses the evolution of deep learning techniques, notable success stories, and key historical developments such as the introduction of the perceptron and backpropagation. The document also emphasizes the significance of feedforward deep networks in modern machine learning applications.

Uploaded by

Dibyajyoti Dash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 160

Advanced Machine Learning

Module - 1
Nayan Ranjan Paul

Assistant Professor
Department of Computer Science and Engineering
Silicon Institute of Technology

September 23, 2024

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 1 / 160
Outline

1 Deep Learning

2 Success Stories

3 History

4 Feedforward Deep Networks

5 Perceptron

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 2 / 160
Deep Learning

Deep learning attempts to learn representations of data with multiple levels


of abstraction.

Deep learning usually refers to a set of algorithms and computational mod-


els that are composed of multiple processing layers.

These methods have significantly improved the state-of-the-art in many


domains including, speech recognition, classification, pattern recognition,
drug discovery, and genomics.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 3 / 160
Success Stories

Deep Learning Machine Teaches Itself Chess in 72 Hours, Plays at Inter-


national Master Level.

An artificial intelligence machine plays chess by evaluating the board


rather than using brute force to work out every possible move.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 4 / 160
Success Stories

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 5 / 160
Success Stories

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 6 / 160
Success Stories

Captions generated by a recurrent neural network.


. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 7 / 160
Success Stories

Describing photos

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 8 / 160
Success Stories

Translation

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 9 / 160
Success Stories

Create new images

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 10 / 160
History - McCulloch Pitts Neuron (1943)
The first model of a neuron was invented by McCulloch (physiologists)
and Pitts (logician).
The model had two inputs and a single output.
A neuron would not activate if only one of the inputs was active.
The weights for each input were equal, and the output was binary.
Until the inputs summed up to a certain threshold level, the output would
remain zero.
The McCulloch and Pitts’ neuron has become known today as a logic cir-
cuit.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 11 / 160
History - McCulloch Pitts Neuron(MPN) (1943)

Logic functions can be modeled by a network of MP-neurons


. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 12 / 160
History - Perceptron (1958)

The perceptron was developed by Rosenblatt (physiologist).

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 13 / 160
History - Perceptron, The dream(1958)

Rosenblatt randomly connected the perceptrons and changed the weights


in order to achieve “learning.”

Based on Rosenblatt’s statements in a press conference in 1958, The New


York Times reported the perceptron to be

“the embryo of an electronic computer that [the Navy] expects will be able to
walk, talk, see, write, reproduce itself and be conscious of its existence.”

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 14 / 160
History - MPN vs Perceptron

Apparently McCulloch and Pitts’ neuron is a better model for the electro-
chemical process inside the neuron than the perceptron.

But perceptron is the basis and building block for modern neural networks.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 15 / 160
History - Optimization (1960)

Widrow and Hoff proposed a method for adjusting the weights. They in-
troduced a gradient search method based on minimizing the error squared
(Least Mean Squares).

In the 1960’s, there were many articles promising robots that could think.

It seems there was a general belief that perceptrons could solve any prob-
lem.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 16 / 160
History - shattered dream (1969)

Minsky and Papert published their book Perceptrons.

The book shows that perceptrons could only solve linearly separable prob-
lems.

They showed that it is not possible for perceptron to learn an XOR func-
tion.

After Perceptrons was published, researchers lost interest in perceptron


and neural networks.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 17 / 160
History - Backpropagation (1974)

Backpropagation was reinvented / applied in the context of neural net-


works by Paul Werbos, David E. Rumelhart, Geoffrey E. Hinton, and
Ronald J. Williams.

Backpropagation allowed perceptrons to be trained in a multilayer config-


uration.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 18 / 160
History
1980
The field of artificial neural network research experienced a resurgence.
2000
Neural network fell out of favor partly due to BP limitations.
Backpropagation Limitations
It requires labeled training data.
It is very slow in networks with multiple layers (doesn’t scale well).
It can converge to poor local minima.

Backpropagation returns
Backpropagation returned again in the 2010s, now able to train much larger
networks using huge modern computing power such as GPUs. For example, in
2013 top speech recognizers now use backpropagation-trained neural networks.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 19 / 160
Feedforward Deep Networks

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 20 / 160
Feedforward Deep Networks

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 21 / 160
Feedforward Deep Networks

Feedforward deep networks, a.k.a. multilayer perceptions (MLPs), are


parametric functions composed of several parametric functions.

Each layer of the network defines one of these sub-functions.

Each layer (sub-function) has multiple inputs and multiple outputs.

Each layer is composed of many units (scalar output of the layer).

We sometimes refer to each unit as a feature.

Each unit is usually a simple transformation of its input.

The entire network can be very complex.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 22 / 160
Perceptron

The perceptron is the building block for neural networks.

It was invented by Rosenblatt in 1957 at Cornell Labs, and was first men-
tioned in the paper ‘The Perceptron – a perceiving and recognizing au-
tomaton’.

Perceptron computes a linear combination of a factor of input and returns


the sign.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 23 / 160
Perceptron

xi is the ith feature of a sample and βi is the ith weight.


β0 is defined as bias. The bias alters the position of the decision boundary
between the 2 classes.
From a geometrical point of view, Perceptron assigns the label “1” to el-
ements on one side of β T x + β0 and label “-1” to elements on the other
side.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 24 / 160
Perceptron

Define a cost function, ϕ(β, β0 ), as a summation of the distance between


all misclassified points and the hyper-plane, or the decision boundary.
To minimize this cost function, we need to estimate β, β0 .
minβ,β0 ϕ(β, β0 ) = distance of all misclassified points

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 25 / 160
Perceptron

Distance between the point and the decision boundary hyperplane


(black line).

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 26 / 160
Perceptron

1) A hyper-plane L can be defined as


L = x : f (x) = β T x + β0 = 0,
For any two arbitrary points x1 and x2 on L , we have
β T x1 + β0 = 0,
β T x2 + β0 = 0,
such that
β T (x1 − x2 ) = 0.
Therefore, β is orthogonal to the hyper-plane and it is the normal vector.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 27 / 160
Perceptron

2) For any point x0 in L,


β T x0 + β0 = 0, which means β T x0 = −β0 .

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 28 / 160
Perceptron

3) We set β ∗ = ∥β∥
β
as the unit normal vector of the hyper-plane L .
For simplicity, we call β ∗ norm vector. The distance of point x to L is given
by

βT x (β T x+β0 )
β ∗T (x − x0 ) = β ∗T x − β ∗T x0 = ∥β∥ + β0
∥β∥ = ∥β∥

Where x0 is any point on L . Hence, β T x + β0 is proportional to the distance


of the point x to the hyper-plane L.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 29 / 160
Perceptron

4) The distance from a misclassified data point xi to the hyper-plane L is

di = −yi (β T xi + β0 ),

where yi is a target value, such that yi = 1 if β T xi + β0 < 0,


yi = −1 if β T xi + β0 > 0

Since we need to find the distance from the hyperplane to the misclassified
data points, we need to add a negative sign in front. When the data point is
misclassified, β T xi + β0 will produce an opposite sign of yi . Since we need
a positive sign for distance, we add a negative sign.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 30 / 160
Learning Perceptron

The gradient descent is an optimization method that finds the minimum


of an objective function by incrementally updating its parameters in the
negative direction of the derivative of this function.

That is, it finds the steepest slope in the D-dimensional space at a given
point and descends down in the direction of the negative slope.

Note that unless the error function is convex, it is possible to get stuck in
a local minima.

In our case, the objective function to be minimized is classification error


and the parameters of this function are the weights associated with the
inputs, β

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 31 / 160
Learning Perceptron

The gradient descent algorithm updates the weights as follows:

β new ← β old − α ∂E
∂β
rr

α is called learning rate. The learning rate α is positively related to the


step size of convergence of min ϕ(β, β0 ). i.e. the larger α is, the larger
the step size is. Typically α ∈ [0.1, 0.3]

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 32 / 160
Learning Perceptron

The classification error is defined as the distance of misclassified obser-


vations to the decision boundary:
P
To minimize the cost function ϕ(β, β0 ) = − i∈M yi (β T xi + β0 ) where
M ={all points that are misclassified}
∂ϕ P ∂ϕ P
∂β = − i∈M yi xi and ∂β0 = − i∈M yi

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 33 / 160
Learning Perceptron

Therefore, the gradient is


 P 
− Pi∈M yi xi
∇D(β, β0 ) =
− i∈M yi
Using the gradient descent algorithm to solve these two equations, we have
 new   old   
β β yi xi
= +α
β0new β0old yi

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 34 / 160
Learning Perceptron

If the data is linearly-separable, the solution is theoretically guaranteed to


converge to a separating hyperplane in a finite number of iterations.

In this situation the number of iterations depends on the learning rate and
the margin. However, if the data is not linearly separable there is no guar-
antee that the algorithm converges.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 35 / 160
Perceptron Features

A Perceptron can only discriminate between two classes at a time.

When data is (linearly) separable, there are an infinite number of solutions


depending on the starting point.

Even though convergence to a solution is guaranteed if the solution exists,


the finite number of steps until convergence can be very large.

The smaller the gap between the two classes, the longer the time of con-
vergence.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 36 / 160
Perceptron Features

When the data is not separable, the algorithm will not converge (it should
be stopped after N steps).

A learning rate that is too high will make the perception periodically os-
cillate around the solution unless additional steps are taken.

Learning rate affects the accuracy of the solution and the number of iter-
ations directly.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 37 / 160
Neural Network

A neural network is a multistate regression model which is typically rep-


resented by a network diagram.

Feed Forward Neural Network


. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 38 / 160
Neural Network

For regression, typically k = 1 (the number of nodes in the last layer), there
is only one output unit y1 at the end.

For c-class classification, there are typically c units at the end with the cth
unit modelling the probability of class c, each yc is coded as 0-1 variable
for the cth class.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 39 / 160
Backpropagation

P
ai = l zl uil

zi = σ(ai )
1
σ(a) = 1+e−a
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 40 / 160
Backpropagation

Take the derivative with respect to


weight uil :
∂|y−ŷ|2 ∂|y−ŷ|2 ∂aj
∂uil = ∂aj . ∂uil

∂|y−ŷ|2
∂uil = δj .zl
∂|y−ŷ|2
where δj = ∂aj

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 41 / 160
Backpropagation

∂|y−ŷ|2
δi = ∂ai =
P ∂|y−ŷ|2 ∂aj
j ∂aj . ∂ai
P ∂aj ∂zi
δi = j δj . ∂zi . ∂ai
P ′
δi = j δj .uji .σ (ai )
where 2
δj = ∂|y−ŷ|
∂aj

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 42 / 160
Backpropagation

Note that if σ(x) is the sigmoid function, then



σ (x) = σ(x)(1 − σ(x))
The recursive definition of δi
′ P
δi = σ (ai ) j δj .uji

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 43 / 160
Backpropagation

Now considering δk for the output layer:


∂(y−ŷ)2
δk = ∂ak

where ak = ŷ
assume an activation function is not applied in the output layer.
∂(y−ŷ)2
δk = ∂ ŷ

δk = −2(y − ŷ)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 44 / 160
Backpropagation

2
uil ← uil − α ∂(y−ŷ)
∂uil

The network weights are updated using the backpropagation algorithm


when each training data point x is fed into the feed forward neural network
(FFNN).

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 45 / 160
Backpropagation

Backpropagation procedure is done using the following steps:


First arbitrarily choose some random weights (preferably close to zero)
for your network.

Apply x to the FFNN’s input layer and calculate the outputs of all input
neurons.

Propagate the outputs of each hidden layer forward, one hidden layer at a
time, and calculate the outputs of all hidden neurons.

Once x reaches the output layer, calculate the output(s) of all output neu-
ron(s) given the outputs of the previous hidden layer.

At the output layer, compute δk = −2(yk − yˆk ) for each output neuron(s).

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 46 / 160
Backpropagation

Compute each δi , starting


Pfrom i = k - 1 all the way to the first hidden

layer, where δi = σ (ai ) j δj .uji .

∂(y−ŷ)2
Compute ∂ui l = δi zl for all weights uil .

∂(y−ŷ) 2
Then update unew
il ← uold
il − α. ∂uil for all weights uil .

Continue for the next data points and iterate on the training set until weights
converge.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 47 / 160
Epochs

It is common to cycle through all of the data points multiple times in order
to reach convergence.

An epoch represents one cycle in which you feed all of your data points
through the neural network.

It is good practice to randomize the order you feed the points to the neural
network within each epoch; this can prevent your weights from changing
in cycles.

The number of epochs required for convergence depends greatly on the


learning rate & convergence requirements used.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 48 / 160
Regularization for Deep Learning

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 49 / 160
Overview of Supervised Learning

In supervised machine learning, an algorithm learns a model from training


data.

The goal of any supervised machine learning algorithm is to best estimate


the mapping function (f) for the output variable (Y) given the input data
(X).

The mapping function is often called the target function because it is the
function that a given supervised machine learning algorithm aims to ap-
proximate.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 50 / 160
Overview of Supervised Learning

The learned model will be used to predict the unseen test data.

However, if the learned model is not accurate, it can make prediction er-
rors.

In machine learning, these errors will always be present as there is always


a slight difference between the model predictions and actual value.

The main aim of ML analysts is to reduce these errors in order to get more
accurate results.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 51 / 160
Capacity, Overfitting and Underfitting

The central challenge in machine learning is that we must perform well


on new, previously unseen inputs—not just those on which our model was
trained.

The ability to perform well on previously unobserved inputs is called gen-


eralization.

The training error is defined as the error computed on the training set by
the model during training.

The test error or also called generalization error is defined as the expected
value of the error on a new unseen input or test set. Here the expectation
is taken across different possible inputs, drawn from the distribution of
inputs we expect the system to encounter in practice.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 52 / 160
Capacity, Overfitting and Underfitting

Our objective is not only to reduce the training error but also to reduce the
generalization error or test error.

We typically estimate the generalization error of a machine learning model


by measuring its performance on a test set of examples that were collected
separately from the training set.

The factors determining how well a machine learning algorithm will per-
form are its ability to:
1 Make the training error small.
2 Make the gap between training and test error small.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 53 / 160
Capacity, Overfitting and Underfitting

These two factors correspond to the two central challenges in machine


learning: underfitting and overfitting.

Underfitting occurs when the model is not able to obtain a sufficiently low
error value on the training set.

Overfitting occurs when the gap between the training error and test error
is too large.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 54 / 160
Capacity, Overfitting and Underfitting

We can control whether a model is more likely to overfit or underfit by


altering its capacity.

A model’s capacity is its ability to fit a wide variety of functions.

Models with low capacity may struggle to fit the training set.

Models with high capacity can overfit by memorizing properties of the


training set that do not serve them well on the test set.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 55 / 160
Capacity, Overfitting and Underfitting

Models capacity can be changed in many different ways, some are :


Changing the number of input features it has.

Adding new parameters associated with those features.

Specifies which family of functions the learning algorithm can choose from
when varying the parameters to reduce a training objective. This is called
the representational capacity of the model.

In many cases, finding the best function within this family is a very diffi-
cult optimization problem.

In practice, the learning algorithm does not actually find the best function,
but merely one that significantly reduces the training error.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 56 / 160
Capacity, Overfitting and Underfitting

Machine learning algorithms will generally perform best when their ca-
pacity is appropriate for the true complexity of the task they need to per-
form and the amount of training data they are provided with.

Models with insufficient capacity are unable to solve complex tasks which
may lead to underfitting.

Models with high capacity can solve complex tasks, but when their capac-
ity is higher than needed to solve the present task they may overfit.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 57 / 160
Capacity, Overfitting and Underfitting

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 58 / 160
Introduction to Regularization

Definition and objective


Regularization is any modification we make to a learning algorithm that
is intended to reduce its generalization error but not its training error.
The goal of Regularization is to prevent overfitting by imposing some
strategies such as :
Put extra constraints on a machine learning model
Add extra terms in the objective function
Impose ensemble method

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 59 / 160
Introduction to Regularization

Bias Variance trade-off


Regularization of an estimator works by trading increased bias for
reduced variance

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 60 / 160
What is Regularization? An Intuitive view

Plausible rules
What’s my rule? 3 consecutive single digits
3 consecutive integers
1 2 3 ⇐ Satisfies rule
3 numbers in ascending order
4 5 6 ⇐ Satisfies rule
3 numbers whose sum is less than 25
7 8 9 ⇐ Satisfies rule
1, 4, 7 in the first column
9 2 32 ⇐ Satisfies rule
“yes” to the first 3 sequences and “no” to
all others.
Regularization corresponds to methods used in machine learning to improve
generalization performance (avoid overfitting the training data).

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 61 / 160
Parameter Norm Penalties

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 62 / 160
Supervised learning (Inference)

Predict (Label/Dependent variable) given data X

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 63 / 160
Supervised learning (Learning)

Update θ through the gradient of the objective function ∇J

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 64 / 160
Norms (Definition)

A way to measure the size of a vector.


P 1
∥ x ∥p = ( i |xi |p ) p

Thus, L1 norm and L2 norm are respectively :


P P 1
2 2
∥ x ∥1 = i |x| ∥ x ∥2 = i |x i |

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 65 / 160
Squared L2 Norm

Squared L2 norm is used instead of the original L2 norm for regularization


in the machine learning task.

All of the derivatives of the L2 norm depend on the entire vector.

The derivatives of the squared L2 norm with respect to each element of x


depend only on the corresponding element of x.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 66 / 160
Squared L2 Norm

xT = [x1 , x2 ]

L2 norm
p
f (x) = x21 + x22

f (x) = x1 (x21 + x22 )− 12

Squared L2 norm

f (x) = x21 + x22



f (x) = 2x1

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 67 / 160
Norm penalties

Limiting the capacity of models by adding norm penalty Ω(θ) to the ob-
jective function J

Not modifying the model in the inference phase, but adding penalties to
the objective function in the learning phase
Also known as weight decay.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 68 / 160
L2 Regularization

Penalty on L2-norm of parameters commonly known as L2 weight decay,


or simply weight decay
Let the original objective function or loss function with weights w is de-
noted as J(w)
After adding squared L2-norm penalty the regularized loss function is
˜
J(w) = J(w) + α2 ∥ w ∥2
Gradient of total objective function:
∇J(w)
˜ = ∇J(w) + αw
To take a single gradient step to update the weights, we perform this up-
date:
wt+1 = wt − ϵ(∇J(wt ) + αwt )
= wt − ϵ∇J(wt ) − ϵαwt

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 69 / 160
L2 Regularization - Analysis

Let w∗ be optimal solution in absence of regularization [i.e. ∇J(w∗ ) = 0]


Consider u = w − w∗ . Using Taylor series approximation upto 2nd order:
J(w∗ + u) = J(w∗ ) + uT ∇J(w∗ ) + 12 uT Hu
J(w) = J(w∗ ) + (w − w∗ )T ∇J(w∗ ) + 21 (w − w∗ )T H(w − w∗ )
= J(w∗ ) + 12 (w − w∗ )T H(w − w∗ ) [∵ ∇J(w∗ ) = 0]
Taking derivative
∇J(w) = ∇J(w∗ ) + H(w − w∗ )
= H(w − w∗ ) [∵ ∇J(w∗ ) = 0]
Gradient of total objective function (in presence of L2 regularization):
∇J(w)
˜ = ∇J(w) + αw
= H(w − w∗ ) + αw

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 70 / 160
L2 Regularization - Analysis

Let w̃ be optimal solution in the presence of regularization,


i.e. ∇J(
˜ w̃) = 0
⇒ H(w̃ − w∗ ) + αw̃ = 0
⇒ (H + αI)w̃ = Hw∗
⇒ w̃ = (H + αI)−1 Hw∗
if α → 0 then w̃ → w∗ ; and therefore no regularization

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 71 / 160
L2 Regularization - Analysis

What happens when α ̸= 0?


If H is symmetric positive semidefinite,
H = QΛQT , where Q is orthogonal i.e. QQT = QT Q = I
w̃ = (H + αI)−1 Hw∗
= (QΛQT + αI)−1 QΛQT w∗
= (QΛQT + αQIQT )−1 QΛQT w∗
= [Q(Λ + αI)QT ]−1 QΛQT w∗
−1
= QT (Λ + αI)−1 Q−1 QΛQT w∗
−1
= Q(Λ + αI)−1 ΛQT w∗ [∵ QT = Q]

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 72 / 160
L2 Regularization - Analysis

We see that the effect of weight decay is to rescale w∗ along the axes
defined by the eigenvectors of H
Specifically, the component of w∗ that is aligned with the i-th eigenvector
of H is rescaled by a factor of λiλ+α
i

Along the directions where the eigenvalues of H are relatively large, for
example, where λi ≫ α, the effect of regularization is relatively small.
However, components with λi ≪ α will be shrunk to have nearly zero
magnitude.
Only directions along which the parameters contribute significantly to re-
ducing the objective function are preserved relatively intact.
Summary: The weight vector (w∗ ) is getting rotated to w̃ on using L2
regularization. All of its elements are shrinking but some are shrinking
more than others. This ensures that only important features are given high
weights. .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 73 / 160
L2 Regularization - Analysis

How do these effects relate to machine learning in particular? We can find


out by studying linear regression.
Applying the analysis again, we will be able to obtain a special case of the
same results.
For linear regression, the cost function is the sum of squared errors:
(Xw − y)T (Xw − y).
When we add L2 regularization, the objective function changes to
(Xw − y)T (Xw − y) + 21 αwT w
This changes the normal equations for the solution from
w = (X T X)−1 X T y to
w = (X T X + αI)−1 X T y

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 74 / 160
L2 Regularization - Analysis

1
1 The matrix X T X is proportional to the covariance matrix mX X
T

2 Using L2 regularization replaces this matrix with w = (X T X + αI)−1 ,


here it ensures that w−1 exist.
3 L2 regularization causes the learning algorithm to “perceive” the input X
as having higher variance, which makes it shrink the weights on features
whose covariance with the output target is low compared to its added vari-
ance.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 75 / 160
L1 Regularization

1 Let the objective function is J(w) and the norm penalty is Ω(w)
2 The regularized objective function is
˜
J(w) = J(w) + Ω(w)
3 Substituting L1 norm to the Ω(w)
˜
J(w) = J(w) + α ∥ w ∥1
4 Gradient of the objective function is
∇J(w)
˜ = ∇(J(w) + α ∥ w ∥1 )
= ∇J(w) + αsign(w)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 76 / 160
L1 Regularization - Analysis

Effect of L1 regularization
1 The regularization contribution to the gradient no longer scales linearly
with each wi ; instead, it is a constant factor with a sign equal to sign(wi ).
2 It has no algebraic solutions.
3 The approximate solution in this regard
b
J(w) = H(w − w∗ )

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 77 / 160
Comparison L1 and L2 norm Regularization

The L1 norm is commonly used in machine learning when the difference


between zero and nonzero elements is very important.
L2 regularization does not cause the parameters to become zero.
L1 regularization may cause the parameters to become zero for large enough
α

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 78 / 160
Comparison L1 and L2 norm Regularization

Usually, the bias of each weight is excluded in penalty terms


˜ = J(θ) + αΩ(w)
J(θ)
The reason is :
The biases typically require less data to fit than the weights.
Each weight specifies how two variables interact while biases specify the
interaction of one variable
Regularizing the bias parameters can cause underfitting

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 79 / 160
Norm Penalties as Constrained Optimization

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 80 / 160
Constrained optimization

Sometimes one may wish to find the maximal or minimal value of f (x)
for value of x in some set S

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 81 / 160
Expression of constrained function

To express function with the constrained condition is difficult

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 82 / 160
Generalized Lagrange function

A possible approach is to design a different, unconstrained optimization


problem whose solution can be converted into a solution to the original
constrained problem.

The unconstrained optimization function is called “Generalized Lagrange


function”

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 83 / 160
Generalized Lagrange function

Generalized LagrangePfunction is defined


P as:
L(x, λ, α) = f (x) + i λi g (i) (x) + j αj h(j) (x)
Where the constrained region is:
S = {x|∀i, g (i) (x) = 0and∀j, h(j) ≤ 0}
We can find optimal x in region S by solving:
minmax max L(x, λ, α)
x λ α,α≥0

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 84 / 160
Norm Penalties with respect to Constrained Optimization

Cost function regularized by a parameter norm penalty


˜ X, y) = J(θ; X, y) + αΩ(θ)
J(θ;
Original cost function + norm penalty, Constrained term
If we wanted to constrain Ω(θ) to be less than some constant k, we could
construct a generalized Lagrange function
L(θ, α; X, y) = J(θ; X, y) + α(Ω(θ) − k).

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 85 / 160
Norm Penalties with respect to Constrained Optimization

Generalized Lagrange function


L(θ, α; X, y) = J(θ; X, y) + α(Ω(θ) − k).
The solution to the constrained problem is given by
θ∗ = arg min max L(θ, α).
θ α,α≥0
α must increase whenever Ω(θ) > k and decrease whenever Ω(θ) < k
All positive α encourage Ω(θ) to shrink. The optimal value α∗ will en-
courage Ω(θ) to shrink, but not so strongly to make Ω(θ) become less than
k.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 86 / 160
Regularization and Under-Constrained Problems

In some cases, regularization is necessary for machine learning problems


to be properly defined.
Many linear models in machine learning, including linear regression and
PCA, depend on inverting the matrix X T X.
If X T X is singular then the matrix inverse is not possible.
When the matrix will be singular?
whenever the data generating distribution truly has no variance in some
direction, or
when no variance is observed in some direction because there are fewer
examples (rows of X) than input features (columns of X).

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 87 / 160
Regularization and Under-Constrained Problems

To regularize some extent, Find (X T X + αI)−1 . This regularized matrix


is guaranteed to be invertible.
These linear problems have closed form solutions when the relevant ma-
trix is invertible.
It is also possible for a problem with no closed form solution to be under-
determined.
An example is logistic regression applied to a problem where the classes
are linearly separable.
If a weight vector w is able to achieve perfect classification, then 2w will
also achieve perfect classification and higher likelihood.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 88 / 160
Dataset Augmentation

The best way to make a machine learning model generalize better is to


train it on more data.
The amount of data we have is limited.
Create fake data and add it to the training set.
Dataset augmentation is a technique that creates fake data and adds it to
the training set.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 89 / 160
Dataset Augmentation

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 90 / 160
Dataset Augmentation

Typically, More data = better learning


Works well for image classification / object recognition tasks
Also shown to work well for speech
For some tasks it may not be clear how to generate such data

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 91 / 160
Noise Robustness

Noise applied at inputs


Injecting noise in the input to a neural network can also be seen as a form
of data augmentation.
Neural networks prove not to be very robust to noise, however One way
to improve the robustness of neural networks is simply to train them with
random noise applied to their inputs.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 92 / 160
Noise Robustness

Noise applied at weights


Injecting random noise into weight to improve robustness
This makes the model relatively insensitive to small variations in the weights
This can be interpreted as a stochastic implementation of a Bayesian in-
ference over the weights.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 93 / 160
Noise Robustness

Injecting Noise at Output Targets


Most datasets have some amount of mistakes in the y labels
To prevent it we explicitly model noise on labels

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 94 / 160
Early Stopping

Stop the excitation of an algorithm early before it deteriorates more with


respect to its performance.
Why does the performance of an algorithm deteriorate?
When a model is large and complex and executes several times, the train-
ing error of the model decreases continuously.
However, the validation error decreases up to some steps, then increases
and never decreases i.e. its performance decreases.
Stop execution of algorithm before further deteriorates the performance.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 95 / 160
Early Stopping
How to stop early?
1 Execute the algorithm itera-
tively
2 Track the validation error
3 Have a patience parameter p
4 If you are at step k and there
was no improvement in vali-
dation error in the previous p
steps then stop training and re-
turn the model stored at step
k−p
5 Basically, stop the training
early before it drives the train-
ing error to 0 and blows up the
validation error . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 96 / 160
Early Stopping

1 Very effective and the most widely used form of regularization


2 Can be used even with other regularizers (such as L2 )
3 How does it act as a regularizer?

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 97 / 160
How does Early Stopping act as a regularizer?
(Mathematical analysis)

1 Let J(w) be the objective function to be minimized.


2 Let w∗ be optimal solution in absence of regularization [i.e. ∇J(w∗ ) = 0]
3 Consider u = w −w∗ . Using Taylor series approximation up to 2nd order:
J(w∗ + u) = J(w∗ ) + uT ∇J(w∗ ) + 12 uT Hu
J(w) = J(w∗ ) + (w − w∗ )T ∇J(w∗ ) + 21 (w − w∗ )T H(w − w∗ )
= J(w∗ ) + 12 (w − w∗ )T H(w − w∗ ) [∵ ∇J(w∗ ) = 0]
4 Taking derivative
∇J(w) = ∇J(w∗ ) + H(w − w∗ )
= H(w − w∗ ) [∵ ∇J(w∗ ) = 0]

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 98 / 160
How does Early Stopping act as a regularizer?
(Mathematical analysis)

1 Now the SGD update rule is:


wt = wt−1 − η∇J(wt−1 )
= wt−1 − ηH(wt−1 − w∗ )
= wt−1 − ηHwt−1 + ηHw∗
= (I − ηH)wt−1 + ηHw∗
2 By subtracting w∗ from both side, we get
wt − w∗ = (I − ηH)wt−1 + ηHw∗ − w∗
= (I − ηH)wt−1 − (I − ηH)w∗
= (I − ηH)(wt−1 − w∗ )
3 Let initialize the terms i.e. w0 = 0 then
wt − w∗ = −(I − ηH)w∗

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 99 / 160
How does Early Stopping act as a regularizer?
(Mathematical analysis)

1 Using EVD of H as H = QΛQT , we get


wt − w∗ = −(I − ηQΛQT )w∗
⇒ wt = −(I − ηQΛQT )w∗ + w∗
= w∗ − (I − ηQΛQT )w∗
= [I − (I − ηQΛQT )]w∗
= [QQT − (QQT − ηQΛQT )]w∗
= [QQT − QQT + ηQΛQT )]w∗
= Q[I − (I − ηΛ)]QT w∗
2 After τ number of iterations, it will be
wt = Q[I − (I − ηΛ)τ ]QT w∗
⇒ QT wt = [I − (I − ηΛ)τ ]QT w∗ —(1)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 100 / 160
How does Early Stopping act as a regularizer?
(Mathematical analysis)

1 Now the expression for QT w̃ for L2 regularization can be rearranged as:


QT w̃ = (Λ + αI)−1 ΛQT w∗
= [I − (Λ + αI)−1 α]QT w∗ —(2)
2 By comparing equation(1) and (2) we see that the hyperparameters η, α
and τ are chosen such that
(I − ηΛ)τ = (Λ + αI)−1 α
3 Then L2 regularization and early stopping can be seen as to be equivalent.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 101 / 160
Algorithm
Objective - To determine the best amount of time to train model
Q: Number of steps between execution
1: n :No. of steps between execution
2: p :The number of times to observe worsening validation set error before giving up.(stands
for patience)
3: θo :Initial parameter vector
4: θ ← θ0
5: i ← 0, j ← 0, v ← ∞, θ∗ ← θ, i∗ ← i # θ∗ :Best parameter set
6: while j < p do
7: Update θ by executing training algorithm for n steps
8: i←i+n

9: v ← V alidationSetError(θ)

10: if v < v then
11: j←0
12: θ∗ ← θ
13: i∗ ← i
14: v ← v∗
15: else
16: j ←j+1
17: end if . . . . . . . . . . . . . . . . . . . .

18: end while . . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 102 / 160
Parameter Tying

1 Parameter dependency
1 L2 regularization or weight decay penalizes model parameters for deviating
from the fixed value of zero.
2 Sometimes we need other ways to express prior knowledge of parameters.
3 We may know from domain and model architecture that there should be
some dependencies between model parameters.
2 The goal of parameter tying
1 We want to express that certain parameters should be close to one another.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 103 / 160
Parameter Tying
(A scenario of parameter tying)

1 Two models performing the same classification task with the same set of
classes but with somewhat different input distributions.
2 Model A with parameter W (A)
3 Model B with parameter W (B)
4 The two models will map the input to two different, but related output:
ŷ (A) = f (W (A) , x)

ŷ (B) = g(W (B) , x )

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 104 / 160
Parameter Tying
(L2 penalty for parameter tying)

1 If the tasks are similar enough (perhaps with similar input and output dis-
tributions) then we believe that the model parameters should be close to
each other:
(A) (B)
∀, wi ≈ wi
2 We can leverage this information via regularization.
3 Use a parameter norm penalty (other choices are possible)
Ω(W (A) , W (B) ) =∥ W (A) − W (B) ∥22
˜ X, y) = J(θ; X, y) + αΩ(θ)
so J(θ;

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 105 / 160
Parameter Tying
(L2 penalty for parameter tying)

1 This approach, which regularized the parameters of one model, trained


as a classifier in a supervised paradigm, to be close to the parameters of
another model, trained in an unsupervised paradigm.
2 The architectures were constructed such that many of the parameters in the
classifier model could be paired i.e. parameters of one supervised classi-
fier can be paired to corresponding parameters in the unsupervised model.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 106 / 160
Parameter Sharing

1 The parameter norm penalty is one way to regularize parameters to be


close to one another, but the more popular way is to use constraints is
to force sets of parameters to be equal. This method of regularization is
referred to as parameter sharing.
2 Here it interprets the various models or model components sharing a unique
set of parameters.
3 Parameter sharing forces sets of parameters to be equal.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 107 / 160
Parameter Sharing
Advantage

1 A significant advantage of parameter sharing over regularizing the param-


eters to be close, that is only a subset of the parameters need to be stored
in memory.
2 Example - Parameter sharing is extensively used in CNN applied to com-
puter vision.
How?
Natural images have many statistical properties that are invariant through
translation.
E.g. - Cat on a wall.
Cat on a tree.
Two cats are fighting.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 108 / 160
Sparse Representation

1 Sparse means most values are zeros.


1 Parameter sparsity (L1 - penalty parameter)
2 Representational sparsity(L1-penalty on elements)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 109 / 160
Parameter Sparsity

1 Weight Decay: Placing penalty directly on model parameters.


2 Placing penalty on the activation of the units in a neural network(Here
activation is sparse)
3 L1-penalty induces parameter sparsity.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 110 / 160
Representational Sparsity

1 Representational sparsity describes a representation where many of the


elements of the representation are zero or close to zero.
2 Let for linear regression represented as
Bh = y
here h is a sparse representation of data X. That is h is a function of x that
in some sense represents the information present in x but does so with a
sparse vector.
3 ˜ x, y) = J(θ; x, y) + αΩ(h)
Regularized loss function: J(θ;
P
4 Representational sparsity : Ω(h) =∥ h ∥1 = |hi |

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 111 / 160
Orthogonal matching pursuit

1 Other approaches obtain representational sparsity with a hard constraint


on the activation values, like orthogonal matching pursuit.
2 OMP encodes an input x with representation h that solves the constrained
optimization problem.
arg minh,∥h∥0 <k ∥ x − W h ∥2
Where ∥ h ∥0 is the number of non-zero entries of h.
3 This problem can be solved efficiently when W is constrained to be or-
thogonal. This method is often called OMP-k with the value of k specified
to indicate the number of non-zero features allowed.
4 OMP-1 is a very effective feature extractor for deep architecture

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 112 / 160
Bagging and other ensemble methods
Background

1 Bagging also called bootstrap aggregating, is a technique for reducing gen-


eralization error by combining several models.
2 Train several different models separately, then have all of the models vote
on the output for test examples.
3 This is an example of a general strategy in machine learning called model
averaging.
4 Techniques that employ this strategy are known as ensemble models.
5 The reason that model averaging works is that different models will usu-
ally not make all the same errors on the test set

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 113 / 160
Bagging and other ensemble methods
Approach

1 Consider for example a set of k regression models.


2 Suppose that each model makes an error ϵi on each example.
3 Suppose the errors drawn from zero-mean multivariate normal distribution
with:
variance E(ϵ2i ) = v
and covariance E(ϵi ϵj ) = c
4 Then the error made by the average prediction of all ensemble models is
1 P
k i ϵi
5 The expected
P squared error of the ensemble predictor is
E[( k1 P ϵ
i i )2]
P
= k12 E[ i (ϵ2i + j̸=i ϵi ϵj )]
= k1 v + k−1
k c
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 114 / 160
Bagging and other ensemble methods
Approach

1 In the case, the errors are perfectly correlated and c=v, this reduces to v,
and the model averaging does not help at all.
2 If the errors are perfectly uncorrelated and c=0, then expected squared
error of the ensemble is only k1 v.
3 On average, the ensemble will perform at least as well as any of its mem-
bers.
4 If the members make independent errors, the ensemble will perform sig-
nificantly better than its members.
5 Question - Errors during ensembling are less than errors in average?

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 115 / 160
Bagging method

1 Bagging is a method, that allows the same kind of model, training algo-
rithm and objective function to be reused several times.
2 At the outset this method constructs k-different data set.
3 Each dataset has the same number of examples as the original dataset,
but each dataset is built by sampling with replacement from the original
dataset.
4 This means it may contain several duplicate examples, however, the in-
tersection between different datasets should be almost ϕ. Then only the
model can be properly trained and the averaging value better in compari-
son to individual algorithms.
5 Model i is then trained on the dataset i.
6 This method is powerful in reducing generalization errors.
7 Model complexity and memory requirement are more.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 116 / 160
Dropout
Background(Why/How)

1 Bagging and other ensemble techniques require large neural network mod-
els.
2 Training and evaluating such networks is costly in terms of runtime and
memory.
3 In this case bagging and ensembling should not be applied.
4 So dropout provides a computationally inexpensive but powerful method
of regularizing abroad family of models

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 117 / 160
Dropout Method

1 It is a method to reduce generalization error by dropping the units exclud-


ing the output layer in a neural network.
2 How?
It multiplies 0 with that corresponding unit or units.
3 Method
Radial-Basis function(RBF) is used for this purpose.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 118 / 160
Dropout - How does it select units?

1 It selects units by following some sequence or randomly selects.


2 Then measure the error that occurred after dropping.
3 Example—

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 119 / 160
Dropout - How does it select units?

1 Dropout is an inexpensive technique which uses the radial basis function.


2 The dropout also aims to approximate the process but with an exponen-
tially large no. of neural networks (2k ).
3 For training NN fullbatch is not considered rather a minibatch-based learn-
ing algorithm is used in this process.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 120 / 160
Optimization for Training Deep Learning Models//
Introduction

1 Of all the many optimization problems involved in deep learning, the most
difficult is neural network training.
2 It is quite common to invest days to months of time on hundreds of ma-
chines in order to solve even a single instance of neural network training
problem.
3 Because this problem is so important and so expensive, a specialized set
of optimization techniques have been developed for solving it.
4 Finding the parameter θ of a neural network that significantly reduces a
cost function J(θ)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 121 / 160
How Learning Differs from Pure Optimization?

1 The primary difference between machine learning and pure optimization


is that machine learning often acts indirectly. Typically the ML setup is as
follows:
1 Goal - Increase some performance metric P
2 Method - Minimize some J(θ) with hopes of optimizing P
2 This is in contrast to pure optimization where minimizing J(θ) is the goal
itself.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 122 / 160
How Learning Differs from Pure Optimization?

1 In the context of deep learning, the loss function can be written as an av-
erage over the training set.
J(θ) = E(x,y)∼P̂data L(f (x; θ), y)
Where L is the per-example loss function
f (x; θ) is the predicted output for x
y is the actual output for x
P̂data is the empirical distribution
2 Here the expectation is over the training data, we would usually prefer to
minimize the expectation over the data generating distribution Pdata rather
than over the finite training set:
s.t. J ∗ (θ) = E(x,y)∼Pdata L(f (x; θ), y)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 123 / 160
Empirical Risk Minimization

1 The quantity J ∗ is referred to as the risk.


2 If we knew Pdata , risk minimization would be reduced to a standard opti-
mization task.
3 Since Pdata is not known, we minimize the empirical risk:
1 Pm
E(x,y)∼Pdata L(f (x; θ), y) = m i=1 L(f (x(i) ; θ), y (i) )
4 Optimizing this is called empirical risk minimization.
5 Rather than optimizing the risk directly, we optimize the empirical risk
and hope that the risk decreases significantly as well.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 124 / 160
What is the problem with empirical risk Minimization?

1 Empirical risk minimization is prone to overfitting. Models with high


enough capacity can simply memorize the training set.
2 Furthermore, the most effective optimization algorithms rely on gradient
descent. However, many useful loss functions have no useful gradient.
3 The above two problems mean that in the context of deep learning, we
cannot usually use empirical risk minimization.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 125 / 160
Surrogate Loss Function

1 Often exactly minimizing a loss function is intractable.


2 Instead of minimizing empirical risk, we minimize the surrogate loss func-
tion.
3 A surrogate loss function acts as a proxy to empirical risk because it is
easier to minimize, differentiable and improve robustness.
4 Example - Log likelihood loss is a surrogate to the 0-1 loss for classifica-
tion.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 126 / 160
Early Stopping

1 Another difference between pure and machine learning optimization is that


we will often use early stopping based on the true underlying loss function.
2 This helps to prevent overfitting and leaves large gradients in the surrogate
loss. Which in pure optimization usually leaves small gradients in the
function.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 127 / 160
Batch and Minibatch Algorithms

1 Optimization algorithms that use the entire training set to compute the
gradient are called batch or deterministic gradient methods.
2 Algorithms that use a single training example for that task are called stochas-
tic or online gradient methods.
3 Algorithms that use b training examples are called minibatch where 1 ≤
b ≤ m and m is the size of the entire training examples.
4 Larger batch sizes provide a more accurate estimate of the gradient, but
with less than linear returns.
5 Max batch size will be limited by available memory.
6 Small batches can offer a regularizing effect.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 128 / 160
Challenges in Neural Network Optimization
Optimization is a Hard Problem

1 General optimization by itself is an extremely difficult task.


2 Traditionally ML has avoided this difficulty by carefully designing the
objective function and constraints to ensure the problem is convex.
3 Convex optimization is not without complications. However, this is not a
problem for deep models as problems we usually face in that context are
non-convex.
4 The following challenges will be for optimization.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 129 / 160
Ill Conditioning

1 Ill conditioning of the Hessian matrix is a prominent problem in most nu-


merical optimization problems, convex or otherwise.
2 Ill conditioning is manifested in SGD by causing the algorithm to get
“stuck” in the sense that even very small steps increase the cost function.
3 Even if the algorithms does not get stuck, learning will proceed very slowly
when the Hessian matrix has a large condition number.
4 Recall the second-order Taylor series of a cost function
x = x(0) − ϵgf (x(0) − ϵg)
≈ f (x(0) ) − ϵg T g + 21 ϵ2 g T Hg
5 Ill-conditioning becomes a problem when 12 ϵ2 g T Hg > ϵg T g, which means
that ;earning will become very slow despite having a strong gradient. This
is because ϵ will have to be small enough to compensate for strong curva-
ture.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 130 / 160
Issue of Local Minima
1 When optimizing a convex function, any minima is guaranteed to be a
global minimum.
2 For non-convex optimization functions (generally for NN), there can be
an extremely large number of local minima.
3 Functions involved in deep models are guaranteed to have an extremely
large number of local minima, which is an issue.
4 Local minima more generally are particularly problematic when they have
high costs compared to the global minimum.
5 It is an open question but researchers largely suspect that these local min-
imums are not a problem.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 131 / 160
Issue of Saddle Point
1 For any high-dimensional non-convex functions, local minima and max-
ima are in fact rare compared to saddle point.
2 A saddle point is a point where the Hessian matrix has both positive and
negative eigenvalues.
3 We can think of a saddle point as being a local minimum along one cross-
section of the cost function and a local maximum along another cross-
section.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 132 / 160
Issue of Saddle Point

1 Many classes of random functions exhibit the following behaviour:


1 In lower dimensional spaces, local minima and maxima are common.
2 In higher dimensional spaces, local minima and maxima are rare, and saddle
points are much more common.
2 For a function f : Rn → R, the expected ratio of the number of saddle
points to local minima grows exponentially with n.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 133 / 160
Issue of Saddle Point
Implications

1 Having many saddle points will make first-order methods struggle.


2 Second order methods(Newton’s method) may work to find a point with
gradient 0, but will likely find saddle points.
3 Dauphin et al.[2014] describes a method for a saddle-free Newton’s method
that works much better in these situations, but it isn’t easy to scale.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 134 / 160
Cliffs and Exploding Gradients
1 Neural network with many layers often have extremely deep regions re-
assembling cliffs.
2 Cliffs results from the multiplication of several large weights together.
3 On the face of an extremely steep cliff structure, the gradient update step
can move the parameters extremely far, usually jumping off the cliff struc-
ture altogether.
4 We can deal with these cliff structures using gradient clipping.
5 Cliffs are common in RNNs due to the multiplication of many factors.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 135 / 160
Long Term Dependency

1 It arises when the computational graph is very deep, The result of this
problem is a vanishing and exploding gradient.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 136 / 160
Stochastic Gradient Descent(SGD)

1 Stochastic gradient descent (SGD) and its variants are probably the most
used optimization algorithms for machine learning in general and for deep
learning in particular.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 137 / 160
SGD
Explanation

(i)
1 The dataset is D = x(i) , yi=1toN
2 Select a minibatch of M from the N samples.
3 The error can be minimized by considering the minibatch.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 138 / 160
SGD
Analysis

1 The
P∞convergence of SGD is guaranteed if
Pk=1 ϵk = ∞ and

k=1 ϵk < ∞
2

2 The learning rate can be decayed linearly until a certain iteration τ .


ϵk = (1 − α)ϵ0 + αϵτ ∞
with α = τk
After iteration τ , it is common to leave ϵ constant.
3 The learning rate of SGD can be obtained by trial and error. The user or
programmer has to construct the learning curve as a function of time.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 139 / 160
SGD
Analysis

1 Select ϵ0 , it executes the algorithm several times and then determines the
best-performing step. Then do change ϵ0 and check again, the perfor-
mance that is better than the previous.
2 Further increase the minibatch from M to M+j during training so that the
ϵi can be selected properly.
3 Q - Can we increase the performance of SGD method?
Ans - By adding momentum to the process.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 140 / 160
Momentum

1 SGD is a popular optimization strategy but it can be slow.


2 The momentum method accelerates learning when
1 Facing high curvature
2 Small but consistent gradients
3 Noisy gradients
3 The momentum algorithm accumulates an exponentially decaying moving
average of past gradients and continues to move in their direction.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 141 / 160
Momentum Definition
1 the momentum algorithm introduces a variable v that plays the role of
velocity.
2 velocity—it is the direction and speed at which the parameters move through
parameter space.
3 The velocity is set to an exponentially decaying average of the negative
gradient.
4 The name momentum derives from a physical analogy, in which the neg-
ative gradient is a force moving a particle through parameter space, ac-
cording to Newton’s laws of motion.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 142 / 160
Momentum Definition

1 Momentum in physics is mass times velocity.


momentum = M v
In the momentum learning algorithm, we assume unit mass. so M = 1
and the momentum = v
2 A hyperparameter α ∈ [0, 1) determines how quickly the contributions of
previous gradients exponentially decay.
3 The update rule is given by:
1 Pm
v ← αv − ϵ▽θ ( m i=1 L(f (x(i) ; θ), y (i) ))
θ ←θ+v

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 143 / 160
SGD With Momentum Algorithm

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 144 / 160
Analysis of SGD and SGD-Momentum

1 Both optimizes the learning rate with reduced number of steps.


2 However, both the methods are not adaptive to learning rate.
i.e. ϵ0 > ϵ1 > ϵ2 > .... > ϵr
3 But if ϵi+1 > ϵi for some i in the range (0 − τ ) then both SGD and SGD-
momentum are not suitable.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 145 / 160
Nesterov Momentum

1 The update rule in Nesterov momentum is

where the parameter α and ϵ plays the simillar role as in the standard mo-
mentum method.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 146 / 160
Nesterov Momentum Vs. Standard Momentum

1 The difference between Nesterov momentum and standard momentum is


where the gradient is evaluated.
2 With Nesterov momentum the gradient is evaluated after the current ve-
locity is applied where as it is not on standard momentum.
3 one can interpret Nesterov momentum as attempting to add a correction
factor to the standard method of momentum.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 147 / 160
SGD with Nesterov Momentum Algorithm

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 148 / 160
Algorithms with Adaptive Learning Rates

1 The learning rate was reliably one of the hyperparameters that is the most
difficult to set because it has a significant impact on model performance.
2 Cost is often highly sensitive to some directions in parameter space and
insensitive to others.
3 The momentum algorithm can mitigate these issues somewhat, but does
so at the expense of introducing another hyperparameter.
4 Is there another way?
If we believe that the directions of sensitivity are somewhat axis-aligned,
it can make sense to use a separate learning rate for each parameter, and
automatically adapt these learning rates throughout the course of learning.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 149 / 160
AdaGrad Algorithm

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 150 / 160
AdaGrad Algorithm

1 The AdaGrad algorithm, individually adapts the learning rates of all model
parameters by scaling them inversely proportional to the square root of the
sum of all of their historical squared values.
2 The parameters with the largest partial derivative of the loss have a cor-
respondingly rapid decrease in their learning rate, while parameters with
small partial derivatives have a relatively small decrease in their learning
rate.
3 Empirically it has been found that—for training deep neural network mod-
els, the accumulation of squared gradients from the beginning of training
can result in a premature and excessive decrease in the effective learning
rate.
4 AdaGrad performs well for some but not all deep learning models.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 151 / 160
RMSProp

1 SGD, SGD with momentum, and AdaGrad are suitable for convex func-
tions but RMSProp is good for non-convex functions.
2 It modifies AdaGrad for a non-convex setting.
3 It changes the gradient accumulation into exponentially weighted moving
avarage.
4 AdaGrad shrinks the learning rate according to the entire history of the
squared gradient and may have made the learning rate too small before
arriving at such a convex structure.
5 RMSProp uses an exponentially decaying average to discard history from
the extreme past so that it can converge rapidly.
6 The use of the moving average introduces a new hyperparameter, ρ , that
controls the length scale of the moving average.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 152 / 160
AdaGrad Algorithm

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 153 / 160
AdaGrad Vs. RMSProp

RMSProp
Adagrad
1 Exponential decay rate
1 No decay rate
2 Numerical stability constant δ =
2 Numerical stability constant δ =
10−6
10−7
3 Variation along the path is less.
3 Variation is more.
4 Time to converge is less compared
4 Time to converge is more.
to Adagrad.
5 For convex bowl the result is not 5 For convex bowl RMSProp pro-
that faster.
duce result faster.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 154 / 160
RMSProp algorithm with Nesterov momentum

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 155 / 160
Charecterstics of RMSProp with Nesterov momentum

1 It handles the variation of learning rate


2 It converges faster than Adagrad.
3 The key take away of the algorithm is the selection of momentum co-
efficent “α”.
4 The selection of ‘α’ is carried out through several simulation of a particular
algorithm.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 156 / 160
Adam

1 The name “Adam” derives from the phrase “adaptive moments.”


2 it is the combination of RMSProp and momentum with a few important
distinctions.
3 First, in Adam, momentum is incorporated directly as an estimate of the
first order moment of the gradient.
4 The most straightforward way to add momentum to RMSProp is to apply
momentum to the rescaled gradients.
5 Second, Adam includes bias corrections to the estimates of both the first-
order moments and the second-order moments to account for their initial-
ization at the origin.
6 Adam is generally regarded as being fairly robust to the choice of hyper-
parameters.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 157 / 160
Adam Algorithm

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 158 / 160
Adam Algorithm

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 159 / 160
Thank You

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Nayan Ranjan Paul (Silicon) AML Module-1 September 23, 2024 160 / 160

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy