0% found this document useful (0 votes)
1K views20 pages

Module 2 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414

Module 2 | S8 CSE NOTES -KTU DEEP LEARNING NOTES | CST414

Uploaded by

suryajit27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views20 pages

Module 2 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414

Module 2 | S8 CSE NOTES -KTU DEEP LEARNING NOTES | CST414

Uploaded by

suryajit27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Deep Learning Module 2

Different Parameter Initialization methods


1. Zero Initialization:
Description: All weights and biases are set to zero initially.

Issue: This approach is not commonly used in deep learning because it


leads to symmetry in the gradients during backpropagation. As a result, all
neurons in the network learn the same features, limiting the model's
capacity to represent diverse patterns and reducing the effectiveness of
learning.

2. Random Initialization:
Description: Weights and biases are initialized randomly from a specified
distribution, such as uniform or normal distribution.

Advantages: Random initialization breaks symmetry in the network,


allowing each neuron to learn different features independently. This
randomness helps prevent neurons from getting stuck during training and
promotes diverse feature learning.

Common Technique: Random initialization is the most widely used


technique in deep learning, as it provides flexibility and encourages
effective training.

3. Xavier Initialization (Glorot Initialization):


Description: Weights are initialized with a normal distribution with mean 0
1
and variance ( n ), where ( n ) is the number of neurons in the previous
​ ​

layer.

Purpose: Xavier initialization aims to ensure that the variance of activations


remains consistent across layers, preventing activations from exploding or
vanishing during training.

Deep Learning Module 2 1


Suitability: It is particularly suitable for activation functions like sigmoid or
hyperbolic tangent (tanh), which are sensitive to the scale of the inputs.

Aspect L1 Regularization L2 Regularization

Also Known As Lasso Regression Ridge Regression

Penalty Term Absolute value of weights Squared value of weights

Effect on Encourages sparse solutions Shrinks weights towards 0 (but not


Weights (some weights become 0) exactly 0)

Sparsity High sparsity Low sparsity

Feature Can be used for feature


Does not perform feature selection
Selection selection

Effect on Model Simpler models with fewer


Smoother models with all features
Complexity features

Computationally more
Computationally
expensive due to absolute Computationally less expensive
Efficient
values

Robustness to
Less sensitive to outliers More sensitive to outliers
Outliers

Bias-Variance Tends to have higher bias and Tends to have lower bias and higher
Tradeoff lower variance variance

Suitable when feature Suitable when all features are


Application
selection is important relevant

Overfitting
Definition: Overfitting occurs when a machine learning model learns to
perform well on the training data but fails to generalize to unseen data from
the same distribution.

Cause: Overfitting typically happens when the model becomes too complex
relative to the amount and variability of the training data. As a result, the
model starts to memorize the training examples rather than learning general
patterns or relationships in the data.

Symptoms:

High accuracy on the training data but poor performance on unseen


data.

Deep Learning Module 2 2


The model may exhibit excessive sensitivity to noise or irrelevant
features in the training data.

In extreme cases, the model may produce wildly inaccurate predictions


or classifications on unseen data.

Consequences: Overfitting can lead to models that are unreliable or


unusable in real-world scenarios, as they fail to generalize beyond the
specific examples seen during training.

Strategies to mitigate overfitting:

1. Increase Training Data:


Description: Increasing the amount of training data provides the model with
more diverse examples, helping it to learn generalizable patterns rather
than memorizing specific instances.

Advantages: More data can help the model capture the underlying
distribution of the data more accurately, reducing the chances of
overfitting.

Considerations: Acquiring additional data may not always be feasible or


cost-effective, but techniques like data augmentation can help artificially
increase the size of the training dataset by applying transformations such
as rotations, flips, and scaling to existing data.

2. Regularization:
Description: Regularization techniques add constraints to the model's
optimization process, discouraging it from fitting the training data too
closely and preventing overfitting.

L1 Regularization: Adds a penalty term to the loss function proportional to


the absolute values of the model's weights. This encourages sparsity in the
weight matrix, effectively reducing the number of features used by the
model.

L2 Regularization: Adds a penalty term to the loss function proportional to


the squared magnitudes of the model's weights. This penalizes large
weights and encourages smoother solutions.

Advantages: Regularization techniques help prevent overfitting by reducing


the model's capacity and complexity, making it less prone to fitting noise in

Deep Learning Module 2 3


the training data.

Considerations: The choice between L1 and L2 regularization, as well as


the strength of the regularization penalty, may require tuning based on the
specific characteristics of the dataset and model.

3. Feature Selection:
Description: Selecting relevant features and eliminating irrelevant or noisy
ones can improve the model's ability to generalize by focusing on the most
informative aspects of the data.

Advantages: By reducing the dimensionality of the input space, feature


selection can help prevent overfitting by reducing the model's complexity
and the potential for fitting noise.

Considerations: Feature selection methods such as univariate feature


selection, recursive feature elimination, or feature importance ranking
based on model coefficients or tree-based algorithms can be used to
identify and retain the most relevant features.

4. Cross-Validation:
Description: Cross-validation techniques involve partitioning the training
data into multiple subsets, training the model on different subsets, and
evaluating its performance on the remaining subsets. This provides a more
accurate estimate of the model's performance on unseen data than training
on a single fixed validation set.

Advantages: Cross-validation helps assess the generalization performance


of the model more reliably, reducing the risk of overfitting to a specific
validation set.

Considerations: Techniques such as k-fold cross-validation, leave-one-out


cross-validation, or stratified cross-validation can be used to partition the
data and evaluate the model's performance across multiple folds.

5. Early Stopping:
Description: Early stopping involves monitoring the model's performance
on a validation set during training and halting the training process when the
performance starts to degrade, indicating that the model is overfitting.

Deep Learning Module 2 4


Advantages: Early stopping prevents the model from continuing to train
beyond the point of optimal performance, reducing the risk of overfitting to
the training data.

Considerations: The choice of when to stop training (e.g., based on the


validation loss or validation accuracy) and the threshold for determining
performance degradation may require experimentation and tuning based on
the specific characteristics of the dataset and model.

Describe the effect in bias and variance when a neural network


is modified with more number of hidden units followed with
dropout regularization
Increasing the number of hidden units in a neural network and applying dropout
regularization can have a significant impact on the bias and variance of the
model. Let's break down the effect of each modification:

1. Increasing the Number of Hidden Units:

Bias: Increasing the number of hidden units generally reduces the bias
of the model. With more hidden units, the network becomes more
expressive and capable of capturing complex relationships in the data.
This allows the model to better fit the training data and reduce bias, as it
can learn more intricate patterns.

Variance: However, increasing the number of hidden units also tends to


increase the variance of the model. This is because a larger network
has more parameters, making it more susceptible to overfitting. The
model may start memorizing noise or outliers in the training data,
leading to high variance and poor generalization to unseen data.

2. Applying Dropout Regularization:

Bias: Dropout regularization typically has little to no effect on the bias of


the model. Dropout randomly deactivates a fraction of neurons during
training, which can prevent overfitting by reducing co-adaptation
between neurons. However, since dropout is only applied during
training and not during inference, it doesn't fundamentally change the
model's bias.

Variance: Dropout regularization is primarily used to reduce variance in


the model. By randomly dropping neurons, dropout introduces noise
into the training process, effectively creating an ensemble of slightly

Deep Learning Module 2 5


different models. This ensemble averaging helps reduce overfitting and
variance, leading to better generalization performance on unseen data.

Suppose a supervised learning problem is given to model a


deep feed forward neural network. Suggest solutions for the
following
a) small sized dataset for training

b) dataset with unlabeled data

c) large data set but data from different distribution.


For each scenario provided, here are the suggested solutions:

a) Small Sized Dataset for Training:

Data Augmentation: If possible, augment the existing data by applying


transformations such as rotation, scaling, cropping, or adding noise.
This can artificially increase the size of the dataset and provide more
diverse examples for the model to learn from.

Transfer Learning: Pretrain a neural network on a larger dataset or a


related task and fine-tune it on the smaller dataset. This leverages the
knowledge learned from the larger dataset and helps the model
generalize better to the smaller dataset.

Regularization: Apply regularization techniques such as weight decay,


dropout, or early stopping to prevent overfitting and promote
generalization to unseen data.

Use Simple Models: Instead of deep neural networks, consider using


simpler models with fewer parameters, such as linear models or
decision trees, which may generalize better with limited data.

b) Dataset with Unlabeled Data:

Semi-Supervised Learning: Utilize semi-supervised learning


techniques, where the model is trained on both labeled and unlabeled
data. Methods such as self-training, co-training, or pseudo-labeling can
be effective in leveraging unlabeled data to improve model
performance.

Unsupervised Pretraining: Pretrain the neural network using


unsupervised learning techniques such as autoencoders or generative

Deep Learning Module 2 6


adversarial networks (GANs) on the unlabeled data. Then, fine-tune the
pretrained model on the labeled data for the specific task.

Use Pretrained Models: Utilize pretrained models trained on large-scale


datasets, such as ImageNet for image classification or Word2Vec for
natural language processing tasks. Fine-tune these pretrained models
on the labeled data for the target task, which can often lead to better
performance than training from scratch.

c) Large Dataset but Data from Different Distribution:

Domain Adaptation: Apply domain adaptation techniques to align the


distributions of data from different sources or domains. Methods such
as adversarial training, domain adversarial neural networks (DANN), or
discrepancy-based approaches can help the model adapt to the
differences in distribution between datasets.

Ensemble Learning: Train multiple models on different subsets of the


data or using different architectures. Combine the predictions of these
models using techniques such as averaging or stacking to improve
robustness to distribution shifts.

Data Preprocessing: Perform data preprocessing techniques such as


domain-specific normalization, feature scaling, or domain-specific
feature engineering to reduce the discrepancies between datasets and
make them more compatible for training.

Fine-Tuning: Start with a model pretrained on a large dataset that is


more representative of the target distribution. Fine-tune this pretrained
model on the dataset of interest, which can help the model adapt to the
specific characteristics of the new data distribution.

Explain the following.


i)Early stopping

ii)Drop out
iii)Injecting noise at input
iv)Parameter sharing and tying.

i) Early Stopping:

Deep Learning Module 2 7


Early stopping is a regularization technique used to prevent overfitting
during the training of neural networks.

It involves monitoring the performance of the model on a separate


validation dataset during training.

The training process is halted when the performance on the validation


set starts to degrade or stagnate, indicating that the model is overfitting.

By stopping the training early, it helps prevent the model from


memorizing noise in the training data and encourages it to generalize
better to unseen data.

Early stopping strikes a balance between training the model long


enough to learn useful patterns and preventing it from overfitting by
terminating training when performance on the validation set begins to
worsen.

ii) Dropout:

Dropout is a regularization technique used during training of neural


networks to prevent overfitting.

It works by randomly deactivating (dropping out) a fraction of neurons


in a layer during each training iteration.

This forces the network to learn more robust and generalized features,
as it cannot rely on any single neuron to always be present.

During inference (testing), all neurons are used, but their outputs are
scaled by the dropout rate to compensate for the deactivated neurons
during training.

Dropout helps in preventing co-adaptation between neurons and


encourages the network to learn more diverse representations of the
data, improving its generalization performance.

iii) Injecting Noise at Input:

Injecting noise at the input is a form of data augmentation and


regularization technique used to improve the robustness of neural
networks.

It involves adding random noise to the input data before feeding it into
the network during training.

Deep Learning Module 2 8


The noise can be in various forms, such as Gaussian noise, random
jitter, or dropout-like masking.

Injecting noise helps the network become more tolerant to variations


and uncertainties in the input data, making it less sensitive to small
perturbations and noise.

By exposing the network to a diverse range of input patterns with


added noise, it learns to generalize better and becomes more robust to
noise in the test data.

iv) Parameter Sharing and Tying:

Parameter sharing and tying are techniques used to reduce the number
of parameters in a neural network model, thereby improving its
efficiency and generalization.

In parameter sharing, the same set of parameters (weights) is reused


across different parts of the network, typically in convolutional neural
networks (CNNs).

For example, in CNNs, the same filter is applied to different locations of


the input image, allowing the network to learn spatially invariant
features.

Parameter tying involves constraining certain parameters in the model


to be equal to each other.

For example, in natural language processing tasks, the embeddings for


different words can be tied to reduce the number of parameters and
improve generalization.

Parameter sharing and tying help in learning more compact and


transferable representations, especially when the amount of training
data is limited.

Suppose that a model does well on the training set, but only
achieves an accuracy of 85% on the validation set. You
conclude that the model is overfitting, and plan to use L1 or
L2 regularization to fix the issue. However, you learn that
some of the examples in the data may be incorrectly labeled.
Which form of regularisation would you prefer to use and
why?

Deep Learning Module 2 9


In this scenario, where there is a possibility of incorrectly labeled examples
in the dataset, L1 regularization would be preferred over L2 regularization.
The reason is that L1 regularization tends to result in sparsity in the model's
parameters by pushing some of them to exactly zero. This sparsity property
makes L1 regularization more robust to outliers or incorrectly labeled
examples compared to L2 regularization.

Here's why L1 regularization is preferable in this situation:

1. Robustness to Outliers: L1 regularization penalizes the absolute values


of the model parameters, which tends to result in sparsity. Parameters
that are not contributing significantly to the model are pushed to zero.
This makes the model less sensitive to outliers or incorrectly labeled
examples because their influence on the overall loss function is
reduced.

2. Feature Selection: L1 regularization has the additional benefit of


performing feature selection by automatically setting irrelevant or
redundant features' corresponding weights to zero. This can help
mitigate the impact of incorrectly labeled examples by effectively
ignoring features that are not informative or noisy.

3. Higher Penalty for Outliers: In situations where there are incorrectly


labeled examples, L1 regularization imposes a higher penalty on outliers
compared to L2 regularization. This is because L1 regularization directly
penalizes the magnitude of the parameters, while L2 regularization
penalizes the squared magnitude. Consequently, L1 regularization is
more effective at reducing the impact of outliers on the model's
performance.

Derivation of Weight Updating Rule in Gradient Descent


To derive the weight updating rule in gradient descent, we need to compute
the gradient of the error function with respect to the weights and then use
this gradient to update the weights. We'll consider two common error
functions: Mean Squared Error (MSE) and Cross Entropy.

General Gradient Descent Rule:


For a weight ( w ), the weight update rule in gradient descent is:
w ← w − η ∂E
∂w
 ​

Deep Learning Module 2 10


where ( η) is the learning rate and ( ∂E
∂w
) is the gradient of the error function ​

( E ) with respect to the weight ( w ).

a) Mean Squared Error (MSE)


The Mean Squared Error for a single training example is defined as:
E = 12 (y − y^)2  ​ ​

^) is the predicted output.


where ( y) is the true label and ( y ​

^=
For a linear model with output ( y wx + b) (ignoring bias for simplicity), ​

we need to compute the gradient of the error function with respect to ( w ).

^):
1. Compute the derivative of ( E ) with respect to ( y ​

∂E ∂
∂ y^ ​
​ = ∂ y^


( 12 (y − y^)2 ) = −(y − y^)
​ ​ ​

2.
^) with respect to ( w ):
Compute the derivative of ( y ​

∂ y^ ∂
∂w

​ = ∂w
(wx) ​ = x
3.
Apply the chain rule to get the gradient of ( E ) with respect to ( w ):

∂E ∂E ∂ y^
∂w
​ = ∂ y^ ​
​ ⋅ ∂w

​ = −(y − y^) ⋅ x

4.
Substitute into the weight update rule:

w ← w − η ∂E
∂w
= w + η(y − y^)x ​ ​

So, the weight update rule for MSE is:


w ← w + η(y − y^)x ​

b) Cross Entropy
The Cross Entropy Error for a single training example is defined as:

E = −[y log(y^) + (1 − y) log(1 − y^)] ​ ​

where ( y ) is the true label (0 or 1) and


y^is the predicted probability.

1
^=
For a logistic regression model, y ​ σ(z)where z = wxand σ(z) = 1+e −z

.

^):
1. Compute the derivative of (E ) with respect to ( y ​

Deep Learning Module 2 11


∂E
∂ y^ ​
​ = − yy^ + 1−y
1−y^
 ​


^) with respect to ( z ) (where (z


2. Compute the derivative of ( y ​
= wx)):
∂ y^
∂z

​ = σ(z)(1 − σ(z)) = y^(1 − y^) ​ ​

3. Compute the derivative of ( z ) with respect to ( w ):


∂z
∂w

= x

4. Apply the chain rule to get the gradient of ( E ) with respect to (w ):
∂E ∂E ∂ y^ ∂z
∂w

= ∂ y^ ​

⋅ ∂z

​ ⋅ ∂w
 ​

∂E
∂w
​ = (− yy^ + 1−y
1−y^
) ⋅ y^(1 − y^) ⋅ x



​ ​

∂E
∂w
​ = (y^ − y)x

5. Substitute into the weight update rule:


w ← w − η ∂E
∂w
= w − η(y^ − y)x ​ ​

So, the weight update rule for cross entropy is:

w ← w − η(y^ − y)x ​

1. Gradient Descent (GD)


Weight Update Equation:

w ← w − η∇L(w)
Explanation:

( w): Weights of the model.

( η): Learning rate, a small positive value that controls the step size.

( ∇L(w)): Gradient of the loss function ( L) with respect to the


weights.

How It Works:

Compute the gradient of the loss function with respect to all weights.

Update the weights by subtracting a fraction of the gradient


(determined by the learning rate).

Repeats for all data points (entire dataset) in each iteration (epoch).

Problems:

Deep Learning Module 2 12


Computationally expensive for large datasets as it requires computing
the gradients using the entire dataset.

Can get stuck in local minima or saddle points.

Slow convergence for high-dimensional data.

2. Stochastic Gradient Descent (SGD)


Weight Update Equation:

w ← w − η∇L(w; xi , yi ) ​ ​

Explanation:

Uses one training example (xi , yi )at a time to update the weights.
​ ​

How It Works:

Compute the gradient of the loss function with respect to a single data
point.

Update the weights immediately after computing the gradient for that
single data point.

Repeats for each data point in the dataset.

Advantages:

Faster updates compared to GD, allowing for more frequent adjustments


to the weights.

Introduces noise into the gradient computation, which can help escape
local minima.

Problems:

Noisy updates can cause the loss function to fluctuate.

May never converge to the minimum but oscillate around it.

3. Mini-Batch Stochastic Gradient Descent (Mini-Batch SGD)


Weight Update Equation:

w ← w − η∇L(w; B)
Explanation:

Uses a small batch of training examples ( B) to update the weights.

Deep Learning Module 2 13


How It Works:

Compute the gradient of the loss function with respect to a small batch
of data points.

Update the weights after computing the gradient for the entire mini-
batch.

Repeats for all mini-batches in the dataset.

Advantages:

Reduces the variance in the gradient updates compared to SGD.

Provides a compromise between the frequent updates of SGD and the


stability of GD.

Can make better use of vectorized operations and hardware


acceleration.

Problems:

Still might face issues with oscillations and slow convergence.

4. SGD with Momentum


Weight Update Equations:

vt = γvt−1 + η∇L(w)
​ ​

w ← w − vt  ​

Explanation:

vt : Velocity term that accumulates the gradient.


γ : Momentum term (typically between 0.5 and 0.9).


How It Works:

Compute the gradient of the loss function.

Update the velocity term by combining the previous velocity and the
current gradient.

Update the weights using the velocity term.

Advantages:

Helps accelerate SGD in relevant directions and dampens oscillations.

Deep Learning Module 2 14


Uses the past gradients to smooth out the updates.

Reduces the effects of small, inconsistent gradient directions.

5. Nesterov Accelerated Gradient (NAG)


Weight Update Equations:

vt = γvt−1 + η∇L(w − γvt−1 )


​ ​ ​

w ← w − vt  ​

Explanation:

Computes the gradient at the predicted future position of the


parameters.

How It Works:

Predict the future position of the weights using the current velocity.

Compute the gradient of the loss function at this future position.

Update the velocity term using this gradient.

Update the weights using the new velocity term.

Advantages:

More responsive to the changes in the gradient direction compared to


standard momentum.

Provides a look-ahead mechanism that improves convergence.

6. Adagrad
Weight Update Equations:

Gt = Gt−1 + (∇L(w))2 
​ ​

η
w←w− Gt +ϵ

∇L(w)

Explanation:

( Gt ): Sum of the squares of past gradients.


( ϵ): Small constant to prevent division by zero.

How It Works:

Deep Learning Module 2 15


Compute the gradient of the loss function.

Accumulate the squared gradients.

Adjust the learning rate for each weight based on the accumulated
gradients.

Update the weights using the adjusted learning rate.

Advantages:

Adapts the learning rate for each parameter based on historical


gradient information.

Works well for sparse data.

Scales down the learning rate for parameters with large gradients.

Summary
Each optimizer improves upon the previous one by addressing specific
issues:

GD: Simple but slow and computationally expensive.

SGD: Faster updates but noisy and may not converge.

Mini-Batch SGD: Compromise between GD and SGD, reducing variance


in updates.

SGD with Momentum: Smoother updates and faster convergence by


using past gradients.

NAG: Further improved convergence by anticipating future gradients.

Adagrad: Adaptive learning rates based on past gradient information,


useful for sparse data.

State how to apply early stopping in the context of learning


using Gradient Descent. Why is it necessary to use a
validation set (instead of simply using the test set) when
using early stopping?
Early stopping is applied in the context of learning using Gradient Descent
by monitoring the performance of the model on a validation set during
training. The training process is stopped when the performance on the

Deep Learning Module 2 16


validation set starts to degrade or stagnate, indicating that the model is
beginning to overfit.
Here's how early stopping is implemented:

1. Training Phase: During the training phase, the model is trained on the
training dataset using Gradient Descent or its variants (e.g., Stochastic
Gradient Descent). The performance of the model is evaluated
periodically (after each epoch or after a certain number of iterations) on
the validation dataset.

2. Validation Phase: After evaluating the model on the validation dataset,


the performance metric (e.g., validation loss or accuracy) is monitored.
If the performance metric does not improve or starts to degrade over
several consecutive evaluations, early stopping is triggered.

3. Early Stopping Criterion: The early stopping criterion is typically


defined based on the behavior of the validation metric. For example,
early stopping may be triggered if the validation loss does not decrease
for a certain number of epochs or if it increases for a certain number of
consecutive epochs.

4. Termination of Training: Once early stopping is triggered, the training


process is terminated, and the model parameters from the epoch with
the best validation performance are retained as the final model.

It is necessary to use a validation set (instead of simply using the test set)
when using early stopping because:

Prevents Overfitting to Test Set: If early stopping were based on the


test set performance, the model could potentially overfit to the test set,
resulting in optimistic estimates of generalization performance. Using a
separate validation set ensures that the model's performance is
evaluated on unseen data that is not used for training or model
selection.

Prevents Data Leakage: Using the test set for early stopping could
introduce data leakage, where information from the test set influences
the training process. This violates the principle of using the test set only
for final evaluation and can lead to biased performance estimates.

Allows for Model Selection: By using a validation set, early stopping


allows for model selection based on performance metrics independent

Deep Learning Module 2 17


of the test set. This ensures that the final model is selected based on its
ability to generalize to unseen data.

Describe one advantage of using Adam optimizer instead of


basic gradient descent
One advantage of using the Adam optimizer instead of basic gradient
descent is its ability to adaptively adjust the learning rate for each
parameter in the neural network. This adaptiveness is achieved through the
use of momentum and adaptive learning rate techniques.
Here's how Adam optimizer offers this advantage:

1. Adaptive Learning Rate: Adam computes individual learning rates for


different parameters based on estimates of the first and second
moments of the gradients. It maintains separate learning rates for each
parameter, allowing it to automatically adjust the learning rate for each
parameter according to the magnitude of its gradients. This means that
parameters with large gradients will have smaller learning rates, and
parameters with small gradients will have larger learning rates. As a
result, Adam can handle situations where the gradients of different
parameters vary widely, allowing for more efficient and stable training.

2. Momentum: Adam optimizer incorporates momentum, which helps


accelerate convergence and dampen oscillations during training. By
maintaining a moving average of past gradients, Adam effectively
smooths the gradient updates, allowing for more stable progress
towards the optimum. This momentum term helps in navigating complex
loss surfaces, especially in high-dimensional parameter spaces, and
can lead to faster convergence compared to basic gradient descent.

Differentiate gradient descent with and without momentum.


Give equations for weight updation in GD with and without
momentum. Illustrate plateaus, saddle points and slowly
varying gradients
Gradient Descent without Momentum:

In standard Gradient Descent (GD), the update rule for the parameters ( θ)
at each iteration (t) is given by:
w ← w − η∇L(w)

Deep Learning Module 2 18


Explanation:

( w): Weights of the model.

( η): Learning rate, a small positive value that controls the step size.

( ∇L(w)): Gradient of the loss function ( L) with respect to the


weights.

Gradient Descent with Momentum:


In Gradient Descent with Momentum, a momentum term \( \beta \) is
introduced to the update rule to accelerate convergence. The update rule
becomes:
vt = γvt−1 + η∇L(w)
​ ​

w ← w − vt  ​

Where:

( vt ) is the velocity vector at iteration ( t ).


γ is the momentum parameter, typically between 0 and 1, controlling the


contribution of the previous velocity.

The first equation updates the velocity vector by accumulating a


fraction of the previous velocity and adding the current gradient.

The second equation updates the parameters using the velocity vector,
similar to standard GD but with the velocity term.

Illustration of Plateaus, Saddle Points, and Slowly Varying Gradients:

Plateaus: Plateaus are flat regions in the loss landscape where the
gradients are close to zero. In standard GD, convergence on plateaus
can be slow as the updates are proportional to the gradient magnitude,
which is small. In GD with Momentum, the momentum term helps the
optimization process to overcome plateaus more effectively by
accumulating velocity over time, allowing the optimizer to escape the
flat regions more efficiently.

Saddle Points: Saddle points are critical points where some dimensions
have positive curvature and others have negative curvature. GD without
Momentum can get stuck at saddle points due to the slow convergence
caused by small gradients. In GD with Momentum, the accumulated
momentum helps the optimization process to move along the direction

Deep Learning Module 2 19


of the negative curvature, enabling the optimizer to escape saddle
points more quickly.

Slowly Varying Gradients: In regions with slowly varying gradients, GD


without Momentum may converge slowly as it relies solely on the
current gradient. GD with Momentum can accelerate convergence in
such regions by accumulating momentum over time, allowing the
optimizer to smooth out fluctuations in the gradient direction and move
more quickly towards the minimum.

Deep Learning Module 2 20

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy