0% found this document useful (0 votes)

1K views20 pages

Module 2 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414

Module 2 | S8 CSE NOTES -KTU DEEP LEARNING NOTES | CST414

Uploaded by

suryajit27

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views20 pages

Module 2 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414

Module 2 | S8 CSE NOTES -KTU DEEP LEARNING NOTES | CST414

Uploaded by

suryajit27

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Deep Learning Module 2

Different Parameter Initialization methods

1. Zero Initialization:
Description: All weights and biases are set to zero initially.

Issue: This approach is not commonly used in deep learning because it

leads to symmetry in the gradients during backpropagation. As a result, all
neurons in the network learn the same features, limiting the model's
capacity to represent diverse patterns and reducing the effectiveness of
learning.

2. Random Initialization:
Description: Weights and biases are initialized randomly from a specified
distribution, such as uniform or normal distribution.

Advantages: Random initialization breaks symmetry in the network,

allowing each neuron to learn different features independently. This
randomness helps prevent neurons from getting stuck during training and
promotes diverse feature learning.

Common Technique: Random initialization is the most widely used

technique in deep learning, as it provides flexibility and encourages
effective training.

3. Xavier Initialization (Glorot Initialization):

Description: Weights are initialized with a normal distribution with mean 0
1
and variance ( n ), where ( n ) is the number of neurons in the previous

layer.

Purpose: Xavier initialization aims to ensure that the variance of activations

remains consistent across layers, preventing activations from exploding or
vanishing during training.

Deep Learning Module 2 1

Suitability: It is particularly suitable for activation functions like sigmoid or
hyperbolic tangent (tanh), which are sensitive to the scale of the inputs.

Aspect L1 Regularization L2 Regularization

Also Known As Lasso Regression Ridge Regression

Penalty Term Absolute value of weights Squared value of weights

Effect on Encourages sparse solutions Shrinks weights towards 0 (but not

Weights (some weights become 0) exactly 0)

Sparsity High sparsity Low sparsity

Feature Can be used for feature

Does not perform feature selection
Selection selection

Effect on Model Simpler models with fewer

Smoother models with all features
Complexity features

Computationally more
Computationally
expensive due to absolute Computationally less expensive
Efficient
values

Robustness to
Less sensitive to outliers More sensitive to outliers
Outliers

Bias-Variance Tends to have higher bias and Tends to have lower bias and higher
Tradeoff lower variance variance

Suitable when feature Suitable when all features are

Application
selection is important relevant

Overfitting
Definition: Overfitting occurs when a machine learning model learns to
perform well on the training data but fails to generalize to unseen data from
the same distribution.

Cause: Overfitting typically happens when the model becomes too complex
relative to the amount and variability of the training data. As a result, the
model starts to memorize the training examples rather than learning general
patterns or relationships in the data.

Symptoms:

High accuracy on the training data but poor performance on unseen

data.

Deep Learning Module 2 2

The model may exhibit excessive sensitivity to noise or irrelevant
features in the training data.

In extreme cases, the model may produce wildly inaccurate predictions

or classifications on unseen data.

Consequences: Overfitting can lead to models that are unreliable or

unusable in real-world scenarios, as they fail to generalize beyond the
specific examples seen during training.

Strategies to mitigate overfitting:

1. Increase Training Data:

Description: Increasing the amount of training data provides the model with
more diverse examples, helping it to learn generalizable patterns rather
than memorizing specific instances.

Advantages: More data can help the model capture the underlying
distribution of the data more accurately, reducing the chances of
overfitting.

Considerations: Acquiring additional data may not always be feasible or

cost-effective, but techniques like data augmentation can help artificially
increase the size of the training dataset by applying transformations such
as rotations, flips, and scaling to existing data.

2. Regularization:
Description: Regularization techniques add constraints to the model's
optimization process, discouraging it from fitting the training data too
closely and preventing overfitting.

L1 Regularization: Adds a penalty term to the loss function proportional to

the absolute values of the model's weights. This encourages sparsity in the
weight matrix, effectively reducing the number of features used by the
model.

L2 Regularization: Adds a penalty term to the loss function proportional to

the squared magnitudes of the model's weights. This penalizes large
weights and encourages smoother solutions.

Advantages: Regularization techniques help prevent overfitting by reducing

the model's capacity and complexity, making it less prone to fitting noise in

Deep Learning Module 2 3

the training data.

Considerations: The choice between L1 and L2 regularization, as well as

the strength of the regularization penalty, may require tuning based on the
specific characteristics of the dataset and model.

3. Feature Selection:
Description: Selecting relevant features and eliminating irrelevant or noisy
ones can improve the model's ability to generalize by focusing on the most
informative aspects of the data.

Advantages: By reducing the dimensionality of the input space, feature

selection can help prevent overfitting by reducing the model's complexity
and the potential for fitting noise.

Considerations: Feature selection methods such as univariate feature

selection, recursive feature elimination, or feature importance ranking
based on model coefficients or tree-based algorithms can be used to
identify and retain the most relevant features.

4. Cross-Validation:
Description: Cross-validation techniques involve partitioning the training
data into multiple subsets, training the model on different subsets, and
evaluating its performance on the remaining subsets. This provides a more
accurate estimate of the model's performance on unseen data than training
on a single fixed validation set.

Advantages: Cross-validation helps assess the generalization performance

of the model more reliably, reducing the risk of overfitting to a specific
validation set.

Considerations: Techniques such as k-fold cross-validation, leave-one-out

cross-validation, or stratified cross-validation can be used to partition the
data and evaluate the model's performance across multiple folds.

5. Early Stopping:
Description: Early stopping involves monitoring the model's performance
on a validation set during training and halting the training process when the
performance starts to degrade, indicating that the model is overfitting.

Deep Learning Module 2 4

Advantages: Early stopping prevents the model from continuing to train
beyond the point of optimal performance, reducing the risk of overfitting to
the training data.

Considerations: The choice of when to stop training (e.g., based on the

validation loss or validation accuracy) and the threshold for determining
performance degradation may require experimentation and tuning based on
the specific characteristics of the dataset and model.

Describe the effect in bias and variance when a neural network

is modified with more number of hidden units followed with
dropout regularization
Increasing the number of hidden units in a neural network and applying dropout
regularization can have a significant impact on the bias and variance of the
model. Let's break down the effect of each modification:

1. Increasing the Number of Hidden Units:

Bias: Increasing the number of hidden units generally reduces the bias
of the model. With more hidden units, the network becomes more
expressive and capable of capturing complex relationships in the data.
This allows the model to better fit the training data and reduce bias, as it
can learn more intricate patterns.

Variance: However, increasing the number of hidden units also tends to

increase the variance of the model. This is because a larger network
has more parameters, making it more susceptible to overfitting. The
model may start memorizing noise or outliers in the training data,
leading to high variance and poor generalization to unseen data.

2. Applying Dropout Regularization:

Bias: Dropout regularization typically has little to no effect on the bias of

the model. Dropout randomly deactivates a fraction of neurons during
training, which can prevent overfitting by reducing co-adaptation
between neurons. However, since dropout is only applied during
training and not during inference, it doesn't fundamentally change the
model's bias.

Variance: Dropout regularization is primarily used to reduce variance in

the model. By randomly dropping neurons, dropout introduces noise
into the training process, effectively creating an ensemble of slightly

Deep Learning Module 2 5

different models. This ensemble averaging helps reduce overfitting and
variance, leading to better generalization performance on unseen data.

Suppose a supervised learning problem is given to model a

deep feed forward neural network. Suggest solutions for the
following
a) small sized dataset for training

b) dataset with unlabeled data

c) large data set but data from different distribution.

For each scenario provided, here are the suggested solutions:

a) Small Sized Dataset for Training:

Data Augmentation: If possible, augment the existing data by applying

transformations such as rotation, scaling, cropping, or adding noise.
This can artificially increase the size of the dataset and provide more
diverse examples for the model to learn from.

Transfer Learning: Pretrain a neural network on a larger dataset or a

related task and fine-tune it on the smaller dataset. This leverages the
knowledge learned from the larger dataset and helps the model
generalize better to the smaller dataset.

Regularization: Apply regularization techniques such as weight decay,

dropout, or early stopping to prevent overfitting and promote
generalization to unseen data.

Use Simple Models: Instead of deep neural networks, consider using

simpler models with fewer parameters, such as linear models or
decision trees, which may generalize better with limited data.

b) Dataset with Unlabeled Data:

Semi-Supervised Learning: Utilize semi-supervised learning

techniques, where the model is trained on both labeled and unlabeled
data. Methods such as self-training, co-training, or pseudo-labeling can
be effective in leveraging unlabeled data to improve model
performance.

Unsupervised Pretraining: Pretrain the neural network using

unsupervised learning techniques such as autoencoders or generative

Deep Learning Module 2 6

adversarial networks (GANs) on the unlabeled data. Then, fine-tune the
pretrained model on the labeled data for the specific task.

Use Pretrained Models: Utilize pretrained models trained on large-scale

datasets, such as ImageNet for image classification or Word2Vec for
natural language processing tasks. Fine-tune these pretrained models
on the labeled data for the target task, which can often lead to better
performance than training from scratch.

c) Large Dataset but Data from Different Distribution:

Domain Adaptation: Apply domain adaptation techniques to align the

distributions of data from different sources or domains. Methods such
as adversarial training, domain adversarial neural networks (DANN), or
discrepancy-based approaches can help the model adapt to the
differences in distribution between datasets.

Ensemble Learning: Train multiple models on different subsets of the

data or using different architectures. Combine the predictions of these
models using techniques such as averaging or stacking to improve
robustness to distribution shifts.

Data Preprocessing: Perform data preprocessing techniques such as

domain-specific normalization, feature scaling, or domain-specific
feature engineering to reduce the discrepancies between datasets and
make them more compatible for training.

Fine-Tuning: Start with a model pretrained on a large dataset that is

more representative of the target distribution. Fine-tune this pretrained
model on the dataset of interest, which can help the model adapt to the
specific characteristics of the new data distribution.

Explain the following.

i)Early stopping

ii)Drop out
iii)Injecting noise at input
iv)Parameter sharing and tying.

i) Early Stopping:

Deep Learning Module 2 7

Early stopping is a regularization technique used to prevent overfitting
during the training of neural networks.

It involves monitoring the performance of the model on a separate

validation dataset during training.

The training process is halted when the performance on the validation

set starts to degrade or stagnate, indicating that the model is overfitting.

By stopping the training early, it helps prevent the model from

memorizing noise in the training data and encourages it to generalize
better to unseen data.

Early stopping strikes a balance between training the model long

enough to learn useful patterns and preventing it from overfitting by
terminating training when performance on the validation set begins to
worsen.

ii) Dropout:

Dropout is a regularization technique used during training of neural

networks to prevent overfitting.

It works by randomly deactivating (dropping out) a fraction of neurons

in a layer during each training iteration.

This forces the network to learn more robust and generalized features,
as it cannot rely on any single neuron to always be present.

During inference (testing), all neurons are used, but their outputs are
scaled by the dropout rate to compensate for the deactivated neurons
during training.

Dropout helps in preventing co-adaptation between neurons and

encourages the network to learn more diverse representations of the
data, improving its generalization performance.

iii) Injecting Noise at Input:

Injecting noise at the input is a form of data augmentation and

regularization technique used to improve the robustness of neural
networks.

It involves adding random noise to the input data before feeding it into
the network during training.

Deep Learning Module 2 8

The noise can be in various forms, such as Gaussian noise, random
jitter, or dropout-like masking.

Injecting noise helps the network become more tolerant to variations

and uncertainties in the input data, making it less sensitive to small
perturbations and noise.

By exposing the network to a diverse range of input patterns with

added noise, it learns to generalize better and becomes more robust to
noise in the test data.

iv) Parameter Sharing and Tying:

Parameter sharing and tying are techniques used to reduce the number
of parameters in a neural network model, thereby improving its
efficiency and generalization.

In parameter sharing, the same set of parameters (weights) is reused

across different parts of the network, typically in convolutional neural
networks (CNNs).

For example, in CNNs, the same filter is applied to different locations of

the input image, allowing the network to learn spatially invariant
features.

Parameter tying involves constraining certain parameters in the model

to be equal to each other.

For example, in natural language processing tasks, the embeddings for

different words can be tied to reduce the number of parameters and
improve generalization.

Parameter sharing and tying help in learning more compact and

transferable representations, especially when the amount of training
data is limited.

Suppose that a model does well on the training set, but only
achieves an accuracy of 85% on the validation set. You
conclude that the model is overfitting, and plan to use L1 or
L2 regularization to fix the issue. However, you learn that
some of the examples in the data may be incorrectly labeled.
Which form of regularisation would you prefer to use and
why?

Deep Learning Module 2 9

In this scenario, where there is a possibility of incorrectly labeled examples
in the dataset, L1 regularization would be preferred over L2 regularization.
The reason is that L1 regularization tends to result in sparsity in the model's
parameters by pushing some of them to exactly zero. This sparsity property
makes L1 regularization more robust to outliers or incorrectly labeled
examples compared to L2 regularization.

Here's why L1 regularization is preferable in this situation:

1. Robustness to Outliers: L1 regularization penalizes the absolute values

of the model parameters, which tends to result in sparsity. Parameters
that are not contributing significantly to the model are pushed to zero.
This makes the model less sensitive to outliers or incorrectly labeled
examples because their influence on the overall loss function is
reduced.

2. Feature Selection: L1 regularization has the additional benefit of

performing feature selection by automatically setting irrelevant or
redundant features' corresponding weights to zero. This can help
mitigate the impact of incorrectly labeled examples by effectively
ignoring features that are not informative or noisy.

3. Higher Penalty for Outliers: In situations where there are incorrectly

labeled examples, L1 regularization imposes a higher penalty on outliers
compared to L2 regularization. This is because L1 regularization directly
penalizes the magnitude of the parameters, while L2 regularization
penalizes the squared magnitude. Consequently, L1 regularization is
more effective at reducing the impact of outliers on the model's
performance.

Derivation of Weight Updating Rule in Gradient Descent

To derive the weight updating rule in gradient descent, we need to compute
the gradient of the error function with respect to the weights and then use
this gradient to update the weights. We'll consider two common error
functions: Mean Squared Error (MSE) and Cross Entropy.

General Gradient Descent Rule:

For a weight ( w ), the weight update rule in gradient descent is:
w ← w − η ∂E
∂w

Deep Learning Module 2 10

where ( η) is the learning rate and ( ∂E
∂w
) is the gradient of the error function

( E ) with respect to the weight ( w ).

a) Mean Squared Error (MSE)

The Mean Squared Error for a single training example is defined as:
E = 12 (y − y^)2

^) is the predicted output.

where ( y) is the true label and ( y

^=
For a linear model with output ( y wx + b) (ignoring bias for simplicity),

we need to compute the gradient of the error function with respect to ( w ).

^):
1. Compute the derivative of ( E ) with respect to ( y

∂E ∂
∂ y^
= ∂ y^

( 12 (y − y^)2 ) = −(y − y^)

2.
^) with respect to ( w ):
Compute the derivative of ( y

∂ y^ ∂
∂w

= ∂w
(wx) = x
3.
Apply the chain rule to get the gradient of ( E ) with respect to ( w ):

∂E ∂E ∂ y^
∂w
= ∂ y^
⋅ ∂w

= −(y − y^) ⋅ x

4.
Substitute into the weight update rule:

w ← w − η ∂E
∂w
= w + η(y − y^)x

So, the weight update rule for MSE is:

w ← w + η(y − y^)x

b) Cross Entropy
The Cross Entropy Error for a single training example is defined as:

E = −[y log(y^) + (1 − y) log(1 − y^)]

where ( y ) is the true label (0 or 1) and

y^is the predicted probability.

1
^=
For a logistic regression model, y σ(z)where z = wxand σ(z) = 1+e −z

^):
1. Compute the derivative of (E ) with respect to ( y

Deep Learning Module 2 11

∂E
∂ y^
= − yy^ + 1−y
1−y^

^) with respect to ( z ) (where (z

2. Compute the derivative of ( y
= wx)):
∂ y^
∂z

= σ(z)(1 − σ(z)) = y^(1 − y^)

3. Compute the derivative of ( z ) with respect to ( w ):

∂z
∂w

= x

4. Apply the chain rule to get the gradient of ( E ) with respect to (w ):
∂E ∂E ∂ y^ ∂z
∂w

= ∂ y^

⋅ ∂z

⋅ ∂w

∂E
∂w
= (− yy^ + 1−y
1−y^
) ⋅ y^(1 − y^) ⋅ x

∂E
∂w
= (y^ − y)x

5. Substitute into the weight update rule:

w ← w − η ∂E
∂w
= w − η(y^ − y)x

So, the weight update rule for cross entropy is:

w ← w − η(y^ − y)x

1. Gradient Descent (GD)

Weight Update Equation:

w ← w − η∇L(w)
Explanation:

( w): Weights of the model.

( η): Learning rate, a small positive value that controls the step size.

( ∇L(w)): Gradient of the loss function ( L) with respect to the

weights.

How It Works:

Compute the gradient of the loss function with respect to all weights.

Update the weights by subtracting a fraction of the gradient

(determined by the learning rate).

Repeats for all data points (entire dataset) in each iteration (epoch).

Problems:

Deep Learning Module 2 12

Computationally expensive for large datasets as it requires computing
the gradients using the entire dataset.

Can get stuck in local minima or saddle points.

Slow convergence for high-dimensional data.

2. Stochastic Gradient Descent (SGD)

Weight Update Equation:

w ← w − η∇L(w; xi , yi )

Explanation:

Uses one training example (xi , yi )at a time to update the weights.

How It Works:

Compute the gradient of the loss function with respect to a single data
point.

Update the weights immediately after computing the gradient for that
single data point.

Repeats for each data point in the dataset.

Advantages:

Faster updates compared to GD, allowing for more frequent adjustments

to the weights.

Introduces noise into the gradient computation, which can help escape
local minima.

Problems:

Noisy updates can cause the loss function to fluctuate.

May never converge to the minimum but oscillate around it.

3. Mini-Batch Stochastic Gradient Descent (Mini-Batch SGD)

Weight Update Equation:

w ← w − η∇L(w; B)
Explanation:

Uses a small batch of training examples ( B) to update the weights.

Deep Learning Module 2 13

How It Works:

Compute the gradient of the loss function with respect to a small batch
of data points.

Update the weights after computing the gradient for the entire mini-
batch.

Repeats for all mini-batches in the dataset.

Advantages:

Reduces the variance in the gradient updates compared to SGD.

Provides a compromise between the frequent updates of SGD and the

stability of GD.

Can make better use of vectorized operations and hardware

acceleration.

Problems:

Still might face issues with oscillations and slow convergence.

4. SGD with Momentum

Weight Update Equations:

vt = γvt−1 + η∇L(w)

w ← w − vt

Explanation:

vt : Velocity term that accumulates the gradient.

γ : Momentum term (typically between 0.5 and 0.9).

How It Works:

Compute the gradient of the loss function.

Update the velocity term by combining the previous velocity and the
current gradient.

Update the weights using the velocity term.

Advantages:

Helps accelerate SGD in relevant directions and dampens oscillations.

Deep Learning Module 2 14

Uses the past gradients to smooth out the updates.

Reduces the effects of small, inconsistent gradient directions.

5. Nesterov Accelerated Gradient (NAG)

Weight Update Equations:

vt = γvt−1 + η∇L(w − γvt−1 )

w ← w − vt

Explanation:

Computes the gradient at the predicted future position of the

parameters.

How It Works:

Predict the future position of the weights using the current velocity.

Compute the gradient of the loss function at this future position.

Update the velocity term using this gradient.

Update the weights using the new velocity term.

Advantages:

More responsive to the changes in the gradient direction compared to

standard momentum.

Provides a look-ahead mechanism that improves convergence.

6. Adagrad
Weight Update Equations:

Gt = Gt−1 + (∇L(w))2

η
w←w− Gt +ϵ

∇L(w)

Explanation:

( Gt ): Sum of the squares of past gradients.

( ϵ): Small constant to prevent division by zero.

How It Works:

Deep Learning Module 2 15

Compute the gradient of the loss function.

Accumulate the squared gradients.

Adjust the learning rate for each weight based on the accumulated
gradients.

Update the weights using the adjusted learning rate.

Advantages:

Adapts the learning rate for each parameter based on historical

gradient information.

Works well for sparse data.

Scales down the learning rate for parameters with large gradients.

Summary
Each optimizer improves upon the previous one by addressing specific
issues:

GD: Simple but slow and computationally expensive.

SGD: Faster updates but noisy and may not converge.

Mini-Batch SGD: Compromise between GD and SGD, reducing variance

in updates.

SGD with Momentum: Smoother updates and faster convergence by

using past gradients.

NAG: Further improved convergence by anticipating future gradients.

Adagrad: Adaptive learning rates based on past gradient information,

useful for sparse data.

State how to apply early stopping in the context of learning

using Gradient Descent. Why is it necessary to use a
validation set (instead of simply using the test set) when
using early stopping?
Early stopping is applied in the context of learning using Gradient Descent
by monitoring the performance of the model on a validation set during
training. The training process is stopped when the performance on the

Deep Learning Module 2 16

validation set starts to degrade or stagnate, indicating that the model is
beginning to overfit.
Here's how early stopping is implemented:

1. Training Phase: During the training phase, the model is trained on the
training dataset using Gradient Descent or its variants (e.g., Stochastic
Gradient Descent). The performance of the model is evaluated
periodically (after each epoch or after a certain number of iterations) on
the validation dataset.

2. Validation Phase: After evaluating the model on the validation dataset,

the performance metric (e.g., validation loss or accuracy) is monitored.
If the performance metric does not improve or starts to degrade over
several consecutive evaluations, early stopping is triggered.

3. Early Stopping Criterion: The early stopping criterion is typically

defined based on the behavior of the validation metric. For example,
early stopping may be triggered if the validation loss does not decrease
for a certain number of epochs or if it increases for a certain number of
consecutive epochs.

4. Termination of Training: Once early stopping is triggered, the training

process is terminated, and the model parameters from the epoch with
the best validation performance are retained as the final model.

It is necessary to use a validation set (instead of simply using the test set)
when using early stopping because:

Prevents Overfitting to Test Set: If early stopping were based on the

test set performance, the model could potentially overfit to the test set,
resulting in optimistic estimates of generalization performance. Using a
separate validation set ensures that the model's performance is
evaluated on unseen data that is not used for training or model
selection.

Prevents Data Leakage: Using the test set for early stopping could
introduce data leakage, where information from the test set influences
the training process. This violates the principle of using the test set only
for final evaluation and can lead to biased performance estimates.

Allows for Model Selection: By using a validation set, early stopping

allows for model selection based on performance metrics independent

Deep Learning Module 2 17

of the test set. This ensures that the final model is selected based on its
ability to generalize to unseen data.

Describe one advantage of using Adam optimizer instead of

basic gradient descent
One advantage of using the Adam optimizer instead of basic gradient
descent is its ability to adaptively adjust the learning rate for each
parameter in the neural network. This adaptiveness is achieved through the
use of momentum and adaptive learning rate techniques.
Here's how Adam optimizer offers this advantage:

1. Adaptive Learning Rate: Adam computes individual learning rates for

different parameters based on estimates of the first and second
moments of the gradients. It maintains separate learning rates for each
parameter, allowing it to automatically adjust the learning rate for each
parameter according to the magnitude of its gradients. This means that
parameters with large gradients will have smaller learning rates, and
parameters with small gradients will have larger learning rates. As a
result, Adam can handle situations where the gradients of different
parameters vary widely, allowing for more efficient and stable training.

2. Momentum: Adam optimizer incorporates momentum, which helps

accelerate convergence and dampen oscillations during training. By
maintaining a moving average of past gradients, Adam effectively
smooths the gradient updates, allowing for more stable progress
towards the optimum. This momentum term helps in navigating complex
loss surfaces, especially in high-dimensional parameter spaces, and
can lead to faster convergence compared to basic gradient descent.

Differentiate gradient descent with and without momentum.

Give equations for weight updation in GD with and without
momentum. Illustrate plateaus, saddle points and slowly
varying gradients
Gradient Descent without Momentum:

In standard Gradient Descent (GD), the update rule for the parameters ( θ)
at each iteration (t) is given by:
w ← w − η∇L(w)

Deep Learning Module 2 18

Explanation:

( w): Weights of the model.

( η): Learning rate, a small positive value that controls the step size.

( ∇L(w)): Gradient of the loss function ( L) with respect to the

weights.

Gradient Descent with Momentum:

In Gradient Descent with Momentum, a momentum term \( \beta \) is
introduced to the update rule to accelerate convergence. The update rule
becomes:
vt = γvt−1 + η∇L(w)

w ← w − vt

Where:

( vt ) is the velocity vector at iteration ( t ).

γ is the momentum parameter, typically between 0 and 1, controlling the

contribution of the previous velocity.

The first equation updates the velocity vector by accumulating a

fraction of the previous velocity and adding the current gradient.

The second equation updates the parameters using the velocity vector,
similar to standard GD but with the velocity term.

Illustration of Plateaus, Saddle Points, and Slowly Varying Gradients:

Plateaus: Plateaus are flat regions in the loss landscape where the
gradients are close to zero. In standard GD, convergence on plateaus
can be slow as the updates are proportional to the gradient magnitude,
which is small. In GD with Momentum, the momentum term helps the
optimization process to overcome plateaus more effectively by
accumulating velocity over time, allowing the optimizer to escape the
flat regions more efficiently.

Saddle Points: Saddle points are critical points where some dimensions
have positive curvature and others have negative curvature. GD without
Momentum can get stuck at saddle points due to the slow convergence
caused by small gradients. In GD with Momentum, the accumulated
momentum helps the optimization process to move along the direction

Deep Learning Module 2 19

of the negative curvature, enabling the optimizer to escape saddle
points more quickly.

Slowly Varying Gradients: In regions with slowly varying gradients, GD

without Momentum may converge slowly as it relies solely on the
current gradient. GD with Momentum can accelerate convergence in
such regions by accumulating momentum over time, allowing the
optimizer to smooth out fluctuations in the gradient direction and move
more quickly towards the minimum.

Deep Learning Module 2 20

AWS AI Practitioner - Questions 2025 v1.10
100% (1)
AWS AI Practitioner - Questions 2025 v1.10
42 pages
DL Notes 1 5 Deep Learning
100% (1)
DL Notes 1 5 Deep Learning
189 pages
Module 1 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
100% (1)
Module 1 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
18 pages
Module 4 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
No ratings yet
Module 4 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
21 pages
DEEP LEARNING IIT Kharagpur Assignment - 4 - 2024
100% (2)
DEEP LEARNING IIT Kharagpur Assignment - 4 - 2024
7 pages
Deep Learning Notes
100% (1)
Deep Learning Notes
71 pages
Deep Learning Unit-II
No ratings yet
Deep Learning Unit-II
19 pages
Deep Learning-Question Bank-Module-Wise
67% (3)
Deep Learning-Question Bank-Module-Wise
5 pages
1.deep Learning Assignment1 Solutions 1
100% (3)
1.deep Learning Assignment1 Solutions 1
12 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Deep Learning R18 Jntuh Lab Manual
0% (1)
Deep Learning R18 Jntuh Lab Manual
21 pages
ML Unit-1
100% (2)
ML Unit-1
12 pages
Question Bank
No ratings yet
Question Bank
14 pages
Unit 2 Machine Learning Notes
100% (1)
Unit 2 Machine Learning Notes
25 pages
Unit I: Chapter 3:functional Units For Anns For Pattern Recognition Task
100% (2)
Unit I: Chapter 3:functional Units For Anns For Pattern Recognition Task
24 pages
Unit 1 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Machine Learning - WWW - Rgpvnotes.in
23 pages
Module 3 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
No ratings yet
Module 3 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
20 pages
Module 5 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
No ratings yet
Module 5 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
26 pages
Deep Learning-KTU
No ratings yet
Deep Learning-KTU
6 pages
DL Unit - 5
No ratings yet
DL Unit - 5
14 pages
DL Unit-3
No ratings yet
DL Unit-3
9 pages
DL Unit - 4
No ratings yet
DL Unit - 4
14 pages
Unit 4
100% (1)
Unit 4
7 pages
RL Unit 5
No ratings yet
RL Unit 5
30 pages
Unit IV
No ratings yet
Unit IV
22 pages
Unit II
No ratings yet
Unit II
56 pages
101905CS502H - Neural Networks and Deep Learning - Model Question Paper
100% (1)
101905CS502H - Neural Networks and Deep Learning - Model Question Paper
4 pages
RL Unit 1
100% (1)
RL Unit 1
26 pages
Unit V
No ratings yet
Unit V
21 pages
ML UNIT-4 Notes PDF
100% (1)
ML UNIT-4 Notes PDF
40 pages
Unit 4 Notes
100% (1)
Unit 4 Notes
45 pages
MCQ
100% (1)
MCQ
9 pages
Optimization For Long-Term Dependencies
No ratings yet
Optimization For Long-Term Dependencies
57 pages
DEEP LEARNING (Previous Question Papers)
No ratings yet
DEEP LEARNING (Previous Question Papers)
3 pages
UNIT-1 Foundations of Deep Learning
100% (1)
UNIT-1 Foundations of Deep Learning
51 pages
NN UNIT-1 Complete Notes With 153 Pages
No ratings yet
NN UNIT-1 Complete Notes With 153 Pages
153 pages
Deep Learning Question Bank (2024-25)
No ratings yet
Deep Learning Question Bank (2024-25)
2 pages
NNDL Lab Manual
No ratings yet
NNDL Lab Manual
41 pages
AD3501 - Deep Learning University Question
No ratings yet
AD3501 - Deep Learning University Question
2 pages
A Probabilistic Theory of Deep Learning: Unit 2
100% (1)
A Probabilistic Theory of Deep Learning: Unit 2
17 pages
ccs355 Lab Manual
No ratings yet
ccs355 Lab Manual
24 pages
NNDL Technical Publication Notes
No ratings yet
NNDL Technical Publication Notes
81 pages
DEEP LEARNING NOTES - Btech
No ratings yet
DEEP LEARNING NOTES - Btech
26 pages
Deep Learning Questions
50% (2)
Deep Learning Questions
51 pages
Question Bank Ann
50% (2)
Question Bank Ann
2 pages
Deep Learning Question Paper
100% (1)
Deep Learning Question Paper
3 pages
ccs355 Syllabus NNDL
100% (1)
ccs355 Syllabus NNDL
3 pages
Ait401 DL Syllubus
100% (1)
Ait401 DL Syllubus
13 pages
Ad3511 Deep Learning Lab Manual III Yearjnn
No ratings yet
Ad3511 Deep Learning Lab Manual III Yearjnn
58 pages
DL - Assignment 10 Solution
100% (2)
DL - Assignment 10 Solution
6 pages
ADL Unit-3
100% (2)
ADL Unit-3
21 pages
Unit-3 Unit-3 RL Problems, Prediction and Control P 241111 181426
No ratings yet
Unit-3 Unit-3 RL Problems, Prediction and Control P 241111 181426
15 pages
NNDL Lab Record
No ratings yet
NNDL Lab Record
26 pages
Unit I
0% (1)
Unit I
21 pages
Deep Learning PPT Full Notes
No ratings yet
Deep Learning PPT Full Notes
105 pages
ANN Most Notes
100% (1)
ANN Most Notes
6 pages
Introduction To Machine Learning - Unit 3 - Week 1
No ratings yet
Introduction To Machine Learning - Unit 3 - Week 1
3 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
NN Unit 1 Complete Notes
100% (1)
NN Unit 1 Complete Notes
154 pages
DL Class3
No ratings yet
DL Class3
28 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
DL Notes
No ratings yet
DL Notes
16 pages
Science BSC Computer Science Semester 5 2022 November Elective I Artificial Intelligence Cbcs
No ratings yet
Science BSC Computer Science Semester 5 2022 November Elective I Artificial Intelligence Cbcs
29 pages
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
No ratings yet
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
100 pages
Sensors 23 03385
No ratings yet
Sensors 23 03385
20 pages
Computers and Electronics in Agriculture: Anna Chlingaryan, Salah Sukkarieh, Brett Whelan
No ratings yet
Computers and Electronics in Agriculture: Anna Chlingaryan, Salah Sukkarieh, Brett Whelan
9 pages
BE Computer Engineering Syllabus 2019 Course
No ratings yet
BE Computer Engineering Syllabus 2019 Course
3 pages
Deep Learning UNIT-II Part1
No ratings yet
Deep Learning UNIT-II Part1
48 pages
Road Damage Detection Algorithm For Improved YOLOv5
No ratings yet
Road Damage Detection Algorithm For Improved YOLOv5
12 pages
6.interpretable Hardness Prediction of High-Entropy Alloys Through Ensemble Learning
No ratings yet
6.interpretable Hardness Prediction of High-Entropy Alloys Through Ensemble Learning
13 pages
MiniProject-Weed Detection
No ratings yet
MiniProject-Weed Detection
25 pages
ML Interview Questions and Answers
100% (1)
ML Interview Questions and Answers
25 pages
Pa - Unit - Iv
No ratings yet
Pa - Unit - Iv
45 pages
UNIT 3-Bayesian Statistics
No ratings yet
UNIT 3-Bayesian Statistics
80 pages
Rineng S 25 00942
No ratings yet
Rineng S 25 00942
55 pages
Fresco
No ratings yet
Fresco
50 pages
Maximizing Drilling Process With ROP - Geothermal - Dang Ton 2021
No ratings yet
Maximizing Drilling Process With ROP - Geothermal - Dang Ton 2021
77 pages
Research Methodology - Unit 5 - Week 3 - Data Analysis and Modelling Skills
No ratings yet
Research Methodology - Unit 5 - Week 3 - Data Analysis and Modelling Skills
4 pages
Quiz 1 Materials
No ratings yet
Quiz 1 Materials
159 pages
Camera Ready Paper-Anushree
No ratings yet
Camera Ready Paper-Anushree
12 pages
Fashion Mnist Classification Using CNN
No ratings yet
Fashion Mnist Classification Using CNN
19 pages
Machine Learning Yarning - Andrew NG - 23 To 27
50% (2)
Machine Learning Yarning - Andrew NG - 23 To 27
8 pages
Synopsis
No ratings yet
Synopsis
10 pages
ResearchProposalFinalVer1 4 33
No ratings yet
ResearchProposalFinalVer1 4 33
30 pages
Brock Fazeli Arvand 2019
No ratings yet
Brock Fazeli Arvand 2019
66 pages
Atmosphere 13 01887 v2
No ratings yet
Atmosphere 13 01887 v2
17 pages
Machine Learning VIVEK
80% (5)
Machine Learning VIVEK
118 pages
Business Analyst Slide Deck
No ratings yet
Business Analyst Slide Deck
120 pages
Machine Learning Multiple Choice Questions - Free Practice Test
100% (1)
Machine Learning Multiple Choice Questions - Free Practice Test
12 pages
PDF Malware Detection A Hybrid Approach Using Random Forest and K-Nearest Neighbors
No ratings yet
PDF Malware Detection A Hybrid Approach Using Random Forest and K-Nearest Neighbors
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.