Module 2 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
Module 2 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
2. Random Initialization:
Description: Weights and biases are initialized randomly from a specified
distribution, such as uniform or normal distribution.
layer.
Computationally more
Computationally
expensive due to absolute Computationally less expensive
Efficient
values
Robustness to
Less sensitive to outliers More sensitive to outliers
Outliers
Bias-Variance Tends to have higher bias and Tends to have lower bias and higher
Tradeoff lower variance variance
Overfitting
Definition: Overfitting occurs when a machine learning model learns to
perform well on the training data but fails to generalize to unseen data from
the same distribution.
Cause: Overfitting typically happens when the model becomes too complex
relative to the amount and variability of the training data. As a result, the
model starts to memorize the training examples rather than learning general
patterns or relationships in the data.
Symptoms:
Advantages: More data can help the model capture the underlying
distribution of the data more accurately, reducing the chances of
overfitting.
2. Regularization:
Description: Regularization techniques add constraints to the model's
optimization process, discouraging it from fitting the training data too
closely and preventing overfitting.
3. Feature Selection:
Description: Selecting relevant features and eliminating irrelevant or noisy
ones can improve the model's ability to generalize by focusing on the most
informative aspects of the data.
4. Cross-Validation:
Description: Cross-validation techniques involve partitioning the training
data into multiple subsets, training the model on different subsets, and
evaluating its performance on the remaining subsets. This provides a more
accurate estimate of the model's performance on unseen data than training
on a single fixed validation set.
5. Early Stopping:
Description: Early stopping involves monitoring the model's performance
on a validation set during training and halting the training process when the
performance starts to degrade, indicating that the model is overfitting.
Bias: Increasing the number of hidden units generally reduces the bias
of the model. With more hidden units, the network becomes more
expressive and capable of capturing complex relationships in the data.
This allows the model to better fit the training data and reduce bias, as it
can learn more intricate patterns.
ii)Drop out
iii)Injecting noise at input
iv)Parameter sharing and tying.
i) Early Stopping:
ii) Dropout:
This forces the network to learn more robust and generalized features,
as it cannot rely on any single neuron to always be present.
During inference (testing), all neurons are used, but their outputs are
scaled by the dropout rate to compensate for the deactivated neurons
during training.
It involves adding random noise to the input data before feeding it into
the network during training.
Parameter sharing and tying are techniques used to reduce the number
of parameters in a neural network model, thereby improving its
efficiency and generalization.
Suppose that a model does well on the training set, but only
achieves an accuracy of 85% on the validation set. You
conclude that the model is overfitting, and plan to use L1 or
L2 regularization to fix the issue. However, you learn that
some of the examples in the data may be incorrectly labeled.
Which form of regularisation would you prefer to use and
why?
^=
For a linear model with output ( y wx + b) (ignoring bias for simplicity),
we need to compute the gradient of the error function with respect to ( w ).
^):
1. Compute the derivative of ( E ) with respect to ( y
∂E ∂
∂ y^
= ∂ y^
( 12 (y − y^)2 ) = −(y − y^)
2.
^) with respect to ( w ):
Compute the derivative of ( y
∂ y^ ∂
∂w
= ∂w
(wx) = x
3.
Apply the chain rule to get the gradient of ( E ) with respect to ( w ):
∂E ∂E ∂ y^
∂w
= ∂ y^
⋅ ∂w
= −(y − y^) ⋅ x
4.
Substitute into the weight update rule:
w ← w − η ∂E
∂w
= w + η(y − y^)x
b) Cross Entropy
The Cross Entropy Error for a single training example is defined as:
1
^=
For a logistic regression model, y σ(z)where z = wxand σ(z) = 1+e −z
.
^):
1. Compute the derivative of (E ) with respect to ( y
4. Apply the chain rule to get the gradient of ( E ) with respect to (w ):
∂E ∂E ∂ y^ ∂z
∂w
= ∂ y^
⋅ ∂z
⋅ ∂w
∂E
∂w
= (− yy^ + 1−y
1−y^
) ⋅ y^(1 − y^) ⋅ x
∂E
∂w
= (y^ − y)x
w ← w − η(y^ − y)x
w ← w − η∇L(w)
Explanation:
( η): Learning rate, a small positive value that controls the step size.
How It Works:
Compute the gradient of the loss function with respect to all weights.
Repeats for all data points (entire dataset) in each iteration (epoch).
Problems:
w ← w − η∇L(w; xi , yi )
Explanation:
Uses one training example (xi , yi )at a time to update the weights.
How It Works:
Compute the gradient of the loss function with respect to a single data
point.
Update the weights immediately after computing the gradient for that
single data point.
Advantages:
Introduces noise into the gradient computation, which can help escape
local minima.
Problems:
w ← w − η∇L(w; B)
Explanation:
Compute the gradient of the loss function with respect to a small batch
of data points.
Update the weights after computing the gradient for the entire mini-
batch.
Advantages:
Problems:
vt = γvt−1 + η∇L(w)
w ← w − vt
Explanation:
Update the velocity term by combining the previous velocity and the
current gradient.
Advantages:
w ← w − vt
Explanation:
How It Works:
Predict the future position of the weights using the current velocity.
Advantages:
6. Adagrad
Weight Update Equations:
Gt = Gt−1 + (∇L(w))2
η
w←w− Gt +ϵ
∇L(w)
Explanation:
How It Works:
Adjust the learning rate for each weight based on the accumulated
gradients.
Advantages:
Scales down the learning rate for parameters with large gradients.
Summary
Each optimizer improves upon the previous one by addressing specific
issues:
1. Training Phase: During the training phase, the model is trained on the
training dataset using Gradient Descent or its variants (e.g., Stochastic
Gradient Descent). The performance of the model is evaluated
periodically (after each epoch or after a certain number of iterations) on
the validation dataset.
It is necessary to use a validation set (instead of simply using the test set)
when using early stopping because:
Prevents Data Leakage: Using the test set for early stopping could
introduce data leakage, where information from the test set influences
the training process. This violates the principle of using the test set only
for final evaluation and can lead to biased performance estimates.
In standard Gradient Descent (GD), the update rule for the parameters ( θ)
at each iteration (t) is given by:
w ← w − η∇L(w)
( η): Learning rate, a small positive value that controls the step size.
w ← w − vt
Where:
The second equation updates the parameters using the velocity vector,
similar to standard GD but with the velocity term.
Plateaus: Plateaus are flat regions in the loss landscape where the
gradients are close to zero. In standard GD, convergence on plateaus
can be slow as the updates are proportional to the gradient magnitude,
which is small. In GD with Momentum, the momentum term helps the
optimization process to overcome plateaus more effectively by
accumulating velocity over time, allowing the optimizer to escape the
flat regions more efficiently.
Saddle Points: Saddle points are critical points where some dimensions
have positive curvature and others have negative curvature. GD without
Momentum can get stuck at saddle points due to the slow convergence
caused by small gradients. In GD with Momentum, the accumulated
momentum helps the optimization process to move along the direction