Module 2
Module 2
BACKPROPAGATION
Back Propagation
In Neural Network technique is used to enhance the accuracy based on
many existing techniques.
Backpropagation is the essence of neural network training. It is the
method of fine-tuning the weights of a neural network based on the error
rate obtained in the previous epoch (i.e., iteration).
Proper tuning of the weights allows you to reduce error rates and make
the model reliable by increasing its generalization.
Backpropagation in neural network is a short form for “backward
propagation of errors.” It is a standard method of training artificial neural
networks.
This method helps calculate the gradient of a loss function with respect to
all the weights in the network.
Working principle:
The Back propagation algorithm in neural network computes the gradient
of the loss function for a single weight by the chain rule.
It efficiently computes one layer at a time. It computes the gradient, but
it does not define how the gradient is used. It generalizes the
computation in the delta rule.
1. Inputs X, arrive through the preconnected path
2. Input is modelled using real weights W. The weights are usually
randomly selected.
3. Calculate the output for every neuron from the input layer, to the
hidden layers, to the output layer.
4. Calculate the error in the outputs
5. Travel back from the output layer to the hidden layer to adjust the
weights such that the error is decreased.
where:
∂L/∂a is the gradient from the next layer.
∂a/∂z is the derivative of the activation function.
∂z/∂w and ∂z/∂b are just the inputs to the neuron.
Continue backward through the network: This process is repeated
layer by layer until reaching the input layer.
4. Update Weights and Biases:
Once the gradients are computed for all weights and biases, the network's
parameters are updated using an optimization algorithm like Gradient
Descent:
w*n=w−η⋅∂L/∂w
where η is the learning rate and ∂L/∂w is the gradient for weight w.
Example:
Simple 2-Layer Neural Network
Consider a simple network with one hidden layer:
Layer 1 (Input → Hidden):
Initial Fast Learning: A larger learning rate at the beginning of training allows the
optimizer to make substantial updates quickly, which speeds up learning and helps
avoid shallow local minima.
Without decay, a high learning rate might cause the model to oscillate around the
minimum, while starting with a low learning rate could make training excessively slow .
Momentum-based learning:
The learning rate is a critical hyperparameter that controls the extent of weight adjustments
based on the gradient of the loss function. A smaller learning rate slows learning but can lead
to precise convergence, while a larger learning rate speeds up learning but may overshoot
optimal solutions.
Traditionally, a single global learning rate is used for all model parameters. However, different
parts of a neural network may benefit from distinct learning rates. This is where Parameter-
Specific Learning Rates (PSLR) come into play, involving assigning different learning rates to
various subsets of parameters.
Regularization:
Combining parameter-specific learning rates with parameter-specific regularization can
provide optimized and controlled learning dynamics, particularly when different parts of
the network have varying levels of regularization.
Complex Architectures:
In complex architectures like ResNets, GANs, transformers, or multi-branch networks,
different branches or components may need specific learning rates. PSLR helps manage
the learning dynamics of these complex interactions more effectively.
Gradient Clipping
Gradient clipping is a technique used to address the problem of exploding gradients during
the training of deep neural networks. Exploding gradients occur when gradients become
excessively large during backpropagation, leading to instability in the training process.
This can result in excessively large updates to the model parameters, causing divergence
and training failure.
Purpose:
The goal of gradient clipping is to cap the gradients to a predefined range or threshold to
ensure they do not exceed a certain size. This stabilizes and controls the training process,
particularly in deep or recurrent neural networks (RNNs), where exploding gradients are
more common.
How it Works:
Example:
Gradient Component (gi): Each individual element of the gradient vector, which
corresponds to the gradient with respect to a specific parameter in the model. This
clipping is applied separately to each component.
clip_value_max: The upper bound for gradient values. If any gradient component
exceeds this maximum value, it is clipped to clip_value_max.
clip_value_min: The lower bound for gradient values. If any gradient component
is below
clip_value_min: This is the lower bound for the gradient value. Any gradient
component smaller than this will be clipped to this minimum value.
2.Gradient Clipping by Norm:
Gradient Clipping by Norm is a technique used to stabilize training by rescaling the entire
gradient vector if its magnitude exceeds a predefined threshold.
How It Works:
1. Compute the Norm: Calculate the norm (typically the Euclidean or L2 norm) of the
gradient vector.
2. Compare with Threshold: If the norm of the gradient vector exceeds the threshold
(clip_norm), rescale the entire vector.
3. Rescaling: Adjust the gradient vector so that its norm equals clip_norm, while
maintaining its direction.
Advantages:
Disadvantages:
Polyak Averaging
1. Definition:
1. Improves Stability:
2. Reduces Variance:
o Frequently employed with SGD to enhance the stability of the final outcome.
2. Neural Networks:
1. Definition:
o A local minimum is a point in the loss landscape where the loss function has
a lower value than at nearby points but is not necessarily the lowest value
across the entire landscape.
2. Characteristics:
Spurious Minima
1. Definition:
o A spurious minimum is a local minimum that does not accurately reflect the
global structure of the loss landscape, often arising from noise or
irregularities in the loss surface.
2. Characteristics:
o Non-Optimal: Spurious minima are generally not true solutions and may
not lead to models with good generalization.