0% found this document useful (0 votes)
9 views14 pages

Module 2

Backpropagation is a key method for training neural networks, allowing for the adjustment of weights based on the error rate from previous iterations. It involves a forward pass to compute outputs and a backward pass to calculate gradients of the loss function, which are then used to update the model's parameters. Techniques such as learning rate decay, momentum-based learning, and gradient clipping enhance the effectiveness and stability of the backpropagation process.

Uploaded by

jaibalaya524
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views14 pages

Module 2

Backpropagation is a key method for training neural networks, allowing for the adjustment of weights based on the error rate from previous iterations. It involves a forward pass to compute outputs and a backward pass to calculate gradients of the loss function, which are then used to update the model's parameters. Techniques such as learning rate decay, momentum-based learning, and gradient clipping enhance the effectiveness and stability of the backpropagation process.

Uploaded by

jaibalaya524
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

MODULE 2

BACKPROPAGATION
Back Propagation
In Neural Network technique is used to enhance the accuracy based on
many existing techniques.
Backpropagation is the essence of neural network training. It is the
method of fine-tuning the weights of a neural network based on the error
rate obtained in the previous epoch (i.e., iteration).
Proper tuning of the weights allows you to reduce error rates and make
the model reliable by increasing its generalization.
Backpropagation in neural network is a short form for “backward
propagation of errors.” It is a standard method of training artificial neural
networks.
This method helps calculate the gradient of a loss function with respect to
all the weights in the network.
Working principle:
The Back propagation algorithm in neural network computes the gradient
of the loss function for a single weight by the chain rule.
It efficiently computes one layer at a time. It computes the gradient, but
it does not define how the gradient is used. It generalizes the
computation in the delta rule.
1. Inputs X, arrive through the preconnected path
2. Input is modelled using real weights W. The weights are usually
randomly selected.
3. Calculate the output for every neuron from the input layer, to the
hidden layers, to the output layer.
4. Calculate the error in the outputs
5. Travel back from the output layer to the hidden layer to adjust the
weights such that the error is decreased.

Backpropagation algorithm calculates the gradient of the error


function. Backpropagation can be written as a function of the neural
network. Backpropagation algorithms are a set of methods used to
efficiently train artificial neural networks following a gradient descent
approach which exploits the chain rule.

Why We Need Backpropagation?


Most prominent advantages of Backpropagation are:
 Backpropagation is fast, simple and easy to program
 It has no parameters to tune apart from the numbers of input
 It is a flexible method as it does not require prior knowledge about
the network
 It is a standard method that generally works well
 It does not need any special mention of the features of the function
to be learned.
Backpropagation with a computational graph abstraction:
Backpropagation with a computational graph abstraction provides a
structured way of understanding how to compute gradients in a neural
network.
The computational graph represents the flow of computations (such as
multiplication, addition, and activation functions) as nodes, where each
node computes a specific operation. The edges between the nodes
represent dependencies between operations.
Key Concepts of Computational Graph for Backpropagation
1. Nodes and Operations:
o Nodes: Represent variables or operations. For example, input
variables, weights, biases, intermediate values, activations,
and the final output.
o Edges: Represent the flow of data or computations. For
instance, an edge between two nodes indicates that the
output of one node is used as an input to another node.
2. Flow of Computation:
o The network performs a series of operations (like weighted
sums, activations) starting from the input layer and moving to
the output layer. This forms the forward pass in the graph.
o Once the output is computed, backpropagation (the backward
pass) starts to compute how each weight and bias affects the
loss function, using the chain rule to propagate gradients
backward through the graph.
Steps for Backpropagation Using Computational Graph
1. Forward Pass:
In the forward pass, data flows from the input layer to the output layer.
During this process, the network computes the activations for each
neuron. Let's break it down:
 Input x → weighted sum of inputs at a neuron: z=w⋅x+b
 Apply activation function a=σ(z)
This continues for each layer in the network. We construct the graph for
each layer's operation (multiplication, addition, activation), and the output
of each operation is stored.
2. Loss Function:
Once the forward pass is complete, the final output is compared to the
true target using a loss function L. This could be something like Mean
Squared Error (MSE) or Cross-Entropy Loss.
The loss function will also be a node in the graph that depends on the
output of the final layer.
3. Backward Pass (Backpropagation):
During backpropagation, we calculate the gradient of the loss function
with respect to each parameter (weights and biases) by following the
reverse path of the computational graph, applying the chain rule at each
step.
Step-by-Step Gradient Calculation:
 Start at the output layer: Compute the derivative of the loss with
respect to the output of the network, ∂L/∂y where y is the output of
the last neuron.
 Backtrack through each layer:
o For each layer, use the chain rule to compute the gradients of
the loss with respect to each neuron’s output. This involves
multiplying the local gradient (derivative of the activation
function) by the gradient of the loss with respect to the output
of the neuron.
o For weights www, biases b, and activations a, the gradients
are computed as:
∂L/∂w=∂L/∂a ⋅ ∂a/∂z ⋅ ∂z/∂w
∂L/∂b=∂L/∂a ⋅ ∂a/∂z ⋅ ∂z/∂b

where:
 ∂L/∂a is the gradient from the next layer.
 ∂a/∂z is the derivative of the activation function.
 ∂z/∂w and ∂z/∂b are just the inputs to the neuron.
Continue backward through the network: This process is repeated
layer by layer until reaching the input layer.
4. Update Weights and Biases:
Once the gradients are computed for all weights and biases, the network's
parameters are updated using an optimization algorithm like Gradient
Descent:
w*n=w−η⋅∂L/∂w
where η is the learning rate and ∂L/∂w is the gradient for weight w.
Example:
Simple 2-Layer Neural Network
Consider a simple network with one hidden layer:
 Layer 1 (Input → Hidden):

Computes z1=w1⋅x+b1, then applies activation a1=σ(z1)


 Layer 2 (Hidden → Output):

Computes z2=w2⋅a1+b2, then applies activation y=σ(z2).


 Loss Function:
EXAMPLE diagram:

Backpropagation with Post-Activation and Pre-Activation


Variables:
In neural networks, backpropagation involves computing gradients to
update the model's parameters. The process can vary depending on
whether gradients are computed with respect to post-activation or pre-
activation variables. Understanding these concepts is crucial for the
efficient implementation and optimization of neural networks.
Backpropagation with Pre-Activation
Instead of propagating derivatives through the activated values al, we use the derivatives with
respect to zl
Backward Pass Equations:
Compute the error at the output layer:

where δL is the error term, and J is the loss function.


Propagate the error backward:

Compute weight and bias gradients:

Backpropagation with Post-Activation


Here, derivatives are propagated through the activations al instead of zl
Backward Pass Equations:
Compute error at the output layer:

Propagate error backward using post-activation:

Compute weight and bias gradients:

Gradient based Strategies:


Learning Rate Decay:
Learning rate decay is a technique used to adjust the learning rate during neural network
training. It involves gradually reducing the learning rate as training progresses. The
strategy is to start with a relatively high learning rate to accelerate convergence and then
decrease it for more precise adjustments as the model nears the optimal solution.

Why Use Learning Rate Decay?

 Initial Fast Learning: A larger learning rate at the beginning of training allows the
optimizer to make substantial updates quickly, which speeds up learning and helps
avoid shallow local minima.

 Later Precise Adjustments: As the model approaches the minimum, a smaller


learning rate enables finer adjustments, reducing the risk of overshooting and
promoting a smoother convergence.

Without decay, a high learning rate might cause the model to oscillate around the
minimum, while starting with a low learning rate could make training excessively slow .

Momentum-based learning:

Momentum-based learning is an optimization technique used to enhance the performance


and convergence of training deep learning models. It refines the gradient descent
algorithm by incorporating past gradient information into the current update.

Concept of Momentum-Based Learning

Momentum-based learning involves adding a fraction of the previous gradient update to


the current gradient update. This approach smooths the optimization path, reduces
oscillations, and accelerates convergence, particularly in areas with small gradients or
narrow valleys.
Parameter-Specific Learning Rates (PSLR)

The learning rate is a critical hyperparameter that controls the extent of weight adjustments
based on the gradient of the loss function. A smaller learning rate slows learning but can lead
to precise convergence, while a larger learning rate speeds up learning but may overshoot
optimal solutions.

Traditionally, a single global learning rate is used for all model parameters. However, different
parts of a neural network may benefit from distinct learning rates. This is where Parameter-
Specific Learning Rates (PSLR) come into play, involving assigning different learning rates to
various subsets of parameters.
Regularization:
Combining parameter-specific learning rates with parameter-specific regularization can
provide optimized and controlled learning dynamics, particularly when different parts of
the network have varying levels of regularization.

Complex Architectures:
In complex architectures like ResNets, GANs, transformers, or multi-branch networks,
different branches or components may need specific learning rates. PSLR helps manage
the learning dynamics of these complex interactions more effectively.

Comparison with Global Learning Rate:

Aspect Global Learning Rate Parameter-Specific Learning Rate


Single learning rate for all Different learning rates for different
Learning Rate
parameters parameters
Less flexible; same rate applied More flexible; adapts rates based on
Flexibility
uniformly parameter needs
May be slower if parameters Can converge faster by optimizing
Convergence
have different sensitivities learning rates for each parameter
Complexity Simple to implement More complex to manage and tune
Handling Different Automatically adapts to different
May require manual adjustment
Scales parameter scales

Gradient Clipping

Gradient clipping is a technique used to address the problem of exploding gradients during
the training of deep neural networks. Exploding gradients occur when gradients become
excessively large during backpropagation, leading to instability in the training process.
This can result in excessively large updates to the model parameters, causing divergence
and training failure.

Purpose:
The goal of gradient clipping is to cap the gradients to a predefined range or threshold to
ensure they do not exceed a certain size. This stabilizes and controls the training process,
particularly in deep or recurrent neural networks (RNNs), where exploding gradients are
more common.

Types of Gradient Clipping:


1. Gradient Clipping by Value

In this approach, each component of the gradient vector is clipped if it


exceeds a specified range, typically defined by a minimum and maximum
value. The idea is to limit the magnitude of each gradient element to
prevent any from becoming too large.

How it Works:

 Define a minimum threshold (clip_value_min) and a maximum threshold


(clip_value_max).
 If any gradient component exceeds clip_value_max, it is set to clip_value_max.
 Conversely, if any gradient component is below clip_value_min, it is set to
clip_value_min.

Example:

 Let’s assume a gradient vector g = [0.5, -1.2, 3.0, -4.5, 2.0].


 If we clip the values to lie between -2 and 2, the clipped gradient vector will become
g = [0.5, -1.2, 2.0, -2.0, 2.0].

 Gradient Component (gi): Each individual element of the gradient vector, which
corresponds to the gradient with respect to a specific parameter in the model. This
clipping is applied separately to each component.
 clip_value_max: The upper bound for gradient values. If any gradient component
exceeds this maximum value, it is clipped to clip_value_max.
 clip_value_min: The lower bound for gradient values. If any gradient component
is below
 clip_value_min: This is the lower bound for the gradient value. Any gradient
component smaller than this will be clipped to this minimum value.
2.Gradient Clipping by Norm:

Gradient Clipping by Norm is a technique used to stabilize training by rescaling the entire
gradient vector if its magnitude exceeds a predefined threshold.

How It Works:

1. Compute the Norm: Calculate the norm (typically the Euclidean or L2 norm) of the
gradient vector.
2. Compare with Threshold: If the norm of the gradient vector exceeds the threshold
(clip_norm), rescale the entire vector.
3. Rescaling: Adjust the gradient vector so that its norm equals clip_norm, while
maintaining its direction.

Advantages:

 Preserves Direction: The gradient's direction is maintained, which helps in


preserving the gradient's relative scales and can lead to more stable convergence.
 Effective for Deep Models: Particularly useful for models where gradients can vary
greatly in magnitude, such as RNNs or very deep networks.

Disadvantages:

 Complexity: Slightly more complex to implement than clipping by value, as it


involves computing norms and rescaling vectors.
 Potential for Inefficiency: If norms are frequently close to or exceed the threshold,
it may lead to rescaling more often, which could affect training dynamics.

Polyak Averaging

Polyak Averaging is a method employed to enhance the convergence and stability of


optimization algorithms, especially in iterative methods such as gradient descent. This
technique involves computing and using the average of model parameters over multiple
iterations to obtain a more stable and potentially improved solution.

Concept of Polyak Averaging:

1. Definition:

o Polyak Averaging entails calculating an average of the model's parameters


across various iterations. Instead of relying solely on the parameters from
the final iteration, this average is used for making predictions or as the final
model.

Benefits of Polyak Averaging

1. Improves Stability:

o By averaging parameters over multiple iterations, the effects of noisy


updates or fluctuations are mitigated, leading to more stable solutions.

2. Reduces Variance:

o This technique helps smooth out noise in the optimization process,


potentially resulting in a more accurate and reliable final model.

Applications of Polyak Averaging

1. Stochastic Gradient Descent (SGD):

o Frequently employed with SGD to enhance the stability of the final outcome.

2. Neural Networks:

o Utilized to improve the performance and stability of models trained through


iterative optimization techniques.

Local and Spurious Minima:


Local Minima

1. Definition:

o A local minimum is a point in the loss landscape where the loss function has
a lower value than at nearby points but is not necessarily the lowest value
across the entire landscape.

2. Characteristics:

o Surrounding Area: The loss value is lower compared to neighboring points,


though it may not be the global minimum.

o Convergence: Optimization algorithms may become trapped in local


minima without mechanisms to escape.

Spurious Minima

1. Definition:
o A spurious minimum is a local minimum that does not accurately reflect the
global structure of the loss landscape, often arising from noise or
irregularities in the loss surface.

2. Characteristics:

o Non-Optimal: Spurious minima are generally not true solutions and may
not lead to models with good generalization.

o Irregular Landscapes: Common in complex loss landscapes with


significant irregularities.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy