0% found this document useful (0 votes)

9 views14 pages

Module 2

Backpropagation is a key method for training neural networks, allowing for the adjustment of weights based on the error rate from previous iterations. It involves a forward pass to compute outputs and a backward pass to calculate gradients of the loss function, which are then used to update the model's parameters. Techniques such as learning rate decay, momentum-based learning, and gradient clipping enhance the effectiveness and stability of the backpropagation process.

Uploaded by

jaibalaya524

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views14 pages

Module 2

Uploaded by

jaibalaya524

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

MODULE 2

BACKPROPAGATION
Back Propagation
In Neural Network technique is used to enhance the accuracy based on
many existing techniques.
Backpropagation is the essence of neural network training. It is the
method of fine-tuning the weights of a neural network based on the error
rate obtained in the previous epoch (i.e., iteration).
Proper tuning of the weights allows you to reduce error rates and make
the model reliable by increasing its generalization.
Backpropagation in neural network is a short form for “backward
propagation of errors.” It is a standard method of training artificial neural
networks.
This method helps calculate the gradient of a loss function with respect to
all the weights in the network.
Working principle:
The Back propagation algorithm in neural network computes the gradient
of the loss function for a single weight by the chain rule.
It efficiently computes one layer at a time. It computes the gradient, but
it does not define how the gradient is used. It generalizes the
computation in the delta rule.
1. Inputs X, arrive through the preconnected path
2. Input is modelled using real weights W. The weights are usually
randomly selected.
3. Calculate the output for every neuron from the input layer, to the
hidden layers, to the output layer.
4. Calculate the error in the outputs
5. Travel back from the output layer to the hidden layer to adjust the
weights such that the error is decreased.

Backpropagation algorithm calculates the gradient of the error

function. Backpropagation can be written as a function of the neural
network. Backpropagation algorithms are a set of methods used to
efficiently train artificial neural networks following a gradient descent
approach which exploits the chain rule.

Why We Need Backpropagation?

Most prominent advantages of Backpropagation are:
 Backpropagation is fast, simple and easy to program
 It has no parameters to tune apart from the numbers of input
 It is a flexible method as it does not require prior knowledge about
the network
 It is a standard method that generally works well
 It does not need any special mention of the features of the function
to be learned.
Backpropagation with a computational graph abstraction:
Backpropagation with a computational graph abstraction provides a
structured way of understanding how to compute gradients in a neural
network.
The computational graph represents the flow of computations (such as
multiplication, addition, and activation functions) as nodes, where each
node computes a specific operation. The edges between the nodes
represent dependencies between operations.
Key Concepts of Computational Graph for Backpropagation
1. Nodes and Operations:
o Nodes: Represent variables or operations. For example, input
variables, weights, biases, intermediate values, activations,
and the final output.
o Edges: Represent the flow of data or computations. For
instance, an edge between two nodes indicates that the
output of one node is used as an input to another node.
2. Flow of Computation:
o The network performs a series of operations (like weighted
sums, activations) starting from the input layer and moving to
the output layer. This forms the forward pass in the graph.
o Once the output is computed, backpropagation (the backward
pass) starts to compute how each weight and bias affects the
loss function, using the chain rule to propagate gradients
backward through the graph.
Steps for Backpropagation Using Computational Graph
1. Forward Pass:
In the forward pass, data flows from the input layer to the output layer.
During this process, the network computes the activations for each
neuron. Let's break it down:
 Input x → weighted sum of inputs at a neuron: z=w⋅x+b
 Apply activation function a=σ(z)
This continues for each layer in the network. We construct the graph for
each layer's operation (multiplication, addition, activation), and the output
of each operation is stored.
2. Loss Function:
Once the forward pass is complete, the final output is compared to the
true target using a loss function L. This could be something like Mean
Squared Error (MSE) or Cross-Entropy Loss.
The loss function will also be a node in the graph that depends on the
output of the final layer.
3. Backward Pass (Backpropagation):
During backpropagation, we calculate the gradient of the loss function
with respect to each parameter (weights and biases) by following the
reverse path of the computational graph, applying the chain rule at each
step.
Step-by-Step Gradient Calculation:
 Start at the output layer: Compute the derivative of the loss with
respect to the output of the network, ∂L/∂y where y is the output of
the last neuron.
 Backtrack through each layer:
o For each layer, use the chain rule to compute the gradients of
the loss with respect to each neuron’s output. This involves
multiplying the local gradient (derivative of the activation
function) by the gradient of the loss with respect to the output
of the neuron.
o For weights www, biases b, and activations a, the gradients
are computed as:
∂L/∂w=∂L/∂a ⋅ ∂a/∂z ⋅ ∂z/∂w
∂L/∂b=∂L/∂a ⋅ ∂a/∂z ⋅ ∂z/∂b

where:
 ∂L/∂a is the gradient from the next layer.
 ∂a/∂z is the derivative of the activation function.
 ∂z/∂w and ∂z/∂b are just the inputs to the neuron.
Continue backward through the network: This process is repeated
layer by layer until reaching the input layer.
4. Update Weights and Biases:
Once the gradients are computed for all weights and biases, the network's
parameters are updated using an optimization algorithm like Gradient
Descent:
w*n=w−η⋅∂L/∂w
where η is the learning rate and ∂L/∂w is the gradient for weight w.
Example:
Simple 2-Layer Neural Network
Consider a simple network with one hidden layer:
 Layer 1 (Input → Hidden):

Computes z1=w1⋅x+b1, then applies activation a1=σ(z1)

 Layer 2 (Hidden → Output):

Computes z2=w2⋅a1+b2, then applies activation y=σ(z2).

 Loss Function:
EXAMPLE diagram:

Backpropagation with Post-Activation and Pre-Activation

Variables:
In neural networks, backpropagation involves computing gradients to
update the model's parameters. The process can vary depending on
whether gradients are computed with respect to post-activation or pre-
activation variables. Understanding these concepts is crucial for the
efficient implementation and optimization of neural networks.
Backpropagation with Pre-Activation
Instead of propagating derivatives through the activated values al, we use the derivatives with
respect to zl
Backward Pass Equations:
Compute the error at the output layer:

where δL is the error term, and J is the loss function.

Propagate the error backward:

Compute weight and bias gradients:

Backpropagation with Post-Activation

Here, derivatives are propagated through the activations al instead of zl
Backward Pass Equations:
Compute error at the output layer:

Propagate error backward using post-activation:

Compute weight and bias gradients:

Gradient based Strategies:

Learning Rate Decay:
Learning rate decay is a technique used to adjust the learning rate during neural network
training. It involves gradually reducing the learning rate as training progresses. The
strategy is to start with a relatively high learning rate to accelerate convergence and then
decrease it for more precise adjustments as the model nears the optimal solution.

Why Use Learning Rate Decay?

 Initial Fast Learning: A larger learning rate at the beginning of training allows the
optimizer to make substantial updates quickly, which speeds up learning and helps
avoid shallow local minima.

 Later Precise Adjustments: As the model approaches the minimum, a smaller

learning rate enables finer adjustments, reducing the risk of overshooting and
promoting a smoother convergence.

Without decay, a high learning rate might cause the model to oscillate around the
minimum, while starting with a low learning rate could make training excessively slow .

Momentum-based learning:

Momentum-based learning is an optimization technique used to enhance the performance

and convergence of training deep learning models. It refines the gradient descent
algorithm by incorporating past gradient information into the current update.

Concept of Momentum-Based Learning

Momentum-based learning involves adding a fraction of the previous gradient update to

the current gradient update. This approach smooths the optimization path, reduces
oscillations, and accelerates convergence, particularly in areas with small gradients or
narrow valleys.
Parameter-Specific Learning Rates (PSLR)

The learning rate is a critical hyperparameter that controls the extent of weight adjustments
based on the gradient of the loss function. A smaller learning rate slows learning but can lead
to precise convergence, while a larger learning rate speeds up learning but may overshoot
optimal solutions.

Traditionally, a single global learning rate is used for all model parameters. However, different
parts of a neural network may benefit from distinct learning rates. This is where Parameter-
Specific Learning Rates (PSLR) come into play, involving assigning different learning rates to
various subsets of parameters.
Regularization:
Combining parameter-specific learning rates with parameter-specific regularization can
provide optimized and controlled learning dynamics, particularly when different parts of
the network have varying levels of regularization.

Complex Architectures:
In complex architectures like ResNets, GANs, transformers, or multi-branch networks,
different branches or components may need specific learning rates. PSLR helps manage
the learning dynamics of these complex interactions more effectively.

Comparison with Global Learning Rate:

Aspect Global Learning Rate Parameter-Specific Learning Rate

Single learning rate for all Different learning rates for different
Learning Rate
parameters parameters
Less flexible; same rate applied More flexible; adapts rates based on
Flexibility
uniformly parameter needs
May be slower if parameters Can converge faster by optimizing
Convergence
have different sensitivities learning rates for each parameter
Complexity Simple to implement More complex to manage and tune
Handling Different Automatically adapts to different
May require manual adjustment
Scales parameter scales

Gradient Clipping

Gradient clipping is a technique used to address the problem of exploding gradients during
the training of deep neural networks. Exploding gradients occur when gradients become
excessively large during backpropagation, leading to instability in the training process.
This can result in excessively large updates to the model parameters, causing divergence
and training failure.

Purpose:
The goal of gradient clipping is to cap the gradients to a predefined range or threshold to
ensure they do not exceed a certain size. This stabilizes and controls the training process,
particularly in deep or recurrent neural networks (RNNs), where exploding gradients are
more common.

Types of Gradient Clipping:

1. Gradient Clipping by Value

In this approach, each component of the gradient vector is clipped if it

exceeds a specified range, typically defined by a minimum and maximum
value. The idea is to limit the magnitude of each gradient element to
prevent any from becoming too large.

How it Works:

 Define a minimum threshold (clip_value_min) and a maximum threshold

(clip_value_max).
 If any gradient component exceeds clip_value_max, it is set to clip_value_max.
 Conversely, if any gradient component is below clip_value_min, it is set to
clip_value_min.

Example:

 Let’s assume a gradient vector g = [0.5, -1.2, 3.0, -4.5, 2.0].

 If we clip the values to lie between -2 and 2, the clipped gradient vector will become
g = [0.5, -1.2, 2.0, -2.0, 2.0].

 Gradient Component (gi): Each individual element of the gradient vector, which
corresponds to the gradient with respect to a specific parameter in the model. This
clipping is applied separately to each component.
 clip_value_max: The upper bound for gradient values. If any gradient component
exceeds this maximum value, it is clipped to clip_value_max.
 clip_value_min: The lower bound for gradient values. If any gradient component
is below
 clip_value_min: This is the lower bound for the gradient value. Any gradient
component smaller than this will be clipped to this minimum value.
2.Gradient Clipping by Norm:

Gradient Clipping by Norm is a technique used to stabilize training by rescaling the entire
gradient vector if its magnitude exceeds a predefined threshold.

How It Works:

1. Compute the Norm: Calculate the norm (typically the Euclidean or L2 norm) of the
gradient vector.
2. Compare with Threshold: If the norm of the gradient vector exceeds the threshold
(clip_norm), rescale the entire vector.
3. Rescaling: Adjust the gradient vector so that its norm equals clip_norm, while
maintaining its direction.

Advantages:

 Preserves Direction: The gradient's direction is maintained, which helps in

preserving the gradient's relative scales and can lead to more stable convergence.
 Effective for Deep Models: Particularly useful for models where gradients can vary
greatly in magnitude, such as RNNs or very deep networks.

Disadvantages:

 Complexity: Slightly more complex to implement than clipping by value, as it

involves computing norms and rescaling vectors.
 Potential for Inefficiency: If norms are frequently close to or exceed the threshold,
it may lead to rescaling more often, which could affect training dynamics.

Polyak Averaging

Polyak Averaging is a method employed to enhance the convergence and stability of

optimization algorithms, especially in iterative methods such as gradient descent. This
technique involves computing and using the average of model parameters over multiple
iterations to obtain a more stable and potentially improved solution.

Concept of Polyak Averaging:

1. Definition:

o Polyak Averaging entails calculating an average of the model's parameters

across various iterations. Instead of relying solely on the parameters from
the final iteration, this average is used for making predictions or as the final
model.

Benefits of Polyak Averaging

1. Improves Stability:

o By averaging parameters over multiple iterations, the effects of noisy

updates or fluctuations are mitigated, leading to more stable solutions.

2. Reduces Variance:

o This technique helps smooth out noise in the optimization process,

potentially resulting in a more accurate and reliable final model.

Applications of Polyak Averaging

1. Stochastic Gradient Descent (SGD):

o Frequently employed with SGD to enhance the stability of the final outcome.

2. Neural Networks:

o Utilized to improve the performance and stability of models trained through

iterative optimization techniques.

Local and Spurious Minima:

Local Minima

1. Definition:

o A local minimum is a point in the loss landscape where the loss function has
a lower value than at nearby points but is not necessarily the lowest value
across the entire landscape.

2. Characteristics:

o Surrounding Area: The loss value is lower compared to neighboring points,

though it may not be the global minimum.

o Convergence: Optimization algorithms may become trapped in local

minima without mechanisms to escape.

Spurious Minima

1. Definition:
o A spurious minimum is a local minimum that does not accurately reflect the
global structure of the loss landscape, often arising from noise or
irregularities in the loss surface.

2. Characteristics:

o Non-Optimal: Spurious minima are generally not true solutions and may
not lead to models with good generalization.

o Irregular Landscapes: Common in complex loss landscapes with

significant irregularities.

ML Unit 5
No ratings yet
ML Unit 5
34 pages
ANN Notes Updated
0% (1)
ANN Notes Updated
46 pages
Neural Networks
No ratings yet
Neural Networks
52 pages
DL U-I Introduction Part-2
No ratings yet
DL U-I Introduction Part-2
48 pages
Understanding Backpropagation and Its Role in Deep LearningPARTH LAMBAT AND - 20250415 - 122012 - 0000
No ratings yet
Understanding Backpropagation and Its Role in Deep LearningPARTH LAMBAT AND - 20250415 - 122012 - 0000
18 pages
DL Unit 3 Notes PPT
No ratings yet
DL Unit 3 Notes PPT
37 pages
Neural Networks - III: ICT3212 - Introduction To Intelligence Systems COM3303 - Artificial Intelligence
No ratings yet
Neural Networks - III: ICT3212 - Introduction To Intelligence Systems COM3303 - Artificial Intelligence
44 pages
ML Unit 2 Lecture Notes
No ratings yet
ML Unit 2 Lecture Notes
20 pages
Unit 2 - ML
No ratings yet
Unit 2 - ML
18 pages
ANN Unit 3
No ratings yet
ANN Unit 3
100 pages
Lecture-17 Machine Learning With Python
No ratings yet
Lecture-17 Machine Learning With Python
37 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
Mod 2 3
No ratings yet
Mod 2 3
27 pages
Back Propagation
No ratings yet
Back Propagation
29 pages
Deep Learning
No ratings yet
Deep Learning
24 pages
Cosworth Performance Parts 2011
No ratings yet
Cosworth Performance Parts 2011
48 pages
Working of Multi-Layer Perceptron
No ratings yet
Working of Multi-Layer Perceptron
16 pages
6.deriving Back Propogation
No ratings yet
6.deriving Back Propogation
11 pages
Back Propogation
No ratings yet
Back Propogation
28 pages
Unit4 - Chain Rule and Backpropagation
No ratings yet
Unit4 - Chain Rule and Backpropagation
4 pages
Lecture 13.3 Classification ANN
No ratings yet
Lecture 13.3 Classification ANN
64 pages
Neural Networks
No ratings yet
Neural Networks
10 pages
Aws Glossary
No ratings yet
Aws Glossary
184 pages
ANN Research
No ratings yet
ANN Research
18 pages
Types of MAC Protocols
No ratings yet
Types of MAC Protocols
32 pages
ML Module 2 New
No ratings yet
ML Module 2 New
36 pages
FFNN, GD, Backpropagation
No ratings yet
FFNN, GD, Backpropagation
18 pages
Learning Typescript Fudamentals
100% (1)
Learning Typescript Fudamentals
72 pages
Assignment - 4
No ratings yet
Assignment - 4
24 pages
Types of MAC Protocols
No ratings yet
Types of MAC Protocols
16 pages
Foundations of Deep Learning
No ratings yet
Foundations of Deep Learning
30 pages
MLS 1 - Presentation
No ratings yet
MLS 1 - Presentation
11 pages
CS610 Quiz-3 by Vu Topper RM-2
No ratings yet
CS610 Quiz-3 by Vu Topper RM-2
46 pages
Berryman
No ratings yet
Berryman
24 pages
Backpropagation Process in Deep Neural Network
No ratings yet
Backpropagation Process in Deep Neural Network
6 pages
ML Session 15 Backpropagation
No ratings yet
ML Session 15 Backpropagation
30 pages
Backpropagation Networks Presentation Updated
No ratings yet
Backpropagation Networks Presentation Updated
10 pages
Artificial Neural Network - Back-Propagation Learning
No ratings yet
Artificial Neural Network - Back-Propagation Learning
21 pages
Laporan Prestasi Pekerjaan Pemasangan Elevator
No ratings yet
Laporan Prestasi Pekerjaan Pemasangan Elevator
12 pages
Random Forest
No ratings yet
Random Forest
9 pages
Backpropagation Algorithm
No ratings yet
Backpropagation Algorithm
6 pages
QGB Series: Instruction Manual
No ratings yet
QGB Series: Instruction Manual
100 pages
Unit 2
No ratings yet
Unit 2
38 pages
ML Exp 8
No ratings yet
ML Exp 8
2 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
Back Propagation
No ratings yet
Back Propagation
19 pages
Back Propagation Technique
No ratings yet
Back Propagation Technique
24 pages
35 Swap Space Management 08-11-2024
No ratings yet
35 Swap Space Management 08-11-2024
6 pages
Back Propagation
No ratings yet
Back Propagation
5 pages
Artificial Neural Networks - Lect - 3
No ratings yet
Artificial Neural Networks - Lect - 3
16 pages
Backpropagation Steps
No ratings yet
Backpropagation Steps
2 pages
Back Propagation
No ratings yet
Back Propagation
20 pages
Backpropagation
No ratings yet
Backpropagation
4 pages
2012-1158. Backpropagation NN
No ratings yet
2012-1158. Backpropagation NN
56 pages
Backpropagation Learning in Neural Networks
No ratings yet
Backpropagation Learning in Neural Networks
27 pages
Building An AI Startup-2024. in 2024, Building An AI Startup - by Bijit Ghosh - Feb, 2024 - Medium
No ratings yet
Building An AI Startup-2024. in 2024, Building An AI Startup - by Bijit Ghosh - Feb, 2024 - Medium
25 pages
Backpropagation
No ratings yet
Backpropagation
4 pages
Unit II Supervised II
No ratings yet
Unit II Supervised II
16 pages
Errorback Propagation
No ratings yet
Errorback Propagation
3 pages
F&G Devices Inspection and Test Plan
No ratings yet
F&G Devices Inspection and Test Plan
3 pages
ZXA10 C320 Datasheet: Key Features Technical Specifications
No ratings yet
ZXA10 C320 Datasheet: Key Features Technical Specifications
3 pages
Back Propagation
No ratings yet
Back Propagation
1 page
What Is Backpropagation
No ratings yet
What Is Backpropagation
8 pages
Unit 4
No ratings yet
Unit 4
16 pages
Loss Optimization Gradient Decent
No ratings yet
Loss Optimization Gradient Decent
10 pages
Backpropagation
No ratings yet
Backpropagation
2 pages
Understanding Backpropagation Algorithm - Towards Data Science
No ratings yet
Understanding Backpropagation Algorithm - Towards Data Science
11 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
Santu CV Job Final (07!01!25)
No ratings yet
Santu CV Job Final (07!01!25)
10 pages
Doosan Schematic All Models
100% (69)
Doosan Schematic All Models
20 pages
Bangladesh Telecommunications Company LTD.: Subscriber Copy ADSL Bill
No ratings yet
Bangladesh Telecommunications Company LTD.: Subscriber Copy ADSL Bill
3 pages
EM-80/EM-300 MDS 5150A/LIT Actuator System: Applications
No ratings yet
EM-80/EM-300 MDS 5150A/LIT Actuator System: Applications
5 pages
An Introduction To Mathematics Behind Neural Networks
No ratings yet
An Introduction To Mathematics Behind Neural Networks
5 pages
SP800 Operating Manual
No ratings yet
SP800 Operating Manual
17 pages
7700e SPM
No ratings yet
7700e SPM
2 pages
Car Basic Mechanics
No ratings yet
Car Basic Mechanics
3 pages
SSG-VD-000-MECH-IOM-SCA01-0001 - 3 - IFI - AC (Cover)
No ratings yet
SSG-VD-000-MECH-IOM-SCA01-0001 - 3 - IFI - AC (Cover)
20 pages
Main
No ratings yet
Main
12 pages
Unit IV
No ratings yet
Unit IV
32 pages
Template For GigaByte Journal Data Report Submissions
No ratings yet
Template For GigaByte Journal Data Report Submissions
10 pages
Wattless Current
No ratings yet
Wattless Current
2 pages
DE Ch21
No ratings yet
DE Ch21
20 pages
Software Engineering NTA UGC NET Question Analysis PART 1
No ratings yet
Software Engineering NTA UGC NET Question Analysis PART 1
22 pages
SensaGuard Switches With EStop To MSR138.1DP Relay
No ratings yet
SensaGuard Switches With EStop To MSR138.1DP Relay
4 pages
Flyer - RCS-9631 Capacitor Management Relay
No ratings yet
Flyer - RCS-9631 Capacitor Management Relay
1 page
Lebanese International University: CSCI345 - Digital Logic Assignment 1
No ratings yet
Lebanese International University: CSCI345 - Digital Logic Assignment 1
5 pages
Bess White-Paper Explosion-Protection Final
No ratings yet
Bess White-Paper Explosion-Protection Final
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Module 2

Uploaded by

Module 2

Uploaded by

MODULE 2

Backpropagation algorithm calculates the gradient of the error

Why We Need Backpropagation?

Computes z1=w1⋅x+b1, then applies activation a1=σ(z1)

Computes z2=w2⋅a1+b2, then applies activation y=σ(z2).

Backpropagation with Post-Activation and Pre-Activation

where δL is the error term, and J is the loss function.

Compute weight and bias gradients:

Backpropagation with Post-Activation

Propagate error backward using post-activation:

Compute weight and bias gradients:

Gradient based Strategies:

Why Use Learning Rate Decay?

 Later Precise Adjustments: As the model approaches the minimum, a smaller

Momentum-based learning is an optimization technique used to enhance the performance

Concept of Momentum-Based Learning

Momentum-based learning involves adding a fraction of the previous gradient update to

Comparison with Global Learning Rate:

Aspect Global Learning Rate Parameter-Specific Learning Rate

Types of Gradient Clipping:

In this approach, each component of the gradient vector is clipped if it

 Define a minimum threshold (clip_value_min) and a maximum threshold

 Let’s assume a gradient vector g = [0.5, -1.2, 3.0, -4.5, 2.0].

 Preserves Direction: The gradient's direction is maintained, which helps in

 Complexity: Slightly more complex to implement than clipping by value, as it

Polyak Averaging is a method employed to enhance the convergence and stability of

Concept of Polyak Averaging:

o Polyak Averaging entails calculating an average of the model's parameters

Benefits of Polyak Averaging

o By averaging parameters over multiple iterations, the effects of noisy

o This technique helps smooth out noise in the optimization process,

Applications of Polyak Averaging

1. Stochastic Gradient Descent (SGD):

o Utilized to improve the performance and stability of models trained through

Local and Spurious Minima:

o Surrounding Area: The loss value is lower compared to neighboring points,

o Convergence: Optimization algorithms may become trapped in local

o Irregular Landscapes: Common in complex loss landscapes with

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.