0% found this document useful (0 votes)

79 views

Short Notes On Vanishing & Exploding Gradients

Uploaded by

yogini.prabhu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views

Short Notes On Vanishing & Exploding Gradients

Uploaded by

yogini.prabhu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

What is LSTM?

LSTM (Long Short-Term Memory) is a recurrent neural network (RNN)

architecture widely used in Deep Learning. It excels at capturing
long-term dependencies, making it ideal for sequence prediction tasks.
Backpropagation
• Backpropagation, the heart of training neural networks, is susceptible
to two potential roadblocks: exploding and vanishing gradients. These
issues can significantly hinder your network's learning ability, so
understanding and addressing them is crucial.

• Back propagation is the propagation of the error from its prediction

up until the weights and biases. In recurrent networks like the RNN
and the LSTM this term was also coined Back Propagation Through
Time (BPTT) since it propagates through all time steps even though
the weight and bias matrices are always the same.
Vanishing Gradients:

• Imagine the gradient as a signal carrying information about how to adjust the weights. In deep
networks, this signal is multiplied during backpropagation.
• If the gradient involves values less than 1 (common with activation functions like sigmoid), repeated
multiplication shrinks the signal exponentially, eventually making it negligible for earlier layers.
• This "vanishing" information makes it difficult to update weights in these layers and hinders learning
long-term dependencies in sequences (a problem in RNNs).

•
Vanishing gradient problem
• In general, when using backpropagation and gradient-based learning techniques along with ANNs, largely
in the training stage, a problem called the vanishing gradient problem arises.
• More specifically, in each training iteration, every weight of the neural network is updated based on the
current weight and is proportionally relative to the partial derivative of the error function. However, this
weight updating may not occur in some cases due to a vanishingly small gradient, which in the worst case
means that no extra training is possible and the neural network will stop completely.
• Conversely, similarly to other activation functions, the sigmoid function shrinks a large input space to a
tiny input space. Thus, the derivative of the sigmoid function will be small due to large variation at the
input that produces a small variation at the output.
• In a shallow network, only some layers use these activations, which is not a significant issue. While using
more layers will lead the gradient to become very small in the training stage, in this case, the network
works efficiently.
• The back-propagation technique is used to determine the gradients of the neural networks. Initially, this
technique determines the network derivatives of each layer in the reverse direction, starting from the last
layer and progressing back to the first layer.
• The next step involves multiplying the derivatives of each layer down the network in a similar manner to
the first step. For instance, multiplying N small derivatives together when there are N hidden layers
employs an activation function such as the sigmoid function.
.continued…

• Hence, the gradient declines exponentially while propagating back to the first layer. More specifically, the
biases and weights of the first layers cannot be updated efficiently during the training stage because the
gradient is small. Moreover, this condition decreases the overall network accuracy, as these first layers are
frequently critical to recognizing the essential elements of the input data.
• However, such a problem can be avoided through employing activation functions. These functions lack
the squishing property, i.e., the ability to squish the input space to within a small space. By mapping X to
max, the ReLU is the most popular selection, as it does not yield a small derivative that is employed in the
field.
• Another solution involves employing the batch normalization layer. As mentioned earlier, the problem
occurs once a large input space is squashed into a small space, leading to vanishing the derivative.
Employing batch normalization degrades this issue by simply normalizing the input, i.e., the expression
|x| does not accomplish the exterior boundaries of the sigmoid function.
• The normalization process makes the largest part of it come down in the green area, which ensures that
the derivative is large enough for further actions. Furthermore, faster hardware can tackle the previous
issue, e.g. that provided by GPUs. This makes standard back-propagation possible for many deeper layers
of the network compared to the time required to recognize the vanishing gradient problem.
Exploding Gradients:
• Conversely, some situations (like large initial weights or activation functions with unbounded gradients)
can cause the signal to explode exponentially as it propagates back.
• This results in massive weight updates, potentially pushing the network far away from the optimal
solution and causing instability.
• Opposite to the vanishing problem is the one related to gradient. Specifically, large error gradients are
accumulated during back-propagation. The latter will lead to extremely significant updates to the
weights of the network, meaning that the system becomes unsteady. Thus, the model will lose its
ability to learn effectively. Grosso modo, moving backward in the network during back-propagation,
the gradient grows exponentially by repetitively multiplying gradients. The weight values could thus
become incredibly large and may overflow to become a not-a-number (NaN) value. Some potential
solutions include:
1. Using different weight regularization techniques.
2. Redesigning the architecture of the network model.
Consequences of both issues:

• Slow or stalled learning: The network struggles to improve or even

diverges from the desired outcome.
• Difficulty finding optimal weights: The update process becomes
erratic, hindering efficient optimization.
• Wasting computational resources: Training takes longer without
achieving optimal results.
Strategies to tackle these issues:
• Use activation functions with non-saturating gradients: Consider
alternatives like ReLU or leaky ReLU that avoid diminishing the signal
significantly.
• Weight initialization: Initialize weights carefully to prevent large initial
gradients. Techniques like Xavier initialization can help.
• Gradient clipping: Set a threshold to prevent gradients from becoming too
large or too small, stabilizing the training process.
• RNN-specific approaches: Utilize LSTM or GRU cells, designed to combat
vanishing gradients in recurrent networks.
• Adaptive learning rate methods: Algorithms like RMSprop or Adam adjust
learning rates dynamically, mitigating both vanishing and exploding
gradients.
Why LSTM?
Traditional ANN vs LSTM
• Unlike traditional neural networks, LSTM incorporates feedback
connections, allowing it to process entire sequences of data, not just
individual data points. This makes it highly effective in understanding
and predicting patterns in sequential data like time series, text, and
speech.

• LSTM has become a powerful tool in artificial intelligence and deep

learning, enabling breakthroughs in various fields by uncovering
valuable insights from sequential data.
One look at LSTM
https://youtu.be/Z03f7Wu5a6A
LSTM Architecture

• In the introduction to long short-term memory, we learned that it

resolves the vanishing gradient problem faced by RNN, so now, in this
section, we will see how it resolves this problem by learning the
architecture of the LSTM.

• At a high level, LSTM works very much like an RNN cell.

• Here is the internal functioning of the LSTM network.
• The LSTM network architecture consists of three parts, as shown in
the image below, and each part performs an individual function.
The Logic Behind LSTM

• The first part chooses whether the information coming from the previous timestamp is to be
remembered or is irrelevant and can be forgotten.
• In the second part, the cell tries to learn new information from the input to this cell.
• At last, in the third part, the cell passes the updated information from the current timestamp to the
next timestamp.
• This one cycle of LSTM is considered a single-time step.

• These three parts of an LSTM unit are known as gates.

• They control the flow of information in and out of the memory cell or lstm cell. The first gate is
called Forget gate, the second gate is known as the Input gate, and the last one is the Output gate.
• An LSTM unit that consists of these three gates and a memory cell or lstm cell can be considered as
a layer of neurons in traditional feedforward neural network, with each neuron having a hidden
layer and a current state.
• Just like a simple RNN, an LSTM also has a hidden state where H(t-1)
represents the hidden state of the previous timestamp and Ht is the
hidden state of the current timestamp.

• In addition to that, LSTM also has a cell state represented by C(t-1)

and C(t) for the previous and current timestamps, respectively .
• Here the hidden state is known as Short term memory, and the cell
state is known as Long term memory. Refer to the following image
It is interesting to note that the cell state carries
the information along with all the timestamps.
Example of LTSM Working

• Let’s take an example to understand how LSTM works.

• Here we have two sentences separated by a full stop.

• The first sentence is “Bob is a nice person,” and the second sentence
is “Dan, on the Other hand, is evil”.

• It is very clear, in the first sentence, we are talking about Bob, and as
soon as we encounter the full stop(.), we started talking about Dan.
Work of Forget Gate
As we move from the first sentence to the second sentence, our network should
realize that we are no more talking about Bob. Now our subject is Dan. Here, the
Forget gate of the network allows it to forget about it.
Let’s understand the roles played by these gates in LSTM architecture.
Forget Gate

• In a cell of the LSTM neural network, the first step is to decide whether we
should keep the information from the previous time step or forget it. Here
is the equation for forget gate.

• Let’s try to understand the equation, here

• Xt: input to the current timestamp.
• Uf: weight associated with the input
• Ht-1: The hidden state of the previous timestamp
• Wf: It is the weight matrix associated with the hidden state
Forget Everything / Nothing
• Later, a sigmoid function is applied to it. That will make ft a number
between 0 and 1.
• This ft is later multiplied with the cell state of the previous timestamp,
as shown below.
Input Gate

• Let’s take another example.

• “Bob knows swimming. He told me over the phone that he had
served the navy for four long years.”

• So, in both these sentences, we are talking about Bob. However, both
give different kinds of information about Bob.
• In the first sentence, we get the information that he knows
swimming. Whereas the second sentence tells, he uses the phone and
served in the navy for four years.
• Now just think about it, based on the context given in the first
sentence, which information in the second sentence is critical?
• First, he used the phone to tell, or he served in the navy.

• In this context, it doesn’t matter whether he used the phone or any

other medium of communication to pass on the information.

• The fact that he was in the navy is important information, and this is
something we want our model to remember for future computation.
This is the task of the Input gate.
• The input gate is used to quantify the importance of the new information
carried by the input. Here is the equation of the input gate

• Here,
• Xt: Input at the current timestamp t
• Ui: weight matrix of input
• Ht-1: A hidden state at the previous timestamp
• Wi: Weight matrix of input associated with hidden state
New Information

• Again we have applied the sigmoid function over it. As a result, the value of
I at timestamp t will be between 0 and 1

• Now the new information that needed to be passed to the cell state is a
function of a hidden state at the previous timestamp t-1 and input x at
timestamp t. The activation function here is tanh.
• Due to the tanh function, the value of new information will be between -1
and 1. If the value of Nt is negative, the information is subtracted from the
cell state, and if the value is positive, the information is added to the cell
state at the current timestamp.
• However, the Nt won’t be added directly to the cell state. Here comes
the updated equation:

• Here, Ct-1 is the cell state at the current timestamp, and the others
are the values we have calculated previously.
Output Gate

• Now consider this sentence.

• “Bob single-handedly fought the enemy and died for his country. For
his contributions, brave______.”

• During this task, we have to complete the second sentence. Now, the
minute we see the word brave, we know that we are talking about a
person. In the sentence, only Bob is brave, we can not say the enemy
is brave, or the country is brave. So based on the current expectation,
we have to give a relevant word to fill in the blank. That word is our
output, and this is the function of our Output gate.

What is LSTM - Long Short Term Memory_ - GeeksforGeeks
No ratings yet
What is LSTM - Long Short Term Memory_ - GeeksforGeeks
10 pages
Frequently Asked Questions: What Is Brightwell Navigator? Where Can I Use My Card?
No ratings yet
Frequently Asked Questions: What Is Brightwell Navigator? Where Can I Use My Card?
4 pages
RNN & LSTM: Vamsi Krishna B 1 9 M E 0 2 3
No ratings yet
RNN & LSTM: Vamsi Krishna B 1 9 M E 0 2 3
14 pages
Illustrated Guide To LSTM's and GRU'S - A Step by Step Explanation - by Michael Phi - Towards Data Science
No ratings yet
Illustrated Guide To LSTM's and GRU'S - A Step by Step Explanation - by Michael Phi - Towards Data Science
15 pages
LSTM
No ratings yet
LSTM
22 pages
Long Short-Term Memory (LSTM) by Mohsin
No ratings yet
Long Short-Term Memory (LSTM) by Mohsin
17 pages
Long Short-Term Memory Networks (LSTM)- simply explained! _ Data Basecamp
No ratings yet
Long Short-Term Memory Networks (LSTM)- simply explained! _ Data Basecamp
4 pages
LSTM
No ratings yet
LSTM
19 pages
UNIT-III
No ratings yet
UNIT-III
5 pages
CS5560 Lect12-RNN - LSTM
No ratings yet
CS5560 Lect12-RNN - LSTM
30 pages
LSTM Networks Thesis Updated
No ratings yet
LSTM Networks Thesis Updated
5 pages
RNN_2
No ratings yet
RNN_2
144 pages
Chap 7.2 Sequence Analysis Using RNN LSTM
No ratings yet
Chap 7.2 Sequence Analysis Using RNN LSTM
60 pages
Unit IV
No ratings yet
Unit IV
22 pages
chapter 2
No ratings yet
chapter 2
68 pages
Module 6
No ratings yet
Module 6
42 pages
Cs224n 2025 Lecture06 Fancy Rnn
No ratings yet
Cs224n 2025 Lecture06 Fancy Rnn
57 pages
longshorttermmemorylstm-231215171600-1feb7b1b
No ratings yet
longshorttermmemorylstm-231215171600-1feb7b1b
17 pages
RNNs and LSTMs
No ratings yet
RNNs and LSTMs
41 pages
Long Short-Term Memory (LSTM)
No ratings yet
Long Short-Term Memory (LSTM)
25 pages
LSTM Presentation
No ratings yet
LSTM Presentation
23 pages
LSTM
No ratings yet
LSTM
12 pages
Towardsdatascience
No ratings yet
Towardsdatascience
10 pages
RNN LSTM
No ratings yet
RNN LSTM
72 pages
07 RNN Recurrent Neural Networks
No ratings yet
07 RNN Recurrent Neural Networks
115 pages
dis6-sol
No ratings yet
dis6-sol
6 pages
LSTM_1738024034
No ratings yet
LSTM_1738024034
13 pages
Stock Price Trends Prediction Paper
No ratings yet
Stock Price Trends Prediction Paper
4 pages
CH4_AA1.1-Sequence Models (1)
No ratings yet
CH4_AA1.1-Sequence Models (1)
26 pages
15.03.2024_CSA3007_A24+D23+D24 (1)
No ratings yet
15.03.2024_CSA3007_A24+D23+D24 (1)
8 pages
Chapter_12_PartII_en
No ratings yet
Chapter_12_PartII_en
23 pages
598_114_216_Recurrent_Neural_Networks
No ratings yet
598_114_216_Recurrent_Neural_Networks
87 pages
Understanding LSTM Networks
No ratings yet
Understanding LSTM Networks
8 pages
DL CO-3 PPT 3
No ratings yet
DL CO-3 PPT 3
19 pages
Long Short Term Memory Networks - Architecture of LSTM
No ratings yet
Long Short Term Memory Networks - Architecture of LSTM
14 pages
Kgptalkie Com Multi Step Time Series Predicting Using RNN LSTM
No ratings yet
Kgptalkie Com Multi Step Time Series Predicting Using RNN LSTM
32 pages
STMs and LSTM Variations For Prediction
No ratings yet
STMs and LSTM Variations For Prediction
16 pages
Recurrent Neural Network: What Does RNN Stand For?
No ratings yet
Recurrent Neural Network: What Does RNN Stand For?
7 pages
deep learning questions
No ratings yet
deep learning questions
17 pages
Long-Short Term Memory
No ratings yet
Long-Short Term Memory
21 pages
DL_4_notes
No ratings yet
DL_4_notes
34 pages
21CSE356T-NLP-Unit 4.1
No ratings yet
21CSE356T-NLP-Unit 4.1
46 pages
LSTM by Bushra
No ratings yet
LSTM by Bushra
16 pages
30-35
No ratings yet
30-35
26 pages
Understanding LSTM Networks
No ratings yet
Understanding LSTM Networks
15 pages
UNIT-5 Foundations of Deep Learning
No ratings yet
UNIT-5 Foundations of Deep Learning
9 pages
Unit 4 - MachineLearning
No ratings yet
Unit 4 - MachineLearning
16 pages
Deep Learning L3
No ratings yet
Deep Learning L3
37 pages
CO2_LSTM_5
No ratings yet
CO2_LSTM_5
17 pages
RNNs
No ratings yet
RNNs
22 pages
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
No ratings yet
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
16 pages
LSTM_networks_in_python__1723896317
No ratings yet
LSTM_networks_in_python__1723896317
17 pages
Sequence Modeling - Recurrent Networks: Biplab Banerjee
No ratings yet
Sequence Modeling - Recurrent Networks: Biplab Banerjee
66 pages
CNN RNN LSTM GRU Simple
100% (3)
CNN RNN LSTM GRU Simple
20 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
36 pages
lstm
No ratings yet
lstm
12 pages
GRU
No ratings yet
GRU
17 pages
EPJ LSTM Survey
No ratings yet
EPJ LSTM Survey
14 pages
Understanding LSTM Networks
No ratings yet
Understanding LSTM Networks
7 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
JD - Central Procurement Head
No ratings yet
JD - Central Procurement Head
4 pages
MIP ch1 To ch5
No ratings yet
MIP ch1 To ch5
45 pages
Sensor Dew Point
No ratings yet
Sensor Dew Point
2 pages
Bibliometric Analyses
No ratings yet
Bibliometric Analyses
38 pages
Annihilator ANH 1G
No ratings yet
Annihilator ANH 1G
1 page
Digital Breast Tomosynthesis
No ratings yet
Digital Breast Tomosynthesis
24 pages
Price-List Polycab Industrial Flexible Cables
No ratings yet
Price-List Polycab Industrial Flexible Cables
4 pages
F9-Series
No ratings yet
F9-Series
8 pages
Irfan Nazeer 1st IMC Assign.
No ratings yet
Irfan Nazeer 1st IMC Assign.
3 pages
Green House Monitoring and Control System Using Iot
No ratings yet
Green House Monitoring and Control System Using Iot
4 pages
Week3 Assignment
No ratings yet
Week3 Assignment
6 pages
Info Only
No ratings yet
Info Only
5 pages
How To Replace Timing Belt On Peugeot 307 2.0 HDi 2005-2007
No ratings yet
How To Replace Timing Belt On Peugeot 307 2.0 HDi 2005-2007
8 pages
Caterpillar 303 | Mini Hyd Excavator | Service Manual | PDF Download
No ratings yet
Caterpillar 303 | Mini Hyd Excavator | Service Manual | PDF Download
34 pages
ANTIQUE - Refllector ClearTop - PDF
No ratings yet
ANTIQUE - Refllector ClearTop - PDF
6 pages
Divisibility Rule From 1 To 20
No ratings yet
Divisibility Rule From 1 To 20
3 pages
Neural Networks
No ratings yet
Neural Networks
14 pages
Chapter 1 Transaction Management & Concurrency Control
No ratings yet
Chapter 1 Transaction Management & Concurrency Control
89 pages
V-guard Glado Prime 400 Low Res
No ratings yet
V-guard Glado Prime 400 Low Res
13 pages
CV Vaibhav Singh
No ratings yet
CV Vaibhav Singh
3 pages
Fermueller YB09
No ratings yet
Fermueller YB09
12 pages
Compressor Data Sheet Rotary Compressor: Fixed Speed Model Data - For Compressed Air
No ratings yet
Compressor Data Sheet Rotary Compressor: Fixed Speed Model Data - For Compressed Air
1 page
Living in The It Era
No ratings yet
Living in The It Era
70 pages
Human Value - Ethics - 5
No ratings yet
Human Value - Ethics - 5
4 pages
The Development of An Automated Irrigation System Using An Open Source Microcontroller
No ratings yet
The Development of An Automated Irrigation System Using An Open Source Microcontroller
8 pages
Toshiba Machine Co., LTD.: User's Manual Product SHAN5 Version 1.12
No ratings yet
Toshiba Machine Co., LTD.: User's Manual Product SHAN5 Version 1.12
39 pages
Pioneer Deh-1250mp Parts SCH Incomplete
No ratings yet
Pioneer Deh-1250mp Parts SCH Incomplete
12 pages
Submitted by Sanskar Satyal and Aashish Rai Reg No:-6-2-522-193-2020 and 6-2-522-171-2020 March, 2023
No ratings yet
Submitted by Sanskar Satyal and Aashish Rai Reg No:-6-2-522-193-2020 and 6-2-522-171-2020 March, 2023
25 pages
Indian Coast Guard: Directorate of Recruitment
67% (3)
Indian Coast Guard: Directorate of Recruitment
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Short Notes On Vanishing & Exploding Gradients

Uploaded by

Short Notes On Vanishing & Exploding Gradients

Uploaded by

What is LSTM?

LSTM (Long Short-Term Memory) is a recurrent neural network (RNN)

• Back propagation is the propagation of the error from its prediction

• Slow or stalled learning: The network struggles to improve or even

• LSTM has become a powerful tool in artificial intelligence and deep

• In the introduction to long short-term memory, we learned that it

• At a high level, LSTM works very much like an RNN cell.

• These three parts of an LSTM unit are known as gates.

• In addition to that, LSTM also has a cell state represented by C(t-1)

• Let’s take an example to understand how LSTM works.

• Let’s try to understand the equation, here

• Let’s take another example.

• In this context, it doesn’t matter whether he used the phone or any

• Now consider this sentence.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.