0% found this document useful (0 votes)
79 views

Short Notes On Vanishing & Exploding Gradients

Uploaded by

yogini.prabhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views

Short Notes On Vanishing & Exploding Gradients

Uploaded by

yogini.prabhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

What is LSTM?

LSTM (Long Short-Term Memory) is a recurrent neural network (RNN)


architecture widely used in Deep Learning. It excels at capturing
long-term dependencies, making it ideal for sequence prediction tasks.
Backpropagation
• Backpropagation, the heart of training neural networks, is susceptible
to two potential roadblocks: exploding and vanishing gradients. These
issues can significantly hinder your network's learning ability, so
understanding and addressing them is crucial.

• Back propagation is the propagation of the error from its prediction


up until the weights and biases. In recurrent networks like the RNN
and the LSTM this term was also coined Back Propagation Through
Time (BPTT) since it propagates through all time steps even though
the weight and bias matrices are always the same.
Vanishing Gradients:

• Imagine the gradient as a signal carrying information about how to adjust the weights. In deep
networks, this signal is multiplied during backpropagation.
• If the gradient involves values less than 1 (common with activation functions like sigmoid), repeated
multiplication shrinks the signal exponentially, eventually making it negligible for earlier layers.
• This "vanishing" information makes it difficult to update weights in these layers and hinders learning
long-term dependencies in sequences (a problem in RNNs).


Vanishing gradient problem
• In general, when using backpropagation and gradient-based learning techniques along with ANNs, largely
in the training stage, a problem called the vanishing gradient problem arises.
• More specifically, in each training iteration, every weight of the neural network is updated based on the
current weight and is proportionally relative to the partial derivative of the error function. However, this
weight updating may not occur in some cases due to a vanishingly small gradient, which in the worst case
means that no extra training is possible and the neural network will stop completely.
• Conversely, similarly to other activation functions, the sigmoid function shrinks a large input space to a
tiny input space. Thus, the derivative of the sigmoid function will be small due to large variation at the
input that produces a small variation at the output.
• In a shallow network, only some layers use these activations, which is not a significant issue. While using
more layers will lead the gradient to become very small in the training stage, in this case, the network
works efficiently.
• The back-propagation technique is used to determine the gradients of the neural networks. Initially, this
technique determines the network derivatives of each layer in the reverse direction, starting from the last
layer and progressing back to the first layer.
• The next step involves multiplying the derivatives of each layer down the network in a similar manner to
the first step. For instance, multiplying N small derivatives together when there are N hidden layers
employs an activation function such as the sigmoid function.
.continued…

• Hence, the gradient declines exponentially while propagating back to the first layer. More specifically, the
biases and weights of the first layers cannot be updated efficiently during the training stage because the
gradient is small. Moreover, this condition decreases the overall network accuracy, as these first layers are
frequently critical to recognizing the essential elements of the input data.
• However, such a problem can be avoided through employing activation functions. These functions lack
the squishing property, i.e., the ability to squish the input space to within a small space. By mapping X to
max, the ReLU is the most popular selection, as it does not yield a small derivative that is employed in the
field.
• Another solution involves employing the batch normalization layer. As mentioned earlier, the problem
occurs once a large input space is squashed into a small space, leading to vanishing the derivative.
Employing batch normalization degrades this issue by simply normalizing the input, i.e., the expression
|x| does not accomplish the exterior boundaries of the sigmoid function.
• The normalization process makes the largest part of it come down in the green area, which ensures that
the derivative is large enough for further actions. Furthermore, faster hardware can tackle the previous
issue, e.g. that provided by GPUs. This makes standard back-propagation possible for many deeper layers
of the network compared to the time required to recognize the vanishing gradient problem.
Exploding Gradients:
• Conversely, some situations (like large initial weights or activation functions with unbounded gradients)
can cause the signal to explode exponentially as it propagates back.
• This results in massive weight updates, potentially pushing the network far away from the optimal
solution and causing instability.
• Opposite to the vanishing problem is the one related to gradient. Specifically, large error gradients are
accumulated during back-propagation. The latter will lead to extremely significant updates to the
weights of the network, meaning that the system becomes unsteady. Thus, the model will lose its
ability to learn effectively. Grosso modo, moving backward in the network during back-propagation,
the gradient grows exponentially by repetitively multiplying gradients. The weight values could thus
become incredibly large and may overflow to become a not-a-number (NaN) value. Some potential
solutions include:
1. Using different weight regularization techniques.
2. Redesigning the architecture of the network model.
Consequences of both issues:

• Slow or stalled learning: The network struggles to improve or even


diverges from the desired outcome.
• Difficulty finding optimal weights: The update process becomes
erratic, hindering efficient optimization.
• Wasting computational resources: Training takes longer without
achieving optimal results.
Strategies to tackle these issues:
• Use activation functions with non-saturating gradients: Consider
alternatives like ReLU or leaky ReLU that avoid diminishing the signal
significantly.
• Weight initialization: Initialize weights carefully to prevent large initial
gradients. Techniques like Xavier initialization can help.
• Gradient clipping: Set a threshold to prevent gradients from becoming too
large or too small, stabilizing the training process.
• RNN-specific approaches: Utilize LSTM or GRU cells, designed to combat
vanishing gradients in recurrent networks.
• Adaptive learning rate methods: Algorithms like RMSprop or Adam adjust
learning rates dynamically, mitigating both vanishing and exploding
gradients.
Why LSTM?
Traditional ANN vs LSTM
• Unlike traditional neural networks, LSTM incorporates feedback
connections, allowing it to process entire sequences of data, not just
individual data points. This makes it highly effective in understanding
and predicting patterns in sequential data like time series, text, and
speech.

• LSTM has become a powerful tool in artificial intelligence and deep


learning, enabling breakthroughs in various fields by uncovering
valuable insights from sequential data.
One look at LSTM
https://youtu.be/Z03f7Wu5a6A
LSTM Architecture

• In the introduction to long short-term memory, we learned that it


resolves the vanishing gradient problem faced by RNN, so now, in this
section, we will see how it resolves this problem by learning the
architecture of the LSTM.

• At a high level, LSTM works very much like an RNN cell.


• Here is the internal functioning of the LSTM network.
• The LSTM network architecture consists of three parts, as shown in
the image below, and each part performs an individual function.
The Logic Behind LSTM

• The first part chooses whether the information coming from the previous timestamp is to be
remembered or is irrelevant and can be forgotten.
• In the second part, the cell tries to learn new information from the input to this cell.
• At last, in the third part, the cell passes the updated information from the current timestamp to the
next timestamp.
• This one cycle of LSTM is considered a single-time step.

• These three parts of an LSTM unit are known as gates.

• They control the flow of information in and out of the memory cell or lstm cell. The first gate is
called Forget gate, the second gate is known as the Input gate, and the last one is the Output gate.
• An LSTM unit that consists of these three gates and a memory cell or lstm cell can be considered as
a layer of neurons in traditional feedforward neural network, with each neuron having a hidden
layer and a current state.
• Just like a simple RNN, an LSTM also has a hidden state where H(t-1)
represents the hidden state of the previous timestamp and Ht is the
hidden state of the current timestamp.

• In addition to that, LSTM also has a cell state represented by C(t-1)


and C(t) for the previous and current timestamps, respectively .
• Here the hidden state is known as Short term memory, and the cell
state is known as Long term memory. Refer to the following image
It is interesting to note that the cell state carries
the information along with all the timestamps.
Example of LTSM Working

• Let’s take an example to understand how LSTM works.


• Here we have two sentences separated by a full stop.

• The first sentence is “Bob is a nice person,” and the second sentence
is “Dan, on the Other hand, is evil”.

• It is very clear, in the first sentence, we are talking about Bob, and as
soon as we encounter the full stop(.), we started talking about Dan.
Work of Forget Gate
As we move from the first sentence to the second sentence, our network should
realize that we are no more talking about Bob. Now our subject is Dan. Here, the
Forget gate of the network allows it to forget about it.
Let’s understand the roles played by these gates in LSTM architecture.
Forget Gate

• In a cell of the LSTM neural network, the first step is to decide whether we
should keep the information from the previous time step or forget it. Here
is the equation for forget gate.

• Let’s try to understand the equation, here


• Xt: input to the current timestamp.
• Uf: weight associated with the input
• Ht-1: The hidden state of the previous timestamp
• Wf: It is the weight matrix associated with the hidden state
Forget Everything / Nothing
• Later, a sigmoid function is applied to it. That will make ft a number
between 0 and 1.
• This ft is later multiplied with the cell state of the previous timestamp,
as shown below.
Input Gate

• Let’s take another example.


• “Bob knows swimming. He told me over the phone that he had
served the navy for four long years.”

• So, in both these sentences, we are talking about Bob. However, both
give different kinds of information about Bob.
• In the first sentence, we get the information that he knows
swimming. Whereas the second sentence tells, he uses the phone and
served in the navy for four years.
• Now just think about it, based on the context given in the first
sentence, which information in the second sentence is critical?
• First, he used the phone to tell, or he served in the navy.

• In this context, it doesn’t matter whether he used the phone or any


other medium of communication to pass on the information.

• The fact that he was in the navy is important information, and this is
something we want our model to remember for future computation.
This is the task of the Input gate.
• The input gate is used to quantify the importance of the new information
carried by the input. Here is the equation of the input gate

• Here,
• Xt: Input at the current timestamp t
• Ui: weight matrix of input
• Ht-1: A hidden state at the previous timestamp
• Wi: Weight matrix of input associated with hidden state
New Information

• Again we have applied the sigmoid function over it. As a result, the value of
I at timestamp t will be between 0 and 1

• Now the new information that needed to be passed to the cell state is a
function of a hidden state at the previous timestamp t-1 and input x at
timestamp t. The activation function here is tanh.
• Due to the tanh function, the value of new information will be between -1
and 1. If the value of Nt is negative, the information is subtracted from the
cell state, and if the value is positive, the information is added to the cell
state at the current timestamp.
• However, the Nt won’t be added directly to the cell state. Here comes
the updated equation:

• Here, Ct-1 is the cell state at the current timestamp, and the others
are the values we have calculated previously.
Output Gate

• Now consider this sentence.


• “Bob single-handedly fought the enemy and died for his country. For
his contributions, brave______.”

• During this task, we have to complete the second sentence. Now, the
minute we see the word brave, we know that we are talking about a
person. In the sentence, only Bob is brave, we can not say the enemy
is brave, or the country is brave. So based on the current expectation,
we have to give a relevant word to fill in the blank. That word is our
output, and this is the function of our Output gate.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy