Short Notes On Vanishing & Exploding Gradients
Short Notes On Vanishing & Exploding Gradients
• Imagine the gradient as a signal carrying information about how to adjust the weights. In deep
networks, this signal is multiplied during backpropagation.
• If the gradient involves values less than 1 (common with activation functions like sigmoid), repeated
multiplication shrinks the signal exponentially, eventually making it negligible for earlier layers.
• This "vanishing" information makes it difficult to update weights in these layers and hinders learning
long-term dependencies in sequences (a problem in RNNs).
•
Vanishing gradient problem
• In general, when using backpropagation and gradient-based learning techniques along with ANNs, largely
in the training stage, a problem called the vanishing gradient problem arises.
• More specifically, in each training iteration, every weight of the neural network is updated based on the
current weight and is proportionally relative to the partial derivative of the error function. However, this
weight updating may not occur in some cases due to a vanishingly small gradient, which in the worst case
means that no extra training is possible and the neural network will stop completely.
• Conversely, similarly to other activation functions, the sigmoid function shrinks a large input space to a
tiny input space. Thus, the derivative of the sigmoid function will be small due to large variation at the
input that produces a small variation at the output.
• In a shallow network, only some layers use these activations, which is not a significant issue. While using
more layers will lead the gradient to become very small in the training stage, in this case, the network
works efficiently.
• The back-propagation technique is used to determine the gradients of the neural networks. Initially, this
technique determines the network derivatives of each layer in the reverse direction, starting from the last
layer and progressing back to the first layer.
• The next step involves multiplying the derivatives of each layer down the network in a similar manner to
the first step. For instance, multiplying N small derivatives together when there are N hidden layers
employs an activation function such as the sigmoid function.
.continued…
• Hence, the gradient declines exponentially while propagating back to the first layer. More specifically, the
biases and weights of the first layers cannot be updated efficiently during the training stage because the
gradient is small. Moreover, this condition decreases the overall network accuracy, as these first layers are
frequently critical to recognizing the essential elements of the input data.
• However, such a problem can be avoided through employing activation functions. These functions lack
the squishing property, i.e., the ability to squish the input space to within a small space. By mapping X to
max, the ReLU is the most popular selection, as it does not yield a small derivative that is employed in the
field.
• Another solution involves employing the batch normalization layer. As mentioned earlier, the problem
occurs once a large input space is squashed into a small space, leading to vanishing the derivative.
Employing batch normalization degrades this issue by simply normalizing the input, i.e., the expression
|x| does not accomplish the exterior boundaries of the sigmoid function.
• The normalization process makes the largest part of it come down in the green area, which ensures that
the derivative is large enough for further actions. Furthermore, faster hardware can tackle the previous
issue, e.g. that provided by GPUs. This makes standard back-propagation possible for many deeper layers
of the network compared to the time required to recognize the vanishing gradient problem.
Exploding Gradients:
• Conversely, some situations (like large initial weights or activation functions with unbounded gradients)
can cause the signal to explode exponentially as it propagates back.
• This results in massive weight updates, potentially pushing the network far away from the optimal
solution and causing instability.
• Opposite to the vanishing problem is the one related to gradient. Specifically, large error gradients are
accumulated during back-propagation. The latter will lead to extremely significant updates to the
weights of the network, meaning that the system becomes unsteady. Thus, the model will lose its
ability to learn effectively. Grosso modo, moving backward in the network during back-propagation,
the gradient grows exponentially by repetitively multiplying gradients. The weight values could thus
become incredibly large and may overflow to become a not-a-number (NaN) value. Some potential
solutions include:
1. Using different weight regularization techniques.
2. Redesigning the architecture of the network model.
Consequences of both issues:
• The first part chooses whether the information coming from the previous timestamp is to be
remembered or is irrelevant and can be forgotten.
• In the second part, the cell tries to learn new information from the input to this cell.
• At last, in the third part, the cell passes the updated information from the current timestamp to the
next timestamp.
• This one cycle of LSTM is considered a single-time step.
• They control the flow of information in and out of the memory cell or lstm cell. The first gate is
called Forget gate, the second gate is known as the Input gate, and the last one is the Output gate.
• An LSTM unit that consists of these three gates and a memory cell or lstm cell can be considered as
a layer of neurons in traditional feedforward neural network, with each neuron having a hidden
layer and a current state.
• Just like a simple RNN, an LSTM also has a hidden state where H(t-1)
represents the hidden state of the previous timestamp and Ht is the
hidden state of the current timestamp.
• The first sentence is “Bob is a nice person,” and the second sentence
is “Dan, on the Other hand, is evil”.
• It is very clear, in the first sentence, we are talking about Bob, and as
soon as we encounter the full stop(.), we started talking about Dan.
Work of Forget Gate
As we move from the first sentence to the second sentence, our network should
realize that we are no more talking about Bob. Now our subject is Dan. Here, the
Forget gate of the network allows it to forget about it.
Let’s understand the roles played by these gates in LSTM architecture.
Forget Gate
• In a cell of the LSTM neural network, the first step is to decide whether we
should keep the information from the previous time step or forget it. Here
is the equation for forget gate.
• So, in both these sentences, we are talking about Bob. However, both
give different kinds of information about Bob.
• In the first sentence, we get the information that he knows
swimming. Whereas the second sentence tells, he uses the phone and
served in the navy for four years.
• Now just think about it, based on the context given in the first
sentence, which information in the second sentence is critical?
• First, he used the phone to tell, or he served in the navy.
• The fact that he was in the navy is important information, and this is
something we want our model to remember for future computation.
This is the task of the Input gate.
• The input gate is used to quantify the importance of the new information
carried by the input. Here is the equation of the input gate
• Here,
• Xt: Input at the current timestamp t
• Ui: weight matrix of input
• Ht-1: A hidden state at the previous timestamp
• Wi: Weight matrix of input associated with hidden state
New Information
• Again we have applied the sigmoid function over it. As a result, the value of
I at timestamp t will be between 0 and 1
• Now the new information that needed to be passed to the cell state is a
function of a hidden state at the previous timestamp t-1 and input x at
timestamp t. The activation function here is tanh.
• Due to the tanh function, the value of new information will be between -1
and 1. If the value of Nt is negative, the information is subtracted from the
cell state, and if the value is positive, the information is added to the cell
state at the current timestamp.
• However, the Nt won’t be added directly to the cell state. Here comes
the updated equation:
• Here, Ct-1 is the cell state at the current timestamp, and the others
are the values we have calculated previously.
Output Gate
• During this task, we have to complete the second sentence. Now, the
minute we see the word brave, we know that we are talking about a
person. In the sentence, only Bob is brave, we can not say the enemy
is brave, or the country is brave. So based on the current expectation,
we have to give a relevant word to fill in the blank. That word is our
output, and this is the function of our Output gate.