598 114 216 Recurrent Neural Networks
598 114 216 Recurrent Neural Networks
Networks
RNN
• A recurrent neural network (RNN) is a special type of an artificial neural
network adapted to work for time series data or data that involves
sequences. ...
• RNNs have the concept of 'memory' that helps them store the states or
information of previous inputs to generate the next output of the sequence
• The main difference between a CNN and an RNN is the ability to process
temporal information — data that comes in sequences, such as a sentence.
• Recurrent neural networks are designed for this very purpose, while
convolutional neural networks are incapable of effectively interpreting
temporal information.
Applications
• Speech recognition involves converting a sequence of audio signals to
a sequence of words.
• Video captioning involves converting a sequence of video frames to a
sequence of words.
• Natural language processing tasks such as question answering involve
addressing a question (a sequence of words) into an answer (another
sequence of words).
• Prediction problems.(whether a sentence is positive or negative)
• Text Summarization.
RNN VS. FEED-FORWARD
NEURAL NETWORKS
RNN VS. FEED-FORWARD
NEURAL NETWORKS
• In a feed-forward neural network, the information only moves in one direction —
from the input layer, through the hidden layers, to the output layer.
• The information moves straight through the network and never touches a node twice.
• Feed-forward neural networks have no memory of the input they receive and are bad
at predicting what’s coming next.
• Because a feed-forward network only considers the current input, it has no notion of
order in time.
• It simply can’t remember anything about what happened in the past except
its training.
• While feedforward networks have different weights across each node, recurrent
neural networks share the same weight parameter within each layer of the network.
RNN
• In RNN the information cycles through a loop.
• When it makes a decision, it considers the current input and also what it
has learned from the inputs it received previously.
• Another good way to illustrate the concept of a recurrent neural
network's memory is to explain it with an example:
• Imagine you have a normal feed-forward neural network and give it the
word "neuron" as an input and it processes the word character by
character. By the time it reaches the character "r," it has already
forgotten about "n," "e" and "u," which makes it almost impossible for
this type of neural network to predict which character would come next.
RNN
• S1, S2, S3 are the hidden states or memory (hidden layer)units at
time t1, t2, t3 respectively, and Ws is the weight matrix associated
with it.
X1, X2, X3 are the inputs at time t1, t2, t3 respectively, and Wx is the
weight matrix associated with it.
Y1, Y2, Y3 are the outputs at time t1, t2, t3 respectively, and Wy is the
weight matrix associated with it.
• S1=(Wx.X1+Ws)
• S2=(Wx.X2+Ws.S1)
• S3=(Wx.X3+Ws.S2)
• General Representation:
• St=g1(WxXt+WsSt-1)
• Yt=g2(WySt)
• Where g1 and g2 are activation functions.
Back Propagation through time
(BPTT)
• Let us now perform back propagation at time t = 3.
• Let the error function be:
Et=(Y-Yt)2
We are using the squared error here
To perform back propagation, we have to adjust the weights associated
with inputs, the memory units and the outputs.
BPTT
• Adjusting Wy
• For better understanding, let us consider the following
representation:
BPTT
• Adjusting Wy
Formula:
მE3 / მWy = მE3 / მY3 . მY3 / მWy (chain rule)
Explanation:
• There are two major obstacles RNN’s have had to deal with, but to
understand them, you first need to know what a gradient is.
• A gradient is a partial derivative with respect to its inputs. If you don’t know
what that means, just think of it like this: a gradient measures how much
the output of a function changes if you change the inputs a little bit.
• You can also think of a gradient as the slope of a function.
• The higher the gradient, the steeper the slope and the faster a model can
learn.
• But if the slope is zero, the model stops learning. A gradient simply
measures the change in all weights with regard to the change in error.
1,VANISHING GRADIENTS
• Vanishing gradients occur when the values of a gradient are too small
and the model stops learning or takes way too long as a result. (It
couldn’t reach global minima)
• It was solved through the concept of LSTM by Sepp Hochreiter and
Juergen Schmidhuber.
• While you are using Backpropogating through time, you find Error is
the difference of Actual and Predicted model.
• Now what if the partial derivation of error with respect to weight is
very less than 1?
• If the partial derivation of Error is less than 1, then when it get
multiplied with the Learning rate which is also very less.
• Then Multiplying learning rate with partial derivation of Error wont be
a big change when compared with previous iteration.
• For ex:- Lets say the value decreased like 0.863 →0.532 →0.356
→0.192 →0.117 →0.086 →0.023 →0.019..
• you can see that there is no much change in last 3 iterations. This
Vanishing of Gradience is called Vanishing Gradience.
2,EXPLODING GRADIENTS
• Introduced in 2014, GRU (Gated Recurrent Unit) aims to solve the vanishing
gradient problem
• These networks are designed to handle the vanishing gradient problem.
• They have a reset and update gate.
• These gates determine which information is to be retained for future
predictions.
• GRU can also be considered as a variation on the LSTM because both are
designed similarly and, in some cases, produce equally excellent results.
• GRU uses less training parameter and therefore uses less memory and
executes faster than LSTM whereas LSTM is more accurate on a larger
dataset.
LSTM Vs GRU
• Another Interesting thing about GRU is that, unlike LSTM, it does not
have a separate cell state (Ct).
• It only has a hidden state(Ht).
• The information which is stored in the Internal Cell State in an LSTM
recurrent unit is incorporated into the hidden state of the Gated
Recurrent Unit.
• Due to the simpler architecture, GRUs are faster to train.
• Update Gate
• The update gate acts similar to the input gate of an LSTM. It decides
what information to throw away and what new information to add.
• The update gate determines how much of the new input should be
used to update the hidden state
• Reset Gate
• The reset gate is another gate is used to decide how much past
information to forget.
• To solve the vanishing gradient problem of a standard RNN, GRU uses,
so-called, update gate and reset gate.
• Basically, these are two vectors which decide what information should
be passed to the output.
• The special thing about them is that they can be trained to keep
information from long ago, without washing it through time or
remove information which is irrelevant to the prediction.
GRU three cell state view
GRU unit single cell state
1.Update gate
Update gate
• We start with calculating the update gate zt for time step t using the
formula:
Reset gate
• Essentially, this gate is used from the model to decide how much of
the past information to forget.
• To calculate it, we use:
2.Reset gate
• This formula is the same as the one for the update gate.
• The difference comes in the weights and the gate’s usage.
• As before, we plug in h(t-1) blue line and xt purple line, multiply
them with their corresponding weights, sum the results and apply the
sigmoid function.
Current memory content
• Let’s see how exactly the gates will affect the final output.
• First, we start with the usage of the reset gate.
• We introduce a new memory content which will use the reset gate to
store the relevant information from the past.
• It is calculated as follows:
3.Current memory content
• Step-1: Multiply the input xt with a weight W and h(t-1) with a
weight U.
• Step-2: Calculate the Hadamard (element-wise) product between the
reset gate rt and Uh(t-1).
• That will determine what to remove from the previous time steps.
• Consider an example:
• That will determine what to remove from the previous time steps. Let’s
say we have a sentiment analysis problem for determining one’s opinion
about a book from a review he wrote. The text starts with “This is a
fantasy book which illustrates…” and after a couple paragraphs ends with
“I didn’t quite enjoy the book because I think it captures too many
details.” To determine the overall level of satisfaction from the book we
only need the last part of the review. In that case as the neural network
approaches to the end of the text it will learn to assign rt vector close to
0, washing out the past and focusing only on the last sentences.
• Step-3 :Sum up the results of step 1 and 2.
• Step 4: Apply the nonlinear activation function tanh.
Final memory at current time
step
• As the last step, the network needs to calculate ht vector which holds
information for the current unit and passes it down to the network.
• In order to do that the update gate is needed.
• It determines what to collect from the current memory content
h’t and what from the previous steps h(t-1). That is done as follows:
Final output
• Step 1:Apply element-wise multiplication to the update
gate zt and h(t-1).
• Step 2 :Apply element-wise multiplication to (1-zt) and h’t.
• Step 3 :Sum the results from step 1 and 2.
• Let’s bring up the example about the book review. This time, the most
relevant information is positioned in the beginning of the text. The
model can learn to set the vector zt close to 1 and keep a majority of
the previous information. Since zt will be close to 1 at this time
step, 1-zt will be close to 0 which will ignore big portion of the current
content (in this case the last part of the review which explains the
book plot) which is irrelevant for our prediction.
GRU Network
• Now, you can see how GRUs are able to store and filter the
information using their update and reset gates.
• That eliminates the vanishing gradient problem since the model is
not washing out the new input every single time but keeps the
relevant information and passes it down to the next time steps of the
network.
• If carefully trained, they can perform extremely well even in
complex scenarios.
• Those students who did not to attend my calss you can refer the youtube videos:
• RNN:
• https://www.youtube.com/watch?v=6EXP2-d_xQA
• https://www.youtube.com/watch?v=mDaEfPgwtgo
• LSTM:
• https://www.youtube.com/watch?v=XsFkGGlocc4
• https://www.youtube.com/watch?v=rdkIOM78ZPk
• GRU:
• https://www.youtube.com/watch?v=xLKSMaYp2oQ
• https://www.youtube.com/watch?v=tOuXgORsXJ4
•