0% found this document useful (0 votes)
24 views87 pages

598 114 216 Recurrent Neural Networks

RNN

Uploaded by

g4gowthamkumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views87 pages

598 114 216 Recurrent Neural Networks

RNN

Uploaded by

g4gowthamkumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 87

Recurrent Neural

Networks
RNN
• A recurrent neural network (RNN) is a special type of an artificial neural
network adapted to work for time series data or data that involves
sequences. ...
• RNNs have the concept of 'memory' that helps them store the states or
information of previous inputs to generate the next output of the sequence
• The main difference between a CNN and an RNN is the ability to process
temporal information — data that comes in sequences, such as a sentence.
• Recurrent neural networks are designed for this very purpose, while
convolutional neural networks are incapable of effectively interpreting
temporal information.
Applications
• Speech recognition involves converting a sequence of audio signals to
a sequence of words.
• Video captioning involves converting a sequence of video frames to a
sequence of words.
• Natural language processing tasks such as question answering involve
addressing a question (a sequence of words) into an answer (another
sequence of words).
• Prediction problems.(whether a sentence is positive or negative)
• Text Summarization.
RNN VS. FEED-FORWARD
NEURAL NETWORKS
RNN VS. FEED-FORWARD
NEURAL NETWORKS
• In a feed-forward neural network, the information only moves in one direction —
from the input layer, through the hidden layers, to the output layer.
• The information moves straight through the network and never touches a node twice.
• Feed-forward neural networks have no memory of the input they receive and are bad
at predicting what’s coming next.
• Because a feed-forward network only considers the current input, it has no notion of
order in time.
• It simply can’t remember anything about what happened in the past except
its training.
• While feedforward networks have different weights across each node, recurrent
neural networks share the same weight parameter within each layer of the network.
RNN
• In RNN the information cycles through a loop.
• When it makes a decision, it considers the current input and also what it
has learned from the inputs it received previously.
• Another good way to illustrate the concept of a recurrent neural
network's memory is to explain it with an example:
• Imagine you have a normal feed-forward neural network and give it the
word "neuron" as an input and it processes the word character by
character. By the time it reaches the character "r," it has already
forgotten about "n," "e" and "u," which makes it almost impossible for
this type of neural network to predict which character would come next.
RNN
• S1, S2, S3 are the hidden states or memory (hidden layer)units at
time t1, t2, t3 respectively, and Ws is the weight matrix associated
with it.

X1, X2, X3 are the inputs at time t1, t2, t3 respectively, and Wx is the
weight matrix associated with it.

Y1, Y2, Y3 are the outputs at time t1, t2, t3 respectively, and Wy is the
weight matrix associated with it.
• S1=(Wx.X1+Ws)
• S2=(Wx.X2+Ws.S1)
• S3=(Wx.X3+Ws.S2)
• General Representation:
• St=g1(WxXt+WsSt-1)
• Yt=g2(WySt)
• Where g1 and g2 are activation functions.
Back Propagation through time
(BPTT)
• Let us now perform back propagation at time t = 3.
• Let the error function be:
Et=(Y-Yt)2
We are using the squared error here
To perform back propagation, we have to adjust the weights associated
with inputs, the memory units and the outputs.
BPTT
• Adjusting Wy
• For better understanding, let us consider the following
representation:
BPTT
• Adjusting Wy
Formula:
მE3 / მWy = მE3 / მY3 . მY3 / მWy (chain rule)
Explanation:

E3 is a function of Y3. Hence, we differentiate E3 w.r.t Y3.


Y3 is a function of WY. Hence, we differentiate Y3 w.r.t WY.
BPTT
• Adjusting Ws
For better understanding, let us consider the following
representation:
BPTT
• Adjusting Ws
Formula:
მE3 / მWs = (მE3 / მY3 . მY3 / მS3 . მS3 / მWs ) +
(მE3 / მY3 . მY3 / მS3 . მS3 / მS2 .მS2 / მWs ) +
(მE3 / მY3 . მY3 / მS3 . მS3 / მS2 . მS2 / მS1. მS1 / მWs )
BPTT
• Explanation:
E3 is a function of Y3. Hence, we differentiate E3 w.r.t Y3.
Y3 is a function of S3. Hence, we differentiate Y3 w.r.t S3.
S3 is a function of WS. Hence, we differentiate S3 w.r.t WS.
But we can’t stop with this; we also have to take into consideration, the
previous time steps. So, we differentiate (partially) the Error function with
respect to memory units S2 as well as S1 taking into consideration the
weight matrix WS.
We have to keep in mind that a memory unit, say St is a function of its
previous memory unit St-1.
Hence, we differentiate S3 with S2 and S2 with S1.
BPTT
• Generally, we can express this formula as:

მEn / მWs = ∑ i=1 n (მEn / მYn . მYn / მSi . მSi / მWs )


BPTT
• Adjusting WX:
BPTT
• Adjusting Wx
Formula:
მE3 / მWx = (მE3 / მY3 . მY3 / მS3 . მS3 / მWx ) +
(მE3 / მY3 . მY3 / მS3 . მS3 / მS2 .მS2 / მWx ) +
(მE3 / მY3 . მY3 / მS3 . მS3 / მS2 . მS2 / მS1. მS1 / მWx )
• Explanation:
E3 is a function of Y3. Hence, we differentiate E3 w.r.t Y3.
Y3 is a function of S3. Hence, we differentiate Y3 w.r.t S3.
S3 is a function of WX. Hence, we differentiate S3 w.r.t WX.
Again we can’t stop with this; we also have to take into consideration,
the previous time steps. So, we differentiate (partially) the Error
function with respect to memory units S2 as well as S1 taking into
consideration the weight matrix WX.
BPTT
• Generally, we can express this formula as:

მEn / მWx = ∑ i=1 n (მEn / მYn . მYn / მSi . მSi / მWx )


RNN continues…………
• RNN as a sequence of neural networks that you train one after another with
backpropagation.
• While RNNs learn similarly while training, in addition, they remember things learnt
from prior input(s) while generating output(s).
• RNNs can take one or more input vectors and produce one or more output vectors and
the output(s) are influenced not just by weights applied on inputs like a regular NN,
but also by a “hidden” state vector representing the context based on prior
input(s)/output(s).
• Recurrent neural networks apply the same weights for each element of the
sequence, significantly reducing the number of parameters and allowing the model
to generalize to variable length sequences.
• So, the same input could produce a different output depending on previous inputs in
the series.
RNN
• A recurrent neural network, however, is able to remember those
characters because of its internal memory. It produces output, copies
that output and loops it back into the network.
• recurrent neural networks add the immediate past to the present.
• Therefore, a RNN has two inputs: the present and the recent
past. This is important because the sequence of data contains crucial
information about what is coming next, which is why a RNN can do
things other algorithms can’t.
Different types of RNN
Different types of RNN
• One-to-one:
• It deals with Fixed size of input to Fixed size of Output where they are
independent of previous information/output.
• Ex: Image classification.
• One-to-Many:
• it deals with fixed size of information as input that gives sequence of
data as output.
• Ex:Image Captioning takes image as input and outputs a sentence of
words.
Different types of RNN
• Many-to-One:
• It takes Sequence of information as input and ouputs a fixed size of
output.
• Ex:sentiment analysis where a given sentence is classified as expre
• ssing positive or negative sentiment.
• Many-to-Many:
• It takes a Sequence of information as input and process it recurrently
outputs a Sequence of data.
• Ex: Machine Translation, where an RNN reads a sentence in English and
then outputs a sentence in French.
Limitations-Two issues of
standard RNN’s
• This method of Back Propagation through time (BPTT) can be used up
to a limited number of time steps like 8 or 10.
• If we back propagate further, the gradient becomes too small.
• This problem is called the “Vanishing gradient” problem.
• The problem is that the contribution of information decays
geometrically over time.
• Another problem is exploding gradient.(where the gradient grows
uncontrollably large.(ReLU is an activation function)
What is gradient /slop

• There are two major obstacles RNN’s have had to deal with, but to
understand them, you first need to know what a gradient is.
• A gradient is a partial derivative with respect to its inputs. If you don’t know
what that means, just think of it like this: a gradient measures how much
the output of a function changes if you change the inputs a little bit.
• You can also think of a gradient as the slope of a function.
• The higher the gradient, the steeper the slope and the faster a model can
learn.
• But if the slope is zero, the model stops learning. A gradient simply
measures the change in all weights with regard to the change in error.
1,VANISHING GRADIENTS

• Vanishing gradients occur when the values of a gradient are too small
and the model stops learning or takes way too long as a result. (It
couldn’t reach global minima)
• It was solved through the concept of LSTM by Sepp Hochreiter and
Juergen Schmidhuber.
• While you are using Backpropogating through time, you find Error is
the difference of Actual and Predicted model.
• Now what if the partial derivation of error with respect to weight is
very less than 1?
• If the partial derivation of Error is less than 1, then when it get
multiplied with the Learning rate which is also very less.
• Then Multiplying learning rate with partial derivation of Error wont be
a big change when compared with previous iteration.
• For ex:- Lets say the value decreased like 0.863 →0.532 →0.356
→0.192 →0.117 →0.086 →0.023 →0.019..
• you can see that there is no much change in last 3 iterations. This
Vanishing of Gradience is called Vanishing Gradience.
2,EXPLODING GRADIENTS

• where the gradient grows uncontrollably large.(ReLU is an activation


function)
• Fortunately, this problem can be easily solved by truncating or
squashing the gradients.
• Exploding gradients are when the algorithm, without much
reason, assigns a stupidly high importance to the weights.
• A popular method called gradient clipping can be used where in each
time step, we can check if the gradient> threshold. If yes, then
normalize it.
EXPLODING GRADIENTS
RNN
Long Short-Term Memory (LSTM)
• The units of an LSTM are used as building units for the layers of a RNN,
often called an LSTM network.
• LSTMs enable RNNs to remember inputs over a long period of time.
• This is because LSTMs contain information in a memory, much like the
memory of a computer.
• The function of the memory is remembering and forget the information
based on the context of the information.
• The LSTM can read, write and delete information from its memory.
• LSTM is well-suited to classify, process and predict time series given
time lags of unknown duration.
• This memory can be seen as a gated cell, with gated meaning the cell
decides whether or not to store or delete information (i.e., if it opens
the gates or not), based on the importance it assigns to the
information.
• The assigning of importance happens through weights, which are also
learned by the algorithm.
• This simply means that it learns over time what information is
important and what is not.
• In an LSTM you have three gates: input, forget and output gate.
• These gates determine whether or not to let new input in (input
gate), delete the information because it isn’t important (forget gate),
or let it impact the output at the current timestep (output gate).
• Below is an illustration of a RNN with its three gates:
• These three parts of an LSTM cell are known as gates. The first part is
called Forget gate, the second part is known as the Input gate and the last
one is the Output gate.
• The LSTM consists of three parts, as shown in the image below and each
part performs an individual function.
• The first part chooses whether the information coming from the previous
timestamp is to be remembered or is irrelevant and can be forgotten.
• In the second part, the cell tries to learn new information from the input to
this cell.
• At last, in the third part, the cell passes the updated information from the
current timestamp to the next timestamp.
• Just like a simple RNN, an LSTM also has a hidden state where H(t-1)
represents the hidden state of the previous timestamp and Ht is the
hidden state of the current timestamp.
• In addition to that LSTM also have a cell state represented by C(t-1)
and C(t) for previous and current timestamp respectively.
• Here the hidden state is known as Short term memory and the cell
state is known as Long term memory. Refer to the above image.
• Let’s take an example to understand how LSTM works. Here we have
two sentences separated by a full stop. The first sentence is “Bob is a
nice person” and the second sentence is “Dan, on the Other hand, is
evil”. It is very clear, in the first sentence we are talking about Bob and
as soon as we encounter the full stop(.) we started talking about Dan.
• As we move from the first sentence to the second sentence, our
network should realize that we are no more talking about Bob. Now our
subject is Dan. Here, the Forget gate of the network allows it to forget
about it. Let’s understand the roles played by these gates in LSTM
architecture(context change).

LSTM had a three step Process:
Different view of LSTM
Different view of LSTM
Forget Gate

• In a cell of the LSTM network, the first step is to decide whether we


should keep the information from the previous timestamp or forget it.
Here is the equation for forget gate.
• Xt: input to the current timestamp.
• Uf: weight associated with the input
• Ht-1: The hidden state of the previous timestamp
• Wf: It is the weight matrix associated with hidden state
• Later, a sigmoid function is applied over it. That will make ft a number
between 0 and 1.
• This ft is later multiplied with the cell state of the previous timestamp
as shown below.[vector multiplication]
• If ft is 0 then the network will forget everything and if the value of ft is
1 it will forget nothing.[vector multiplication]
• Let’s get back to our example, The first sentence was talking about
Bob and after a full stop, the network will encounter Dan, in an ideal
case the network should forget about Bob.
• The output of the forget gate tells the cell state which information to
forget by multiplying 0 to a position in the matrix.
• If the output of the forget gate is 1, the information is kept in the cell
state. From equation, sigmoid function is applied to the weighted
input/observation and previous hidden state.
• Decides how much of the past you should remember.
• This gate Decides which information to be omitted in from the cell in
that particular time stamp.
• It is decided by the sigmoid function.
• it looks at the previous state(ht-1) and the content input(Xt) and
outputs a number between 0(omit this)and 1(keep this)for each
number in the cell state Ct−1.
Input Gate

• Let’s take another example


• “Bob knows swimming. He told me over the phone that he had served the navy for
four long years.”
• So, in both these sentences, we are talking about Bob. However, both give different
kinds of information about Bob. In the first sentence, we get the information that he
knows swimming. Whereas the second sentence tells he uses the phone and served
in the navy for four years.
• Now just think about it, based on the context given in the first sentence, which
information of the second sentence is critical. First, he used the phone to tell or he
served in the navy. In this context, it doesn’t matter whether he used the phone or
any other medium of communication to pass on the information. The fact that he was
in the navy is important information and this is something we want our model to
remember. This is the task of the Input gate.
Input gate is used to quantify the importance of the new information carried
by the input. Here is the equation of the input gate .
• Here,
• Xt: Input at the current timestamp t
• Ui: weight matrix of input
• Ht-1: A hidden state at the previous timestamp
• Wi: Weight matrix of input associated with hidden state
• Again we have applied sigmoid function over it. As a result, the value
of I at timestamp t will be between 0 and 1.
New information
• Now the new information that needed to be passed to the cell state is
a function of a hidden state at the previous timestamp t-1 and input x
at timestamp t.
• The activation function here is tanh. Due to the tanh function, the
value of new information will between -1 and 1.
• If the value is of Nt is negative the information is subtracted from the
cell state and if the value is positive the information is added to the
cell state at the current timestamp.
• Only meaningful information added to the cell.
However, the Nt won’t be added directly to the cell state. Here comes
the updated equation. Here, Ct-1 is the cell state at the previous
timestamp and others are the values we have calculated currently .
Output Gate

• Now consider this sentence


• “Bob single-handedly fought the enemy and died for his country. For
his contributions, brave________ .”
• During this task, we have to complete the second sentence. Now, the
minute we see the word brave, we know that we are talking about a
person. In the sentence only Bob is brave, we can not say the enemy
is brave or the country is brave. So based on the current expectation
we have to give a relevant word to fill in the blank. That word is our
output and this is the function of our Output gate.
Its value will also lie between 0 and
1 because of this sigmoid function.
Now to calculate the current hidden state we will use Ot and tanh
of the updated cell state. As shown below.
It turns out that the hidden state is a function of Long term memory (Ct) and the
current output. If you need to take the output of the current timestamp just
apply the SoftMax activation on hidden state Ht.
Different view of LSTM
Another view of LSTM cell.
• The first sigmoid activation function is the forget gate.
• Which information should be forgotten from the previous cell state
(Ct-1).
• The second sigmoid and first tanh activation function is our input
gate.
• Which information should be saved to the cell state or should be
forgotten?
• The last sigmoid is the output gate and highlights which information
should be going to the next hidden state.
Reduce vanishing Gradient
• The problematic issues of vanishing gradients is solved through LSTM
because it keeps the gradients steep enough, which keeps the training
relatively short and the accuracy high.
• During forward propagation, gates control the flow of the
information. They prevent any irrelevant information from being
written to the state.
• Similarly, during backward propagation, they control the flow of the
gradients. It is easy to see that during the backward pass, gradients
will get multiplied by the gate.
Gated Recurrent Units (GRU)

• Introduced in 2014, GRU (Gated Recurrent Unit) aims to solve the vanishing
gradient problem
• These networks are designed to handle the vanishing gradient problem.
• They have a reset and update gate.
• These gates determine which information is to be retained for future
predictions.
• GRU can also be considered as a variation on the LSTM because both are
designed similarly and, in some cases, produce equally excellent results.
• GRU uses less training parameter and therefore uses less memory and
executes faster than LSTM whereas LSTM is more accurate on a larger
dataset.
LSTM Vs GRU
• Another Interesting thing about GRU is that, unlike LSTM, it does not
have a separate cell state (Ct).
• It only has a hidden state(Ht).
• The information which is stored in the Internal Cell State in an LSTM
recurrent unit is incorporated into the hidden state of the Gated
Recurrent Unit.
• Due to the simpler architecture, GRUs are faster to train.
• Update Gate
• The update gate acts similar to the input gate of an LSTM. It decides
what information to throw away and what new information to add.
• The update gate determines how much of the new input should be
used to update the hidden state
• Reset Gate
• The reset gate is another gate is used to decide how much past
information to forget.
• To solve the vanishing gradient problem of a standard RNN, GRU uses,
so-called, update gate and reset gate.
• Basically, these are two vectors which decide what information should
be passed to the output.
• The special thing about them is that they can be trained to keep
information from long ago, without washing it through time or
remove information which is irrelevant to the prediction.
GRU three cell state view
GRU unit single cell state
1.Update gate
Update gate

• We start with calculating the update gate zt for time step t using the
formula:
Reset gate

• Essentially, this gate is used from the model to decide how much of
the past information to forget.
• To calculate it, we use:
2.Reset gate
• This formula is the same as the one for the update gate.
• The difference comes in the weights and the gate’s usage.
• As before, we plug in h(t-1) blue line and xt purple line, multiply
them with their corresponding weights, sum the results and apply the
sigmoid function.
Current memory content

• Let’s see how exactly the gates will affect the final output.
• First, we start with the usage of the reset gate.
• We introduce a new memory content which will use the reset gate to
store the relevant information from the past.
• It is calculated as follows:
3.Current memory content
• Step-1: Multiply the input xt with a weight W and h(t-1) with a
weight U.
• Step-2: Calculate the Hadamard (element-wise) product between the
reset gate rt and Uh(t-1).
• That will determine what to remove from the previous time steps.
• Consider an example:
• That will determine what to remove from the previous time steps. Let’s
say we have a sentiment analysis problem for determining one’s opinion
about a book from a review he wrote. The text starts with “This is a
fantasy book which illustrates…” and after a couple paragraphs ends with
“I didn’t quite enjoy the book because I think it captures too many
details.” To determine the overall level of satisfaction from the book we
only need the last part of the review. In that case as the neural network
approaches to the end of the text it will learn to assign rt vector close to
0, washing out the past and focusing only on the last sentences.
• Step-3 :Sum up the results of step 1 and 2.
• Step 4: Apply the nonlinear activation function tanh.
Final memory at current time
step
• As the last step, the network needs to calculate ht vector which holds
information for the current unit and passes it down to the network.
• In order to do that the update gate is needed.
• It determines what to collect from the current memory content
h’t and what from the previous steps h(t-1). That is done as follows:
Final output
• Step 1:Apply element-wise multiplication to the update
gate zt and h(t-1).
• Step 2 :Apply element-wise multiplication to (1-zt) and h’t.
• Step 3 :Sum the results from step 1 and 2.
• Let’s bring up the example about the book review. This time, the most
relevant information is positioned in the beginning of the text. The
model can learn to set the vector zt close to 1 and keep a majority of
the previous information. Since zt will be close to 1 at this time
step, 1-zt will be close to 0 which will ignore big portion of the current
content (in this case the last part of the review which explains the
book plot) which is irrelevant for our prediction.
GRU Network
• Now, you can see how GRUs are able to store and filter the
information using their update and reset gates.
• That eliminates the vanishing gradient problem since the model is
not washing out the new input every single time but keeps the
relevant information and passes it down to the next time steps of the
network.
• If carefully trained, they can perform extremely well even in
complex scenarios.
• Those students who did not to attend my calss you can refer the youtube videos:
• RNN:
• https://www.youtube.com/watch?v=6EXP2-d_xQA
• https://www.youtube.com/watch?v=mDaEfPgwtgo
• LSTM:
• https://www.youtube.com/watch?v=XsFkGGlocc4
• https://www.youtube.com/watch?v=rdkIOM78ZPk
• GRU:
• https://www.youtube.com/watch?v=xLKSMaYp2oQ
• https://www.youtube.com/watch?v=tOuXgORsXJ4

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy