ANN Text and Sequence Processing
ANN Text and Sequence Processing
Lecture # 9
Text and
Sequence Processing
Why Sequence Models?
• Examples of Sequence Data
2
Instructor: Tanzila Kehkashan
Conversion of FFNN into a RNN
• RNN works on the principle of saving the output of a particular layer and feeding this back to
the input in order to predict the output of the layer.
• Nodes in different layers of the neural network are compressed to form a single layer of
recurrent neural networks.
• A, B, and C are the parameters of the network used to improve the output of the model.
3
Instructor: Tanzila Kehkashan
Fully Connected Recurrent Neural Network
• At any given time t, the current input is a combination of input at x(t) and x(t-1).
• Output at any given time is fetched back to the network to improve on the output.
4
Instructor: Tanzila Kehkashan
How Does RNN Work?
5
Instructor: Tanzila Kehkashan
Motivating Example
• X : Harry and Hermione invented a new spell.
6
• Instructor:
Why Sequence Models?
How do we represent n individual word in a sequence?
• This is where we lean on a vocabulary, or a dictionary.
• This is a list of words that we use in our representations.
• A vocabulary might look like this:
• Size of the vocabulary might vary depending on the application.
• One potential way of making a vocabulary is by picking up the most frequently occurring
words from the training set.
• Now, suppose we want to represent the word ‘harry’ which is in 4075th
position in our vocabulary.
• We one-hot encode this vocabulary to represent ‘harry’:
• To generalize, x<t> is an one-hot encoded vector.
• We will put 1 in 4075th position and all remaining words will be represented as 0.
• If the word is not in our vocabulary, we create an unknown <UNK> tag and add it in the
vocabulary.
7
Instructor: Tanzila Kehkashan
Why not Use a Standard Neural Network?
• We use Recurrent Neural Networks to learn mapping from X to Y, when either X or Y, or both
X and Y, are some sequences.
• But why can’t we just use a standard neural network for these sequence problems?
• Example: Suppose we build the below neural network:
• Problems:
1. Inputs and outputs do not have a fixed length, i.e., some input sentences can be of 10
words while others could be <> 10. The same is true for the eventual output
2. We will not be able to share features learned across different positions of text if we use
a standard neural network
8
Instructor: Tanzila Kehkashan
Recurrent Neural Network (RNN) Model
• We need a representation that will help us to parse through different sentence lengths as
well as reduce the number of parameters in the model.
• This is where we use a recurrent neural network. This is how a typical RNN looks like:
• A RNN takes the first word (x<1>) and feeds it into a neural network layer which predicts an
output (y’<1>).
• This process is repeated until the last time step x<Tx> which generates the last output y’<Ty>.
• This is the network where the number of words in input as well as the output are same.
9
Instructor: Tanzila Kehkashan
Recurrent Neural Network (RNN) Model
• RNN scans through the data in a left to right sequence.
• Note that the parameters that the RNN uses for each time step are shared.
• We will have parameters shared between each input and hidden layer (Wax), every timestep
(Waa) and between the hidden layer and the output (Wya).
• So if we are making predictions for x<3>, we will also have information about x<1> and x<2>.
• A potential weakness of RNN is that it only takes information from the previous timesteps
and not from the ones that come later.
• This problem can be solved using bi-directional RNNs. For now, let’s look at forward
propagation steps in a RNN model:
• a<0> is a vector of all zeros and we calculate the further activations similar to that of a
standard neural network:
• a<0> = 0
• a<1> = g(Waa * a<0> + Wax * x<1> + ba)
10
y<1>Tanzila
• Instructor: = g’(W
Kehkashan * a<1> + b )
ya y
Recurrent Neural Network (RNN) Model
• Similarly, we can calculate the output at each time step. The generalized form of these
formulae can be written as:
• We horizontally stack Waa and Wya to get Wa. a<t-1> and x<t> are stacked vertically.
• Rather than carrying around 2 parameter matrices, we now have just 1 matrix.
• And that, in a nutshell, is how forward propagation works for recurrent neural networks.
• Recurrent neural nets use backpropagation algorithm, but it is applied for every timestamp. It
is commonly known as Backpropagation Through Time (BTT).
11
Instructor: Tanzila Kehkashan
Backpropagation Through Time (BTT)
• Backpropagation steps work in the opposite direction to forward propagation.
• We have a loss function which we need to minimize in order to generate accurate
predictions. The loss function is given by:
• We calculate the loss at every timestep and finally sum all these losses to calculate the final
loss for a sequence:
• In forward propagation, we move from left to right, i.e., increasing the indices of time t.
• In backpropagation, we are going from right to left, i.e., going backward in time (hence the
name backpropagation through time).
• So far, we have seen scenarios where the length of input and output sequences was equal.
• But what if the length differs?
12
Instructor: Tanzila Kehkashan
Application of RNNs
• We can have different types of RNNs to deal with use cases where sequence length differs.
• One to One: given some scores of a championship, you can predict the winner.
13
Instructor: Tanzila Kehkashan
Application of RNNs
• One to Many: given an image, you can predict what the caption is going to be.
14
Instructor: Tanzila Kehkashan
Application of RNNs
• Many to One: given a tweet, you can predict the sentiment of that tweet.
• We pass a sentence to model and it returns
sentiment or rating corresponding to that
sentence.
15
Instructor: Tanzila Kehkashan
Application of RNNs
• Many to Many: given an English sentence, you can translate it to its German equivalent.
16
Instructor: Tanzila Kehkashan
Application of RNNs
• Many to Many: consider the machine translation application where we take an input
sentence in one language and translate it into another language.
• It is a many-to-many problem but the length of the input sequence might or might not be
equal to the length of output sequence.
17
Instructor: Tanzila Kehkashan
Language Model and Sequence Generation
• Suppose we are building a speech recognition system and we hear the sentence “the apple
and pear salad was delicious”.
• What will the model predict – “the apple and pair salad was delicious” or “the apple and pear
salad was delicious”?
• Speech recognition system picks a sentence by using a language model which predicts the
probability of each sentence.
• But how do we build a language model?
• Suppose we have an input sentence: Cats average 15 hours of sleep a day.
• Steps to build a language model will be:
• Step 1 – Tokenize the input, i.e. create a dictionary
• Step 2 – Map these words to a one-hot encode vector. We can add <EOS> tag which
represents the End Of Sentence
• Step 3 – Build an RNN model
18
Instructor: Tanzila Kehkashan
Language Model and Sequence Generation
• We take the first input word and make a prediction for that.
• Output here tells us what is the probability of any word in the dictionary.
• Second output tells us the probability of the predicted word given the first input word:
• Each step in our RNN model looks at some set of preceding words to predict the next word.
19
Instructor: Tanzila Kehkashan
Challenges in Training a RNN Model
Issues with backpropagation
1. Vanishing Gradient
• Consider the two sentences:
• The cat, which already ate a bunch of food, was full.
• The cat, which already ate a bunch of food, were full.
• Which of the above two sentences is grammatically correct?
• It’s the first one.
20
Instructor: Tanzila Kehkashan
Challenges in Training a RNN Model
1. Vanishing Gradient
• When making use of back-propagation the goal is to calculate the error which is
actually found out by finding out the difference between the actual output and the
model output and raising that to a power of 2.
21
Instructor: Tanzila Kehkashan
Challenges in Training a RNN Model
Issues with backpropagation
2. Exploding Gradient
• Working of the exploding gradient is similar but the weights here change drastically
instead of negligible change.
22
Instructor: Tanzila Kehkashan
Challenges in Training a RNN Model
How to overcome these Challenges?
23
Instructor: Tanzila Kehkashan
Long Short Term Memory (LSTM) Networks
• LSTM networks are a special kind of RNN because they are capable of learning long-term
dependencies.
24
Instructor: Tanzila Kehkashan
Long Short Term Memory (LSTM) Networks
25
Instructor: Tanzila Kehkashan
Long Short Term Memory (LSTM) Networks
26
Instructor: Tanzila Kehkashan
Long Short Term Memory (LSTM) Networks
27
Instructor: Tanzila Kehkashan
Long Short Term Memory (LSTM) Networks
28
Instructor: Tanzila Kehkashan
Long Short Term Memory (LSTM) Networks
29
Instructor: Tanzila Kehkashan
LSTM Network – Use case
• We will feed an LSTM with correct sequences from the text of three symbols a
inputs and 1 labeled symbol.
• Eventually the neural network will learn to predict the next symbol correctly.
30
Instructor: Tanzila Kehkashan
LSTM Network – Use case
31
Instructor: Tanzila Kehkashan
LSTM Network – Use case
• A unique integer value is assigned to each symbol because LSTM inputs can only
understand real numbers.
ote
fo otn
b l e in
ail a
v
32
e a
d
co
Instructor: Tanzila Kehkashan
33
Instructor: Tanzila Kehkashan