RNN and LSTM - Explanation by Example
RNN and LSTM - Explanation by Example
1
Explanation using examples
We will attempt to explain the functionality of
• RNNs
• LSTMs
2
RNN – Guess what we have for Dinner tonight?
• Every night for dinner, we have either:
₋ Pizza, or
₋ Sushi, or
₋ Waffles
3
Guess the dinner tonight?
Voting Process Prediction
Outputs: (?)
3 choices
• pizza,
• sushi,
• waffles
Inputs: (?)
what can affect what
we have for dinner, for
example,
• day of the week,
• month,
• late meeting
4
Pizza, Sushi, Waffles, & repeat - Re-examine the data
Let’s simplify our assumptions
Assume that the choice of
dinner does not depend on the
day of the week, month, or late
meetings
Let’s assume that the data
follows a simple pattern of
• Pizza,
• sushi,
• waffles and
• repeat
Therefore, we just need to
know what we had last night 5
What happens if we do not know what we had last night?
6
What do we need to know to make a prediction re dinner night?
• Generally we need
to know:
• A prediction of
what we might
have had last
night
or
• Information
about the dinner
last night
7
Side note - Vectors
Native language of
NNs is vectors
8
Side note - Vectors as statements
ONE HOT ENCODING
“It is Tuesday”
9
Side note - One Hot Vector for our example
A vector: a list of values
We have 3 choices for
dinner
-Pizza,
-Sushi,
-Waffles
0
1
0 10
Input/Output vector
- Input: Two vectors
1. A vector for
prediction of dinner
for yesterday
11
Recurrent Neural Networks
12
RNN - Create a feedback from output to the input
13
Dinner example - Unwrapped recurrent network
Now we can go as
far back as we want
14
Example: A network to write a children’s book
The collection and/or
dictionary of the words
that we have to write
this book is rather small:
₋ Doug
₋ Jane
₋ Spot
₋ saw
₋ .
2. A vector of the
prediction of the
❶
words (Pt) it
3. A vector of the
words that may The new information (it) indicates what is the current
word, e.g., if it is Doug then the vector is [0 1 0 0 0 0]’
come next (Pt-1)
16
Trained RNN – new information vector (it)
Let’s try to work out
this RNN
Similarly we expect
that the trained net
would point to
- Saw, or
- .
18
Working out our RNN
if the present word is
- saw, or
- .
As a name should
appear after saw or .
19
A representation for our RNN
The input is a collection
(concatenation) of the new
information and the
predicated values
This is denoted by
22
What may not work so far?
Problem:
23
RNN
A simple architecture of
a RNN Feedback
delay
Your input is a
combination of:
24
How do we fix this?
We need to modify
the existing
architecture
How do we add a
memory component ?
25
Introduction of the memory component
Adding memory
component
26
Side note - Element-by-Element Addition/Plus Junction
27
Side note - Element-by-Element Multiplication/Times Junction
28
Gating
We can use time junction to
control what percentage of
the an input (a signal) goes
through, i.e., gating
29
Side note - Sigmoid Function
30
Memory Component: forget & keep
Memory
component: Prediction
from last
round
• To forget some
of the previous
prediction and
31
How does the forget gate work?
1. A combination of the previous
prediction & new information
goes thru net1 & a prediction
is made accordingly
a copy of the
2. A copy of the prediction will prediction from the
be given to the forget gate last round will be
net2: what to forget passed to the
a combination of the
forget gate
previous prediction &
Note: new information
net2 is different from net1 & its ❷
task is to learn what to forget &
when to forget net1: what to predict
We do not necessarily
need to send the entire
prediction to the
input/output
net3: what to select
33
How does the selection gate work?
In the previous layer
(forget/keep) we combined our
memory with our prediction
35
Add an ignore/attention layer – net4
To ignore some of
the possible
predictions
36
How does the ignore layer work?
Some of the possible
predictions that are not
immediately relevant to
be ignored
Not to unnecessarily
complicate the predictions
(by having too many of
them) in the memory as
going forward
37
Where does learning happen?
38
LSTM Structure
③② ① ④
39
Side note
• A multiplicative input gate unit learns to protect the constant
error flow within the memory cell from perturbation by
irrelevant inputs
40
Running a simple example
Assume this LSTM is
already trained
41
Information going through
① So far we have …
“Jane saw Spot.”
and the new word is “Doug”
② We also know from
previous prediction that the
next word can be “Doug,
②
Jane, Spot”
③ We pass this info through
③
net 1, 2, 3, 4 to
1. Predict ①
2. Ignore
3. Forget
4. Select
42
net1 - Prediction Step
④ The new word is “Doug”, net1 should predict that the next word is “saw”
Also, net1 should know that since the new word is “Doug” it should not see the word
“Doug” again very soon
43
net2 - Ignore Step
This example is simple,
we do not need to focus on
ignoring anything
This prediction of
₋ “saw”
₋ “not Doug”
⑤
is passed forward
44
net3 - Forget Step
going forward
④
45
net4 - Selection Step
The selection mechanism
(net4) has learned that when
the most recent word was a
name then the next is either saw
• “saw” or saw
• “.” saw Doug
⑦ Doug
47
net1 - Prediction Step
48
net3 - Forget Step
Now the other thing that we
need to consider is our
previous set of possibilities
49
net3 - Forget Step
At the forgetting gate we know:
The last word that occurred was
the word “saw” then the
network can forget it but the
network should keep any ① ⑤
④
predictions about names
For net3:
③
• forgets “saw” ⑥
• keeps “not Doug”
& now at we have: ①
②
• a positive vote for “Doug”
• a positive vote for “not Doug”
( or a negative vote for
After this point the network has only “Jane”
“Doug”)
they cancel each other & “spots” Those get passed forward
50
net4 - Selection Step
The selection gate knows that
That is because LSTM can look back two, three, many time steps and
use that information to make good predictions about what's going to
happen next.
Note: vanilla recurrent neural networks they can actually look back
several time steps as well but not very many.
52
LSTM Applications
• Translation of text from one language to another language
Even though translation is not a word to word process, it's a phrase to phrase or
even in some cases a sentence to sentence process, LSTMS are able to
represent those grammar structures that are specific to each language and
what it looks like is that they find the higher-level idea and translate it from one
mode of expression to another, just using the bits and pieces that we just
walked through.
53
LSTM Applications
• Translation of speech to text
Speech is just some signals that vary in time. It takes them and uses that then to
predict what text -what word- is being spoken and it can use the history -the
recent history of words- to make a better guess for what's going to come next.
54
LSTM Applications
• LSTMS are a great fit for any information that is embedded in time –
like audio, video
• It’s inherently sequential and actions taken now can influence what is
sensed and what should be done many times steps down the line.
55
Some interesting applications
56