0% found this document useful (0 votes)
17 views56 pages

RNN and LSTM - Explanation by Example

The document explains the workings of Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks using examples related to predicting dinner choices and writing a children's book. It discusses the limitations of RNNs in handling long-term dependencies and introduces LSTMs as a solution that incorporates memory components to improve predictions. Additionally, it highlights various applications of LSTMs, including language translation and speech recognition.

Uploaded by

popipe4982
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views56 pages

RNN and LSTM - Explanation by Example

The document explains the workings of Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks using examples related to predicting dinner choices and writing a children's book. It discusses the limitations of RNNs in handling long-term dependencies and introduces LSTMs as a solution that incorporates memory components to improve predictions. Additionally, it highlights various applications of LSTMs, including language translation and speech recognition.

Uploaded by

popipe4982
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

How Recurrent Neural Networks

and Long Short-Term Memory


Work – By Example
2017-2021
Based on the notes from Brandon Brohrer

1
Explanation using examples
We will attempt to explain the functionality of

• RNNs

• LSTMs

By using few examples

2
RNN – Guess what we have for Dinner tonight?
• Every night for dinner, we have either:

₋ Pizza, or
₋ Sushi, or
₋ Waffles

• And repeat again

3
Guess the dinner tonight?
Voting Process  Prediction
Outputs: (?)
3 choices
• pizza,
• sushi,
• waffles

Inputs: (?)
what can affect what
we have for dinner, for
example,
• day of the week,
• month,
• late meeting
4
Pizza, Sushi, Waffles, & repeat - Re-examine the data
Let’s simplify our assumptions
Assume that the choice of
dinner does not depend on the
day of the week, month, or late
meetings
Let’s assume that the data
follows a simple pattern of
• Pizza,
• sushi,
• waffles and
• repeat
Therefore, we just need to
know what we had last night 5
What happens if we do not know what we had last night?

• e.g., I was not home last night,


I cannot remember,

• Then, it will be helpful to have:

• A prediction of what we might have had yesterday night

6
What do we need to know to make a prediction re dinner night?
• Generally we need
to know:

• A prediction of
what we might
have had last
night
or
• Information
about the dinner
last night

7
Side note - Vectors

Neural networks can


understand vectors
better

Native language of
NNs is vectors

8
Side note - Vectors as statements
ONE HOT ENCODING

The list (vector) includes


all possibilities for the
days of the week

All of them are ZERO


Except the one that is
true that is Tuesday is
ONE

“It is Tuesday”

9
Side note - One Hot Vector for our example
A vector: a list of values
We have 3 choices for
dinner
-Pizza,
-Sushi,
-Waffles

“we have Sushi”


The one hot vector
representing this
statement is:

0
1
0 10
Input/Output vector
- Input: Two vectors

1. A vector for
prediction of dinner
for yesterday

2. A vector for actual


dinner yesterday

- Output: One vector


1. A vector for dinner
prediction for today

11
Recurrent Neural Networks

12
RNN - Create a feedback from output to the input

We can now connect


the output to the input
to create the predicted
vector with a delay
Pt-1

Dotted line signifies


the delay
Pt

If the output vector


denotes (t) then the
it
feedback line denotes
(t-1)

13
Dinner example - Unwrapped recurrent network
Now we can go as
far back as we want

Let’s say we have


the dinner
information for 2
weeks ago for
example

14
Example: A network to write a children’s book
The collection and/or
dictionary of the words
that we have to write
this book is rather small:

₋ Doug
₋ Jane
₋ Spot
₋ saw
₋ .

Objective: to put these


words together in right
order to write a book
15
RNN to write a book
- 3 vectors
Pt ❷
1. A vector of the
words that we
have now (it) ❸ Pt-1

2. A vector of the
prediction of the

words (Pt) it

3. A vector of the
words that may The new information (it) indicates what is the current
word, e.g., if it is Doug then the vector is [0 1 0 0 0 0]’
come next (Pt-1)
16
Trained RNN – new information vector (it)
Let’s try to work out
this RNN

After the training is


done when the new
information is
₋ Jane,
₋ Doug or
₋ Spot

we expect that the


trained RNN would
point to
₋ saw or
₋ . 17
Working out our RNN – prediction vector (Pt-1)
if the predicted
word is
- Jane, or
- Doug, or
- Spot

Similarly we expect
that the trained net
would point to
- Saw, or
- .

18
Working out our RNN
if the present word is
- saw, or
- .

The trained net would


point to
- Jane or
- Doug
- Spot

As a name should
appear after saw or .
19
A representation for our RNN
The input is a collection
(concatenation) of the new
information and the
predicated values

This is denoted by

The activation function used


here is tanh denoted by

Making the output behave


well
20
Side note – how does tanh work
Tanh is the squashing function
Regardless of the input
everything will always be
between -1 & +1 (very
important)

For the input values between


-1 & +1 the output value is
very close or equal to the
original input

For the values greater than +1


the output value is +1

For the values less than -1


the output value is -1 21
Why RNN may not work ?
Doug saw Doug.
(after saw we expect a
name that name could be
Doug)

Jane saw Spot saw …


(after saw we expect a
name and after a name we
can expect saw …)

Spot. Doug. Jane.


(after a name we can
expect .)

22
What may not work so far?

Problem:

We have short term


memory

We only look back one


time step & do not use
the information from
further back

23
RNN
A simple architecture of
a RNN Feedback
delay

Your input is a
combination of:

- the new information


&
- what you predicted in
the last step (time
wise)

24
How do we fix this?
We need to modify
the existing
architecture

One solution is to add


a memory capabilities

How do we add a
memory component ?

25
Introduction of the memory component
Adding memory
component

to enable the network


to remember what
happens many steps memory

ago (from further


back)

26
Side note - Element-by-Element Addition/Plus Junction

27
Side note - Element-by-Element Multiplication/Times Junction

28
Gating
We can use time junction to
control what percentage of
the an input (a signal) goes
through, i.e., gating

In this example, the 1st


element of the signal goes
through completely
whereas the 3rd element is
completely masked

29
Side note - Sigmoid Function

30
Memory Component: forget & keep
Memory
component: Prediction
from last
round
• To forget some
of the previous
prediction and

• to keep the rest

31
How does the forget gate work?
1. A combination of the previous
prediction & new information
goes thru net1 & a prediction
is made accordingly
a copy of the
2. A copy of the prediction will prediction from the
be given to the forget gate last round will be
net2: what to forget passed to the
a combination of the
forget gate
previous prediction &
Note: new information
net2 is different from net1 & its ❷
task is to learn what to forget &
when to forget net1: what to predict

A part of this will be forgotten &



the remaining will be added to
the prediction
32
Add a selection layer – net3

We do not necessarily
need to send the entire
prediction to the
input/output
net3: what to select

To select with part of


the prediction goes
back to the
input/output

33
How does the selection gate work?
In the previous layer
(forget/keep) we combined our
memory with our prediction

1. We need to have a filter to


select which part of
combined memory +
prediction to go out

2. We also need to add a new


tanh after the elementwise
add to make sure everything
is still bet -1 & +1 (addition
might have caused an
increase beyond -1/+1)
34
Where does learning happen so far?

• net1: to learn to PREDICT

• net2: to learn what to FORGET/KEEP

• net3: to learn what to SELECT

35
Add an ignore/attention layer – net4
To ignore some of
the possible
predictions

net4: what to ignore

36
How does the ignore layer work?
Some of the possible
predictions that are not
immediately relevant to
be ignored

Not to unnecessarily
complicate the predictions
(by having too many of
them) in the memory as
going forward

37
Where does learning happen?

• net1: to learn to predict

• net2: to learn what to forget/keep

• net3: to learn what to select

• net4: to learn what to ignore

38
LSTM Structure

③② ① ④

39
Side note
• A multiplicative input gate unit learns to protect the constant
error flow within the memory cell from perturbation by
irrelevant inputs

• Likewise, a multiplicative output gate unit learns to protect


other units from perturbation by currently irrelevant
memory contents stored in the memory cell

40
Running a simple example
Assume this LSTM is
already trained

net1, net2, net3 ,net4 are


known

41
Information going through
① So far we have …
“Jane saw Spot.”
and the new word is “Doug”
② We also know from
previous prediction that the
next word can be “Doug,

Jane, Spot”
③ We pass this info through

net 1, 2, 3, 4 to
1. Predict ①
2. Ignore
3. Forget
4. Select
42
net1 - Prediction Step
④ The new word is “Doug”, net1 should predict that the next word is “saw”
Also, net1 should know that since the new word is “Doug” it should not see the word
“Doug” again very soon

net1 to make 2 predictions:


1. A positive prediction for
“saw”
2. A negative prediction for
“Doug” (do not expect to
see “Doug” in the near
future) ④

43
net2 - Ignore Step
This example is simple,
we do not need to focus on
ignoring anything

This prediction of
₋ “saw”
₋ “not Doug”

is passed forward

44
net3 - Forget Step

For the sake of


simplicity, assume,
there is no memory at
the moment

Therefore,
• “saw”
• “not Doug” ⑤

going forward

45
net4 - Selection Step
The selection mechanism
(net4) has learned that when
the most recent word was a
name then the next is either saw
• “saw” or saw
• “.” saw Doug
⑦ Doug

net4 blocks any other words


from coming out so
₋ “not Doug” gets blocked
₋ “saw” goes out
as the prediction for the next
time step
46
Next Prediction Process
So we take a step forward in
time now the word “saw” is
our most recent word and
our most recent prediction

They get passed forward to


all of these neural networks
(net 1, 2, 3, 4) and we get
a new set of predictions

47
net1 - Prediction Step

Because the word “saw” just


occurred we now predict that
the words
• “Doug”,
• “Jane”, or
• “Spot”
might come next

we will pass over ignoring


and attention in this example
again & we will take those
predictions forward

48
net3 - Forget Step
Now the other thing that we
need to consider is our
previous set of possibilities

Remember that we already


had the words
• saw
• not Doug
that we maintain internally
from previous step

They get passed to a


forgetting gate

49
net3 - Forget Step
At the forgetting gate we know:
The last word that occurred was
the word “saw” then the
network can forget it but the
network should keep any ① ⑤

predictions about names
For net3: 

• forgets “saw” ⑥
• keeps “not Doug”
& now at  we have: ①

• a positive vote for “Doug”
• a positive vote for “not Doug”
( or a negative vote for
 After this point the network has only “Jane”
“Doug”)
they cancel each other & “spots” Those get passed forward
50
net4 - Selection Step
The selection gate knows that

• the word “saw” just


occurred and
• a name should happen
next

• so it passes through these


predictions for names and
for the next time step then
we get predictions of
• “Jane”
• “spot”
51
Some mistakes may not happen
This network can avoid:
• Doug saw Doug.
• Jane saw Spot saw …
• Spot. Doug. Jane.

That is because LSTM can look back two, three, many time steps and
use that information to make good predictions about what's going to
happen next.

Note: vanilla recurrent neural networks they can actually look back
several time steps as well but not very many.
52
LSTM Applications
• Translation of text from one language to another language

Even though translation is not a word to word process, it's a phrase to phrase or
even in some cases a sentence to sentence process, LSTMS are able to
represent those grammar structures that are specific to each language and
what it looks like is that they find the higher-level idea and translate it from one
mode of expression to another, just using the bits and pieces that we just
walked through.

53
LSTM Applications
• Translation of speech to text

Speech is just some signals that vary in time. It takes them and uses that then to
predict what text -what word- is being spoken and it can use the history -the
recent history of words- to make a better guess for what's going to come next.

54
LSTM Applications
• LSTMS are a great fit for any information that is embedded in time –
like audio, video

• An agent taking in information from a set of sensors and then based


on that information, making a decision and carrying out an action.

• It’s inherently sequential and actions taken now can influence what is
sensed and what should be done many times steps down the line.

55
Some interesting applications

56

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy