Unit-5-updated
Unit-5-updated
yt
ht
xt
Recurrent Connections
yt
W
ht ht = ψ(Ux t + W h t−1 )
xt
Recurrent
Connections ŷ t
V yˆt= φ(Vh t )
W
ht ht = ψ(Ux t + W h t−1 )
xt
h1
x1 x2 x3 ... xτ
Unrolling the Recurrence
yˆ1
h1
x1 x2 x3 ... xτ
Unrolling the Recurrence
yˆ1
W
h1 h2
U U
x1 x2 x3 ... xτ
Unrolling the Recurrence
ŷ 1 ŷ 2
V V
W
h1 h2
U U
x1 x2 x3 ... xτ
Unrolling the Recurrence
ŷ 1 ŷ 2
V V
W W
h1 h2 h3
U U U
x1 x2 x3 ... xτ
Unrolling the Recurrence
ŷ 1 ŷ 2 ŷ 3
V V V
W W
h1 h2 h3
U U U
x1 x2 x3 ... xτ
Unrolling the Recurrence
ŷ 1 ŷ 2 ŷ 3
V V V
W W W
h1 h2 h3
U U U
x1 x2 x3 ... xτ
Unrolling the Recurrence
ŷ 1 ŷ 2 ŷ 3
V V V
W W W W
h1 h2 h3 hτ
U U U U
x1 x2 x3 ... xτ
Unrolling the Recurrence
ŷ 1 ŷ 2 ŷ 3 ŷ τ
V V V V
W W W W
h1 h2 h3 hτ
U U U U
x1 x2 x3 ... xτ
Feedforward Propagation
This is a RNN where the input and output sequences are of
the same length.
Feedforward operation proceeds from left to right
Update Equations:
at = b + W ht−1 + U xt
ht = tanh at
ot = c + V ht
yˆt = softmax(ot)
Feedforward Propagation
Loss would just be the sum of losses over time steps.
If L t is the negative log-likelihood of yt given x1, . . . , xt, then
Backward Propagation
ŷ 1 ŷ 2 ŷ 3 ŷ 4 ŷ 5
V V V V V
W W W W
h1 h2 h3 h4 h5
U U U U U
x1 x2 x3 x4 x5
BPTT
ŷ 1 ŷ 2 ŷ 3 ŷ 4 ŷ 5
∂L
V V V V V ∂V
W W W W
h1 h2 h3 h4 h5
∂L
∂W
∂L
U U U U U ∂U
x1 x2 x3 x4 x5
BPTT
ŷ 1 ŷ 2 ŷ 3 ŷ 4 ŷ 5
∂L
V V V V V ∂V
W W W W
h1 h2 h3 h4 h5
∂L ∂L
∂W ∂W
∂L ∂L
U U U U ∂U U ∂U
x1 x2 x3 x4 x5
BPTT
ŷ 1 ŷ 2 ŷ 3 ŷ 4 ŷ 5
∂L
V V V V V ∂V
W W W W
h1 h2 h3 h4 h5
∂L ∂L ∂L
∂W ∂W ∂W
∂L ∂L ∂L
U U U ∂U U ∂U U ∂U
x1 x2 x3 x4 x5
BPTT
ŷ 1 ŷ 2 ŷ 3 ŷ 4 ŷ 5
∂L
V V V V V ∂V
W W W W
h1 h2 h3 h4 h5
∂L ∂L ∂L ∂L
∂W ∂W ∂W ∂W
∂L ∂L ∂L ∂L ∂L
U ∂U U ∂U U ∂U U ∂U U ∂U
x1 x2 x3 x4 x5
Gradient Computation
Gradient Computation
Gradient Computation
Bidirectional RNNs
RNNs considered till now, all have a causal structure: state at
time t only captures information from the past x(1), . . . ,
x(t−1)
Sometimes we are interested in an output y(t) which may
depend on the whole input sequence
Example: Interpretation of a current sound as a phoneme may
depend on the next few due to co-articulation
Basically, in many cases we are interested in looking into the
future as well as the past to disambiguate interpretations
Bidirectional RNNs were introduced to address this need
(Schuster and Paliwal, 1997), and have been used in
handwriting recognition (Graves 2012, Graves and
Schmidhuber 2009), speech recognition (Graves and
Schmidhuber 2005) and bioinformatics (Baldi 1999)
Bidirectional RNNs
Bidirectional RNNs combine
an RNN that moves
forward through time
beginning from the start of
the sequence with another
RNN that moves backward
through time beginning
from the end of the
sequence.
RNN with Fixed vector as input
We have considered RNNs in the context of a sequence of
vectors x(t) with t = 1, . . . , τ as input
Sometimes we are interested in only taking a single, fixed
sized vector x as input, that generates the y sequence
Some common ways to provide an extra input to an RNN are:
As an extra input at each time step
As the initial state
h (0)
RNN with Fixed vector as input
The first option (extra input at each time step) is the most common:
(b) lengthens shortest paths linking different time steps, (c) mitigates this by introducing skip layers
Recursive Neural Networks
The computational graph is structured as a deep tree rather
than as a chain in a RNN.
Recursive Neural Networks
Successfully used to process data structures as input to neural networks (Frasconi et al
1997), Natural Language Processing (Socher et al 2011) and Computer vision (Socher
et al 2011)
Advantage: For sequences of length , the number of compositions of nonlinear
operations can be reduced from to O(log T )
Choice of tree structure is not very clear
A balanced binary tree, that does not depend on the structure of the data has been used in many
applications
Sometimes domain knowledge can be used: Parse trees given by a parser in NLP (Socher et al 2011)
The computation performed by each node need not be the usual neuron computation - it could instead
be tensor operations etc (Socher et al 2013)
Long-Term Dependencies
Challenge of Long-Term Dependencies
Basic problem: Gradients propagated over many stages tend
to vanish (most of the time) or explode (relatively rarely).
Difficulty with long term interactions (involving multiplication
of many jacobians) arises due to exponentially smaller
weights, compared to short term interactions
Why do gradients explode or vanish?
Why do gradients explode or vanish?
Why do gradients explode or vanish?
The above tells us that the gradient norm can shrink to zero
or blow up exponentially fast depending on the gain
Challenge of Long-Term Dependencies
Recurrent Networks involve the composition of the same function multiple times, once
per time step
The function composition in RNNs somewhat resembles matrix multiplication
Consider the recurrence relationship:
This could be thought of as a very simple recurrent neural network without a nonlinear
activation and lacking x
This recurrence essentially describes the power method and can be written as:
Challenge of Long-Term Dependencies
Solution 1: Echo State Networks
Idea: Set the recurrent weights such that they do a good job of
capturing past history and learn only the output weights
Methods: Echo State Machines, Liquid State Machines
The general methodology is called reservoir computing
How to choose the recurrent weights?
Echo State Networks
Idea 1: Skip Connections
Idea 2: Leaky Units
A Popular Solution: Gated Architectures
Long Short Term Memory
Long Short Term Memory
Long Short Term Memory
Long Short Term Memory
Long Short Term Memory
Long Short Term Memory
Long Short Term Memory
LSTM: Further Intuition
LSTM: Further Intuition
Gated Recurrent Unit
Gated Recurrent Unit
Gated Recurrent Unit
Gated Recurrent Unit
Gated Recurrent Unit
Lecture 6 Smaller Network: RNN
This is our fully connected network. If x1 .... xn, n is very large and growing,
this network would become too large. We now will input one xi at a time,
and re-use the same edge weights.
Recurrent Neural Network
How does RNN reduce complexity?
y1 y2 y3
h h
0 f 1 f h
2 f h
3 …
…
x1 x2 x3
No matter how long the input/output sequence is, we only
need one function f. If f’s are different, then it becomes a
feedforward NN. This may be treated as another compression
from fully connected network.
Deep RNN h’,y = f1(h,x), g’,z = f2(g,y)
…
z1 z2 z3
g f2 g f2 g f2 g …
0 1 2 3
…
y1 y2 y3
h f1 h f1 h f1 h …
0 1 2 3
…
x1 x2 x3
Bidirectional RNN y,h=f1(x,h) z,g = f2(g,x)
x1 x2 x3
g f2 g f2 g f2 g
0 1 2 3
z1 z2 z3
p p p
p=f3(y,z) f3 1 f3 2 f3 3
y1 y2 y3
h f1 h f1 h f1 h
0 1 2 3
x1 x2 x3
Significantly speed up training
Pyramid RNN
y Wh h
Wi
h' x
h f h'
y Wo h’ Note, y is computed
from h’
x softma
x
Ct-1
ht-1
Forget input
gate gate
The core idea is this cell
Why sigmoid or tanh:
state Ct, it is changed
Sigmoid: 0,1 gating as switch.
slowly, with only minor
Vanishing gradient problem in
linear interactions. It is
LSTM is handled already.
very easy for information to
ReLU replaces tanh ok?
flow along it unchanged.
it decides what component
is to be updated.
C’t provides change contents
yt ct-1 ct
LST
Naïve M
ht-1 ht ht-1 ht
RNN
xt xt
ct-1 z xt
i = σ( Wi
) ht-1
Controls Controls Updating Controls
forget gate input gate information Output gate xt
z = σ( Wf
f
)
ht-1
zf zi z zo
z xt
o = σ( Wo
)
ht-1 xt ht-1
z =tanh( W ht-1 )
ct-1 ct-1
diagonal
“peephole” z
o
z
f
z obtained by the same way
i
zf zi z zo
ht-1 xt
Information flow of LSTM
Element-wise multiply
yt
ct-1 ct
ct = zf ct-1 + ziz
tanh ht = zo tanh(ct)
yt = σ(W’ ht)
zf zi z zo
ht-1 xt ht
yt yt+1
ct-1 ct ct+1
tanh tanh
zf zi z zo zf zi z zo
ht+1
ht-1 xt ht xt+1
x f a
1
f a
2
f a
3
f y
1 2 3 4
t is layer
at = ft(at-1) = σ(Wtat-1 + bt)
h
0 f h
1 f h
2 f h f g y
3 4
x x x x
1 2 3 4
t is time step
No input xt at t-1
each step at-1
h ahtt
No output yt at
each step 1-
at-1 is the output of
the (t-1)-th layer
reset update
atis the output of r z h'
the t-th layer
No reset gate
t-1
hat-1 xt xt
h’=σ(Wat-1)
z=σ(W’at-1)
Highway Network at = z at-1 + (1-z) h
c c’ c c’
LST Grid
M LSTM
h h’ h h’
x a b
time
Grid LSTM
h' b'
a’ b’ c c'
a a'
tanh
c c’
Grid
LSTM
h h’ zf zi z zo
a b
h b
You can generalize this to 3D, and more.
Applications of LSTM / RNN
Neural machine translation
LSTM
Sequence to sequence chat model
Chat with context
M: Hi
M: Hello
U: Hi
M: Hi
M: Hello U: Hi
Serban, Iulian V., Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle
Pineau, 2015 "Building End-To-End Dialogue Systems Using Generative Hierarchical
Baidu’s speech recognition using RNN
Attention
Image caption generation using attention
(From CY Lee lecture)
z0 is initial parameter, it is also learned
A vector for
each region
z0 match 0.7
weighted
filter filter filter
CNN
filter filter filter
sum
0.0 0.8 0.2
0.0 0.0 0.0
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo
Larochelle, Aaron Courville, “Describing Videos by Exploiting Temporal Structure”, ICCV,
2015
Thank You