0% found this document useful (0 votes)
4 views

Unit-5-updated

The document provides an overview of sequence modeling using recurrent and recursive neural networks, detailing their structures, functionalities, and applications in processing sequential data. It covers concepts such as computational graphs, bidirectional RNNs, encoder-decoder architectures, and challenges like long-term dependencies, along with solutions like Long Short-Term Memory (LSTM) networks. The content is aimed at understanding how RNNs can effectively handle variable-length sequences and complex mappings in tasks like machine translation and image captioning.

Uploaded by

Raj kiran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Unit-5-updated

The document provides an overview of sequence modeling using recurrent and recursive neural networks, detailing their structures, functionalities, and applications in processing sequential data. It covers concepts such as computational graphs, bidirectional RNNs, encoder-decoder architectures, and challenges like long-term dependencies, along with solutions like Long Short-Term Memory (LSTM) networks. The content is aimed at understanding how RNNs can effectively handle variable-length sequences and complex mappings in tasks like machine translation and image captioning.

Uploaded by

Raj kiran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 125

Deep Learning

Welcome to the Course AI62


Unit V

Sequence Modelling: Recurrent and Recursive Nets:


Unfolding Computational Graphs, Recurrent Neural Networks,
Bidirectional RNNs, Encoder-Decoder Sequence-to-
Sequence Architectures, Deep Recurrent Networks,
Recursive Neural Networks, The Long Short-Term Memory
and Other Gated RNNs
Sequence Modeling:
● Recurrent networks can scale to much longer sequences than would be
practical for networks without sequence-based specialization.
● Most recurrent networks can also process sequences of variable length.
● RNNs may also be applied in two dimensions across spatial data such as
images, and even when applied to data involving time, the network may
have connections that go backwards in time, provided that the entire
sequence is observed before it is provided to the network
Sequence Modeling: Recurrent
and Recursive Nets
Recurrent neural networks or RNNs
 RNNs are a family of neural networks for processing sequential data.
 Early ideas found in machine learning and statistical models of the
1980s:
 sharing parameters across different parts of a model.
 Recurrent networks share parameters in a different way. Each
member of the output is a function of the previous members of the
output.
 Each member of the output is produced using the same update rule
applied to the previous outputs.
 This recurrent formulation results in the sharing of parameters
through a very deep computational graph.
Today: Recurrent Neural Networks

e.g. Image Captioning e.g. Video classification


e.g. Sentiment e.g. Machine Translation
image -> sequence of on frame level
Classification seq of words -> seq of
words sequence of words - words
Vanilla
Neural > sentiment
Networks
Computational Graphs
Computational Graphs
 A computational graph is a way to formalize the structure of a
set of computations, such as those involved in mapping
inputs and parameters to outputs and loss.
 A typical dynamic system

recurrent because the definition of s at time t refers


back to the same definition at time t − 1
A system driven by external data
RNN

 A recurrent network with no outputs.


 This recurrent network just processes information from the input x by
incorporating it into the state h that is passed forward through time.
RNN
 We can represent the unfolded recurrence after t steps with a function
g(t)

 The unfolding process thus introduces two major advantages.


 Regardless of the sequence length, the learned model always has the same input
size, because it is specified in terms of transition from one state to another state.
 It is possible to use the same transition function f with the same parameters at
every time step
Unfolding Computational Graphs
Different ways of Unfolding
computational graphs
RNN: Computational Graph
RNN: Computational Graph
RNN: Computational Graph Many to
Many
RNN: Computational Graph Many to One
RNN: Computational Graph One to Many
Variety of recurrent neural networks
Recurrent networks that produce an output at each time step and have recurrent
connections between hidden units
Recurrent networks that produce an output at each time step and have recurrent
connections between hidden units
Recurrent networks that produce an output at each time step and have
recurrent connections only from the output at one time step to the hidden
units at the next time step
Recurrent networks with recurrent connections between hidden units, that read an entire
sequence and then produce a single output
Simple RNN Implementation
Unrolling the recurrence in RNN
Plain Vanilla Recurrent Network

yt

ht

xt
Recurrent Connections

yt

W
ht ht = ψ(Ux t + W h t−1 )

xt
Recurrent
Connections ŷ t

V yˆt= φ(Vh t )

W
ht ht = ψ(Ux t + W h t−1 )

xt

ψ can be tanh and φ can besoftmax


Unrolling the Recurrence

h1

x1 x2 x3 ... xτ
Unrolling the Recurrence

yˆ1

h1

x1 x2 x3 ... xτ
Unrolling the Recurrence

yˆ1

W
h1 h2

U U

x1 x2 x3 ... xτ
Unrolling the Recurrence

ŷ 1 ŷ 2

V V

W
h1 h2

U U

x1 x2 x3 ... xτ
Unrolling the Recurrence

ŷ 1 ŷ 2

V V

W W
h1 h2 h3

U U U

x1 x2 x3 ... xτ
Unrolling the Recurrence

ŷ 1 ŷ 2 ŷ 3

V V V

W W
h1 h2 h3

U U U

x1 x2 x3 ... xτ
Unrolling the Recurrence

ŷ 1 ŷ 2 ŷ 3

V V V

W W W
h1 h2 h3

U U U

x1 x2 x3 ... xτ
Unrolling the Recurrence

ŷ 1 ŷ 2 ŷ 3

V V V

W W W W
h1 h2 h3 hτ

U U U U

x1 x2 x3 ... xτ
Unrolling the Recurrence

ŷ 1 ŷ 2 ŷ 3 ŷ τ

V V V V

W W W W
h1 h2 h3 hτ

U U U U

x1 x2 x3 ... xτ
Feedforward Propagation
 This is a RNN where the input and output sequences are of
the same length.
 Feedforward operation proceeds from left to right
 Update Equations:
 at = b + W ht−1 + U xt
 ht = tanh at
 ot = c + V ht
 yˆt = softmax(ot)
Feedforward Propagation
 Loss would just be the sum of losses over time steps.
 If L t is the negative log-likelihood of yt given x1, . . . , xt, then
Backward Propagation

 Need to find: ∇VL, ∇W L, ∇UL


 And the gradients w.r.t biases: ∇cL and ∇bL

 Treat the recurrent network as a usual multilayer network and

apply backpropagation on the unrolled network


 We move from the right to left:

 This is called Backpropagationthrough time


BPTT

ŷ 1 ŷ 2 ŷ 3 ŷ 4 ŷ 5

V V V V V

W W W W
h1 h2 h3 h4 h5

U U U U U

x1 x2 x3 x4 x5
BPTT

ŷ 1 ŷ 2 ŷ 3 ŷ 4 ŷ 5

∂L
V V V V V ∂V

W W W W
h1 h2 h3 h4 h5
∂L
∂W

∂L
U U U U U ∂U

x1 x2 x3 x4 x5
BPTT

ŷ 1 ŷ 2 ŷ 3 ŷ 4 ŷ 5

∂L
V V V V V ∂V

W W W W
h1 h2 h3 h4 h5
∂L ∂L
∂W ∂W

∂L ∂L
U U U U ∂U U ∂U

x1 x2 x3 x4 x5
BPTT
ŷ 1 ŷ 2 ŷ 3 ŷ 4 ŷ 5

∂L
V V V V V ∂V

W W W W
h1 h2 h3 h4 h5
∂L ∂L ∂L
∂W ∂W ∂W

∂L ∂L ∂L
U U U ∂U U ∂U U ∂U

x1 x2 x3 x4 x5
BPTT
ŷ 1 ŷ 2 ŷ 3 ŷ 4 ŷ 5

∂L
V V V V V ∂V

W W W W
h1 h2 h3 h4 h5
∂L ∂L ∂L ∂L
∂W ∂W ∂W ∂W

∂L ∂L ∂L ∂L ∂L
U ∂U U ∂U U ∂U U ∂U U ∂U

x1 x2 x3 x4 x5
Gradient Computation
Gradient Computation
Gradient Computation
Bidirectional RNNs
 RNNs considered till now, all have a causal structure: state at
time t only captures information from the past x(1), . . . ,
x(t−1)
 Sometimes we are interested in an output y(t) which may
depend on the whole input sequence
 Example: Interpretation of a current sound as a phoneme may
depend on the next few due to co-articulation
 Basically, in many cases we are interested in looking into the
future as well as the past to disambiguate interpretations
 Bidirectional RNNs were introduced to address this need
(Schuster and Paliwal, 1997), and have been used in
handwriting recognition (Graves 2012, Graves and
Schmidhuber 2009), speech recognition (Graves and
Schmidhuber 2005) and bioinformatics (Baldi 1999)
Bidirectional RNNs
 Bidirectional RNNs combine
an RNN that moves
forward through time
beginning from the start of
the sequence with another
RNN that moves backward
through time beginning
from the end of the
sequence.
RNN with Fixed vector as input
 We have considered RNNs in the context of a sequence of
vectors x(t) with t = 1, . . . , τ as input
 Sometimes we are interested in only taking a single, fixed
sized vector x as input, that generates the y sequence
 Some common ways to provide an extra input to an RNN are:
 As an extra input at each time step
 As the initial state

h (0)
RNN with Fixed vector as input
 The first option (extra input at each time step) is the most common:

Maps a fixed vector x


into a distribution over
sequences Y
(x T R effectively is a new
bias parameter for each
hidden unit)
RNN with Fixed vector as input:
Application
 Caption Generation
Encoder-Decoder Sequence-to-Sequence
Architectures
 How do we map input sequences to output
sequences that are not necessarily of the same
length?
 Other example applications: Speech recognition,
question answering etc.
 The input to this RNN is called the context, we want to
find a representation of the context C
 C could be a vector or a sequence that summarizes
 X = {x(1), . . . , x (n x ) }
Encoder-Decoder Sequence-to-Sequence
Architectures

Far more complicated mappings


Deep Recurrent Networks
 The computations in RNNs can be decomposed into three
blocks of parameters/associated transformations:
 Input to hidden state
 Previous hidden state to the next
 Hidden state to the output
 Each of these transforms till now were learned affine
transformations followed by a fixed nonlinearity.
 The intuition on why depth should be more useful is quite
similar to that in deep feed-forward networks
 Optimization can be made much harder, but can be mitigated
by tricks such as introducing skip connections
Deep Recurrent Networks

(b) lengthens shortest paths linking different time steps, (c) mitigates this by introducing skip layers
Recursive Neural Networks
 The computational graph is structured as a deep tree rather
than as a chain in a RNN.
Recursive Neural Networks
 Successfully used to process data structures as input to neural networks (Frasconi et al
1997), Natural Language Processing (Socher et al 2011) and Computer vision (Socher
et al 2011)
 Advantage: For sequences of length , the number of compositions of nonlinear
operations can be reduced from to O(log T )
 Choice of tree structure is not very clear
 A balanced binary tree, that does not depend on the structure of the data has been used in many
applications
 Sometimes domain knowledge can be used: Parse trees given by a parser in NLP (Socher et al 2011)
 The computation performed by each node need not be the usual neuron computation - it could instead
be tensor operations etc (Socher et al 2013)
Long-Term Dependencies
Challenge of Long-Term Dependencies
 Basic problem: Gradients propagated over many stages tend
to vanish (most of the time) or explode (relatively rarely).
 Difficulty with long term interactions (involving multiplication
of many jacobians) arises due to exponentially smaller
weights, compared to short term interactions
Why do gradients explode or vanish?
Why do gradients explode or vanish?
Why do gradients explode or vanish?

The above tells us that the gradient norm can shrink to zero
or blow up exponentially fast depending on the gain
Challenge of Long-Term Dependencies
 Recurrent Networks involve the composition of the same function multiple times, once
per time step
 The function composition in RNNs somewhat resembles matrix multiplication
 Consider the recurrence relationship:

 This could be thought of as a very simple recurrent neural network without a nonlinear
activation and lacking x
 This recurrence essentially describes the power method and can be written as:
Challenge of Long-Term Dependencies
Solution 1: Echo State Networks
 Idea: Set the recurrent weights such that they do a good job of
capturing past history and learn only the output weights
 Methods: Echo State Machines, Liquid State Machines
 The general methodology is called reservoir computing
 How to choose the recurrent weights?
Echo State Networks
Idea 1: Skip Connections
Idea 2: Leaky Units
A Popular Solution: Gated Architectures
Long Short Term Memory
Long Short Term Memory
Long Short Term Memory
Long Short Term Memory
Long Short Term Memory
Long Short Term Memory
Long Short Term Memory
LSTM: Further Intuition
LSTM: Further Intuition
Gated Recurrent Unit
Gated Recurrent Unit
Gated Recurrent Unit
Gated Recurrent Unit
Gated Recurrent Unit
Lecture 6 Smaller Network: RNN

This is our fully connected network. If x1 .... xn, n is very large and growing,
this network would become too large. We now will input one xi at a time,
and re-use the same edge weights.
Recurrent Neural Network
How does RNN reduce complexity?

● Given function f: h’,y=f(h,x) h and h’ are vectors with


the same dimension

y1 y2 y3

h h
0 f 1 f h
2 f h
3 …

x1 x2 x3
No matter how long the input/output sequence is, we only
need one function f. If f’s are different, then it becomes a
feedforward NN. This may be treated as another compression
from fully connected network.
Deep RNN h’,y = f1(h,x), g’,z = f2(g,y)


z1 z2 z3

g f2 g f2 g f2 g …
0 1 2 3

y1 y2 y3

h f1 h f1 h f1 h …
0 1 2 3

x1 x2 x3
Bidirectional RNN y,h=f1(x,h) z,g = f2(g,x)
x1 x2 x3

g f2 g f2 g f2 g
0 1 2 3

z1 z2 z3
p p p
p=f3(y,z) f3 1 f3 2 f3 3

y1 y2 y3

h f1 h f1 h f1 h
0 1 2 3

x1 x2 x3
Significantly speed up training
Pyramid RNN

● Reducing the number of time steps


Bidirectional
RNN

W. Chan, N. Jaitly, Q. Le and O. Vinyals, “Listen, attend and spell: A neural


network for large vocabulary conversational speech recognition,” ICASSP,
2016
Naïve RNN

y Wh h
Wi
h' x

h f h'
y Wo h’ Note, y is computed
from h’
x softma
x

We have ignored the bias


Problems with naive RNN
● When dealing with a time series, it tends to forget old information. When
there is a distant relationship of unknown length, we wish to have a
“memory” to it.
● Vanishing gradient problem.
The sigmoid layer outputs numbers between 0-1 determine how much
each component should be let through. Pink X gate is point-wise multiplication.
LSTM
Output
This gategate
sigmoid
This decides what info
Controls what
determines how
Is to add to the cellmuch
state
goes into output
information goes thru

Ct-1

ht-1

Forget input
gate gate
The core idea is this cell
Why sigmoid or tanh:
state Ct, it is changed
Sigmoid: 0,1 gating as switch.
slowly, with only minor
Vanishing gradient problem in
linear interactions. It is
LSTM is handled already.
very easy for information to
ReLU replaces tanh ok?
flow along it unchanged.
it decides what component
is to be updated.
C’t provides change contents

Updating the cell state

Decide what part of the cell


state to output
RNN vs LSTM
Peephole LSTM

Allows “peeping into the memory”


Naïve RNN vs LSTM
yt

yt ct-1 ct
LST
Naïve M
ht-1 ht ht-1 ht
RNN

xt xt

c changes slowly ct is ct-1 added by something

h changes faster ht and ht-1 can be very different


These 4 matrix
computation should
be done concurrently.
xt
z W
ht-1

ct-1 z xt
i = σ( Wi
) ht-1
Controls Controls Updating Controls
forget gate input gate information Output gate xt
z = σ( Wf
f
)
ht-1
zf zi z zo

z xt
o = σ( Wo
)
ht-1 xt ht-1

Information flow of LSTM


xt

z =tanh( W ht-1 )

ct-1 ct-1
diagonal
“peephole” z
o
z
f
z obtained by the same way
i

zf zi z zo

ht-1 xt
Information flow of LSTM
Element-wise multiply

yt

ct-1 ct
ct = zf  ct-1 + ziz

tanh ht = zo  tanh(ct)

yt = σ(W’ ht)

zf zi z zo

ht-1 xt ht

Information flow of LSTM


LSTM information flow

yt yt+1

ct-1 ct ct+1

tanh tanh

zf zi z zo zf zi z zo

ht+1
ht-1 xt ht xt+1

Information flow of LSTM


LSTM

GRU – gated recurrent unit


(more compression)
reset gate Update gate

It combines the forget and input into a single update gate.


It also merges the cell state and hidden state. This is simpler
than LSTM. There are many other variants too.

X,*: element-wise multiply


GRUs also takes xt and ht-1 as inputs. They perform some
calculations and then pass along ht. What makes them different
from LSTMs is that GRUs don't need the cell layer to pass
values along. The calculations within each iteration insure that
the ht values being passed along either retain a high amount of
old information or are jump-started with a high amount of new
information.
Feed-forward vs Recurrent Network
1. Feedforward network does not have input at each step
2. Feedforward network has different parameters for each layer

x f a
1
f a
2
f a
3
f y
1 2 3 4

t is layer
at = ft(at-1) = σ(Wtat-1 + bt)

h
0 f h
1 f h
2 f h f g y
3 4

x x x x
1 2 3 4
t is time step

at= f(at-1, xt) = σ(Wh at-1 + Wixt + bi)

We will turn the recurrent network 90 degrees.


yt

No input xt at t-1
each step at-1
h ahtt
No output yt at
each step 1-
at-1 is the output of
the (t-1)-th layer
reset update
atis the output of r z h'
the t-th layer
No reset gate
t-1
hat-1 xt xt
h’=σ(Wat-1)
z=σ(W’at-1)
Highway Network at = z  at-1 + (1-z)  h

• Highway Network • Residual Network


at at
+ at-1
z controls red arrow
h’
h’
Gate copy
controlle copy
r
at-1 at-1

Training Very Deep Networks Deep Residual Learning for Image


https://arxiv.org/pdf/1507.06228 Recognition
v2.pdf http://arxiv.org/abs/1512.03385
output output output
layer layer layer

Highway Network automatically


determines the layers needed!
Input Input Input
Highway Network Experiments
Grid LSTM Memory for both
time and depth
depth
y a’ b’

c c’ c c’
LST Grid
M LSTM
h h’ h h’

x a b

time
Grid LSTM
h' b'

a’ b’ c c'

a a'
tanh
c c’
Grid
LSTM
h h’ zf zi z zo

a b
h b
You can generalize this to 3D, and more.
Applications of LSTM / RNN
Neural machine translation

LSTM
Sequence to sequence chat model
Chat with context
M: Hi
M: Hello
U: Hi
M: Hi

M: Hello U: Hi
Serban, Iulian V., Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle
Pineau, 2015 "Building End-To-End Dialogue Systems Using Generative Hierarchical
Baidu’s speech recognition using RNN
Attention
Image caption generation using attention
(From CY Lee lecture)
z0 is initial parameter, it is also learned
A vector for
each region
z0 match 0.7

filter filter filter


CN
filter filter filter
N

filter filter filter


filter filter filter
Image Caption Generation
Word 1
A vector for
each region
z0 z1
Attention to
a region
weighted
filter filter filter
CN sum
filter filter filter
N
0.7 0.1 0.1
0.1 0.0 0.0

filter filter filter


filter filter filter
Image Caption Generation
Word 1 Word 2
A vector for
each region
z0 z1 z2

weighted
filter filter filter
CNN
filter filter filter
sum
0.0 0.8 0.2
0.0 0.0 0.0

filter filter filter


filter filter filter
Image Caption Generation

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron


Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show,
Attend and Tell: Neural Image Caption Generation with Visual Attention”,
ICML, 2015
Image Caption Generation

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron


Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show,
Attend and Tell: Neural Image Caption Generation with Visual Attention”,
ICML, 2015
* Possible project?

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo
Larochelle, Aaron Courville, “Describing Videos by Exploiting Temporal Structure”, ICCV,
2015
Thank You

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy