slides_rnn
slides_rnn
{michael.wand,vincent.herrmann}@idsia.ch
Dalle Molle Institute for Artificial Intelligence Studies (IDSIA) USI - SUPSI
yt = f (xt , . . . , x1 )
st+1 = g (st , xt )
x = (x1 , . . . , xT ) with xt ∈ RD
o = (o1 , . . . , oT ) with ot ∈ RK .
(On this slide, we use the symbol o for the target, in order to
distinguish it from the time index t.)
The length of the sequences can vary between samples.
with suitable weight matrices W(ℓ) and V(ℓ) , bias b(ℓ) , and
nonlinearity f .
Usually, the state z(ℓ) (0), ℓ ∈ 1, . . . , L is initialized to zeros.
Storage requirement of the recurrent layer (for forward propagation):
one extra set of layer activations; temporal requirement: as many
computation steps as the sequence has elements.
Neural Networks - Sequence Modeling and Advanced Architectures 7
Recurrent Network Setup
(1)
z1
(0)
y1
x1 → z1 (2)
(1) z1
z2
(0)
y2
x2 → z2 (2)
z2
(2)
(0)
zM (2)
xD → zM (0) yK
(1)
zM (1)
W (1) W (2) V (2) W (3)
The figure shows a neural network with one hidden feedforward layer and one
hidden recurrent layer, followed by a feedforward output layer. Remember that
the blue connections incur a delay of one time step; this makes the
computation well-defined.
Neural Networks - Sequence Modeling and Advanced Architectures 9
Visual: Recurrent Neural Network
All the models we have talked about before today can only
implement static input-output mappings.
Still, feedforward neural networks are very powerful: A sufficiently
wide net can approximate any continuous function!
Recurrent nets are dynamical systems: A recurrent network can
implement any algorithm (given enough storage size).
In practice, it is easier to learn some algorithms than others...
We will now see how to extend the backpropagation algorithm which
we got to know in the last lecture to deal with recurrence.
z(1)
x z(2) y
z(2) (1)
z(2) (2)
z(2) (t)
z(1)
x y
Assume z(ℓ−1) (t) is the output of the preceding layer at the current
timestep, and z(ℓ) (t − 1) is the output of the LSTM layer at the
previous time step.
These are the input for the LSTM cell, whose precise behavior is as
follows.
Each gate yields a value between 0 and 1 (using sigmoid
nonlinearities). The gates are trainable, they each have their own set
of weights which connect to the input vector:
GX = σ WX z(ℓ−1) (t) + VX z(ℓ) (t − 1) + bX ,
where X stands for any of the gates ‘I’, ‘O’, and ‘F’.
Occasionally, gates are allowed to access the cell state (peephole
connections).
Finally, the output z (ℓ) (t) is computed from the cell state by passing
it through another nonlinearity (usually tanh), and multiplying it
with the output gate:
An LSTM for a control task, with aligned input and output. Input is
processed by a stack of feedforward layers and a single LSTM layer.
Assume that we have one output sample for each input step, but
dependencies between inputs and outputs are in random order.
Thus each output sample depends on the entire sequence, not just
on the past frames.
One solution: The bidirectional LSTM (layer).
The input is fed into two (identical)
parallel LSTMs, once in forward order,
once in backward order.
The output is created by stepwise
concatenation of the outputs of the
two LSTMs.
Thus at each step, the output reflects
the entire sequence.
Easy-to-implement, standard
architecture.
Neural Networks - Sequence Modeling and Advanced Architectures 27
LSTM Setups
Time runs from left to right. The model reads the input tokens “ABC” and produces the output tokens “WXYZ”, which are re-used as
inputs in the next step. Image modified from Sutskever et al., Sequence to Sequence Learning with Neural Networks, NIPS 2014.
The encoder and the decoder are usually two separate LSTMs.
Information is transferred by transferring the state of the LSTM
from encoder to decoder (and also to backpropagate the error in the
same way).
The entire input sequence is converted to a fixed-size state vector
(between Encoder and Decoder).
Architectural variations (e.g. multiple LSTM layers) are possible and
are practically in use.
Time runs from left to right. The model reads the input tokens “ABC” and produces the output tokens “WXYZ”, which are re-used as
inputs in the next step. Image modified from Sutskever et al., Sequence to Sequence Learning with Neural Networks, NIPS 2014.
In the decoding part, the outputs of the previous step are fed back
to the network.
: Among other advantages, this autoregressive setup allows to explore
several hypotheses.
During (supervised) training, one usually feeds the target symbol,
even if the previous step output another hypothesized symbol
(teacher forcing). One can occasionally feed hypothesized symbols
to improve generalization. Training errors are backpropagated from
all steps of the decoder part.
Time runs from left to right. The model reads the input tokens “ABC” and produces the output tokens “WXYZ”, which are re-used as
inputs in the next step. Image modified from Sutskever et al., Sequence to Sequence Learning with Neural Networks, NIPS 2014.
During testing, one uses the hypothesis from the previous step.
: In the simplest case, we compute exactly one hypothesis sequence
(greedy decoding). Since each hypothesis is fed back into the
network, this means that one small error can cause a whole chain of
subsequent errors.
: One can partially mitigate this problem by using Beam Search, where
several hypotheses are kept in each step.
Time runs from left to right. The model reads the input tokens “ABC” and produces the output tokens “WXYZ”, which are re-used as
inputs in the next step. Image modified from Sutskever et al., Sequence to Sequence Learning with Neural Networks, NIPS 2014.
We use speech
recognition as an
example. Our goal is to
get the most probable
sequence of words, given
a dictionary of possible
pronunciations.
We construct a prefix tree of phone sequences corresponding to all
possible words.
Beam Search is a time-synchronous search method: We completely
process one time frame before moving to the next one.
Accumulated probabilities of prefixes of words (hypotheses) are
saved in the nodes of the prefix tree.
A part of a prefix tree. Green nodes are final, they correspond to words.
Can you see which ones?
Initialization with a probability of 1.0, since there is just one start node.
(You may also skip the start node and have several disconnected trees, one for
each possible phone which may start a word. It is an implementation detail.)
Problem: Beam search may miss optimal results. For example, your
speech recognizer might output the most optimal hypothesis
DOCK BUYS MAN.
Only at the end of the phrase, it may be clear (maybe even from the
external language model) that DOG BITES MAN was much more
probable.
Thus you should not be too restrictive when running this algorithm
(keep the “beam width” wide enough).
Still, in practice beam search works very well, it is used in many
contexts and with many recognition backends.
Idea: Consider the encoded input (a fixed-size vector for each input
sample).
Compute a weighted sum of the encoded input vectors. The weights
depend on the input data and on the decoding process (we will soon
see how).
This annotation vector is recomputed and fed into the decoder in
each timestep.
The decoder attends to specific areas of the input, namely to those
areas with large weights.
Idea: Consider the encoded input (a fixed-size vector for each input
sample).
Compute a weighted sum of the encoded input vectors. The weights
depend on the input data and on the decoding process (we will soon
see how).
This annotation vector is recomputed and fed into the decoder in
each timestep.
The decoder attends to specific areas of the input, namely to those
areas with large weights.
How to compute the attention weights? Note that they are different
from the weights which connect neurons in the NN, because they
depend on the data!
Original idea (“Bahdanau Attention”): predict the weights from the
state of the decoder, using a feedforward neural network.
: For decoding step i, estimate the alignment of the decoder state si−1
and each annotation hj with a feedforward NN:
eij = a(si−1 , hj )
Source: Bahdanau et al, Neural Machine Translation by jointly learning to align and translate
Source: Bahdanau et al, Neural Machine Translation by jointly learning to align and translate
From Xu et al., Show, Attend and Tell: Neural Image CaptionGeneration with Visual Attention
ei,2k = sin(i/100002k/D )
ei,2k+1 = cos(i/100002k/D )