0% found this document useful (0 votes)

6 views

slides_rnn

The document discusses advanced architectures in neural networks, focusing on Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) cells for sequence modeling. It highlights the importance of state retention in RNNs for handling sequential data and addresses challenges such as the vanishing gradient problem. The document also covers various setups for modeling aligned and unaligned sequences using LSTMs.

Uploaded by

Francesco Gualdi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

slides_rnn

Uploaded by

Francesco Gualdi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 75

Neural Networks - Sequence Modeling

and Advanced Architectures

Machine Learning
Michael Wand
TA: Vincent Herrmann

{michael.wand,vincent.herrmann}@idsia.ch

Dalle Molle Institute for Artificial Intelligence Studies (IDSIA) USI - SUPSI

Fall Semester 2024

Contents

Recurrent Neural Networks

The LSTM
Recurrent Network Setups: Modeling Aligned and Unaligned
Sequences
Attention Mechanisms
Advanced Topic: The Transformer Architecture

Neural Networks - Sequence Modeling and Advanced Architectures 2

Recurrent Neural Networks
Introduction (1)

So far, we have looked at mapping samples independently.

Now we look at sequential inputs where the output y can depend on
more than just the immediate input:

yt = f (xt , . . . , x1 )

where t is a time parameter.

In order to achieve this, the neural network holds a state, which
changes when a new input arrives:

st+1 = g (st , xt )

State provides memory, which in RNNs is implemented by feedback

or recurrent connections: The state at time step t is the output at
time step t − 1.

Neural Networks - Sequence Modeling and Advanced Architectures 4

Introduction (2)

Another way to introduce recurrent neural networks:

So far, we have looked at mappings with fixed-size inputs.
There are many situations where the input has varying size.
We concentrate on the case of sequential input1 .
A neural network cannot process such an input sequence all at once,
but it can process it sequentially; in order to do so, it needs to retain
state between inputs.
State provides memory, which in RNNs is implemented by feedback
or recurrent connections: The state at time step t is the output at
time step t − 1.

1 If you want to apply a recurrent neural network to images, have a look at

Stollenga et al., Parallel Multi-Dimensional LSTM, With Application to Fast

Biomedical Volumetric ImageSegmentation, Proc. NIPS 2015
Neural Networks - Sequence Modeling and Advanced Architectures 5
Sequence Modeling

Sequential data occurs in a variety of situations:

: Speech recognition and natural language processing
: Video analysis
: Text generation
: DNA analysis
: In reinforcement learning, short-term memory can be essential for
determining the state of the world
For now assume sequential data with one target value at each input
timestep, i.e.

x = (x1 , . . . , xT ) with xt ∈ RD
o = (o1 , . . . , oT ) with ot ∈ RK .

(On this slide, we use the symbol o for the target, in order to
distinguish it from the time index t.)
The length of the sequences can vary between samples.

Neural Networks - Sequence Modeling and Advanced Architectures 6

Recurrent Layer

Remember the standard feedforward fully-connected layer:

z(ℓ) = f W(ℓ) z(ℓ−1) + b(ℓ) .

Now define a Recurrent Layer as a building block of a neural

network.
The output at time step t depends on the input from the previous
layer at time t and on the layer state at time t − 1, i.e.

z(ℓ) (t) = f W(ℓ) z(ℓ−1) (t) + V(ℓ) z(ℓ) (t − 1) + b(ℓ)

with suitable weight matrices W(ℓ) and V(ℓ) , bias b(ℓ) , and
nonlinearity f .
Usually, the state z(ℓ) (0), ℓ ∈ 1, . . . , L is initialized to zeros.
Storage requirement of the recurrent layer (for forward propagation):
one extra set of layer activations; temporal requirement: as many
computation steps as the sequence has elements.
Neural Networks - Sequence Modeling and Advanced Architectures 7
Recurrent Network Setup

A typical setup is to have a stack of feedforward layers (fully

connected or convolutional), after which one or several recurrent
layers are placed.
The final output is usually computed by a feedforward step on the
output of the last recurrent layer, followed by a standard loss
function (MSE, cross-entropy).
Depending on the task, we might consider all the outputs, or only
the output at the last timestep. (Later on we will get to know more
complex input/output setups.)

Neural Networks - Sequence Modeling and Advanced Architectures 8

Visual: Recurrent Neural Network

(1)
z1
(0)
y1
x1 → z1 (2)
(1) z1
z2
(0)
y2
x2 → z2 (2)
z2

(2)
(0)
zM (2)
xD → zM (0) yK
(1)
zM (1)
W (1) W (2) V (2) W (3)

The figure shows a neural network with one hidden feedforward layer and one
hidden recurrent layer, followed by a feedforward output layer. Remember that
the blue connections incur a delay of one time step; this makes the
computation well-defined.
Neural Networks - Sequence Modeling and Advanced Architectures 9
Visual: Recurrent Neural Network

x → z(0) z(1) z(2) y

W (1) W (2) V (2) W (3)

As always, we usually reason about the network by grouping neurons to layers.

Neural Networks - Sequence Modeling and Advanced Architectures 10

RNNs as Universal Approximators

All the models we have talked about before today can only
implement static input-output mappings.
Still, feedforward neural networks are very powerful: A sufficiently
wide net can approximate any continuous function!
Recurrent nets are dynamical systems: A recurrent network can
implement any algorithm (given enough storage size).
In practice, it is easier to learn some algorithms than others...
We will now see how to extend the backpropagation algorithm which
we got to know in the last lecture to deal with recurrence.

Neural Networks - Sequence Modeling and Advanced Architectures 11

Backpropagation Through Time

Remember the way we decomposed the gradient of the error w.r.t. a

weight matrix as a product of gradients across the network layers?
∂E ∂E ∂z(L) ∂z(ℓ+1) ∂z(ℓ)
(ℓ)
= ···
∂wij ∂z(L) ∂z(L−1) ∂z(ℓ) ∂w (ℓ)
ij

The easiest way to explain BPTT is to imagine the recurrent

network as a very deep feedforward network by unrolling the time.

z(1)

x z(2) y

Neural Networks - Sequence Modeling and Advanced Architectures 12

Backpropagation Through Time

Remember the way we decomposed the gradient of the error w.r.t. a

weight matrix as a product of gradients across the network layers?
∂E ∂E ∂z(L) ∂z(ℓ+1) ∂z(ℓ)
(ℓ)
= ···
∂wij ∂z(L) ∂z(L−1) ∂z(ℓ) ∂w (ℓ)
ij

The easiest way to explain BPTT is to imagine the recurrent

network as a very deep feedforward network by unrolling the time.

Neural Networks - Sequence Modeling and Advanced Architectures 13

Backpropagation Through Time

Remember the way we decomposed the gradient of the error w.r.t. a

weight matrix as a product of gradients across the network layers?
∂E ∂E ∂z(L) ∂z(ℓ+1) ∂z(ℓ)
(ℓ)
= ···
∂wij ∂z(L) ∂z(L−1) ∂z(ℓ) ∂w (ℓ)
ij

The easiest way to explain BPTT is to imagine the recurrent

network as a very deep feedforward network by unrolling the time.

z(2) (1)

z(2) (2)

z(2) (t)
z(1)

x y

Neural Networks - Sequence Modeling and Advanced Architectures 14

Backpropagation Through Time

We now derive the weight updates as in the feedforward case, taking

into account that weights are shared between the time steps, and
that there are multiple paths through the network.
This means that the gradients must be added over timesteps, for
both recurrent and feedforward layers: The error at time step t
causes gradients at time steps 1, . . . , t.
Also remember that (in the general case), an error signal is
backpropagated from the output layer at each time step. That
means that simultaneously,
: output at time step 1 causes gradients at time step 1
: output at time step 2 causes gradients at time steps 1 and 2
: ...
: output at time step t causes gradients at time steps 1, . . . , t.
: That sounds complicated, but it is really just a sum of relevant
gradients. We have already done all required work.
The error at time step t is backpropagated through time.

Neural Networks - Sequence Modeling and Advanced Architectures 15

The Vanishing Gradient Problem

Although RNNs can represent arbitrary sequential behavior, training

suffers:
: once the output depends on some input more than around 10
time-steps in the past, they become very difficult to train.
Why? We apply the chain rule by multiplying partial gradients over
time steps.
The absolute value of these gradients shrinks exponentially (or it
explodes, which is not good either).
Thus the error gradient becomes very small: weights cannot be
adjusted to respond to events far in past. This is the Vanishing
Gradient Problem.
Even simple tasks cannot properly be solved by the RNN if the
relevant information covers a long time (example: check whether
parentheses in a string are balanced).

Neural Networks - Sequence Modeling and Advanced Architectures 16

Long Short Term Memory (LSTM)
LSTM Cell

An LSTM cell can be imagined as a

memory cell with a state S that is
controlled by 3 gates:
: the input gate Gi controls whether
the state is updated with external
input
: the output gate Go controls whether
the state is visible to the outside
: the forget gate Gf allows to reset the
state.
The inputs (for the state update and for all three gates) are the
output of the LSTM layer of the previous timestep and the output
from preceding layer of the current timestep.
References:
Hochreiter & Schmidhuber: Long Short-Term Memory. Neural Computation 9, 1997.
Gers, Schmidhuber, Cummins: Learning to Forget: Continual Prediction with LSTM. Neural
Computation 12, 2000.

Neural Networks - Sequence Modeling and Advanced Architectures 18

LSTM Cell

Assume z(ℓ−1) (t) is the output of the preceding layer at the current
timestep, and z(ℓ) (t − 1) is the output of the LSTM layer at the
previous time step.
These are the input for the LSTM cell, whose precise behavior is as
follows.
Each gate yields a value between 0 and 1 (using sigmoid
nonlinearities). The gates are trainable, they each have their own set
of weights which connect to the input vector:

GX = σ WX z(ℓ−1) (t) + VX z(ℓ) (t − 1) + bX ,

where X stands for any of the gates ‘I’, ‘O’, and ‘F’.
Occasionally, gates are allowed to access the cell state (peephole
connections).

Neural Networks - Sequence Modeling and Advanced Architectures 19

LSTM Cell

The input vector is likewise processed:

s̃(t) = f Wz(ℓ−1) (t) + Vz(ℓ) (t − 1) + b ,

where f is normally a tanh nonlinearity.

At time t, the new state becomes

s(t) = GF (t) · s(t − 1) + GI (t) · s̃(t).

Finally, the output z (ℓ) (t) is computed from the cell state by passing
it through another nonlinearity (usually tanh), and multiplying it
with the output gate:

z (ℓ) (t) = GO (t) · f (s(t)).

Neural Networks - Sequence Modeling and Advanced Architectures 20

LSTM Cell

Compare the dynamics of the LSTM and the RNN.

: Notation: yellow blocks are fully connected layers, red are operations.
: Time flows from left to right.
Left panel - RNN: output from previous step z (ℓ) (t − 1) and input
from current step z (ℓ−1) (t) are concatenated, new output z (ℓ) (t) is
computed.
Right panel - LSTM: The upper line is the state, can you see how it
can remain unchanged over many time steps?
f , i, o stand for the forget, input, and output gates.

Images modified after colah.github.io/posts/2015- 08- Understanding- LSTMs

Neural Networks - Sequence Modeling and Advanced Architectures 21

LSTM Layer

An LSTM layer consists of a number of LSTM cells (which are all

(ℓ)
connected, i.e. the behavior of a cell zm at time t depends on all
the other cells at time t − 1).
It is a neural network building block very much like a recurrent layer
(just replace each recurrent neuron with an LSTM cell).
But the behavior is very different:
: The trainable gates control the flow of information, and also make it
easy to store information over longer periods of time!
: During backpropagation, this means that the error signal is likewise
propagated over many time steps.
: By contrast, in standard RNNs, the state is completely recomputed
in every time step, which also causes the error signal to suffer
degradation during backpropagation.
The whole architecture can be trained end-to-end.
This remarkably powerful architecture solves the vanishing gradient
problem and makes it possible to solve many challeging
sequence-related tasks!
Neural Networks - Sequence Modeling and Advanced Architectures 22
Modeling Aligned and Unaligned Sequences
LSTM Setups

(Everything which is covered in this section technically also works for

standard RNNs, but will usually not work in practice due to the vanishing
gradient problem.)
So far, we have assumed that the training data is aligned: There is
one output target for each input step.
Thus, the LSTM receives one error signal per input step, which is
backpropagated through time.
Sometimes, one can force the output to have the desired length (e.g.
by padding).

Neural Networks - Sequence Modeling and Advanced Architectures 24

LSTM Setups

Example – Control task: At each input, one wants to output a new

control signal, which takes the current input and an estimate of the
system state into account.

An LSTM for a control task, with aligned input and output. Input is
processed by a stack of feedforward layers and a single LSTM layer.

Neural Networks - Sequence Modeling and Advanced Architectures 25

LSTM Setups

Another simple setup: There is a single target per sequence.

In such a case, one only considers the LSTM output at the end of
each sequence, and likewise, errors are backpropagated only from
the last element.
: (A practical implementation would use some kind of mask to control
error backpropagation; modern frameworks have this functionality
built in.)
Example: Word-based speech recognition (see image).

Michael’s work: Word-based Lipreading with LSTMs. Only the output

at the last step is valid. Images are processed by a stack of feedforward
layers and a single LSTM layer.

Neural Networks - Sequence Modeling and Advanced Architectures 26

Bidirectional LSTM

Assume that we have one output sample for each input step, but
dependencies between inputs and outputs are in random order.
Thus each output sample depends on the entire sequence, not just
on the past frames.
One solution: The bidirectional LSTM (layer).
The input is fed into two (identical)
parallel LSTMs, once in forward order,
once in backward order.
The output is created by stepwise
concatenation of the outputs of the
two LSTMs.
Thus at each step, the output reflects
the entire sequence.
Easy-to-implement, standard
architecture.
Neural Networks - Sequence Modeling and Advanced Architectures 27
LSTM Setups

The class of sequence-to-sequence tasks is however much larger.

There are many cases in which the length of the input and output
sequences do not match.
Even if they do, inputs and outputs may not be aligned, e.g. a
relevant output may appear before a relevant input (think about
translating English to German).
Occasionally, you may even wish to generate sequential output from
a single input sample (e.g. image description).

taken from karpathy.github.io

Neural Networks - Sequence Modeling and Advanced Architectures 28

Sequence Modeling with CTC

What if input and output sequences have different length?

Connectionist Temporal Classification (CTC, Graves et al. 2006)
was the first method to tackle this problem.
Task Setup: A set of input sequences {xn }n and target sequences
{tn }n , where for each n, len(tn ) < len(xn ).
Assume an orderedness property: Features which correspond to
targets appear in the same order as the targets.
Example: In speech recognition, we have an input sequence of
frequency vectors (spectrogram), which must be converted into a
sequence of phones (speech sounds).

The spectrogram usually has

100 samples per second, a
phone lasts between 30 and
100 ms ⇒ far less phones than
input samples.
Neural Networks - Sequence Modeling and Advanced Architectures 29
Sequence Modeling with CTC - Training

The CTC solution is as follows:

: Define a blank symbol (–) which is added to the set of possible
output phones (as in other classification tasks, use a one-hot
encoding).
: During training, consider all paths which correspond to the given
label. A path is converted into a label by removing all duplicate
phones, and then removing all blanks.
: For example, the following paths:
H EH EH L OU, H – EH – – – – L L – – OU –, – – H – EH – – – L OU – –
all correspond to the label “H EH L OU” (hello).
: Note that blanks are necessary to output the same symbol repeatedly.
: Obviously, this leads to a huge number of possible outputs whose
errors must be computed. Fortunately, the computation can be
efficiently performed by a Dynamic Programming approach.
: The Dynamic Programming gets integrated into the CTC loss
function, allowing to train the whole network end-to-end.
If you do not know Dynamic Programming, we will cover a more general version (Beam Search) in
a few slides.

Neural Networks - Sequence Modeling and Advanced Architectures 30

Sequence Modeling with CTC - Decoding

The figure shows the output of CTC decoding on an audio clip

which contains the words “THE SOUND OF”.
Note that the output is probabilistic, as for other classifiers: In each
time step, one gets a vector of probabilities for the possible output
symbols (phones + blank).
The result is convincing: The LSTM learns to output a series of
single phones, padded by blank symbols (dashed line).
For recognition, we can simply take the symbol with maximal
probability at each frame (greedy decoding).
Or we can incorporate constraints (e.g. a dictionary which contains
possible pronunciations) by using Beam Search.

CTC output, from Graves et al. 2006.

Neural Networks - Sequence Modeling and Advanced Architectures 31
Encoder-Decoder Networks

CTC makes several assumptions: not more targets than inputs,

orderedness, outputs are statistically independent (given the internal
state of the network).
We look at a simple neural network setup to deal with arbitrary
sequence-to-sequence tasks. We make no assumptions about
sequence lengths or orderedness.
Idea: First read the entire source sequence (encoder part), then
output the entire target sequence (decoder part).
Special tokens (SOS, EOS) indicate the sequence start and end.

Time runs from left to right. The model reads the input tokens “ABC” and produces the output tokens “WXYZ”, which are re-used as
inputs in the next step. Image modified from Sutskever et al., Sequence to Sequence Learning with Neural Networks, NIPS 2014.

Neural Networks - Sequence Modeling and Advanced Architectures 32

Encoder-Decoder Networks

The encoder and the decoder are usually two separate LSTMs.
Information is transferred by transferring the state of the LSTM
from encoder to decoder (and also to backpropagate the error in the
same way).
The entire input sequence is converted to a fixed-size state vector
(between Encoder and Decoder).
Architectural variations (e.g. multiple LSTM layers) are possible and
are practically in use.

Neural Networks - Sequence Modeling and Advanced Architectures 33

Encoder-Decoder Networks

In the decoding part, the outputs of the previous step are fed back
to the network.
: Among other advantages, this autoregressive setup allows to explore
several hypotheses.
During (supervised) training, one usually feeds the target symbol,
even if the previous step output another hypothesized symbol
(teacher forcing). One can occasionally feed hypothesized symbols
to improve generalization. Training errors are backpropagated from
all steps of the decoder part.

Neural Networks - Sequence Modeling and Advanced Architectures 34

Encoder-Decoder Networks

During testing, one uses the hypothesis from the previous step.
: In the simplest case, we compute exactly one hypothesis sequence
(greedy decoding). Since each hypothesis is fed back into the
network, this means that one small error can cause a whole chain of
subsequent errors.
: One can partially mitigate this problem by using Beam Search, where
several hypotheses are kept in each step.

Neural Networks - Sequence Modeling and Advanced Architectures 35

Encoder-Decoder Networks

As the name indicates, the entire source sequence must be encoded

in a fixed-size vector (the state of the underlying LSTM at the EOS).
It is not obvious that such a representation is very robust. Indeed,
the authors suggest to reverse the input sentences in order to have
less long-term dependencies (even though the LSTM should handle
them. . . )
Still, this paves the way to embeddings, a very important concept in
contemporary neural network modeling.

Neural Networks - Sequence Modeling and Advanced Architectures 36

Embeddings

Embedding: Arbitrary-sized, possibly categorical inputs (like words,

or sequences of words) are mapped to real-valued vectors.
The embedding is part of a neural network which is trained
end-to-end (or it can be pretrained for later use in a variety of ways).
It turns out2 that these representation have amazing regularities, for
example, the equation

ϕ(king) − ϕ(man) + ϕ(woman) ≈ ϕ(queen)

is true in standard vector space terminology.

The encoder part of an encoder-decoder network computes an
embedding of the input sentence.

2 Tomas Mikolov et al, Linguistic Regularities in Continuous Space Word

Representations. Proc. NAACL, 2013

Neural Networks - Sequence Modeling and Advanced Architectures 37
Beam Search

Greedy decoding may not be optimal when the elements of the

output sequence are not conditionally independent.
This is true for the Encoder-Decoder networks since output
hypotheses are fed back into the network.
It is also not true when the output is constrained: For example, in
(conventional) speech recognition, one has a dictionary of possible
words (and their pronunciation) and a language model which gives
probabilities for word sequences. One would ideally like to
incorporate these extra probabilistic constraints into the recognition
result.
This problem occurs in a variety of tasks, and for many underlying
models: Encoder-Decoder, CTC, HMM (Hidden Markov Model), . . . !

Neural Networks - Sequence Modeling and Advanced Architectures 38

Beam Search

Purely theoretically, one could consider all possible output sequences

(up to some maximal length), compute their probabilities, and then
take the one with maximum probability.
But no way to compute all these probabilities, the number of
computations would be enormous:
: assume M output symbols, T time steps ⇒ M T possible paths
: Speech recognition example: 26 letters, max sequence length 100 ⇒
26100 possible combinations (there are about 1080 atoms in the
universe)
Standard method to solve this problem: Beam Search.
: Applicable in many cases (not only for neural networks)!
: Occasionally the setup is a bit different, we cover the specific case of
encoder-decoder network output.

Neural Networks - Sequence Modeling and Advanced Architectures 39

Beam Search

We use speech
recognition as an
example. Our goal is to
get the most probable
sequence of words, given
a dictionary of possible
pronunciations.
We construct a prefix tree of phone sequences corresponding to all
possible words.
Beam Search is a time-synchronous search method: We completely
process one time frame before moving to the next one.
Accumulated probabilities of prefixes of words (hypotheses) are
saved in the nodes of the prefix tree.

Neural Networks - Sequence Modeling and Advanced Architectures 40

Beam Search

Assume you have a sequence of softmax outputs. Derive most probable

sequence of words. This is the step-by-step algorithm:
Initialize single hypothesis with probability 1.0 for the start node of
the prefix tree.
At each step:
: Take all current hypotheses
: Propagate each of them to all possible successor nodes by
multiplying with corresponding probability from softmax outputs at
current timestep
: This requires performing the decoding several (many) times, feeding
different hypotheses.
: Repeat for each timestep
: Result: Hypothesis with maximal probability of all final nodes of the
tree.

Neural Networks - Sequence Modeling and Advanced Architectures 41

Beam Search

A part of a prefix tree. Green nodes are final, they correspond to words.
Can you see which ones?

ARE, ART, ARTIST, ANT, AN, AND, . . .

Neural Networks - Sequence Modeling and Advanced Architectures 42

Beam Search

Initialization with a probability of 1.0, since there is just one start node.
(You may also skip the start node and have several disconnected trees, one for
each possible phone which may start a word. It is an implementation detail.)

Neural Networks - Sequence Modeling and Advanced Architectures 43

Beam Search

First propagation step! We need to consider the output phone

probabilities; if we have an encoder-decoder network, they are computed
by the final softmax layer of the decoder.
Phone A AE B ...
Prob. 0.4 0.06 0.34 ...

Neural Networks - Sequence Modeling and Advanced Architectures 44

Beam Search

Next propagation step!

History A A AE ...
Phone R N N ...
Prob. 0.2 0.4 0.5 ...
Probabilities are accumulated along the tree.

Neural Networks - Sequence Modeling and Advanced Architectures 45

Beam Search

In practical implementations, one usually computes in the

logarithmic domain.
Note that in the case of an encoder-decoder network, we always
have a transition to the next node (exception: final node).
In other setups, you may have transitions to the same node (for
example, if you have CTC output, a blank symbol causes transition
to the same node).
In this case, if we have several ways to get to a node (e.g. you can
get to the first ‘A’ by having blank + A or A + blank), you usually
retain the hypothesis with maximum probability.

Neural Networks - Sequence Modeling and Advanced Architectures 46

Beam Search

A hypothesis in a final node can be propagated to the same node,

using the probability of the EOS symbol (once). Then this
hypothesis is not changed any more.
The search is finished when all frames of the softmax output
sequence are exhausted, then take the hypothesis with maximum
probability of all final nodes as result!

Neural Networks - Sequence Modeling and Advanced Architectures 47

Beam Search

We have skipped the following details:

In the specific case of speech recognition, you want to recognize a
sequence of words, so you propagate from each final node to the
start node of the tree. You may have to incorporate linguistic
probabilities, and you get several hypotheses per node (with different
word histories).
This means that you also have to propagate several hypotheses from
a node.
Occasionally, one merges tails of words in the tree (then it is not a
tree any more), mostly due to speed considerations.
In general, this style of search is easy to implement, but if you want
it to be fast, there are a lot of details to consider. . .

Neural Networks - Sequence Modeling and Advanced Architectures 48

Beam Search

In practical cases, it is still necessary to limit the number of

hypotheses during the search.
This can be done by simply
: retaining a fixed number of most probable ones
: or by retaining a fixed number of hypotheses per node
: or by other heuristic/adaptive criteria
: or by all of the above.
Imagine that you search for solutions with a flashlight: You look in
the limited space which is lit by the beam of your lamp. The
position of the beam depends on where you currently are, it could
also become wider or smaller depending on the circumstances.

Neural Networks - Sequence Modeling and Advanced Architectures 49

Beam Search

Problem: Beam search may miss optimal results. For example, your
speech recognizer might output the most optimal hypothesis
DOCK BUYS MAN.
Only at the end of the phrase, it may be clear (maybe even from the
external language model) that DOG BITES MAN was much more
probable.
Thus you should not be too restrictive when running this algorithm
(keep the “beam width” wide enough).
Still, in practice beam search works very well, it is used in many
contexts and with many recognition backends.

Neural Networks - Sequence Modeling and Advanced Architectures 50

Attention Mechanisms
Introducting Attention

Remember one of the reasons why we introduced recurrent

networks?
: A feedforward network cannot process variable-length input.
One can process parts of the input.
Attention: The neural network selectively focuses on parts of the
input (often after some processing) which are important for the
current output step.
The current focus can be computed in a variety of ways.
The idea dates back to Jürgen Schmidhuber’s work in the 1990’s3 ,
the modern formulation was presented by Bahdanau and colleagues4 .

3 Schmidhuber/Huber, Learning to Generate Artificial Fovea Trajectories for Target

Detection, International Journal of Neural Systems 2(1&2), 1991

4 Bahdanau et al., Neural Machine Translation by Jointly Learning to Align and

Translate. ICLR 2015

Neural Networks - Sequence Modeling and Advanced Architectures 52
Basic Attention Mechanism

Reconsider the encoder/decoder architecture. In the example, the

input consists of the tokens A, B, C .

Neural Networks - Sequence Modeling and Advanced Architectures 53

Basic Attention Mechanism

Reconsider the encoder/decoder architecture. In the example, the

input consists of the tokens A, B, C .
How can we feed information from the encoder to the decoder?

Neural Networks - Sequence Modeling and Advanced Architectures 54

Basic Attention Mechanism

Reconsider the encoder/decoder architecture. In the example, the

input consists of the tokens A, B, C .
How can we feed information from the encoder to the decoder?
We assume that the relevance of each input sample varies: for each
created output token, we would like to focus on different input
tokens.
The alignment between input and output samples need not be
monotonic.

Neural Networks - Sequence Modeling and Advanced Architectures 55

Basic Attention Mechanism

Idea: Consider the encoded input (a fixed-size vector for each input
sample).
Compute a weighted sum of the encoded input vectors. The weights
depend on the input data and on the decoding process (we will soon
see how).
This annotation vector is recomputed and fed into the decoder in
each timestep.
The decoder attends to specific areas of the input, namely to those
areas with large weights.

Neural Networks - Sequence Modeling and Advanced Architectures 56

Basic Attention Mechanism

Idea: Consider the encoded input, which is represented by a

fixed-size vector for each input sample. The encoded input is also
called annotation.
Compute a weighted sum of the encoded input vectors. The weights
depend on the input data and on the decoding process (we will soon
see how).
This context vector is recomputed and fed into the decoder in each
timestep.
The decoder attends to specific areas of the input, namely to those
areas with large weights.

Neural Networks - Sequence Modeling and Advanced Architectures 57

Basic Attention Mechanism

Neural Networks - Sequence Modeling and Advanced Architectures 58

Bahdanau Attention

How to compute the attention weights? Note that they are different
from the weights which connect neurons in the NN, because they
depend on the data!
Original idea (“Bahdanau Attention”): predict the weights from the
state of the decoder, using a feedforward neural network.
: For decoding step i, estimate the alignment of the decoder state si−1
and each annotation hj with a feedforward NN:

eij = a(si−1 , hj )

where a is a neural network.

: Compute the attention weights αij from the alignment with a
softmax function, i.e. α
⃗ i = softmax(⃗
ei ).
: Compute
P the context vector as a weighted sum of the annotations:
ci = j αij hij .
The network a is trained jointly with the rest of the architecture.

Neural Networks - Sequence Modeling and Advanced Architectures 59

Bahdanau Attention

The first application of attention was to Machine Translation.

The figure compares the quality of translations for the
attention-based (“search”) system and for the classical
encoder-decoder (“enc”) system (BLEU score, higher is better), for
two different training setups.
In particular for long sentences, the attention system is substantially
better.

Source: Bahdanau et al, Neural Machine Translation by jointly learning to align and translate

Neural Networks - Sequence Modeling and Advanced Architectures 60

Bahdanau Attention

Source: Bahdanau et al, Neural Machine Translation by jointly learning to align and translate

Neural Networks - Sequence Modeling and Advanced Architectures 61

Image Captioning with Visual Attention

The Attention concept is powerful and can be used for many

different tasks.
Often, some interpretation of the attention weights is possible.
Example: Create descriptions of images.
Several styles of attention (“soft” vs “hard” attention).
Note that the input is not sequential (but an image).

From Xu et al., Show, Attend and Tell: Neural Image CaptionGeneration with Visual Attention

Neural Networks - Sequence Modeling and Advanced Architectures 62

Image Captioning with Visual Attention

Neural Networks - Sequence Modeling and Advanced Architectures 63

More Attention Mechanisms

The attention mechanism can be varied in many ways.

Major simplification: the scalar product between the decoder state and
the annotation can be used as similarity score (dot-attention). The extra
subnetwork which computes the attention is often not necessary.
A very flexible approach divides the annotation vectors into keys and
values, and derives queries from the decoder state.
Between each key vector and the query, a similarity score is computed,
and these scores (after normalization) are used to compute the weighted
average of the values.
In Bahdanau’s classical method, we used the annotation simultaneously as
key and value, and the similarity score is computed by a neural network.
Multi-head attention: Carry out multiple attention operations using
separate parameters, and concatenate their results.

Neural Networks - Sequence Modeling and Advanced Architectures 64

Advanced topic: The Transformer Architecture
Self-Attention

Can we build a general purpose sequence processing layer based on

attention?
Basic idea: Take an N-step input sequence.
: Input of each step transformed in 3 ways: key, value, and query
vectors. This transformation is learned during training.
: For any step n:
: Match key and query over all steps (dot product style).
: Compute softmax to obtain normalized weights
: Compute output at step n as weighted sum over values, with
normalized softmax weights.
: This concept is known as self-attention.
: The keys and values form the memory of the layer.
: The memory grows with the size of the input sequence.
Note that no recurrence is involved - can entirely be computed in
parallel.

Neural Networks - Sequence Modeling and Advanced Architectures 66

Self-Attention

Neural Networks - Sequence Modeling and Advanced Architectures 67

Self-attention in detail

Consider a sequence of vectors (x1 , . . . , xN ):

At step n, input xn ∈ RD is first transformed into three vectors:

qn , kn , vn = Qxn , Kxn , Vxn Q, K ∈ Rdkey ×D , V ∈ Rdvalue ×D

Q, K , V are the trainable parameters of the layer.

Compute alignments for position n by matching query qn to all keys:
( ) !
qn · km
αnm = softmax p
dkey
m

(the rescaling avoids too high variance in the alignments).

Output yn is computed as the weighted sum of values, where the
weights are given by the alignment:
X
yn = αnm vm
m

Neural Networks - Sequence Modeling and Advanced Architectures 68

Transformer

The Transformer (Vaswani et al. 2017) is a model with multiple

layers. Each Transformer layer contains:
: one self-attention layer
: one feed-forward layer.
with residual connection and layer normalization (methods to
facilitate training) between each component.
Made self-attention very popular. Latest large advancement in
neural network architecture (2017).
You will be likely using a Transformer layer instead of self-attention
layer alone.
Frequently uses multiple independent attention heads.

Neural Networks - Sequence Modeling and Advanced Architectures 69

Transformer

Graphical overview of transformer with multiple

transformer layers.
Each input frame is encoded into one output
frame.
One element of the system has not yet been
covered: positional encoding.
Required because unlike the RNN, in the
transformer architecture we have no implicit
information about the order of frames!

Neural Networks - Sequence Modeling and Advanced Architectures 70

Positional encoding

Positional information (frame indices) needs to be provided

explicitly!
: Can we simply count positions (1,2,. . . ) and add these values as a
component to the input?
: Problems: bad normalization, does not generalize well
: Divide count by maximum value (e.g. 1,2,3,4 → 0.25,0.5,0.75,1.0)
: Problem: Does not work well for different sequence lengths
: Idea: count in binary numbers, then the numeric range is between 0
and 1 (-1 to 1 can be arranged)
: This means that we add multiple components to the input vector
(e.g. maximum sequence length 1024 → 10 components)
: Still problematic: the representation is not continuous, positions
cannot be interpolated (e.g. the middle between (0,0,0) and (1,0,0)
is (0,1,0)).

Neural Networks - Sequence Modeling and Advanced Architectures 71

Positional encoding

Idea: use the sine/cosine wave at different frequencies as a

continuous way to encode positions!
Sinusoidal positional encoding: Position i is represented by a vector
ei of dimension D where each component is computed, for
0 ≤ k < D2 :
ei,2k = sin(i/100002k/D )
ei,2k+1 = cos(i/100002k/D )

Image source: https://towardsdatascience.com/master-positional- encoding- part-i-63c05d90a0c3

Neural Networks - Sequence Modeling and Advanced Architectures 72

Positional encoding

Idea: use the sine/cosine wave at different frequencies as a

continuous way to encode positions!
Sinusoidal positional encoding: Position i is represented by a vector
ei of dimension D where each component is computed, for
0 ≤ k < D2 :

ei,2k = sin(i/100002k/D )
ei,2k+1 = cos(i/100002k/D )

Choice of 10000: a “large” number.

Standard approach: D is the input dimensionality, and the positional
encoding is added to the input
Studying and improving positional encoding is still a hot research
topic!

Neural Networks - Sequence Modeling and Advanced Architectures 73

Transformer for Nonaligned Sequences

What if the input samples and output

targets are not aligned?
Same idea as with the LSTM: Use an
encoder, a decoder, and (multi-head)
attention to connect these two modules.
Decoder works in autoregressive mode.
Self-attention in the decoder attends
only to past tokens (i.e. tokens which
have already been output).

Image from Vaswani et al. (2017), Attention is all

you need

Neural Networks - Sequence Modeling and Advanced Architectures 74

Summary and Conclusion

You have learned about two fundamental NN paradigms: recurrence

and attention.
Recurrence is a straightforward way to deal with sequences, but
causes long paths between inputs and outputs, leading to Vanishing
Gradients.
The LSTM elegantly solves the vanishing gradient problem.
The encoder-attention-decoder architecture likewise elegantly solves
the problem of matching inputs and outputs, and creates short paths
between inputs and outputs.
We have also gotten to know the Transformer, as of now the most
recent groundbreaking development in NN architecture.
Finally, remember Beam Search as a very common algorithm for
searching through sequences; it appears in many forms and has
applications also beyond neural networks.

Neural Networks - Sequence Modeling and Advanced Architectures 75

Stock Prediction Using Recurrent Neural Network (RNN)
0% (1)
Stock Prediction Using Recurrent Neural Network (RNN)
24 pages
ch10 Sequence Modelling - Recurrent and Recursive Nets
No ratings yet
ch10 Sequence Modelling - Recurrent and Recursive Nets
45 pages
Unit IV
No ratings yet
Unit IV
31 pages
RNN LSTM
No ratings yet
RNN LSTM
72 pages
Modelling Time Series With Neural Networks: Volker Tresp Summer 2017
No ratings yet
Modelling Time Series With Neural Networks: Volker Tresp Summer 2017
24 pages
Chap 7.2 Sequence Analysis Using RNN LSTM
No ratings yet
Chap 7.2 Sequence Analysis Using RNN LSTM
60 pages
15.03.2024_CSA3007_A24+D23+D24 (1)
No ratings yet
15.03.2024_CSA3007_A24+D23+D24 (1)
8 pages
Article On Recurrent Neural Networks
No ratings yet
Article On Recurrent Neural Networks
3 pages
CH4_AA1.1-Sequence Models (1)
No ratings yet
CH4_AA1.1-Sequence Models (1)
26 pages
AN2DL_04_2324_RecurrentNeuralNetworks
No ratings yet
AN2DL_04_2324_RecurrentNeuralNetworks
34 pages
Deep Learning RNN
100% (1)
Deep Learning RNN
53 pages
Module 6
No ratings yet
Module 6
42 pages
RNN
No ratings yet
RNN
79 pages
Survey of Prediction Using Recurrent Neural Network
No ratings yet
Survey of Prediction Using Recurrent Neural Network
3 pages
Blue and White Simple Business Plan Presentation
No ratings yet
Blue and White Simple Business Plan Presentation
15 pages
CS5560 Lect12-RNN - LSTM
No ratings yet
CS5560 Lect12-RNN - LSTM
30 pages
What is a Recurrent Neural Network
No ratings yet
What is a Recurrent Neural Network
36 pages
Unit 4 - MachineLearning
No ratings yet
Unit 4 - MachineLearning
16 pages
Unit 4b - Recurrent Neural Networks
No ratings yet
Unit 4b - Recurrent Neural Networks
60 pages
RNN_2
No ratings yet
RNN_2
144 pages
CNN RNN LSTM Attention
No ratings yet
CNN RNN LSTM Attention
86 pages
Rnn Tutorial
No ratings yet
Rnn Tutorial
41 pages
Deep Arch Msc 2024
No ratings yet
Deep Arch Msc 2024
83 pages
LSTMDerivadas
No ratings yet
LSTMDerivadas
10 pages
Unit 4 - Machine Learning
No ratings yet
Unit 4 - Machine Learning
16 pages
Introduction To Recurrent Neural Networks (RNNS) : Dr. Hans Weber February 9, 2024
No ratings yet
Introduction To Recurrent Neural Networks (RNNS) : Dr. Hans Weber February 9, 2024
9 pages
Unit 4 - Machine Learning - WWW - Rgpvnotes.in
0% (1)
Unit 4 - Machine Learning - WWW - Rgpvnotes.in
16 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
36 pages
Recurrent Neural Network - Fundamentals of Deep Learning
No ratings yet
Recurrent Neural Network - Fundamentals of Deep Learning
16 pages
Time Series Rnn Lstm 1746197734
No ratings yet
Time Series Rnn Lstm 1746197734
25 pages
Advanced Data Analytics: Simon Scheidegger - University of Lausanne, Department of Economics
No ratings yet
Advanced Data Analytics: Simon Scheidegger - University of Lausanne, Department of Economics
50 pages
DeepLearning Unit-III
No ratings yet
DeepLearning Unit-III
42 pages
Lecture Notes_RRN
No ratings yet
Lecture Notes_RRN
8 pages
6S191 MIT DeepLearning L2
No ratings yet
6S191 MIT DeepLearning L2
85 pages
DeepLearning Unit-III
No ratings yet
DeepLearning Unit-III
99 pages
RNN
No ratings yet
RNN
32 pages
Unit_3_rcnn
No ratings yet
Unit_3_rcnn
25 pages
CNN RNN LSTM GRU Simple
100% (3)
CNN RNN LSTM GRU Simple
20 pages
Module 4 Recurrent Neural Network
No ratings yet
Module 4 Recurrent Neural Network
78 pages
DL Notes
No ratings yet
DL Notes
35 pages
Lecture_4
No ratings yet
Lecture_4
34 pages
Lecture Notes - Recurrent Neural Networks
No ratings yet
Lecture Notes - Recurrent Neural Networks
11 pages
DNN U2 Notes
No ratings yet
DNN U2 Notes
32 pages
Sequence Modeling
No ratings yet
Sequence Modeling
131 pages
Machine-learning-Unit-4-RNN (2)
No ratings yet
Machine-learning-Unit-4-RNN (2)
11 pages
module5
No ratings yet
module5
21 pages
DS303_RNN_LSTM
No ratings yet
DS303_RNN_LSTM
16 pages
07 RNN Recurrent Neural Networks
No ratings yet
07 RNN Recurrent Neural Networks
115 pages
Unit 5
No ratings yet
Unit 5
76 pages
Unit 6
No ratings yet
Unit 6
41 pages
Recurrent Neural Network Wiki
100% (1)
Recurrent Neural Network Wiki
7 pages
Long Short Term Memory Networks - Architecture of LSTM
No ratings yet
Long Short Term Memory Networks - Architecture of LSTM
14 pages
Unit 3
No ratings yet
Unit 3
8 pages
Unit 4 NLP
No ratings yet
Unit 4 NLP
19 pages
IMP - Fundamentals of Deep Learning - Introduction To Recurrent Neural Networks
No ratings yet
IMP - Fundamentals of Deep Learning - Introduction To Recurrent Neural Networks
33 pages
Unit III (2) RNN, LSTM, Gru
No ratings yet
Unit III (2) RNN, LSTM, Gru
14 pages
Deep Learning
No ratings yet
Deep Learning
34 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Neural Networks
From Everand
Neural Networks
Sasha Kurzweil
No ratings yet
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
From Everand
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
Fouad Sabry
No ratings yet
slides_nn
No ratings yet
slides_nn
59 pages
Convolutional_Networks_2024
No ratings yet
Convolutional_Networks_2024
44 pages
slides_foundations
No ratings yet
slides_foundations
81 pages
Amanote
No ratings yet
Amanote
33 pages
Analysis: Understanding Multicellular Function and Disease With Human Tissue-Specific Networks
No ratings yet
Analysis: Understanding Multicellular Function and Disease With Human Tissue-Specific Networks
11 pages
ARIMA
No ratings yet
ARIMA
19 pages
Student Notes - Convolutional Neural Networks (CNN) Introduction - Belajar Pembelajaran Mesin Indonesia
No ratings yet
Student Notes - Convolutional Neural Networks (CNN) Introduction - Belajar Pembelajaran Mesin Indonesia
14 pages
BCS 465 Neural Network - 2020
No ratings yet
BCS 465 Neural Network - 2020
5 pages
01 Speed Read Tensorflow Playground
No ratings yet
01 Speed Read Tensorflow Playground
6 pages
Sharma S. - Activation Functions in Neural Networks
No ratings yet
Sharma S. - Activation Functions in Neural Networks
11 pages
Stats Chapter 5 Jeopardy Review
No ratings yet
Stats Chapter 5 Jeopardy Review
52 pages
Chapter 3
No ratings yet
Chapter 3
55 pages
Deep Learning Algorithms and Architectures
No ratings yet
Deep Learning Algorithms and Architectures
26 pages
Recurrent Neural Networks Recurrent Neural Network Model: Deeplearning - Ai
No ratings yet
Recurrent Neural Networks Recurrent Neural Network Model: Deeplearning - Ai
5 pages
Carleton University
No ratings yet
Carleton University
19 pages
Complementary Error Function Table: X Erfc (X) X Erfc (X) X Erfc (X) X Erfc (X) X Erfc (X) X Erfc (X) X Erfc (X)
No ratings yet
Complementary Error Function Table: X Erfc (X) X Erfc (X) X Erfc (X) X Erfc (X) X Erfc (X) X Erfc (X) X Erfc (X)
1 page
Back propagation
No ratings yet
Back propagation
9 pages
L12_intro-cnn-part1__slides
No ratings yet
L12_intro-cnn-part1__slides
56 pages
ARIMA Model Python Example - Time Series Forecasting
No ratings yet
ARIMA Model Python Example - Time Series Forecasting
11 pages
Activity Diagram
No ratings yet
Activity Diagram
16 pages
UNIT 1 TOC Sem5 RGPV
100% (2)
UNIT 1 TOC Sem5 RGPV
12 pages
Regular Expression, DFA and NFA: Prepared By: Prof. J. S. Dhobi Prof. M. D. Mehta
No ratings yet
Regular Expression, DFA and NFA: Prepared By: Prof. J. S. Dhobi Prof. M. D. Mehta
82 pages
nndl (2)
No ratings yet
nndl (2)
10 pages
Chapter9 Econometrics Autocorrelation
No ratings yet
Chapter9 Econometrics Autocorrelation
17 pages
Chapter 4,5,6 (Theory of Computation) #W# W# #W# ##W# #W# #W#W# A B C
No ratings yet
Chapter 4,5,6 (Theory of Computation) #W# W# #W# ##W# #W# #W#W# A B C
2 pages
Applying Bayesian Inference in A Hybrid CNN-LSTM Model For Time Series Prediction.
No ratings yet
Applying Bayesian Inference in A Hybrid CNN-LSTM Model For Time Series Prediction.
7 pages
Lecture 3
No ratings yet
Lecture 3
21 pages
Univariate Time Series
No ratings yet
Univariate Time Series
83 pages
Probability Method in Traffic Engineering
No ratings yet
Probability Method in Traffic Engineering
21 pages
Lecture # 1-2 Introduction To Gen AI
No ratings yet
Lecture # 1-2 Introduction To Gen AI
41 pages
Unit Root Test October 02
No ratings yet
Unit Root Test October 02
11 pages
Essentials of Modern Business Statistics With Microsoft Excel 6th Edition Anderson Solutions Manual 1
100% (77)
Essentials of Modern Business Statistics With Microsoft Excel 6th Edition Anderson Solutions Manual 1
22 pages
Time Series Decomposition
No ratings yet
Time Series Decomposition
54 pages
Forecasting of Demand Using ARIMA Model
No ratings yet
Forecasting of Demand Using ARIMA Model
9 pages
CNN Course V1.3
No ratings yet
CNN Course V1.3
19 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.