MLPS2021 03 CNN-RNN-LSTM
MLPS2021 03 CNN-RNN-LSTM
Stephan Sigg
Department of Communications and Networking
Aalto University, School of Electrical Engineering
stephan.sigg@aalto.fi
Stephan Sigg
February 5, 2021
2 / 32
Lecture Recap
Stephan Sigg
February 5, 2021
3 / 32
Outline
ANN brief
CNN
RNN
LSTM
Stephan Sigg
February 5, 2021
4 / 32
Neural networks
Stephan Sigg
February 5, 2021
5 / 32
Neural networks
For the input layer, we construct linear combinations of the input
(1)
variables x1 , . . . , xD1 and weights w11 , . . . , wD1 D2
D1
(2) (1) (1)
X
zj = wij xi + w0j
i=1
(l)
Each value aj in the hidden and output layers l, l ∈ {2, . . . , L} is
(l)
computed from zj using a differentiable, non-linear activation function
(l) (l) (l)
aj = fact zj
Stephan Sigg
February 5, 2021
6 / 32
Neural networks
Input layer linear combinations of x1 , . . . , xD1 and w11 , . . . , wD1 D2
D1
(2) (1) (1)
X
zj = wij xi + w0j
i=1
Activation function: Differentiable, non-linear
(2) (2) (2)
aj = fact zj
fact (·) is usually a sigmoidal function or tanh
Stephan Sigg
February 5, 2021
7 / 32
Neural networks
(2)
Values aj are then linearly combined in hidden layers:
D2
(3) (2) (2) (2)
X
zk = wjk aj + w0k
j=1
Stephan Sigg
February 5, 2021
8 / 32
Neural networks
Combine these stages to achieve overall network function:
D2 D1
!
→
− → − (3)
X (2)
(2)
X (1) (1) (2)
hk ( x , w ) = fact wjk fact wij xi + w0j + w0k
j=1 i=1
Stephan Sigg
February 5, 2021
9 / 32
Neural networks
Combine these stages to achieve overall network function:
D2 D1
!
→
− → − (3)
X (2)
(2)
X (1) (1) (2)
hk ( x , w ) = fact wjk fact wij xi + w0j + w0k
j=1 i=1
Stephan Sigg
February 5, 2021
9 / 32
Neural networks
Combine these stages to achieve overall network function:
D2 D1
!
→
− → − (3)
X (2)
(2)
X (1) (1) (2)
hk ( x , w ) = fact wjk fact wij xi + w0j + w0k
j=1 i=1
ANN brief
CNN
RNN
LSTM
Stephan Sigg
February 5, 2021
10 / 32
CNN introduction
Successes of DNNs
In recent years, deep neural networks have
let to breakthrough results for various
pattern recognition problems such as
computer vision or voice recognition.
a
Krizhevsky, Sutskever, Hinton (2012). Imagenet classification with deep
convolutional neural networks.
Stephan Sigg
February 5, 2021
11 / 32
CNN introduction
Successes of DNNs
In recent years, deep neural networks have
let to breakthrough results for various
pattern recognition problems such as
computer vision or voice recognition.
Convolutioal neural networks had an
essential role in this success
a
Krizhevsky, Sutskever, Hinton (2012). Imagenet classification with deep
convolutional neural networks.
Stephan Sigg
February 5, 2021
11 / 32
CNN introduction
Successes of DNNs
In recent years, deep neural networks have
let to breakthrough results for various
pattern recognition problems such as
computer vision or voice recognition.
Convolutioal neural networks had an
essential role in this success
a
Krizhevsky, Sutskever, Hinton (2012). Imagenet classification with deep
CNNs can be thought of having many
convolutional neural networks.
identical copies of the same neuron
→ lower number of parameters
Stephan Sigg
February 5, 2021
11 / 32
CNN introduction
Imagenet
Successes of DNNs In 2012, introduced CNNs while proposing
In recent years, deep neural networks have Imagenet, which largely improved on
let to breakthrough results for various existing image classification results at that
pattern recognition problems such as time a
computer vision or voice recognition.
Convolutioal neural networks had an
essential role in this success
CNNs can be thought of having many
identical copies of the same neuron
→ lower number of parameters a
Krizhevsky, Sutskever, Hinton (2012). Imagenet classification with deep
convolutional neural networks.
Stephan Sigg
February 5, 2021
11 / 32
CNN introduction
Imagenet
Imagenet was trained to classify images
into throusand different categories.
Stephan Sigg
February 5, 2021
12 / 32
CNN introduction
Imagenet
Imagenet was trained to classify images
into throusand different categories.
Random Guess: 0.1%
Stephan Sigg
February 5, 2021
12 / 32
CNN introduction
Imagenet
Imagenet was trained to classify images
into throusand different categories.
Random Guess: 0.1%
Imagenet: 63% (85% among top 5)
Stephan Sigg
February 5, 2021
12 / 32
CNN introduction
Imagenet
Imagenet was trained to classify images
into throusand different categories.
Random Guess: 0.1%
Imagenet: 63% (85% among top 5)
Stephan Sigg
February 5, 2021
12 / 32
CNN introduction
What is convolution?
Stephan Sigg
February 5, 2021
13 / 32
CNN introduction
How to use convolution with images?
Stephan Sigg
February 5, 2021
14 / 32
CNN introduction
How to use convolution with images?
Example: Detect edges
We can detect edges in images by taking
the values -1 and 1 in two adjacent pixels
and 0 everywhere else:
Stephan Sigg
February 5, 2021
15 / 32
CNN introduction
How to use convolution with images?
Example: Detect edges
We can detect edges in images by taking
the values -1 and 1 in two adjacent pixels
and 0 everywhere else:
Similar adjacent pixels: y ≈ 0
Different adjacent pixels: |y| large
Stephan Sigg
February 5, 2021
15 / 32
CNN introduction
Convolution in Neural Networks
Convolution function
Deviate from fully connected input layer
(2) (2) Pl Pm (2)
zk = fact w0k + i=0 j=1 wj xk+i
Stephan Sigg
February 5, 2021
16 / 32
CNN introduction
Convolution in Neural Networks
Convolution function
Deviate from fully connected input layer
(2) (2) Pl Pm (2)
zk = fact w0k + i=0 j=1 wj xk+i
(2)
zk = fact (WX )
Stephan Sigg
February 5, 2021
16 / 32
CNN introduction
Convolution in Neural Networks
Convolution function
Deviate from fully connected input layer
(2) (2) Pl Pm (2)
zk = fact w0k + i=0 j=1 wj xk+i
(2)
zk = fact (WX )
(2) (2) (2) (2)
here: zk = fact w0k + w11 xk + w12 xk+1
Stephan Sigg
February 5, 2021
16 / 32
Neurons exclusively defined by their
CNN introduction weights → same weights ≡ identical
Convolution in Neural Networks copies of a neuron
Traditional weight matrix
Multiplying CNN weight matrix ≡
W1,1 W1,2 W1,3 W1,4 ...
sliding a function
W2,1 W2,2 W2,3 W2,4 ...
W3,1 W3,2 W3,3 W3,4 ...
[. . . , 0, w11 , w12 , 0, . . . ] over the xi
W =
W4,1 W4,2 W4,3 W4,4 ...
..
Analogous to reuse of functions in
... ... ... ... . programming: Learn neuron once and
apply in multiple places
Stephan Sigg
February 5, 2021
18 / 32
CNN overview
Different types of layers in a CNN
Stephan Sigg
February 5, 2021
19 / 32
CNN overview
Pooling layers – Pooled feature maps
Stephan Sigg
February 5, 2021
20 / 32
CNN example
Speech prediction from audio samples
Input evenly spaced samples
Symmetry Audio has local properties
(frequency, pitch, ...) that are useful
everywhere in the input → group
neurons that look at small time
segments to compute features
Activation the output of each convolutional
layer is fed into a fully-connected
layer
Stacking Higher-level, abstract features found
by stacking convolutional layers
Pooling Pooling layers zoom out to allow
later layers to operate on larger
sections
Stephan Sigg
February 5, 2021
21 / 32
CNN example
Speech prediction from audio samples
Input evenly spaced samples
Symmetry Audio has local properties
(frequency, pitch, ...) that are useful
everywhere in the input → group
neurons that look at small time
segments to compute features
Activation the output of each convolutional
layer is fed into a fully-connected
layer
Stacking Higher-level, abstract features found
by stacking convolutional layers
Pooling Pooling layers zoom out to allow
later layers to operate on larger
sections
Stephan Sigg
February 5, 2021
21 / 32
CNN example
Speech prediction from audio samples
Input evenly spaced samples
Symmetry Audio has local properties
(frequency, pitch, ...) that are useful
everywhere in the input → group
neurons that look at small time
segments to compute features
Activation the output of each convolutional
layer is fed into a fully-connected
layer
Stacking Higher-level, abstract features found
by stacking convolutional layers
Pooling Pooling layers zoom out to allow
later layers to operate on larger
sections
Stephan Sigg
February 5, 2021
21 / 32
CNN example
Speech prediction from audio samples
Input evenly spaced samples
Symmetry Audio has local properties
(frequency, pitch, ...) that are useful
everywhere in the input → group
neurons that look at small time
segments to compute features
Activation the output of each convolutional
layer is fed into a fully-connected
layer
Stacking Higher-level, abstract features found
by stacking convolutional layers
Pooling Pooling layers zoom out to allow
later layers to operate on larger
sections
Stephan Sigg
February 5, 2021
21 / 32
CNN example
Speech prediction from audio samples
Input evenly spaced samples
Symmetry Audio has local properties
(frequency, pitch, ...) that are useful
everywhere in the input → group
neurons that look at small time
segments to compute features
Activation the output of each convolutional
layer is fed into a fully-connected
layer
Stacking Higher-level, abstract features found
by stacking convolutional layers
Pooling Pooling layers zoom out to allow
later layers to operate on larger
sections
Stephan Sigg
February 5, 2021
21 / 32
CNN remark
Functions realized in convolutional layers
For each feature, a parallel convolutional
layer is needed
→ May comprise complex functions or
networks on their own. a
a
Lin, Chen, Yan (2013). Network in network. arXiv preprint arXiv:1312.4400.
Stephan Sigg
February 5, 2021
22 / 32
Outline
ANN brief
CNN
RNN
LSTM
Stephan Sigg
February 5, 2021
23 / 32
RNN overview
Stephan Sigg
February 5, 2021
24 / 32
RNN overview
Stephan Sigg
February 5, 2021
24 / 32
RNN overview
Stephan Sigg
February 5, 2021
24 / 32
RNN
RNN concept
→ RNNs combine state vectors with a
fixed (but learned) function to produce
a new state vector.
→ RNNs operate in rounds or loops,
where the same function is applied
multiple times
→ This allows information to persist
Stephan Sigg
February 5, 2021
25 / 32
RNN
RNN concept
→ RNNs combine state vectors with a
fixed (but learned) function to produce
a new state vector.
→ RNNs operate in rounds or loops,
where the same function is applied
multiple times
→ This allows information to persist
RNN: multiple copies of the same network, each passing information to its successor
Stephan Sigg
February 5, 2021
25 / 32
RNN
RNN concept
→ RNNs combine state vectors with a
fixed (but learned) function to produce
a new state vector.
→ RNNs operate in rounds or loops,
where the same function is applied
multiple times RNNs often follow a simple structure, for instance
implementing a single tanh module
→ This allows information to persist
RNN: multiple copies of the same network, each passing information to its successor
Stephan Sigg
February 5, 2021
25 / 32
RNN
RNN concept
→ RNNs combine state vectors with a
fixed (but learned) function to produce
a new state vector.
→ RNNs operate in rounds or loops,
where the same function is applied
multiple times RNNs often follow a simple structure, for instance
implementing a single tanh module
→ This allows information to persist
RNN: multiple copies of the same network, each passing information to its successor
Stephan Sigg
February 5, 2021
25 / 32
RNN
RNN concept
→ RNNs combine state vectors with a
fixed (but learned) function to produce
a new state vector.
→ RNNs operate in rounds or loops,
where the same function is applied RNNs are the natural architecture for seqeunce
multiple times data and lists as well as for inputs that inherit
→ This allows information to persist temporal dependency
RNN: multiple copies of the same network, each passing information to its successor
Stephan Sigg
February 5, 2021
25 / 32
RNN limitations
RNN capabilities
The sun rises in the → morning
(direct temporal denpendencies)
Stephan Sigg
February 5, 2021
26 / 32
RNN limitations
RNN capabilities
The sun rises in the → morning
(direct temporal denpendencies)
Stephan Sigg
February 5, 2021
26 / 32
RNN limitations
RNN capabilities
The sun rises in the → morning
(direct temporal denpendencies)
RNN limitations
I practice Finnish since 5 years ... I am
fluent in → Finnish
(contextual dependency to more distant
information)
Stephan Sigg
February 5, 2021
26 / 32
RNN limitations
RNN capabilities
The sun rises in the → morning
(direct temporal denpendencies)
RNN limitations
I practice Finnish since 5 years ... I am
fluent in → Finnish
(contextual dependency to more distant
information)
Stephan Sigg
February 5, 2021
26 / 32
RNN limitations
RNN capabilities
The sun rises in the → morning
(direct temporal denpendencies)
With increasing temporal gap, RNNs become unable to learn to connect the connection
Impact on weights diminishes (Bengio et al, Learning long-term dependencies with gradient descent is difficult, IEEE TNN, 1994)
RNN limitations
I practice Finnish since 5 years ... I am
fluent in → Finnish
(contextual dependency to more distant
information)
Stephan Sigg
February 5, 2021
26 / 32
Outline
ANN brief
CNN
RNN
LSTM
Stephan Sigg
February 5, 2021
27 / 32
LSTM overview
LSTMs are a special kind of RNNa
capable of learning long-term dependencies
explicitly designed to remember information
for long periods
a
Hochreiter, Schmidhuber, Long short-term memory. Neural computation, 1997
Stephan Sigg
February 5, 2021
28 / 32
LSTM overview
LSTMs are a special kind of RNNa
capable of learning long-term dependencies
explicitly designed to remember information
for long periods
a
Hochreiter, Schmidhuber, Long short-term memory. Neural computation, 1997
Stephan Sigg
February 5, 2021
28 / 32
LSTM overview
LSTMs are a special kind of RNNa
capable of learning long-term dependencies
explicitly designed to remember information
for long periods
Stephan Sigg
February 5, 2021
28 / 32
LSTM construction
Stephan Sigg
February 5, 2021
29 / 32
LSTM alternative constructions
Stephan Sigg
February 5, 2021
30 / 32
LSTM alternative constructions
Stephan Sigg
February 5, 2021
30 / 32
LSTM alternative constructions
Stephan Sigg
February 5, 2021
30 / 32
LSTM alternative constructions
Stephan Sigg
February 5, 2021
30 / 32
LSTM alternative constructions
Stephan Sigg
February 5, 2021
30 / 32
LSTM alternative constructions
Stephan Sigg
February 5, 2021
30 / 32
Questions?
Stephan Sigg
stephan.sigg@aalto.fi
Si Zuo
si.zuo@aalto.fi
Hossein Firooz
hossein.firooz@aalto.fi
Stephan Sigg
February 5, 2021
31 / 32
Literature
C.M. Bishop: Pattern recognition and machine learning, Springer,
2007.
R.O. Duda, P.E. Hart, D.G. Stork: Pattern Classification, Wiley, 2001.
Stephan Sigg
February 5, 2021
32 / 32