0% found this document useful (0 votes)

35 views71 pages

MLPS2021 03 CNN-RNN-LSTM

The document discusses machine learning techniques for pervasive systems including CNN, RNN, and LSTM. It provides an overview and introduction to these topics, including defining neural networks and their basic components and functions. It also discusses successes of deep neural networks in areas like computer vision and the role of convolutional neural networks in this.

Uploaded by

Athul Suresh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views71 pages

MLPS2021 03 CNN-RNN-LSTM

Uploaded by

Athul Suresh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

Machine Learning for pervasive systems

CNN, RNN, LSTM

Stephan Sigg
Department of Communications and Networking
Aalto University, School of Electrical Engineering
stephan.sigg@aalto.fi

Version 2.0, February 5, 2021

Lecture overview

Stephan Sigg
February 5, 2021
2 / 32
Lecture Recap

Stephan Sigg
February 5, 2021
3 / 32
Outline

ANN brief

CNN

RNN

LSTM

Stephan Sigg
February 5, 2021
4 / 32
Neural networks

Stephan Sigg
February 5, 2021
5 / 32
Neural networks
For the input layer, we construct linear combinations of the input
(1)
variables x1 , . . . , xD1 and weights w11 , . . . , wD1 D2

D1
(2) (1) (1)
X
zj = wij xi + w0j
i=1

(l)
Each value aj in the hidden and output layers l, l ∈ {2, . . . , L} is
(l)
computed from zj using a differentiable, non-linear activation function

(l) (l) (l)
aj = fact zj

Stephan Sigg
February 5, 2021
6 / 32
Neural networks
Input layer linear combinations of x1 , . . . , xD1 and w11 , . . . , wD1 D2
D1
(2) (1) (1)
X
zj = wij xi + w0j
i=1
Activation function: Differentiable, non-linear

(2) (2) (2)
aj = fact zj
fact (·) is usually a sigmoidal function or tanh

Stephan Sigg
February 5, 2021
7 / 32
Neural networks
(2)
Values aj are then linearly combined in hidden layers:

D2
(3) (2) (2) (2)
X
zk = wjk aj + w0k
j=1

with k = 1, . . . , DL describing the total number of outputs

Again, these values are transformed using a sufficient transformation
function fact to obtain the network outputs
(3) (3)
fact (zk )

Stephan Sigg
February 5, 2021
8 / 32
Neural networks
Combine these stages to achieve overall network function:
 
D2 D1
!
→
− → − (3) 
X (2)
(2)
X (1) (1) (2)
hk ( x , w ) = fact wjk fact wij xi + w0j + w0k 
j=1 i=1

(Multiple hidden layers are added analogously)

Stephan Sigg
February 5, 2021
9 / 32
Neural networks
Combine these stages to achieve overall network function:
 
D2 D1
!
→
− → − (3) 
X (2)
(2)
X (1) (1) (2)
hk ( x , w ) = fact wjk fact wij xi + w0j + w0k 
j=1 i=1

(Multiple hidden layers are added analogously)

We speak of Forward propagation since the network elements are

computed from ’left to right’

(Multiple hidden layers are added analogously)

We speak of Forward propagation since the network elements are

computed from ’left to right’

This is can be seen as logistic regression where features are

learned in the first stage of the network
Stephan Sigg
February 5, 2021
9 / 32
Outline

ANN brief

CNN

RNN

LSTM

Stephan Sigg
February 5, 2021
10 / 32
CNN introduction

Successes of DNNs
In recent years, deep neural networks have
let to breakthrough results for various
pattern recognition problems such as
computer vision or voice recognition.

a
Krizhevsky, Sutskever, Hinton (2012). Imagenet classification with deep
convolutional neural networks.

Stephan Sigg
February 5, 2021
11 / 32
CNN introduction

Successes of DNNs
In recent years, deep neural networks have
let to breakthrough results for various
pattern recognition problems such as
computer vision or voice recognition.
Convolutioal neural networks had an
essential role in this success
a
Krizhevsky, Sutskever, Hinton (2012). Imagenet classification with deep
convolutional neural networks.

Stephan Sigg
February 5, 2021
11 / 32
CNN introduction

Stephan Sigg
February 5, 2021
11 / 32
CNN introduction
Imagenet
Successes of DNNs In 2012, introduced CNNs while proposing
In recent years, deep neural networks have Imagenet, which largely improved on
let to breakthrough results for various existing image classification results at that
pattern recognition problems such as time a
computer vision or voice recognition.
Convolutioal neural networks had an
essential role in this success
CNNs can be thought of having many
identical copies of the same neuron
→ lower number of parameters a
Krizhevsky, Sutskever, Hinton (2012). Imagenet classification with deep
convolutional neural networks.

Stephan Sigg
February 5, 2021
11 / 32
CNN introduction
Imagenet
Imagenet was trained to classify images
into throusand different categories.

Stephan Sigg
February 5, 2021
12 / 32
CNN introduction
Imagenet
Imagenet was trained to classify images
into throusand different categories.
Random Guess: 0.1%

Stephan Sigg
February 5, 2021
12 / 32
CNN introduction
Imagenet
Imagenet was trained to classify images
into throusand different categories.
Random Guess: 0.1%
Imagenet: 63% (85% among top 5)

Today, CNNs are an essential tool in

computer vision and pattern recognition

Stephan Sigg
February 5, 2021
12 / 32
CNN introduction
What is convolution?

Stephan Sigg
February 5, 2021
13 / 32
CNN introduction
How to use convolution with images?

Example: Blur images

We can blur parts of images by averaging a
box of pixels:

Stephan Sigg
February 5, 2021
14 / 32
CNN introduction
How to use convolution with images?
Example: Detect edges
We can detect edges in images by taking
the values -1 and 1 in two adjacent pixels
and 0 everywhere else:

Stephan Sigg
February 5, 2021
15 / 32
CNN introduction
How to use convolution with images?
Example: Detect edges
We can detect edges in images by taking
the values -1 and 1 in two adjacent pixels
and 0 everywhere else:
Similar adjacent pixels: y ≈ 0
Different adjacent pixels: |y| large

Stephan Sigg
February 5, 2021
15 / 32
CNN introduction
Convolution in Neural Networks
Convolution function
Deviate from fully connected input layer

(2) (2) Pl Pm (2)
zk = fact w0k + i=0 j=1 wj xk+i

Stephan Sigg
February 5, 2021
16 / 32
CNN introduction
Convolution in Neural Networks
Convolution function
Deviate from fully connected input layer

(2) (2) Pl Pm (2)
zk = fact w0k + i=0 j=1 wj xk+i

(2)
zk = fact (WX )

Stephan Sigg
February 5, 2021
16 / 32
CNN introduction
Convolution in Neural Networks
Convolution function
Deviate from fully connected input layer

(2) (2) Pl Pm (2)
zk = fact w0k + i=0 j=1 wj xk+i

(2)
zk = fact (WX )

(2) (2) (2) (2)
here: zk = fact w0k + w11 xk + w12 xk+1

Stephan Sigg
February 5, 2021
16 / 32
Neurons exclusively defined by their
CNN introduction weights → same weights ≡ identical
Convolution in Neural Networks copies of a neuron
Traditional weight matrix
  Multiplying CNN weight matrix ≡
W1,1 W1,2 W1,3 W1,4 ...
sliding a function
 W2,1 W2,2 W2,3 W2,4 ... 

 W3,1 W3,2 W3,3 W3,4 ...

 [. . . , 0, w11 , w12 , 0, . . . ] over the xi
W = 
W4,1 W4,2 W4,3 W4,4 ...
 
 

..
 Analogous to reuse of functions in
... ... ... ... . programming: Learn neuron once and
apply in multiple places

A 2D conv. layer (image classificaiton)

canonically over inputs xij in a 2D grid

3D CNN seldom but might be applied

to e.g. videos or 3D medical scans
Complexity O(n log n)
Stephan Sigg
February 5, 2021
17 / 32
Neurons exclusively defined by their
CNN introduction weights → same weights ≡ identical
Convolution in Neural Networks copies of a neuron
Traditional weight matrix
  Multiplying CNN weight matrix ≡
W1,1 W1,2 W1,3 W1,4 ...
sliding a function
 W2,1 W2,2 W2,3 W2,4 ... 

 W3,1 W3,2 W3,3 W3,4 ...

 [. . . , 0, w11 , w12 , 0, . . . ] over the xi
W = 
W4,1 W4,2 W4,3 W4,4 ...
 
 

..
 Analogous to reuse of functions in
... ... ... ... . programming: Learn neuron once and
apply in multiple places
CNN weight matrix (here)

w1,1 w1,2 0 0 ...
 A 2D conv. layer (image classificaiton)

 0 w1,1 w1,2 0 ... 
 canonically over inputs xij in a 2D grid

W = 0 0 w1,1 w1,2 ... 


 0 0 0 w1,1 ...

 3D CNN seldom but might be applied
.. to e.g. videos or 3D medical scans
 
... ... ... ... .
Complexity O(n log n)
Stephan Sigg
February 5, 2021
17 / 32
Neurons exclusively defined by their
CNN introduction weights → same weights ≡ identical
Convolution in Neural Networks copies of a neuron
Traditional weight matrix
  Multiplying CNN weight matrix ≡
W1,1 W1,2 W1,3 W1,4 ...
sliding a function
 W2,1 W2,2 W2,3 W2,4 ... 

 W3,1 W3,2 W3,3 W3,4 ...

 [. . . , 0, w11 , w12 , 0, . . . ] over the xi
W = 
W4,1 W4,2 W4,3 W4,4 ...
 
 

..
 Analogous to reuse of functions in
... ... ... ... . programming: Learn neuron once and
apply in multiple places
CNN weight matrix (here)

w1,1 w1,2 0 0 ...
 A 2D conv. layer (image classificaiton)

 0 w1,1 w1,2 0 ... 
 canonically over inputs xij in a 2D grid

W = 0 0 w1,1 w1,2 ... 


 0 0 0 w1,1 ...

 3D CNN seldom but might be applied
.. to e.g. videos or 3D medical scans
 
... ... ... ... .
Complexity O(n log n)
Stephan Sigg
February 5, 2021
17 / 32
Neurons exclusively defined by their
CNN introduction weights → same weights ≡ identical
Convolution in Neural Networks copies of a neuron
Traditional weight matrix
  Multiplying CNN weight matrix ≡
W1,1 W1,2 W1,3 W1,4 ...
sliding a function
 W2,1 W2,2 W2,3 W2,4 ... 

 W3,1 W3,2 W3,3 W3,4 ...

 [. . . , 0, w11 , w12 , 0, . . . ] over the xi
W = 
W4,1 W4,2 W4,3 W4,4 ...
 
 

..
 Analogous to reuse of functions in
... ... ... ... . programming: Learn neuron once and
apply in multiple places
CNN weight matrix (here)

w1,1 w1,2 0 0 ...
 A 2D conv. layer (image classificaiton)

 0 w1,1 w1,2 0 ... 
 canonically over inputs xij in a 2D grid

W = 0 0 w1,1 w1,2 ... 


 0 0 0 w1,1 ...

 3D CNN seldom but might be applied
.. to e.g. videos or 3D medical scans
 
... ... ... ... .
Complexity O(n log n)
Stephan Sigg
February 5, 2021
17 / 32
Neurons exclusively defined by their
CNN introduction weights → same weights ≡ identical
Convolution in Neural Networks copies of a neuron
Traditional weight matrix
  Multiplying CNN weight matrix ≡
W1,1 W1,2 W1,3 W1,4 ...
sliding a function
 W2,1 W2,2 W2,3 W2,4 ... 

 W3,1 W3,2 W3,3 W3,4 ...

 [. . . , 0, w11 , w12 , 0, . . . ] over the xi
W = 
W4,1 W4,2 W4,3 W4,4 ...
 
 

..
 Analogous to reuse of functions in
... ... ... ... . programming: Learn neuron once and
apply in multiple places
CNN weight matrix (here)

w1,1 w1,2 0 0 ...
 A 2D conv. layer (image classificaiton)

 0 w1,1 w1,2 0 ... 
 canonically over inputs xij in a 2D grid

W = 0 0 w1,1 w1,2 ... 


 0 0 0 w1,1 ...

 3D CNN seldom but might be applied
.. to e.g. videos or 3D medical scans
 
... ... ... ... .
Complexity O(n log n)
Stephan Sigg
February 5, 2021
17 / 32
Neurons exclusively defined by their
CNN introduction weights → same weights ≡ identical
Convolution in Neural Networks copies of a neuron
Traditional weight matrix
  Multiplying CNN weight matrix ≡
W1,1 W1,2 W1,3 W1,4 ...
sliding a function
 W2,1 W2,2 W2,3 W2,4 ... 

 W3,1 W3,2 W3,3 W3,4 ...

 [. . . , 0, w11 , w12 , 0, . . . ] over the xi
W = 
W4,1 W4,2 W4,3 W4,4 ...
 
 

..
 Analogous to reuse of functions in
... ... ... ... . programming: Learn neuron once and
apply in multiple places
CNN weight matrix (here)

w1,1 w1,2 0 0 ...
 A 2D conv. layer (image classificaiton)

 0 w1,1 w1,2 0 ... 
 canonically over inputs xij in a 2D grid

W = 0 0 w1,1 w1,2 ... 


 0 0 0 w1,1 ...

 3D CNN seldom but might be applied
.. to e.g. videos or 3D medical scans
 
... ... ... ... .
Complexity O(n log n)
Stephan Sigg
February 5, 2021
17 / 32
Neurons exclusively defined by their
CNN introduction weights → same weights ≡ identical
Convolution in Neural Networks copies of a neuron
Traditional weight matrix
  Multiplying CNN weight matrix ≡
W1,1 W1,2 W1,3 W1,4 ...
sliding a function
 W2,1 W2,2 W2,3 W2,4 ... 

 W3,1 W3,2 W3,3 W3,4 ...

 [. . . , 0, w11 , w12 , 0, . . . ] over the xi
W = 
W4,1 W4,2 W4,3 W4,4 ...
 
 

..
 Analogous to reuse of functions in
... ... ... ... . programming: Learn neuron once and
apply in multiple places
CNN weight matrix (here)

w1,1 w1,2 0 0 ...
 A 2D conv. layer (image classificaiton)

 0 w1,1 w1,2 0 ... 
 canonically over inputs xij in a 2D grid

W = 0 0 w1,1 w1,2 ... 


 0 0 0 w1,1 ...

 3D CNN seldom but might be applied
.. to e.g. videos or 3D medical scans
 
... ... ... ... .
Complexity O(n log n)
Stephan Sigg
February 5, 2021
17 / 32
CNN overview
Different types of layers in a CNN

Stephan Sigg
February 5, 2021
18 / 32
CNN overview
Different types of layers in a CNN

Interpretation: Convolution and pooling used as activation functions

Stephan Sigg
February 5, 2021
18 / 32
CNN overview
Feature maps – Kernels

Stephan Sigg
February 5, 2021
19 / 32
CNN overview
Pooling layers – Pooled feature maps

Pooling reduces the dimension of an

input representation

Allows to make assumptions about

features contained in the binned
sub-regions

Common types of pooling

Max pooling pick the maximum
Min pooling pick the minimum

Stephan Sigg
February 5, 2021
20 / 32
CNN example
Speech prediction from audio samples
Input evenly spaced samples
Symmetry Audio has local properties
(frequency, pitch, ...) that are useful
everywhere in the input → group
neurons that look at small time
segments to compute features
Activation the output of each convolutional
layer is fed into a fully-connected
layer
Stacking Higher-level, abstract features found
by stacking convolutional layers
Pooling Pooling layers zoom out to allow
later layers to operate on larger
sections
Stephan Sigg
February 5, 2021
21 / 32
CNN example
Speech prediction from audio samples
Input evenly spaced samples
Symmetry Audio has local properties
(frequency, pitch, ...) that are useful
everywhere in the input → group
neurons that look at small time
segments to compute features
Activation the output of each convolutional
layer is fed into a fully-connected
layer
Stacking Higher-level, abstract features found
by stacking convolutional layers
Pooling Pooling layers zoom out to allow
later layers to operate on larger
sections
Stephan Sigg
February 5, 2021
21 / 32
CNN example
Speech prediction from audio samples
Input evenly spaced samples
Symmetry Audio has local properties
(frequency, pitch, ...) that are useful
everywhere in the input → group
neurons that look at small time
segments to compute features
Activation the output of each convolutional
layer is fed into a fully-connected
layer
Stacking Higher-level, abstract features found
by stacking convolutional layers
Pooling Pooling layers zoom out to allow
later layers to operate on larger
sections
Stephan Sigg
February 5, 2021
21 / 32
CNN example
Speech prediction from audio samples
Input evenly spaced samples
Symmetry Audio has local properties
(frequency, pitch, ...) that are useful
everywhere in the input → group
neurons that look at small time
segments to compute features
Activation the output of each convolutional
layer is fed into a fully-connected
layer
Stacking Higher-level, abstract features found
by stacking convolutional layers
Pooling Pooling layers zoom out to allow
later layers to operate on larger
sections
Stephan Sigg
February 5, 2021
21 / 32
CNN example
Speech prediction from audio samples
Input evenly spaced samples
Symmetry Audio has local properties
(frequency, pitch, ...) that are useful
everywhere in the input → group
neurons that look at small time
segments to compute features
Activation the output of each convolutional
layer is fed into a fully-connected
layer
Stacking Higher-level, abstract features found
by stacking convolutional layers
Pooling Pooling layers zoom out to allow
later layers to operate on larger
sections
Stephan Sigg
February 5, 2021
21 / 32
CNN remark
Functions realized in convolutional layers
For each feature, a parallel convolutional
layer is needed
→ May comprise complex functions or
networks on their own. a

a
Lin, Chen, Yan (2013). Network in network. arXiv preprint arXiv:1312.4400.
Stephan Sigg
February 5, 2021
22 / 32
Outline

ANN brief

CNN

RNN

LSTM

Stephan Sigg
February 5, 2021
23 / 32
RNN overview

Issues of vanilla ANN or CNN

→ Network is constrained to a
fixed-size input
→ Number of layers in those
networks fixed

Stephan Sigg
February 5, 2021
24 / 32
RNN overview

Issues of vanilla ANN or CNN

→ Network is constrained to a
fixed-size input
→ Number of layers in those
networks fixed

Recurrent Neural Network (RNN)

Capable to operate over
sequences of vectors

Stephan Sigg
February 5, 2021
24 / 32
RNN overview

Issues of vanilla ANN or CNN

→ Network is constrained to a
fixed-size input
→ Number of layers in those
networks fixed

Recurrent Neural Network (RNN)

Capable to operate over
sequences of vectors

Stephan Sigg
February 5, 2021
24 / 32
RNN

RNN concept
→ RNNs combine state vectors with a
fixed (but learned) function to produce
a new state vector.
→ RNNs operate in rounds or loops,
where the same function is applied
multiple times
→ This allows information to persist

Stephan Sigg
February 5, 2021
25 / 32
RNN

RNN: multiple copies of the same network, each passing information to its successor
Stephan Sigg
February 5, 2021
25 / 32
RNN

RNN concept
→ RNNs combine state vectors with a
fixed (but learned) function to produce
a new state vector.
→ RNNs operate in rounds or loops,
where the same function is applied RNNs are the natural architecture for seqeunce
multiple times data and lists as well as for inputs that inherit
→ This allows information to persist temporal dependency

RNN: multiple copies of the same network, each passing information to its successor
Stephan Sigg
February 5, 2021
25 / 32
RNN limitations
RNN capabilities
The sun rises in the → morning
(direct temporal denpendencies)

Stephan Sigg
February 5, 2021
26 / 32
RNN limitations
RNN capabilities
The sun rises in the → morning
(direct temporal denpendencies)

RNN limitations
I practice Finnish since 5 years ... I am
fluent in → Finnish
(contextual dependency to more distant
information)

Stephan Sigg
February 5, 2021
26 / 32
RNN limitations
RNN capabilities
The sun rises in the → morning
(direct temporal denpendencies)

RNN limitations
I practice Finnish since 5 years ... I am
fluent in → Finnish
(contextual dependency to more distant
information)

Stephan Sigg
February 5, 2021
26 / 32
RNN limitations
RNN capabilities
The sun rises in the → morning
(direct temporal denpendencies)

With increasing temporal gap, RNNs become unable to learn to connect the connection
Impact on weights diminishes (Bengio et al, Learning long-term dependencies with gradient descent is difficult, IEEE TNN, 1994)

RNN limitations
I practice Finnish since 5 years ... I am
fluent in → Finnish
(contextual dependency to more distant
information)

Stephan Sigg
February 5, 2021
26 / 32
Outline

ANN brief

CNN

RNN

LSTM

Stephan Sigg
February 5, 2021
27 / 32
LSTM overview
LSTMs are a special kind of RNNa
capable of learning long-term dependencies
explicitly designed to remember information
for long periods

a
Hochreiter, Schmidhuber, Long short-term memory. Neural computation, 1997

Stephan Sigg
February 5, 2021
28 / 32
LSTM overview
LSTMs are a special kind of RNNa
capable of learning long-term dependencies
explicitly designed to remember information
for long periods

a
Hochreiter, Schmidhuber, Long short-term memory. Neural computation, 1997

Stephan Sigg
February 5, 2021
28 / 32
LSTM overview
LSTMs are a special kind of RNNa
capable of learning long-term dependencies
explicitly designed to remember information
for long periods

An LSTM is an RNN, but...

Instead of single NN layer
Four neural network layers
Interacting in very specific way
a
Hochreiter, Schmidhuber, Long short-term memory. Neural computation, 1997

Stephan Sigg
February 5, 2021
28 / 32
LSTM construction

Stephan Sigg
February 5, 2021
29 / 32
LSTM alternative constructions

Often, LSTMs are adapted to the specific

need of the underlying problem
Examples:
Peephole connections
Coupled forget and input gates
Gated recurrent units