ML 13
ML 13
13. week
Deep Learning
Convolutional Neural Network
1
Umut ORHAN, PhD.
2
Umut ORHAN, PhD.
1
Why Deep Learning is so
Popular ?
2. Increase in GPU and processing power
3
Umut ORHAN, PhD.
2
Deep Learning Aprroaches
5
Umut ORHAN, PhD.
3
Convolutional Neural Network
They have applications in
image and video recognition,
recommender systems,
4
Why CNN’s ? Over Ordinary
neural networks ?
5
Why CNN’s ? Over Ordinary
neural networks ?
6
Why CNN’s ? Over Ordinary
neural networks ?
CNNs are used mainly to look for patterns in an image,
we don’t need to give features, the CNN understands the
right features by itself as it goes deep. This is one of the
reasons why we need CNN’s. Period.
And another reason is, ordinary neural networks don’t
scale well for full sized images , let’s say that input
images size =100(width) * 100 (height) * 3 (rgb).Then
we need to have 30,000 neurons which is very expensive
in the network.
7
Design of CNN
A CNN consists of an input and an output layer,
as well as multiple hidden layers.
The hidden layers of a CNN typically consist of
1. Convolutional Step,
2. Non-Linearity Step
3. Pooling Step
4. Fully Connected Layers
Design of CNN
8
Convolution Step
The primary purpose of Convolution in case of a
CNN is to extract features from the input image.
Convolution preserves the spatial relationship
between pixels by learning image features using
small squares of input data.
Convolution Step
Image Filter
9
Convolution Step
Convolution Step
We slide the orange matrix over our original
image (green) by 1 pixel (also called ‘stride’) and
for every position,
We compute element wise multiplication
(between the two matrices)
Add the multiplication outputs to get the final
integer which forms a single element of the
output matrix (pink).
Note that the 3×3 matrix “sees” only a part of
the input image in each stride.
Umut ORHAN, PhD.
10
Convolution Step
In CNN terminology, the 3×3 matrix is called a
‘filter‘ or ‘kernel’ or ‘feature detector’ and the
matrix formed by sliding the filter over the image
and computing the dot product is called the
‘Convolved Feature’ or ‘Activation Map’ or
the ‘Feature Map‘.
Note that filters acts as feature detectors from
the original input image.
Convolution Step
In practice, a CNN learns the values of these
filters on its own during the training process
(although we still need to specify parameters
such as number of filters, filter size, architecture
of the network etc. before the training process).
The more number of filters we have, the more
image features get extracted and the better our
network becomes at recognizing patterns in
unseen images.
11
Convolution Step
Convolution Step
The size of the Feature Map (Convolved Feature)
is controlled by three parameters that we need
to decide before the convolution step is
performed:
Depth
Stride
Zero-padding
12
Convolution Step
Depth
Depth corresponds to the number of filters we use for
the convolution operation.
Stride
Stride is the number of pixels by which we slide our
filter matrix over the input matrix. When the stride is 1
then we move the filters one pixel at a time. When the
stride is 2, then the filters jump 2 pixels at a time as we
slide them around. Having a larger stride will produce
smaller feature maps.
Convolution Step
Zero-Padding
Sometimes, it is convenient to pad the input matrix
with zeros around the border, so that we can apply
the filter to bordering elements of our input image
matrix. A nice feature of zero padding is that it
allows us to control the size of the feature maps.
Adding zero-padding is also called wide convolution,
and not using zero-padding would be a narrow
convolution.
13
Non-Linearity Step
An additional operation called ReLU has been
used after every Convolution operation.
ReLU stands for Rectified Linear Unit and is a
non-linear operation.
Non-Linearity Step
non linear functions such as
tanh ,
sigmoid,
ReLU(rectified linear unit)
14
Non-Linearity Step
ReLU is an element wise operation (applied per
pixel) and replaces all negative pixel values in the
feature map by zero.
The purpose of ReLU is to introduce non-linearity in
our ConvNet, since most of the real-world data we
would want our ConvNet to learn would be non-
linear (Convolution is a linear operation – element
wise matrix multiplication and addition, so we
account for non-linearity by introducing a non-linear
function like ReLU).
Non-Linearity Step
It shows the ReLU operation applied to one of
the feature maps obtained
15
The Pooling Step
Spatial Pooling (also called subsampling or
downsampling) reduces the dimensionality of each
feature map but retains the most
important information.
Spatial Pooling can be of different types: Max,
Average, Sum etc.
16
The Pooling Step
An example of Max Pooling operation on a Rectified
Feature map (obtained after convolution + ReLU
operation) by using a 2×2 window.
17
The Pooling Step
18
Fully Connected Layer
The output from the convolutional and pooling
layers represent high-level features of the input
image.
The purpose of the Fully Connected layer is to
use these features for classifying the input image
into various classes based on the training
dataset.
19
Fully Connected Layer
Apart from classification, adding a fully-
connected layer is also a (usually) cheap way of
learning non-linear combinations of these
features. Most of the features from convolutional
and pooling layers may be good for the
classification task, but combinations of those
features might be even better.
Introduction to RNN
20
Introduction to RNN
• The network contains at least one feed-back
connection, so the activations can flow round in a
loop.
• That enables the networks to do temporal
processing and learn sequences.
• Recurrent neural networks (RNNs) are often used for
handling sequential data.
Example: if you want to predict the next word
in a sentence you need to know which words came
before it
Introduction to RNN
Feed forward networks
• Information only flows one way
• One input pattern produces (same) one output
• No memory of previous state
21
RNN Architecture
• In the diagram, a chunk of neural network, s, looks
at some input 𝑥𝑡 and outputs a value 𝑜𝑡 . A loop
allows information to be passed from one step of the
network to the next.
RNN Architecture
22
RNN Architecture
• 𝑥𝑡 = input at time step t.
• 𝑠𝑡 = hidden state at time step t. Calculated based on the
previous hidden state and the input at the current step:
𝑠𝑡 = f(U 𝑥𝑡 + W 𝑠𝑡−1 ).
• f = activation function (tanh, ReLU, etc).
• 𝑠−1 , required to calculate first hidden state, typically
initialized to all zeroes.
• 𝑜𝑡 = output at step t.
𝑜𝑡 = 𝑊ℎ𝑦 ∗ 𝑠𝑡
RNN Architecture
23
RNN Architecture
This is in fact a type of
recurrent neural network —
a one to one recurrent net,
because it maps one input to
one output. A one to one
recurrent net is equivalent to
an artificial neural net.
RNN Architecture
24
RNN Architecture
RNN Architecture
The final type of recurrent net
is many to many, where both
the input and output are
sequential.
A use case would be machine
translation where a sequence
of words in one language
needs to be translated to a
sequence of words in
another.
25
RNN Architecture
• Another type of many to many
architecture exists where each
neuron has a state at every time
step, in a “synchronized” fashion.
Here, each output is only
dependent on the inputs that were
fed in during or before it. Because
of this, synchronized many to
many probably wouldn’t be
suitable for translation.
RNN Variants
26
Simple RNN: Elman Network
• It first used by Jeff Elman (1990).
• The SRN is a specific type of back-propagation
network.
• It assumes a feed-forward architecture, with units
in input, hidden, and output pools.
• It also allows for a special type of hidden layer
called a “context” layer.
27
Backpropagation Through
Time
• The Backpropagation Through Time (BPTT) learning
algorithm is a natural extension of standard
backpropagation that performs gradient descent on
a complete unfolded network.
• The more context (copy layers) we maintain, the
more history explicitly include in our gradient
computation.
• The downside of BPTT is that it requires a large
amount of resources
Storage – entire history needs to be stored
28
Long Short Term Memory
• LSTM networks introduce a new structure called
a memory cell
• Each memory cell contains three gates:
Input gate
Forget gate
Output gate
29
Long Short Term Memory
Convolutional RNN
• Convolutional neural networks (CNN) are
able to extract higher level features that are
invariant to local spectral and temporal
variations.
• Recurrent neural networks (RNNs) are
powerful in learning the longer term
temporal context.
• It can be described as a modified CNN by
replacing the last convolutional layers with a
RNN.
30
Convolutional RNN
• The key module of this RCNN are the
recurrent convolution layers (RCL).
• The network can evolve over time though
the input is static and each unit is
influenced by its neighboring units.
RNN Example
• The neural network has the vocabulary: h, e, l ,
o.
• That is, it only knows these four characters;
exactly enough to produce the word “hello”.
• We will input the first character, “h”, and from
there expect the output at the following time
steps to be: “e”, “l”, “l”, and “o” respectively, to
form: hello
31
RNN Example
• We can represent input and output via one hot
encoding, where each character is a vector with a
1 at the corresponding character position and
otherwise all 0s.
• For example, since our vocabulary is [h, e, l, o],
we can represent characters using a vector with
four values, where a 1 in the first, second, third,
and fourth position would represent “h”, “e”, “l”,
and “o” respectively.
RNN Example
32
RNN Example
As you can see, we input the first letter and the
word is completed.
RNN Example
One interesting technique would be to sample
the output at each time step and feed it into the
next as input:
33
RNN Example
• Each hidden state would contain a similar sort of vector,
though not necessarily something we could interpret like
we can for the output.
• The RNN is saying: given “h”, “e” is most likely to be the
next character. Given “he”, “l” is the next likely character.
With “hel”, “l” should be next, and with “hell”, the final
character should be “o”.
• But, if the neural network wasn’t trained on the word
“hello”, and thus didn’t have optimal weights (ie. just
randomly initialized weights), then we’d have garble like
“hleol” coming out.
34