Deep Learnig-CNN-new - DMI-compressed
Deep Learnig-CNN-new - DMI-compressed
By
Dr. Ratnakar Dash
CSE Dept, NIT Rourkela
What we do is:
(1) take some data,
(2) train a model on
that data, and
(3) use the trained
model to make
predictions on
new data.
2
Deep architecture in human
Lateral Geniculate
Nucleus
Visual Cortex
V4
High level
abstractions
V2
Simple shapes
V1
Edges,corners
Retina--Pixels
Neural Networks and Deep Learning
3
Shallow learning
• SVM
• ...
4
Deep learning
WHAT We learn the patterns
WHEN or structures from
data using some sort
of optimization.
WHERE HOW
5
What – Deep learning definition
• Deep learning is a kind of representation learning in which there are
multiple levels of features. These features are automatically
discovered and they are composed together in the various levels to
produce the output.
• Each level represents abstract features that are discovered from the
features represented in the previous level. Hence, the level of
abstraction increases with each level.
6
What – Deep learning definition
• Deep learning is a class of machine learning algorithms that use a
cascade of multiple layers of nonlinear processing units for feature
extraction and transformation
• But commonly it has a become a buzz word for ANNs with more
than one hidden layers.
7
When - Deep learning timeline
Although the architectures existed since 2000s, but in 2012 success of GPU based
implementation of DL models in complex tasks led the revolution. One of them was
AlexNet model that won the ImageNet competition that year.
8
Why – Deep learning needed
The fundamental question still stays is that is deep learning more
powerful than Simpler models?
The answer can obviously be guessed yes but why is the real question.
9
Why – Deep learning needed
Another argument in favor of DL has come from analyzing what the hidden
layers have learned over training. It has been seen as general trend that each
layer tries to combine features of its previous layer to get a semantically
higher feature.
For ex: In a face recognition system, first layer might extract edges and
corners, where as second layer combine them to get features like ear, nose,
eye, etc. whereas the final layer combines them to generate features of
overall face.
10
Where – Deep learning applications
APPLICATIONS ARE INFINITELY MANY
The image shows a basic difference between situation where to apply DL over other models
11
Deep algorithms
• An algorithm is deep if the input is passed through several non-linearities
before being output.
For intuition, imagine you are told to send a letter to your aunt living in some
other city
12
Deep algorithms (contd.)
• A deep architecture would be like:
You drop it in a postbox
The postman in your city collects it and drops it at the distribution hub
The hub sends it via a transportation
It is collected at the other city's distribution hub
The other postman collects it and gives it to your aunt
Wouldn't it be more compact and gracefully reuse previous efforts?
• Now imagine the work was shallow:
You had to travel all the way to your aunt to give the letter
Wouldn't it be too complex and involve a lot of duplication of effort?
13
How deep learning?
• Imagine that we train a neural network which has:
– input
– is sparse, i.e. the nodes usually output 0, but only sometimes >0.05
14
How deep learning? (contd.)
• So then your final network might look like:
– input layer
– ...
– output layer
• Now, if we have a lot of labeled data we can then "fine-tune" this network.
– i.e. use those layers from autoencoders as the first generation of a big
neural network and then
– run a lot of generations of back propagation Neural Networks and Deep Learning
15
Deep neural networks
• Mathematical proofs are there which shows that for some functions
shallow circuits require exponentially more circuit elements to
compute than do deep circuits.
16
Deep neural networks (contd.)
17
Deep neural networks (contd.)
• For instance, if we're doing visual pattern recognition, then the
neurons in the first layer might learn to recognize edges, the neurons
in the second layer could learn to recognize more complex shapes,
say triangle or rectangles, built up from edges. The third layer would
then recognize still more complex shapes and so on.
18
Convolutional neural networks
• In 1995, Yann LeCun and Yoshua Bengio introduced the concept of
convolutional neural networks.
88
History
20
About CNN’s
• CNN’s Were neurobiologically motivated by the findings of locally
sensitive and orientation-selective nerve cells in the visual cortex.
• Like almost every other neural networks they are trained with a version of
the back-propagation algorithm.
21
How CNNs work
• Convolution layer
• ReLU Layer
• Pooling Layer
• Softmax Layer
C
Feature extraction layer
Convolution layer
24
Local receptive fields
25
Local receptive fields
• But we won't connect every input pixel to every hidden neuron.
Instead, we only make connections in small, localized regions of the
input image. Each neuron in the first hidden layer will be connected to
a small region of the input neurons
26
Local receptive fields
• That region in the input image is called the local receptive field for the
hidden neuron. It's a little window on the input pixels
• We then slide the local receptive field across the entire input image.
For each local receptive field, there is a different hidden neuron in the
first hidden layer.
27
Local receptive fields
28
Local receptive fields
• If we have a 28x28 input image, and 5x5 local receptive fields, then there
will be 24x24 neurons in the hidden layer. This is because we can only
move the local receptive field 23 neurons across (or 23 neurons down),
before colliding with the right hand side (or bottom) of the input image.
• In this case, the local receptive field being moved by one pixel at a time. In
fact, sometimes a different stride length is used. For instance, we might
move the local receptive field pixels to the right (or down), in which case
we'd say a stride length of 2 is used.
29
Definition convolution
• For complex-valued
functions f, g defined on the
set Z of integers, the discrete A(4*4) C(3*3)
convolution of f and g is 1 1 0 1 B(2*2)
2 1 1
given by 1 0 0 1 1 0
2 2 1
0 1 1 0 1 1
2 3 3
1 1 1 1 Convolution
A(4*4) C(3*3)
1 1 0 1 B(2*2)
2 1 1
1 0 0 1 1 0
2 2 1
0 1 1 0 1 1
2 3 3
1 1 1 1 Convolution
4 3 4
* = 2 4 3
2 3 4
31
• the size of the input is
Convolutional HI×WI×C, here HI
Layer
represents the height, WI
the width, and C the
channels.
• suppose the size of the
filter is HF×WF×C, here
HF and WF represent the
Filter 1 Filter N height and width of the
HI filters.
HF C
• Suppose the padding size
WF
at each margin is P, the
C
stride size is S,
WI
• The output is HO×WO×N
H I H F 2P WI WF 2 P
HO 1 WO 1
S S
Neural Networks and Deep Learning
Feature map generation
33
Feature extraction layer or
Convolution layer
detect the same feature at different positions in the input image.
features
34
Shared weights and biases
• Each hidden neuron has a bias and 5x5 weights connected to its local
receptive field.
• Same weights and bias are used for each of the hidden neurons.
4 4
b wl ,m a j l ,k m
l 0 l 0
• All the neurons in the first hidden layer detect exactly the same feature
35
Shared weights and biases
• Suppose the weights and bias are such that the hidden neuron can pick out,
say, a vertical edge in a particular local receptive field. That ability is also
likely to be useful at other places in the image. And so it is useful to apply
the same feature detector everywhere in the image.
• The mapping from the input layer to the hidden layer is called as feature
map. The weights defining the feature map are called as shared weights.
And we call the bias defining the feature map in this way the shared bias
36
Shared weights and biases
• For image recognition we'll need more than one feature map. And so a
complete convolutional layer consists of several different feature maps as
shown below.
37
Advantages of shared weights and
biases:
• A big advantage of sharing weights and biases is that it greatly reduces the
number of parameters involved in a convolutional network.
• In the example, For each feature map we need 25 shared weights, plus a
single shared bias. So each feature map requires 26 parameters. If we have
20 feature maps that's a total of 20x26 = 520 parameters.
• That, in turn, will result in faster training for the convolutional model, and,
ultimately, will help us build deep networks using convolutional layers.
38
Activation function
• An mathematical abstraction of
obtained feature map.
39
Rectified linear unit (ReLU)
• The neurons in the feature map
after convolution layer will pass
through a nonlinear activation
function Sigmoid
• Use rectified linear unit (ReLU)
layer other than traditional
sigmoid function
x x0
ReLU( x )
0 x0
ReLU
• Faster convergence.
tanh
1 -4 3 -2 1 0 3 0
-1 1 -2 1 0 1 0 1
-1 -2 3 -1 0 0 3 0
ReLU
2 -2 1 4 2 0 1 4
41
ReLU
ReLU operation.
Source: http://mlss.tuebingen.mpg.de/2015/slides/fergus/Fergus_1.pdf
42
Pooling layers
• Pooling layers are usually used immediately after convolutional layers.
The pooling layers simplify the information in the output from the
convolutional layer.
• A pooling layer takes each feature map output from the convolutional
layer and prepares a condensed feature map. For instance, each unit in
the pooling layer may summarize a region of (say) 2x2 neurons in the
previous layer
43
Pooling layers
• Note that since we have 24×24 neurons output from the convolutional
layer, after pooling we have 12x12 neurons.
44
Pooling method
• pooling region Q
• activation set AQ
• max-pooling MP max AQ
AP
A Q
• Average Pooling
AQ
• stochastic-pooling
– Probability map
ai
P pi
AQ
ai
• Three Advantages
Average
0.4
Pooling
activation Max
2.3
Pooling
0 0.8 0
2.3 0 0 Stochastic
0 0.22 0 Pooling
0 0.5 0
0.64 0 0 k=8 0.5
0 0.14 0
probability map
48
Fully connected layer and
Softmax layer
Fully connected layer:
• connect every neuron in one layer • probability of sample x
to every neuron in another layer. belonging to class f is
• It is in principle the same as the J (x f ) J ( f )
traditional multi-layer perceptron J ( f x) F
neural network J ( x k ) J (k )
k 1
Softmax layer Af ln J ( x, f ) J ( f )
• J represents the probability • We have
function, exp Af ( x )
J ( f x) F
• f is a given class, exp A ( x) k
50
Deep convolutional networks
• These networks use a special architecture which tries to take
advantage of the spatial structure and is particularly well adapted to
classify images.
51
Designing of small CNN example
7x7x8 5 x 5 x 16
ReLU + Pool ReLU + Pool
__________________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
Conv2D (None, 22, 22, 8) 400
__________________________________________________________________________
MaxPooling (None, 11, 11, 8) 0
__________________________________________________________________________
Conv2D (None, 7, 7, 16) 3216
__________________________________________________________________________
MaxPooling (None, 3, 3, 16) 0
__________________________________________________________________________
Flatten (None, 144) 0
__________________________________________________________________________
Dense (None, 10) 1450
=================================================================
Total params: 5,066
Trainable params: 5,066
Non-trainable params: 0
_________________________________________________________________ Neural Networks and Deep Learning
52
Visual results
53
LeNet5
Introduced by LeCun.
54
LeNet5
C1,C3,C5 : Convolutional layer.
5 × 5 Convolution matrix.
S2 , S4 : Subsampling layer.
Subsampling by factor 2.
F6 : Fully connected layer.
55
LeNet5
All the units of the layers up to F6 have a sigmoidal
activation function of the type:
y j (v j ) A tanh( Sv j )
56
LeNet5
+1
+1
84
Y j ( Fi Wij ) 2 , j 0,...,9
W1
F1
W2 i 1
F2
0 Y0
+1
W84
F84
57
LeNet5
58
LeNet5
59
LeNet5
60
Disadvantages of CNNs
From a memory and capacity standpoint the CNN is not much bigger
than a regular two layer network.
CNN’s are about 3X slower than their fully connected equivalents (size-
wise).
61
Solution
ReLU (Rectified Linear Units ) Nonlinearity
Standard way: f(x) = tanh(x) or f(x) = (1 + e-x)-1 (logistic function)
ReLU: f(x) = max(0, x)
6 times faster than hyperbolic function
Multiple GPUs
Larger memory
Parallel computing
Control of communication
Use pre trained Network (Transfer Learning)
62
Transfer learning (inductive transfer)
• storing knowledge gained while solving one problem
• applying it to a different but related problem
• For example, knowledge gained while learning to recognize cars could
apply when trying to recognize trucks.
• Related to psychological field on “transfer or learning”.
65
Other well-known networks
Year CNN Developed No. of parameters
by
1998 LeNet(8) Yann LeCun et al 60 thousand
Alex Krizhevsky
2012 AlexNet(7) Geoffrey Hinton 60 million
Ilya Sutskever
Matthew Zeiler
2013 ZFNetO
Rob Fergus
2014 GoogLeNet(19) Google 4 million
Simonyan
2014 VGG Net(16) 138 million
Zisserman
2015 ResNet(152) Kaiming He
66
Auto-Encoders
• A basic auto encoder is unsupervised learning algorithm in Neural Network
Based on Backpropgation algorithm , it's map the input vector [ say x ] to
hidden representation using the function z = f(Px+b) parameterized by
{P,b} . f is the activation function [may be linear - sigmoid ,....etc ] . the
hidden representation is then mapped linearly to the output usingX' = Wz +
b' . the parameter are optimized to minimize the mea square over
all training points.
67
Auto-Encoders
An auto-encoder is trained, with an absolutely standard weight-adjustment
algorithm to reproduce the input. A simple 3-layer neural network where output
units are directly connected back to input units.
f g where Y is the
X Y X hidden layer unit's
output vector and X
is the input vector
Neural Networks and Deep Learning
68
Auto-Encoders
An auto-encoder consists of two
parts encoder (𝜙) and decoder (𝜓)
which maps input (𝑋) to code space
( 𝐹 ) which are mathematically
expressed as, 𝜙: 𝑋 → 𝐹 , 𝜓: 𝐹 → 𝑋
and we aim to achieve encoder and
decoder parameter such that 𝜙, 𝜓 =
𝑎𝑟𝑔 𝑚𝑎𝑥𝜙,𝜓 ||𝑋 = (𝜙 ∘ 𝜓)𝑋||. Code-
space which also named as latent
space can be defined as basic neural
network as, 𝑧 = 𝜎(𝑊𝑥 + 𝑏) and 𝑥
can be generated from 𝑧 as 𝑥 =
𝜎′(𝑊′𝑧 + 𝑏′) where, 𝜎′, 𝑊′, 𝑏′ differs
from 𝜎, 𝑊, 𝑏 depending upon the
design of the network.
69
Auto-Encoders
•The task of training is to minimize an error of reconstruction, i.e. find the
most efficient compact representation (encoding) for input data.
70
Types of Auto-Encoders
1. De-noising Auto-encoder : Taken partially corrupted input and trained
to generate Original input.
2. Sparse Auto-encoder : The number of hidden nodes are more than the
input but a small number of nodes are kept active at a time. Sparse
autoencoder is based on auto-encoder with linear activation functions
and tied weights. In the feedforward phase, after computing the hidden
code z = Wx + b , rather than reconstructing the input from all of the
hidden units, we identify the largest hidden units and set the others to
zero
71
Loss Function for Auto-Encoders
For binary input
Its a very famous loss function it is called as cross entropy loss. In case of
binary data this loss function becomes a convex optimization problem.
72
Auto encoder example on MNIST
data-set
73
Deep Stacked Auto-Encoder
Why we need Deep Encoder training:
• Initializing the network hence reducing classification learning time
• Solve memorization problem (over fitting)
74
Ladder-wise Training
75
End-to-end Training
76
De-noising Auto-Encoders
77
Applications of Auto-Encoders
1. Dimensionality reduction
2. Improve performance of many machine learning application
3. Learn best feature for a given size of hidden nodes.
4. Information retrieval
5. Data visualization
6. Feature extraction
i. Prediction accuracy
ii. Speed of prediction
iii. Memory requirements
7. Semantic hashing
8. Unsupervised pre-training
Neural Networks and Deep Learning
78
Recurrent neural networks
Motivation
Most of the real world problems are sequential in nature, i.e. the order of
event matters.
Examples:
• Speech - The of words matter to fruitful translation of the sentence.
• Activity – The order of movement of a system may define different
activities.
• Videos – These can be modeled as sequence of images.
• Need a model that extracts features across inputs .
• Models like DNN,CNN assume IID conditions of training data thus
during learning the activation of a neuron Is affected only by current
input.
• We need to find a way for the model to pass information to its subsequent
‘states’. Neural Networks and Deep Learning
79
Sequence models
One thing that comes intuitively is the model must learn some sort state
space representation to model sequences.
80
Recurrent neural network
The design of RNNs are quite simple. There is a extra self-loop input
along with input from previous layer which can unrolled along temporal
axis to get a better understanding.
NOTATIONS
81
Forward propagation
83
Back-propagation through time
Equations back-Propagation are : Notes:
T ′ <𝑖+1>
𝛻𝑎<𝑖> 𝐿 = 𝑊𝑎𝑎 𝑓 𝑎 T
𝛻𝑎<𝑖+1> 𝐿 + 𝑊𝑦𝑎 𝛻𝑜<𝑖> 𝐿 • 𝛻𝑎<𝑖> 𝐿 for 𝑖 = 𝑇𝑚𝑎𝑥 doesn’t
contain the first term.
T
𝛻𝑊𝑦𝑎 𝐿 = 𝛻𝑜<𝑖> 𝐿 𝑎<𝑖>
𝑖 • The derivation assumes 𝑔(. ) is
𝛻𝑊𝑎𝑎 𝐿 = 𝑓 ′ 𝑎 <𝑖+1> 𝛻𝑎<𝑖> 𝐿 𝑎<𝑖−1>
T softmax function and cross-
𝑖
entropy loss function is used.
T
𝛻𝑊𝑎𝑎 𝐿 = 𝑓 ′ 𝑎<𝑖+1> 𝛻𝑎<𝑖> 𝐿 𝑥 <𝑖> • 𝑓 ′ 𝑥 is a diagonal matrix
𝑖 containing derivative of 𝑓(𝑥)
as elements.
84
A numerical example
Let’s see how the above structure be used to predict the fifth letter in the word
“hello”. In the above structure, the blue RNN block, applies something called
as a recurrence formula to the input vector and also its previous state. In this
case, the letter “h” has nothing preceding it, let’s take the letter “e”. So at the
time the letter “e” is supplied to the network, a recurrence formula is applied to
the letter “e” and the previous state which is the letter “h”. These are known as
various time steps of the input. So if at time t, the input is “e”, at time t-1, the
input was “h”. The recurrence formula is applied to e and h both. and we get a
new state. Neural Networks and Deep Learning
85
One-hot encoding of vocabulary
“hello” has 4 distinct character which are going to be used for prediction
Our entire vocabulary is {h,e,l,o} and hence we can easily one hot
encode the inputs.
h e l o
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
86
Epoch 1
• Epoch 1: 𝑎1 = 𝑓(𝑊𝑎𝑥 𝑥 1 + 𝑊𝑎𝑎 𝑎0 + 𝑏)
𝑾𝒂𝒙 h
0.287027
0.287027 0.84606 0.572392 0.486813 1 0.902874
0.902874 0.871522 0.691079 0.18998
0 0.537524
0.537524 0.09224 0.558159 0.491528
0
0
𝒂𝟎
0
87
Epoch 1
• 𝑦1 = 𝑔(𝑊𝑦𝑎 𝑎1 )
𝑾ya
0.37168 0.974829459 0.830034886 0.693168 1.8003
0.39141 0.282585823 0.659835709 0.899554 1.0548
0.64985 0.09821557 0.334287084 0.802118 0.8069
0.91266 0.32581642 0.144630018 1.0417
e
1.8003 0.4323
0.2051 0
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 1.0548
0.1601 1 ERROR
0.8069
1.0417 0.2025 0
0
Neural Networks and Deep Learning
88
Epoch 2
• 𝑎2 = 𝑓(𝑊𝑎𝑥 𝑥 2 + 𝑊𝑎𝑎 𝑎1 + 𝑏)
𝑾𝒂𝒙 e
0.287027 0.84606 0.572392 0.486813
0.8461
0
0.902874 0.871522 0.691079 0.18998 0.8715
1 0.0922
0.537524 0.09224 0.558159 0.491528
0
0
𝒂𝟏
0.693168
𝑾𝒂𝑎 =0.427043 𝑏 =0.56700 0.899554
𝑓=tanh
0.802118
89
Epoch 2
• 𝑦1 = 𝑔(𝑊𝑦𝑎 𝑎1 )
𝑾ya
0.37168 0.974829459 0.830034886 0.93653372
1.9061
0.39141 0.282585823 0.659835709 0.94910403
1.1378
0.64985 0.09821557 0.334287084
0.9567
0.76234056
0.91266 0.32581642 0.144630018 1.2742
l
1.9061 0.4197
1.1378 0.1947 0
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ERROR
0.9567 0.1624 0
1.2742 0.2231 1
0
Neural Networks and Deep Learning
90
Vanishing Gradient Problem
• The vanishing gradient problem is not unique to RNNs rather occur in any DL
model too many layers for the gradient to propagate. In other model number
of layers is a hyper parameter so can be adjusted as needed but in RNN it is
input dependent thus cannot be modified .
• Vanishing gradient term is used to refer to the fact that with each layer
the gradients are multiplied with numbers less than 1 hence becoming
smaller than before finally leading to computational underflow, hence
layers technically get affected by states closer to them.
91
• This handicaps RNNs to learn features across very long sequences thus
rendering them useless in several problems. Thus, there is a need of a
model that understands how to memorize data that can be useful in future.
.
• Gated Recurrent Unit (GRU) and Long Short Term Memory (LSTM)
models are improvements over vanilla RNN model that use the concept of
memory cell to store information that might be useful in future to mitigate
the issue.
92
Long-Short Term Memory (LSTM)
Two major characteristics of LSTM:
● Information Persistence : Done using Cell States. These
are like conveyor belts that runs across time through which
information flows.
Cell State : It’s now time to update the old cell state, Ct-1, into the
new cell state Ct
Global Interpretability:
Understand the overall work of an AI model
Local Interpretability:
Focusing on explaining individual predictions
Example: In medical image, Local Interpretability might involve
identifying specific features in an image that led to a particular
classification.
○ Set function v(S) determining the gain for any subset S of players.
● Shapley values are a fair way to attribute the total gain to the
players based on their contributions
○ Concept: Marginal contribution of a player to a subset of other players (v(S U
{i}) - v(S))
11
References
• A. Gulli, S. Pal, Deep Learning with Keras (Packt publishing, Birmingham, 2017)
• S. Axler, Linear Algebra Done Right (Springer, New York, 2015)
• R. Vidal, Y. Ma, S. Sastry, Generalized Principal Component Analysis (Springer, London, 2016)
• Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, Cambridge, 2016)
• D.H. Ballard, Modular learning in neural networks, in AAAI-87 Proceedings (AAAI, 1987), pp. 279–
284
• Y. LeCun, Modeles connexionnistes de l’apprentissage (Connectionist Learning Models)
(Université P. et M. Curie (Paris 6), 1987)
• P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A.Manzagol, Stacked denoising autoencoders:
learning useful representations in a deep networkwith a local denoising criterion. J. Mach. Learn.
Res. 11, 3371–3408 (2010)
• Q.V. Le,M.A. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, A.Y. Ng, Building
high-level features using large scale unsupervised learning, in Proceedings of the 29th
International Conference on Machine Learning. ICML (2012)
11
Thank You
11