0% found this document useful (0 votes)
20 views118 pages

Deep Learnig-CNN-new - DMI-compressed

Uploaded by

princy_usha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views118 pages

Deep Learnig-CNN-new - DMI-compressed

Uploaded by

princy_usha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 118

Deep Learning

By
Dr. Ratnakar Dash
CSE Dept, NIT Rourkela

Neural Networks and Deep Learning


Machine learning
In machine learning the agent is computer.

What we do is:
(1) take some data,
(2) train a model on
that data, and
(3) use the trained
model to make
predictions on
new data.

Neural Networks and Deep Learning

2
Deep architecture in human
Lateral Geniculate
Nucleus

Visual Cortex
V4
High level
abstractions
V2
Simple shapes
V1
Edges,corners

Retina--Pixels
Neural Networks and Deep Learning

3
Shallow learning
• SVM

• Linear & Kernel Regression

• Hidden Markov Models (HMM)

• Gaussian Mixture Models (GMM)

• Single hidden layer MLP

• ...

 Limited modeling capability of concepts

 Cannot make use of unlabeled data

Neural Networks and Deep Learning

4
Deep learning
WHAT We learn the patterns
WHEN or structures from
data using some sort
of optimization.

WHY DEEP LEARNING

WHERE HOW

Neural Networks and Deep Learning

5
What – Deep learning definition
• Deep learning is a kind of representation learning in which there are
multiple levels of features. These features are automatically
discovered and they are composed together in the various levels to
produce the output.

• Each level represents abstract features that are discovered from the
features represented in the previous level. Hence, the level of
abstraction increases with each level.

• This type of learning enables discovering and representing higher-


level abstractions. In neural networks, the multiple layers
corresponds to multiple levels of features. These multiple layers
compose the features to produce output.

Neural Networks and Deep Learning

6
What – Deep learning definition
• Deep learning is a class of machine learning algorithms that use a
cascade of multiple layers of nonlinear processing units for feature
extraction and transformation

• But commonly it has a become a buzz word for ANNs with more
than one hidden layers.

Neural Networks and Deep Learning

7
When - Deep learning timeline

Although the architectures existed since 2000s, but in 2012 success of GPU based
implementation of DL models in complex tasks led the revolution. One of them was
AlexNet model that won the ImageNet competition that year.

Neural Networks and Deep Learning

8
Why – Deep learning needed
The fundamental question still stays is that is deep learning more
powerful than Simpler models?

The answer can obviously be guessed yes but why is the real question.

A simpler explanation comes from circuit theory that states that


expressibilty of any logic network increases exponentially with layers.
Hence more layers more ability , but this at the same time says that a
network with single large layer can have same power as multiple small
layers.

So why not use one large layer?


• Lot of weights to be trained.
• Infeasible matrix sizes. Neural Networks and Deep Learning

9
Why – Deep learning needed
Another argument in favor of DL has come from analyzing what the hidden
layers have learned over training. It has been seen as general trend that each
layer tries to combine features of its previous layer to get a semantically
higher feature.

For ex: In a face recognition system, first layer might extract edges and
corners, where as second layer combine them to get features like ear, nose,
eye, etc. whereas the final layer combines them to generate features of
overall face.

Neural Networks and Deep Learning

10
Where – Deep learning applications
APPLICATIONS ARE INFINITELY MANY
The image shows a basic difference between situation where to apply DL over other models

Neural Networks and Deep Learning

11
Deep algorithms
• An algorithm is deep if the input is passed through several non-linearities
before being output.

??Now why do we need this so called 'deep' thing??

We were fine with what we had

For intuition, imagine you are told to send a letter to your aunt living in some
other city

Neural Networks and Deep Learning

12
Deep algorithms (contd.)
• A deep architecture would be like:
 You drop it in a postbox
 The postman in your city collects it and drops it at the distribution hub
 The hub sends it via a transportation
 It is collected at the other city's distribution hub
 The other postman collects it and gives it to your aunt
 Wouldn't it be more compact and gracefully reuse previous efforts?
• Now imagine the work was shallow:
 You had to travel all the way to your aunt to give the letter
 Wouldn't it be too complex and involve a lot of duplication of effort?

Neural Networks and Deep Learning

13
How deep learning?
• Imagine that we train a neural network which has:

– input

– one hidden layer

– output which is the same as the input

• And you require that the hidden layer has:

– less nodes than in the input/output layers

– is sparse, i.e. the nodes usually output 0, but only sometimes >0.05

• This is called autoencoder (or autoassociator).

Neural Networks and Deep Learning

14
How deep learning? (contd.)
• So then your final network might look like:

– input layer

– layer from first autoencoder

– layer from second autoencoder

– ...

– layer from nth autoencoder

– output layer

• Now, if we have a lot of labeled data we can then "fine-tune" this network.

– i.e. use those layers from autoencoders as the first generation of a big
neural network and then

– run a lot of generations of back propagation Neural Networks and Deep Learning

15
Deep neural networks
• Mathematical proofs are there which shows that for some functions
shallow circuits require exponentially more circuit elements to
compute than do deep circuits.

• Deep circuits thus can be intrinsically much more powerful than


shallow circuits. It is expected that networks with many more hidden
layers are more powerful.

Neural Networks and Deep Learning

16
Deep neural networks (contd.)

These networks use the intermediate layers to build up multiple


layers of abstraction

Neural Networks and Deep Learning

17
Deep neural networks (contd.)
• For instance, if we're doing visual pattern recognition, then the
neurons in the first layer might learn to recognize edges, the neurons
in the second layer could learn to recognize more complex shapes,
say triangle or rectangles, built up from edges. The third layer would
then recognize still more complex shapes and so on.

• These multiple layers of abstraction gives deep networks an


advantage to learn and solve complex pattern recognition problems.

Neural Networks and Deep Learning

18
Convolutional neural networks
• In 1995, Yann LeCun and Yoshua Bengio introduced the concept of
convolutional neural networks.

• How to produce good internal representations of the visual world to


support recognition...

– detect and classify objects into categories, independent of pose,


scale, illumination, conformation, occlusion and clutter

• Previously in Computer Vision: Hand-crafted feature extractor

• Now in Computer Vision: Learn suitable representations of images

• Argument: Biological vision is hierarchically organized

Neural Networks and Deep Learning

88
History

Yann LeCun, Professor of Computer Science


The Courant Institute of Mathematical Sciences
New York University
Room 1220, 715 Broadway, New York, NY 10003, USA.
(212)998-3283 yann@cs.nyu.edu

In 1995, Yann LeCun and Yoshua Bengio introduced the concept of


convolutional neural networks.

Neural Networks and Deep Learning

20
About CNN’s
• CNN’s Were neurobiologically motivated by the findings of locally
sensitive and orientation-selective nerve cells in the visual cortex.

• They designed a network structure that implicitly extracts relevant features.

• Convolutional Neural Networks are a special kind of multi-layer neural


networks. CNN is a feed-forward network that can extract topological
properties from an image.

• Like almost every other neural networks they are trained with a version of
the back-propagation algorithm.

• Convolutional Neural Networks are designed to recognize visual patterns


directly from pixel images with minimal preprocessing.

Neural Networks and Deep Learning

21
How CNNs work

• Example of a network with many convolutional layers.


• Filters are applied to each training image at different resolutions,
• Output of each convolved image is used as the input to the next layer.

Neural Networks and Deep Learning


Common layers in CNN

• Convolution layer

• ReLU Layer

• Pooling Layer

• Fully Connected Layer

• Softmax Layer

Neural Networks and Deep Learning


CNN’s topology
Feature maps

C
Feature extraction layer
Convolution layer

Shift and distortion invariance or S


Subsampling layer

Neural Networks and Deep Learning

24
Local receptive fields

The inputs are shown as a vertical line of neurons. In a convolutional net


inputs are shown as a 28x28 square of neurons, whose values correspond
to the 28x28 pixel intensities.

Neural Networks and Deep Learning

25
Local receptive fields
• But we won't connect every input pixel to every hidden neuron.
Instead, we only make connections in small, localized regions of the
input image. Each neuron in the first hidden layer will be connected to
a small region of the input neurons

Neural Networks and Deep Learning

26
Local receptive fields
• That region in the input image is called the local receptive field for the
hidden neuron. It's a little window on the input pixels

• Each connection learns a weight. And the hidden neuron learns an


overall bias as well. You can think of that particular hidden neuron as
learning to analyze its particular local receptive field.

• We then slide the local receptive field across the entire input image.
For each local receptive field, there is a different hidden neuron in the
first hidden layer.

Neural Networks and Deep Learning

27
Local receptive fields

Neural Networks and Deep Learning

28
Local receptive fields
• If we have a 28x28 input image, and 5x5 local receptive fields, then there
will be 24x24 neurons in the hidden layer. This is because we can only
move the local receptive field 23 neurons across (or 23 neurons down),
before colliding with the right hand side (or bottom) of the input image.

• In this case, the local receptive field being moved by one pixel at a time. In
fact, sometimes a different stride length is used. For instance, we might
move the local receptive field pixels to the right (or down), in which case
we'd say a stride length of 2 is used.

• The relation of local receptive field can be defined by convolution


operation.

Neural Networks and Deep Learning

29
Definition convolution

• For complex-valued
functions f, g defined on the
set Z of integers, the discrete A(4*4) C(3*3)
convolution of f and g is 1 1 0 1 B(2*2)
2 1 1
given by 1 0 0 1 1 0
2 2 1
0 1 1 0 1 1
2 3 3
1 1 1 1 Convolution

A(4*4) C(3*3)
1 1 0 1 B(2*2)
2 1 1
1 0 0 1 1 0
2 2 1
0 1 1 0 1 1
2 3 3
1 1 1 1 Convolution

Neural Networks and Deep Learning


Convolution operation

4 3 4

* = 2 4 3

2 3 4

Neural Networks and Deep Learning

31
• the size of the input is
Convolutional HI×WI×C, here HI

Layer
represents the height, WI
the width, and C the
channels.
• suppose the size of the
filter is HF×WF×C, here
HF and WF represent the
Filter 1 Filter N height and width of the
HI filters.
HF C
• Suppose the padding size
WF
at each margin is P, the
C
stride size is S,
WI
• The output is HO×WO×N
H I  H F  2P WI  WF  2 P
HO  1 WO  1
S S
Neural Networks and Deep Learning
Feature map generation

Neural Networks and Deep Learning

33
Feature extraction layer or
Convolution layer
detect the same feature at different positions in the input image.
features

Neural Networks and Deep Learning

34
Shared weights and biases
• Each hidden neuron has a bias and 5x5 weights connected to its local
receptive field.

• Same weights and bias are used for each of the hidden neurons.

• For the j,kth hidden neuron, the output is:

 4 4

  b   wl ,m a j l ,k  m 
 l 0 l 0 
• All the neurons in the first hidden layer detect exactly the same feature

Neural Networks and Deep Learning

35
Shared weights and biases
• Suppose the weights and bias are such that the hidden neuron can pick out,
say, a vertical edge in a particular local receptive field. That ability is also
likely to be useful at other places in the image. And so it is useful to apply
the same feature detector everywhere in the image.

• The mapping from the input layer to the hidden layer is called as feature
map. The weights defining the feature map are called as shared weights.
And we call the bias defining the feature map in this way the shared bias

Neural Networks and Deep Learning

36
Shared weights and biases
• For image recognition we'll need more than one feature map. And so a
complete convolutional layer consists of several different feature maps as
shown below.

Neural Networks and Deep Learning

37
Advantages of shared weights and
biases:
• A big advantage of sharing weights and biases is that it greatly reduces the
number of parameters involved in a convolutional network.

• In the example, For each feature map we need 25 shared weights, plus a
single shared bias. So each feature map requires 26 parameters. If we have
20 feature maps that's a total of 20x26 = 520 parameters.

• In comparison, fully connected requires atleast 40 times as many parameters


as the convolutional layer.

• That, in turn, will result in faster training for the convolutional model, and,
ultimately, will help us build deep networks using convolutional layers.

Neural Networks and Deep Learning

38
Activation function
• An mathematical abstraction of
obtained feature map.

• Why we need it?

– To limit the feature values.

– To have a differentiable feature


space to learn by gradient
decent method.

Neural Networks and Deep Learning

39
Rectified linear unit (ReLU)
• The neurons in the feature map
after convolution layer will pass
through a nonlinear activation
function Sigmoid
• Use rectified linear unit (ReLU)
layer other than traditional
sigmoid function

x x0
ReLU( x )  
0 x0
ReLU

Neural Networks and Deep Learning


Rectified linear unit (ReLU)
• Most widely used activation function.

• Faster convergence.
tanh
1 -4 3 -2 1 0 3 0
-1 1 -2 1 0 1 0 1
-1 -2 3 -1 0 0 3 0
ReLU
2 -2 1 4 2 0 1 4

Reported by Krizhevsky et al. in ImageNet Classification


with Deep Convolutional Neural Networks

Neural Networks and Deep Learning

41
ReLU

ReLU operation.
Source: http://mlss.tuebingen.mpg.de/2015/slides/fergus/Fergus_1.pdf

Neural Networks and Deep Learning

42
Pooling layers
• Pooling layers are usually used immediately after convolutional layers.
The pooling layers simplify the information in the output from the
convolutional layer.

• A pooling layer takes each feature map output from the convolutional
layer and prepares a condensed feature map. For instance, each unit in
the pooling layer may summarize a region of (say) 2x2 neurons in the
previous layer

• one common procedure for pooling is known as maxpooling. In


maxpooling, a pooling unit simply outputs the maximum activation.
Neural Networks and Deep Learning

43
Pooling layers
• Note that since we have 24×24 neurons output from the convolutional
layer, after pooling we have 12x12 neurons.

Neural Networks and Deep Learning

44
Pooling method
• pooling region Q

• activation set AQ

• max-pooling MP  max  AQ 

AP 
 A Q
• Average Pooling
AQ

• stochastic-pooling

– Probability map
ai
P   pi  
 AQ
ai

– Output SP  ak,where k ~ P  p1 ,..., pi ,...  Neural Networks and Deep Learning


Advantages of pooling
• pooling layer (PL) replaces the outputs from ReLU layer with a
summary statistic of local outputs.

• Three Advantages

– reduce the computation burden;

– help the activations in pooled map are less sensitive to the


precise locations of structures than unpooled map

– make the representation has invariant property to slight


translation of the input.

Neural Networks and Deep Learning


Pooling example

Average
0.4
Pooling

activation Max
2.3
Pooling
0 0.8 0

2.3 0 0 Stochastic
0 0.22 0 Pooling
0 0.5 0
0.64 0 0 k=8 0.5

0 0.14 0

probability map

Neural Networks and Deep Learning


Pooling layers output
• The convolutional layer usually involves more than a single feature map.
So max pooling is applied to each feature map separately.

Neural Networks and Deep Learning

48
Fully connected layer and
Softmax layer
Fully connected layer:
• connect every neuron in one layer • probability of sample x
to every neuron in another layer. belonging to class f is
• It is in principle the same as the J (x f )  J ( f )
traditional multi-layer perceptron J ( f x)  F

neural network  J ( x k )  J (k )
k 1

• connection provides a weight


• define Af as
matrix (w) and a bias vector (b)

Softmax layer Af  ln  J ( x, f )  J ( f ) 
• J represents the probability • We have
function, exp  Af ( x ) 
J ( f x)  F
• f is a given class,  exp  A ( x) k

• J(f) and J(x|f) represent the class k 1

prior probability and conditional


probability of class f, respectively. Neural Networks and Deep Learning
Putting all together

Neural Networks and Deep Learning

50
Deep convolutional networks
• These networks use a special architecture which tries to take
advantage of the spatial structure and is particularly well adapted to
classify images.

• This architecture makes convolutional networks fast to train. This, in


turns, helps us train deep, many layer networks, which are very good
at classifying images

• Convolutional neural networks use three basic ideas: local receptive


fields, shared weights, and pooling.

Neural Networks and Deep Learning

51
Designing of small CNN example

7x7x8 5 x 5 x 16
ReLU + Pool ReLU + Pool
__________________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
Conv2D (None, 22, 22, 8) 400
__________________________________________________________________________
MaxPooling (None, 11, 11, 8) 0
__________________________________________________________________________
Conv2D (None, 7, 7, 16) 3216
__________________________________________________________________________
MaxPooling (None, 3, 3, 16) 0
__________________________________________________________________________
Flatten (None, 144) 0
__________________________________________________________________________
Dense (None, 10) 1450
=================================================================
Total params: 5,066
Trainable params: 5,066
Non-trainable params: 0
_________________________________________________________________ Neural Networks and Deep Learning

52
Visual results

Neural Networks and Deep Learning

53
LeNet5
 Introduced by LeCun.

 raw image of 32 × 32 pixels as input.

Neural Networks and Deep Learning

54
LeNet5
 C1,C3,C5 : Convolutional layer.
 5 × 5 Convolution matrix.
 S2 , S4 : Subsampling layer.
 Subsampling by factor 2.
 F6 : Fully connected layer.

Neural Networks and Deep Learning

55
LeNet5
 All the units of the layers up to F6 have a sigmoidal
activation function of the type:

y j   (v j )  A tanh( Sv j )

Neural Networks and Deep Learning

56
LeNet5

+1

+1
84
Y j   ( Fi  Wij ) 2 , j  0,...,9
W1
F1
W2 i 1
F2
0 Y0
+1
W84
F84

Neural Networks and Deep Learning

57
LeNet5

 About 187,000 connection.


 About 14,000 trainable weight.

Neural Networks and Deep Learning

58
LeNet5

Neural Networks and Deep Learning

59
LeNet5

Neural Networks and Deep Learning

60
Disadvantages of CNNs
 From a memory and capacity standpoint the CNN is not much bigger
than a regular two layer network.

 At runtime the convolution operations are computationally expensive


and take up about 67% of the time.

 CNN’s are about 3X slower than their fully connected equivalents (size-
wise).

Neural Networks and Deep Learning

61
Solution
 ReLU (Rectified Linear Units ) Nonlinearity
 Standard way: f(x) = tanh(x) or f(x) = (1 + e-x)-1 (logistic function)
 ReLU: f(x) = max(0, x)
 6 times faster than hyperbolic function

 Multiple GPUs
 Larger memory
 Parallel computing
 Control of communication
 Use pre trained Network (Transfer Learning)

Neural Networks and Deep Learning

62
Transfer learning (inductive transfer)
• storing knowledge gained while solving one problem
• applying it to a different but related problem
• For example, knowledge gained while learning to recognize cars could
apply when trying to recognize trucks.
• Related to psychological field on “transfer or learning”.

Neural Networks and Deep Learning


Advantage of transfer learning
• Commonly used in deep learning applications. Take a pretrained
network and use it as a starting point to learn a new task.
• Fine-tuning a network with transfer learning is much faster and easier
than constructing and training a new network.
• You can quickly transfer learning to a new task using a smaller number
of training images.
• The advantage of transfer learning is that the pretrained network has
already learned a rich set of features.
• These features can be applied to a wide range of other similar tasks.

Neural Networks and Deep Learning


AlexNet

 Introduced for ImageNet challenge


 Used two GPU in parallel
 Use of ReLu instead of sigmoid function which reduced the training time
Neural Networks and Deep Learning

65
Other well-known networks
Year CNN Developed No. of parameters
by
1998 LeNet(8) Yann LeCun et al 60 thousand

Alex Krizhevsky
2012 AlexNet(7) Geoffrey Hinton 60 million
Ilya Sutskever
Matthew Zeiler
2013 ZFNetO
Rob Fergus
2014 GoogLeNet(19) Google 4 million

Simonyan
2014 VGG Net(16) 138 million
Zisserman
2015 ResNet(152) Kaiming He

Neural Networks and Deep Learning

66
Auto-Encoders
• A basic auto encoder is unsupervised learning algorithm in Neural Network
Based on Backpropgation algorithm , it's map the input vector [ say x ] to
hidden representation using the function z = f(Px+b) parameterized by
{P,b} . f is the activation function [may be linear - sigmoid ,....etc ] . the
hidden representation is then mapped linearly to the output usingX' = Wz +
b' . the parameter are optimized to minimize the mea square over
all training points.

Neural Networks and Deep Learning

67
Auto-Encoders
An auto-encoder is trained, with an absolutely standard weight-adjustment
algorithm to reproduce the input. A simple 3-layer neural network where output
units are directly connected back to input units.

When you pass data through such a network,


•It first compresses (encodes) input vector to "fit" in a smaller representation
•Then tries to reconstruct (decode) it back.

f g where Y is the
X Y  X hidden layer unit's
output vector and X
is the input vector
Neural Networks and Deep Learning

68
Auto-Encoders
An auto-encoder consists of two
parts encoder (𝜙) and decoder (𝜓)
which maps input (𝑋) to code space
( 𝐹 ) which are mathematically
expressed as, 𝜙: 𝑋 → 𝐹 , 𝜓: 𝐹 → 𝑋
and we aim to achieve encoder and
decoder parameter such that 𝜙, 𝜓 =
𝑎𝑟𝑔 𝑚𝑎𝑥𝜙,𝜓 ||𝑋 = (𝜙 ∘ 𝜓)𝑋||. Code-
space which also named as latent
space can be defined as basic neural
network as, 𝑧 = 𝜎(𝑊𝑥 + 𝑏) and 𝑥
can be generated from 𝑧 as 𝑥 =
𝜎′(𝑊′𝑧 + 𝑏′) where, 𝜎′, 𝑊′, 𝑏′ differs
from 𝜎, 𝑊, 𝑏 depending upon the
design of the network.

Neural Networks and Deep Learning

69
Auto-Encoders
•The task of training is to minimize an error of reconstruction, i.e. find the
most efficient compact representation (encoding) for input data.

Neural Networks and Deep Learning

70
Types of Auto-Encoders
1. De-noising Auto-encoder : Taken partially corrupted input and trained
to generate Original input.
2. Sparse Auto-encoder : The number of hidden nodes are more than the
input but a small number of nodes are kept active at a time. Sparse
autoencoder is based on auto-encoder with linear activation functions
and tied weights. In the feedforward phase, after computing the hidden
code z = Wx + b , rather than reconstructing the input from all of the
hidden units, we identify the largest hidden units and set the others to
zero

3. Variational Auto-encoder : Variational auto-encoder models inherit


auto-encoder architecture, but make strong assumptions concerning the
distribution of latent variables. It is controlled by additional parameter
to the Loss function.
Neural Networks and Deep Learning

71
Loss Function for Auto-Encoders
For binary input

𝐿(𝑓(𝑥)) = − ෍( 𝑥𝑘 𝑙𝑜𝑔(ො𝑥 𝑘 ) + (1 − 𝑥𝑘 )𝑙𝑜𝑔(1 − 𝑥ෞ𝑘 ))


𝑘

Its a very famous loss function it is called as cross entropy loss. In case of
binary data this loss function becomes a convex optimization problem.

For real valued input:


1
𝐿(𝑓(𝑥)) = ෍( 𝑥𝑘 − 𝑥ෞ𝑘 )2
2
𝑘
• sum of squared differences (squared Euclidean distance)
• we may not use activation function

Neural Networks and Deep Learning

72
Auto encoder example on MNIST
data-set

Example of Auto-Encoder on MNIST dataset.


(a)
(a) Presents original image
(b) Presents reconstructed image.
(b)

Neural Networks and Deep Learning

73
Deep Stacked Auto-Encoder
Why we need Deep Encoder training:
• Initializing the network hence reducing classification learning time
• Solve memorization problem (over fitting)

Types of Deep Encoder training:


1. Ladder wise pre-training
2. End-to-end pre-training

Types of Regularization in Deep Encoder training:


1. De-noising Auto-Encoder
2. Sparse Auto-Encoder
Neural Networks and Deep Learning

74
Ladder-wise Training

Neural Networks and Deep Learning

75
End-to-end Training

Neural Networks and Deep Learning

76
De-noising Auto-Encoders

Neural Networks and Deep Learning

77
Applications of Auto-Encoders
1. Dimensionality reduction
2. Improve performance of many machine learning application
3. Learn best feature for a given size of hidden nodes.
4. Information retrieval
5. Data visualization
6. Feature extraction
i. Prediction accuracy
ii. Speed of prediction
iii. Memory requirements
7. Semantic hashing
8. Unsupervised pre-training
Neural Networks and Deep Learning

78
Recurrent neural networks
Motivation

Most of the real world problems are sequential in nature, i.e. the order of
event matters.

Examples:
• Speech - The of words matter to fruitful translation of the sentence.
• Activity – The order of movement of a system may define different
activities.
• Videos – These can be modeled as sequence of images.
• Need a model that extracts features across inputs .
• Models like DNN,CNN assume IID conditions of training data thus
during learning the activation of a neuron Is affected only by current
input.
• We need to find a way for the model to pass information to its subsequent
‘states’. Neural Networks and Deep Learning

79
Sequence models

One thing that comes intuitively is the model must learn some sort state
space representation to model sequences.

• Some of the earlier works focused on Probabilistic Graph


Models among which Hidden Markov Models was quite
successful especially in speech recognition. Basically, it was a
state machine with probabilistic transitions which were learn to
optimize the error on training data. But the problem with it was
a lot of manual tuning was needed and with increase of states
the computation increases heavily.

• Recurrent Neural Networks became successful lately owing to


their more automated way of learning. The idea was to retain
the activation value of last input and to use it for computation of
activation value for next input.
Neural Networks and Deep Learning

80
Recurrent neural network
The design of RNNs are quite simple. There is a extra self-loop input
along with input from previous layer which can unrolled along temporal
axis to get a better understanding.
NOTATIONS

• 𝑿<𝒊>: Input Vector of size (𝐧 × 𝟏)

• 𝒚<𝒊> : Predicted Output of size (𝒐 × 𝟏)

• 𝒂<𝒊> : Activation Vector of size 𝒉 × 𝟏 where 𝒉


is the number of hidden units.
• 𝑾𝒂𝒂 : Weight matrix of the recurrent connection
of size (𝒉 × 𝒉)

• 𝑾𝒂𝒙 : Input to hidden layer weight of size (𝒉 ×


𝐧)
• 𝑾𝒚𝒂 : Hidden layer to output weight of size (𝒐 ×
𝐡) assuming output is 𝒐 sized vector.

Neural Networks and Deep Learning

81
Forward propagation

Equations forward Propagation are :


𝑎<𝑖> = 𝑓(𝑊𝑎𝑥 𝑥 <𝑖> + 𝑊𝑎𝑎 𝑎<𝑖−1> )

𝑦 <𝑖> = 𝑔(𝑊𝑦𝑎 𝑎<𝑖> )


Where , 𝑓 & 𝑔 are activation functions.
𝑖 is the index in temporal axis.

Neural Networks and Deep Learning


82
Back-propagation through time
The backpropagation algorithm used for RNN is quite elegant. It
averages errors over time axis and computes the derivatives backwards in
time axis, thus named aptly “Backpropagation through time”(BPTT).

• 𝐿 and 𝐿<𝑖> are total loss and individual


loss respectively. Total loss is the sum
of individual loss.

• The solid lines are flow of data in


forward propagation where as dashed
lines show flow of derivatives during
BPTT.

• As it can be seen derivative of a state is


affected by both the loss of that layer and
derivative of subsequent layers.
Neural Networks and Deep Learning

83
Back-propagation through time
Equations back-Propagation are : Notes:

𝛻𝑜<𝑖> 𝐿 = ( 𝑦 <𝑖> − 1𝑜<𝑖> )


• 𝑖 varies from 𝑇𝑚𝑎𝑥 to 1

T ′ <𝑖+1>
𝛻𝑎<𝑖> 𝐿 = 𝑊𝑎𝑎 𝑓 𝑎 T
𝛻𝑎<𝑖+1> 𝐿 + 𝑊𝑦𝑎 𝛻𝑜<𝑖> 𝐿 • 𝛻𝑎<𝑖> 𝐿 for 𝑖 = 𝑇𝑚𝑎𝑥 doesn’t
contain the first term.
T
𝛻𝑊𝑦𝑎 𝐿 = ෍ 𝛻𝑜<𝑖> 𝐿 𝑎<𝑖>
𝑖 • The derivation assumes 𝑔(. ) is
𝛻𝑊𝑎𝑎 𝐿 = ෍ 𝑓 ′ 𝑎 <𝑖+1> 𝛻𝑎<𝑖> 𝐿 𝑎<𝑖−1>
T softmax function and cross-
𝑖
entropy loss function is used.
T
𝛻𝑊𝑎𝑎 𝐿 = ෍ 𝑓 ′ 𝑎<𝑖+1> 𝛻𝑎<𝑖> 𝐿 𝑥 <𝑖> • 𝑓 ′ 𝑥 is a diagonal matrix
𝑖 containing derivative of 𝑓(𝑥)
as elements.

Neural Networks and Deep Learning

84
A numerical example

Let’s see how the above structure be used to predict the fifth letter in the word
“hello”. In the above structure, the blue RNN block, applies something called
as a recurrence formula to the input vector and also its previous state. In this
case, the letter “h” has nothing preceding it, let’s take the letter “e”. So at the
time the letter “e” is supplied to the network, a recurrence formula is applied to
the letter “e” and the previous state which is the letter “h”. These are known as
various time steps of the input. So if at time t, the input is “e”, at time t-1, the
input was “h”. The recurrence formula is applied to e and h both. and we get a
new state. Neural Networks and Deep Learning

85
One-hot encoding of vocabulary
“hello” has 4 distinct character which are going to be used for prediction

Our entire vocabulary is {h,e,l,o} and hence we can easily one hot
encode the inputs.

h e l o
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1

Neural Networks and Deep Learning

86
Epoch 1
• Epoch 1: 𝑎1 = 𝑓(𝑊𝑎𝑥 𝑥 1 + 𝑊𝑎𝑎 𝑎0 + 𝑏)
𝑾𝒂𝒙 h
0.287027
0.287027 0.84606 0.572392 0.486813 1 0.902874
0.902874 0.871522 0.691079 0.18998
0 0.537524
0.537524 0.09224 0.558159 0.491528
0
0
𝒂𝟎
0

𝑾𝒂𝑎 =0.427043 𝑏 =0.56700 0


𝑓=tanh
0

0.287027 0 0.56700 0.693168


tanh 0.902874 0 0.56700 0.899554
0.537524 0 0.56700 0.802118
Neural Networks and Deep Learning

87
Epoch 1
• 𝑦1 = 𝑔(𝑊𝑦𝑎 𝑎1 )

𝑾ya
0.37168 0.974829459 0.830034886 0.693168 1.8003
0.39141 0.282585823 0.659835709 0.899554 1.0548
0.64985 0.09821557 0.334287084 0.802118 0.8069
0.91266 0.32581642 0.144630018 1.0417

e
1.8003 0.4323
0.2051 0
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 1.0548
0.1601 1 ERROR
0.8069
1.0417 0.2025 0
0
Neural Networks and Deep Learning

88
Epoch 2
• 𝑎2 = 𝑓(𝑊𝑎𝑥 𝑥 2 + 𝑊𝑎𝑎 𝑎1 + 𝑏)
𝑾𝒂𝒙 e
0.287027 0.84606 0.572392 0.486813
0.8461
0
0.902874 0.871522 0.691079 0.18998 0.8715
1 0.0922
0.537524 0.09224 0.558159 0.491528
0
0
𝒂𝟏
0.693168
𝑾𝒂𝑎 =0.427043 𝑏 =0.56700 0.899554
𝑓=tanh
0.802118

0.8461 0.2960 0.56700 0.93653372


tanh 0.8715 0.3841 0.56700 0.94910403
0.0922 0.3425 0.56700 0.76234056
Neural Networks and Deep Learning

89
Epoch 2
• 𝑦1 = 𝑔(𝑊𝑦𝑎 𝑎1 )

𝑾ya
0.37168 0.974829459 0.830034886 0.93653372
1.9061
0.39141 0.282585823 0.659835709 0.94910403
1.1378
0.64985 0.09821557 0.334287084
0.9567
0.76234056
0.91266 0.32581642 0.144630018 1.2742

l
1.9061 0.4197
1.1378 0.1947 0
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ERROR
0.9567 0.1624 0
1.2742 0.2231 1
0
Neural Networks and Deep Learning

90
Vanishing Gradient Problem
• The vanishing gradient problem is not unique to RNNs rather occur in any DL
model too many layers for the gradient to propagate. In other model number
of layers is a hyper parameter so can be adjusted as needed but in RNN it is
input dependent thus cannot be modified .

• Vanishing gradient term is used to refer to the fact that with each layer
the gradients are multiplied with numbers less than 1 hence becoming
smaller than before finally leading to computational underflow, hence
layers technically get affected by states closer to them.

Neural Networks and Deep Learning

91
• This handicaps RNNs to learn features across very long sequences thus
rendering them useless in several problems. Thus, there is a need of a
model that understands how to memorize data that can be useful in future.
.

• Gated Recurrent Unit (GRU) and Long Short Term Memory (LSTM)
models are improvements over vanilla RNN model that use the concept of
memory cell to store information that might be useful in future to mitigate
the issue.

Neural Networks and Deep Learning

92
Long-Short Term Memory (LSTM)
Two major characteristics of LSTM:
● Information Persistence : Done using Cell States. These
are like conveyor belts that runs across time through which
information flows.

● Prioritizing Information : This means which deciding


information is useful for future and which are useless and
can be erased. Done using gates similar to digital logic, but
are controlled by neural networks.

Neural Networks and Deep Learning


Long-Short Term Memory (LSTM)

Neural Networks and Deep Learning


Different Modules in LSTM
Four Different modules are available in a
LSTM
1. Cell State
2. Forget Gate
3. Input Gate
4. Output Gate

Neural Networks and Deep Learning


Cell State
Key Idea:-
The horizontal line, the cell state
is kind of like a conveyor belt. It
runs straight down the entire
chain, with only some minor
linear interactions. It’s very easy
for information to just flow along
it unchanged.

Neural Networks and Deep Learning


Forget Gate
● Gates are a way to optionally let
information through. They are
composed out of a sigmoid neural net
layer and a pointwise multiplication
operation.
● The first step in the LSTM is to decide
what information we’re going to throw
away from the cell state.

Neural Networks and Deep Learning


Working of LSTM
Input Gate : This gate selects which of the new information is
useful.

Candidate vector which


can be added to cell
state

Neural Networks and Deep Learning


Working of LSTM

Cell State : It’s now time to update the old cell state, Ct-1, into the
new cell state Ct

Neural Networks and Deep Learning


Working of LSTM
Output Gate : Finally, we need to decide what we’re going to
output. This output will be based on our cell state, but will be a
filtered version.

Neural Networks and Deep Learning


Introduction to Explainable AI (XAI)

As Artificial Intelligence (AI) is increasingly entrusted with critical decisions, the


need for transparency is louder than ever. XAI provides the solution, revealing the
“black-box” of AI systems.
The goal of XAI is to provide verifiable explanation of how AI system makes
decision, making it clear and easy to understand to various person.

Neural Networks and Deep Learning


Neural Networks and Deep Learning
Need for Explainability

AI are increasingly implemented in our everyday


lives to assist humans in making decisions.
These trivial decisions can vary from a lifestyle
choice to more complex decisions like military,
healthcare, finance industries.
Manu AI algorithms are Black Box that is not
transparent. This leads to trustability concerns,
due to which human wants accountability and
explanation.

Neural Networks and Deep Learning


Neural Networks and Deep Learning
02
Techniques
of
Explainable AI

Neural Networks and Deep Learning


Taxonomy of XAI Concepts

Global Interpretability:
Understand the overall work of an AI model
Local Interpretability:
Focusing on explaining individual predictions
Example: In medical image, Local Interpretability might involve
identifying specific features in an image that led to a particular
classification.

Neural Networks and Deep Learning


Diverse Techniques in Explainable AI

Model-Specific Techniques Model-Agnostic Techniques


– Tailored to a single model or – Applicable to any machine
group of models. learning model.
– Relies on the unique architecture – Works by analysing input
and capabilities of the specific model. features and output pairs.

– Ex: LIME, SHAP.

Neural Networks and Deep Learning


XAI methods
● Example-Based Explanations: Models like utilize comparisons with instances from the
training distribution to explain predictions. Prototypes, criticism, and comparisons
communicate model logic.
● Local interpretable Model-Agnostic Explanations (LIME): LIME approximates
complex models via local sparse linear models to identify salient input regions and their
contributions to predictions. Model-agnostic and faithful explanations make LIME popular.
● Shapley Values: This game-theoretic approach assigns payouts to features of a model to
indicate their contribution to outputs. Computational expenses scale exponentially but
integration into model pipelines enable responsiveness.
● Simpler Model: Linear models and decision trees have an inherent level of transparency
and global interpretability due to more intuitive mechanism and feature weighting. But
performance lags behind complex models.

Neural Networks and Deep Learning


03
SHAP
(SHapley Additive
exPlanations)

Neural Networks and Deep Learning


SHAP (Shapley Additive exPlanations)

• It is a game-theoretic approach used to explain the output of a model. It


assigns an importance value to each feature, representing its contribution to
the model's prediction.
• The foundation of SHAP lies in Shapley values, a concept from game
theory. Imagine a group of players contributing to a game's outcome.
Shapley values determine each player's fair share of the winnings based on
their contribution.
• In the context of machine learning, features are the players and the model's
prediction is the winnings. SHAP calculates the contribution of each feature
to the final prediction by considering all possible combinations of features.

Neural Networks and Deep Learning


Shapley Value
Classic result in game theory on distributing gain in a coalition
game
● Coalition game

○ Players collaborate to generate some gain.

○ Set function v(S) determining the gain for any subset S of players.
● Shapley values are a fair way to attribute the total gain to the
players based on their contributions
○ Concept: Marginal contribution of a player to a subset of other players (v(S U
{i}) - v(S))

○ Shapley value for a player is a specific weighted aggregation of its marginal


over all possible subsets of other players
Neural Networks and Deep Learning

Shapley Value for player, i = ∑SCN W(S) * (v(S U {i}) - v(S))


Shapley Value Justification
Shapley values are unique under four simple axioms
● Dummy: If a player never contributes to the game then it must receive zero
attribution
● Efficiency: Attributions must add to the total gain
● Symmetry: Symmetric players must receive equal attribution
● Linearity: Attribution for the (weighted) sum of two games must be the
same as the (weighted) sum of the attributions for each of the games

Neural Networks and Deep Learning


Transparency in Deep learning

● Deep Learning models are not transparent.


● Features extract by deep learning models
gave different levels of transparency.
● Many model agnostic methods exists to
extract explanations from deep learning
model
● Mimic models are also used widely for
creating interpretable models.

Neural Networks and Deep Learning


04
Examples of
Explainable AI
in
different domain

Neural Networks and Deep Learning


Conclusion

– AI systems are increasingly deployed in real world settings, explanations


to users regarding their logic and predictions becomes imperative for trust
and accountability.
– XAI is applied in various field such as healthcare, autonomous vehicle,
Finance, etc. for their explainability property.
– XAI is an active research area with new, improved methods being
developed consistently.

Neural Networks and Deep Learning


References
• Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document
recognition. Proc. IEEE 86(11), 2278–2324 (1998)
• D.H. Hubel, T.N. Wiesel, Receptive fields and functional architecture of monkey striate cortex. J.
Physiol. 195(1), 215–243 (1968)
• X. Zhang, J. Zhao, Y. LeCun, Character-level convolutional networks for text classification, in
Advances in Neural Information Processing Systems 28, NIPS (2015)
• J.J. Hopfield, Neural networks and physical systems with emergent collective computational
abilities. Proc. Nat. Acad. Sci. U.S.A 79(8), 2554–2558 (1982)
• S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
• J.L. Elman, Finding structure in time. Cogn. Sci. 14, 179–211 (1990)
• M.I. Jordan, Attractor dynamics and parallelism in a connectionist sequential machine, in
Proceedings of the 26th Annual International Conference on Machine Learning, Erlbaum, NJ,
USA (Cognitive Science Society, 1986), pp. 531–546
• A. Graves, Supervised Sequence Labelling with Recurrent NeuralNetworks (Springer, NewYork,
2012)

Neural Networks and Deep Learning

11
References
• A. Gulli, S. Pal, Deep Learning with Keras (Packt publishing, Birmingham, 2017)
• S. Axler, Linear Algebra Done Right (Springer, New York, 2015)
• R. Vidal, Y. Ma, S. Sastry, Generalized Principal Component Analysis (Springer, London, 2016)
• Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, Cambridge, 2016)
• D.H. Ballard, Modular learning in neural networks, in AAAI-87 Proceedings (AAAI, 1987), pp. 279–
284
• Y. LeCun, Modeles connexionnistes de l’apprentissage (Connectionist Learning Models)
(Université P. et M. Curie (Paris 6), 1987)
• P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A.Manzagol, Stacked denoising autoencoders:
learning useful representations in a deep networkwith a local denoising criterion. J. Mach. Learn.
Res. 11, 3371–3408 (2010)
• Q.V. Le,M.A. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, A.Y. Ng, Building
high-level features using large scale unsupervised learning, in Proceedings of the 29th
International Conference on Machine Learning. ICML (2012)

Neural Networks and Deep Learning

11
Thank You

Neural Networks and Deep Learning

11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy