0% found this document useful (0 votes)
12 views63 pages

Sarma CNN Vce Oct 2022

Convolutional networks, or CNNs, are specialized neural networks designed for processing grid-like data, utilizing a mathematical operation called convolution to improve efficiency. The talk aims to explain convolution, pooling, and the tools provided by CNNs, highlighting their applications in machine learning. Key concepts include the convolution operation, the use of kernels, and the benefits of sparse interactions and parameter sharing in enhancing model performance.

Uploaded by

coolkriss247
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views63 pages

Sarma CNN Vce Oct 2022

Convolutional networks, or CNNs, are specialized neural networks designed for processing grid-like data, utilizing a mathematical operation called convolution to improve efficiency. The talk aims to explain convolution, pooling, and the tools provided by CNNs, highlighting their applications in machine learning. Key concepts include the convolution operation, the use of kernels, and the benefits of sparse interactions and parameter sharing in enhancing model performance.

Uploaded by

coolkriss247
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Convolutional Networks

By
Dr. T. Hitendra Sarma
Associate Professor
Department of IT
Vasavi College of Engineering
Hyderabad
Introduction

 Convolutional networks (LeCun, 1989), also known as convolutional neural


networks or CNNs, are a specialized kind of neural network for processing data that
has a known, grid-like topology.
 Convolutional networks have been tremendously successful in practical applications.
The name “convolutional neural network” indicates that the network employs a
mathematical operation called convolution.
 Convolution is a specialized kind of linear operation.
Plan of the talk

 What is convolution?
 Motivation behind using convolution in a neural network.
 What is pooling?
 How convolution may be applied to many kinds of data, with different numbers of
dimensions.
 Means of making convolution more efficient.
Objective

 The goal of this talk is to describe the kinds of tools that convolutional networks
provide.
 The general guidelines for choosing which tools to use in which circumstances will
be discussed in next session.
The building blocks
 The Convolution Operation:
 Suppose we are tracking the location of a spaceship with a laser sensor. Our laser
sensor provides a single output x(t), the position of the spaceship at time t. Both x
and t are real-valued, i.e., we can get a different reading from the laser sensor at any
instant in time.
 Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy
estimate of the spaceship’s position, we would like to average together several
measurements.
 Of course, more recent measurements are more relevant, so we will want this to be
a weighted average that gives more weight to recent measurements. We can do this
with a weighting function w(a), where a is the age of a measurement.
Convolution…
 If we apply such a weighted average operation at every moment, we obtain a new
function s providing a smoothed estimate of the position of the spaceship:
s(t) = 𝑥 𝑎 𝑤(𝑡 − 𝑎) da
 This operation is called convolution.
 The convolution operation is typically denoted with an asterisk: s(t) = (x ∗ w)(t)
 In general, convolution is defined for any functions for which the above integral is
defined.
Some Terminology
 In convolutional network terminology, the first argument (in the previous example,
the function x) to the convolution is often referred to as the input and the second
argument (in this example, the function w) as the kernel.
 The output is sometimes referred to as the feature map.
 It might be more realistic to assume that our laser provides a measurement once per
second. The time index t can then take on only integer values. If we now assume that
x and w are defined only on integer t, we can define the discrete convolution:

𝑠 𝑡 = 𝑥∗𝑤 𝑡 = 𝑥 𝑎 𝑤(𝑡 − 𝑎)
−∞
Tensors!

 In machine learning applications, the input is usually a


multidimensional array of data and the kernel is usually a
multidimensional array of parameters that are adapted by the
learning algorithm. We will refer to these multidimensional
arrays as tensors.
 We often use convolutions over more than one axis at a time.
For example, if we use a two-dimensional image I as our
input, we probably also want to use a two-dimensional kernel
K
𝑛
𝑠 𝑖, 𝑗 = 𝐼 ∗ 𝐾 𝑖, 𝑗 = 𝑚𝐼 𝑚, 𝑛 𝐾(𝑖 − 𝑚, 𝑗 − 𝑛)

 Note that
(𝐾 ∗ 𝐼)(𝑖, 𝑗) = 𝐼 ∗ 𝐾 𝑖, 𝑗 [Commutative]
Motivation

 Convolution leverages three important ideas that can help


improve a machine learning system:
 sparse interactions
 parameter sharing
 equivariant representations.
 Moreover, convolution provides a means for working with
inputs of variable size.
Motivation..
 We know it is good to learn a small model.
 From this fully connected model, do we really need all the edges?
 Can some of these be shared?
Consider learning an image:
 Some patterns are much smaller than the whole image

Can represent a small region with fewer parameters

“beak” detector
Same pattern appears in different places:
They can be compressed!
What about training a lot of such “small” detectors
and each detector must “move around”.

“upper-left
beak” detector

They can be compressed


to the same parameters.

“middle beak”
detector
A convolutional layer
A CNN is a neural network with some convolutional layers (and some other
layers). A convolutional layer has a number of filters that does convolutional
operation.

Beak detector

A filter
Convolution
These are the network
parameters to be learned.

1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
0 1 0 0 1 0 -1 1 -1 Filter 2
0 0 1 0 1 0 -1 1 -1



6 x 6 image
Each filter detects a
small pattern (3 x 3).
Convolution 1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
Convolutional kernel

This is a gif image


Convolutional kernel

Padding on the
input volume with
zeros in such
way that the conv
layer does not
alter the spatial
dimensions of
the input
Convolution 1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
If stride=2

1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
Convolution 1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1

6 x 6 image 3 -2 -2 -1
Convolution -1 1 -1
-1 1 -1 Filter 2
-1 1 -1
stride=1
Repeat this for each filter
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map
0 1
-1 -1 -2 1
6 x 6 image 3 -2 -2 -1
-1 0 -4 3
Two 4 x 4 images
Forming 2 x 4 x 4 matrix
Convolution v.s. Fully Connected
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image

x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected




0 1 0 0 1 0
0 0 1 0 1 0
x36
1 -1 -1 Filter 1 1 1
-1 1 -1 2 0
-1 -1 1 3 0
4: 0 3


1 0 0 0 0 1
0 1 0 0 1 0 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
0 1 0 0 1 0 10: 0


0 0 1 0 1 0
13 0
6 x 6 image
14 0
fewer parameters! 15 1 Only connect to
16 1 9 inputs, not
fully connected


1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3


1 0 0 0 0 1
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0 Parameter sharing


refers to using the
0 0 1 0 1 0 same parameter
13: 0 for more than one
6 x 6 image function in a
14: 0 model.
Fewer parameters 15: 1
16: 1 Shared weights
Even fewer parameters


Pooling

 A typical layer of a convolutional network consists of three stages.


 In the first stage, the layer performs several convolutions in parallel to produce a set
of linear activations.
 In the second stage, each linear activation is run through a nonlinear activation
function, such as the rectified linear activation function. This stage is sometimes
called the detector stage.
 In the third stage, we use a pooling function to modify the output of the layer further.
 A pooling function replaces the output of the net at a certain
location with a summary statistic of the nearby outputs.
 For example, the max pooling (Zhou and Chellappa, 1988) operation
reports the maximum output within a rectangular neighborhood.
Max Pooling

1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1

3 -1 -3 -1 -1 -1 -1 -1
3 0
-3 1 0 -3 -1 -1 -2 1

3 1 -3 -3 0 1 -1 -1 -2 1

3 -2 -2 -1 -1 0 -4 3
The whole CNN

cat dog ……
Convolution

Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times

Max Pooling

Flattened
Why Pooling
 Subsampling pixels will not change the object

bird
bird

Subsampling

We can subsample the pixels to make image


smaller
fewer parameters to characterize the image
A CNN compresses a fully connected network in
two ways:
 Reducing number of connections
 Shared weights on the edges
 Max pooling further reduces the complexity
Max Pooling

New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
Pooling layer
Pooling
The whole CNN

3 0
-1 1 Convolution

3 1
0 3
Max Pooling
Can
A new image
repeat
Convolution many
Smaller than the original
times
image
The number of channels Max Pooling

is the number of filters


The whole CNN

cat dog ……
Convolution

Max Pooling

Fully Connected A new image


Feedforward network
Convolution

Max Pooling

Flattened A new image


Other Pooling Functions..
 Other popular pooling functions include
 the average of a rectangular neighborhood,

 the L 2 norm of a rectangular neighborhood, or a weighted average based on the

distance from the central pixel.


 In all cases, pooling helps to make the representation become approximately
invariant to small translations of the input. Invariance to translation means that if we
translate the input by a small amount, the values of most of the pooled outputs do not
change.
 Invariance to local translation can be a very useful property if we care more about
whether some feature is present than exactly where it is.
 Eg: we just need to know that there is an eye on the left side of the face and an eye on the right side
of the face.
Note
 Pooling over spatial regions produces invariance to
translation.
 This improves the computational efficiency of the network
because the next layer has roughly k times fewer inputs to
process. (if k is the size of the neighborhood)
 This reduction in the input size can also result in improved
statistical efficiency and reduced memory requirements for
storing the parameters.
3
Flattening
0

1
3 0
-1 1 3

3 1 -1
0 3 Flattened

1 Fully Connected
Feedforward network

3
Only modified the network structure and
CNN in Keras input format (vector -> 3-D tensor)

input

Convolution
1 -1 -1
-1 1 -1
-1 1 -1
-1 1 -1 … There are
25 3x3
-1 -1 1
-1 1 -1 … Max Pooling
filters.
Input_shape = ( 28 , 28 , 1)

28 x 28 pixels 1: black/white, 3: RGB Convolution

3 -1 3 Max Pooling

-3 1
Only modified the network structure and
CNN in Keras input format (vector -> 3-D array)

Input
1 x 28 x 28

Convolution
How many parameters for
each filter? 9 25 x 26 x 26

Max Pooling
25 x 13 x 13

Convolution
How many parameters 225=
for each filter? 50 x 11 x 11
25x9
Max Pooling
50 x 5 x 5
Only modified the network structure and
CNN in Keras input format (vector -> 3-D array)

Input
1 x 28 x 28

Output Convolution

25 x 26 x 26
Fully connected Max Pooling
feedforward network
25 x 13 x 13

Convolution
50 x 11 x 11

Max Pooling
1250 50 x 5 x 5
Flattened
Convolution and Pooling as an Infinitely Strong
Prior
 Priors can be considered weak or strong depending on how concentrated the
probability density in the prior is.
 A weak prior is a prior distribution with high entropy, such as a Gaussian distribution
with high variance. Such a prior allows the data to move the parameters more or less
freely.
 A strong prior has very low entropy, such as a Gaussian distribution with low
variance. Such a prior plays a more active role in determining where the parameters
end up.
 With the way the weights are being trained, one can think of the use of convolution
as introducing an infinitely strong prior probability distribution over the parameters
of a layer.
Variants of the Basic Convolution Function

 multi-channel convolution
 Strided Convolution :
 Padding:
 Zero padding (Valid convolution)
 Without Zero padding (Same)
Note
 In order to transform from the inputs to the outputs in a convolutional layer. We
generally also add some bias term to each output before applying the nonlinearity.
 For locally connected layers it is natural to give each unit its own bias, and for tiled
convolution, it is natural to share the biases with the same tiling pattern as the
kernels.
 For convolutional layers, it is typical to have one bias per channel of the output and
share it across all locations within each convolution map.
 However, if the input is of known, fixed size, it is also possible to learn a separate
bias at each location of the output map.
 Separating the biases may slightly reduce the statistical efficiency of the model, but
also allows the model to correct for differences in the image statistics at different
locations.
Convolutional neural network for Image recognition
Dense neural network and Convolutional neural network
A simple CNN structure

CONV: Convolutional kernel layer


RELU: Activation function
POOL: Dimension reduction layer
FC: Fully connection layer
Demo – CNN

 https://poloclub.github.io/cnn-explainer/
 Video Tutorial
MNIST dataset

The MNIST database of


handwritten digits,
available from this page,
has a training set of 60,000
examples, and a test set of 10,000
examples.
It is a subset of a larger set
available from NIST.
The digits have been size-
normalized and centered in a
fixed-size image.
CIFAR10 dataset and state of the art
The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes,
with 6000 images per class. There are 50000 training images and 10000 test images.
ImageNet

 The ImageNet project is a large visual database designed for use


in visual object recognition software research. As of 2016, over
ten million URLs of images have been hand-annotated by
ImageNet to indicate what objects are pictured; in at least one
million of the images, bounding boxes are also provided.[1] The
database of annotations of third-party image URL's is freely
available directly from ImageNet; however, the actual images are
not owned by ImageNet.[2] Since 2010, the ImageNet project runs
an annual software contest, the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC), where software programs
compete to correctly classify and detect objects and scenes.
Case studies

 LeNet. The first successful applications of Convolutional Networks


were developed by Yann LeCun in 1990’s. Of these, the best known is
the LeNet architecture that was used to read zip codes, digits, etc.
 AlexNet. The first work that popularized Convolutional Networks in
Computer Vision was the AlexNet, developed by Alex Krizhevsky, Ilya
Sutskever and Geoff Hinton. The AlexNet was submitted to
the ImageNet ILSVRC challenge in 2012 and significantly
outperformed the second runner-up (top 5 error of 16% compared to
runner-up with 26% error). The Network had a very similar architecture
to LeNet, but was deeper, bigger, and featured Convolutional Layers
stacked on top of each other (previously it was common to only have a
single CONV layer always immediately followed by a POOL layer).
LeNet-5 for MNIST

http://yann.lecun.com/exdb/lenet/multiples.html
Case studies
 GoogLeNet. The ILSVRC 2014 winner was a Convolutional
Network from Szegedy et al. from Google. Its main
contribution was the development of an Inception Module that
dramatically reduced the number of parameters in the
network (4M, compared to AlexNet with 60M). Additionally,
this paper uses Average Pooling instead of Fully Connected
layers at the top of the ConvNet, eliminating a large amount
of parameters that do not seem to matter much. There are
also several followup versions to the GoogLeNet, most
recently Inception-v4.
Case studies

 VGGNet. The runner-up in ILSVRC 2014 was the network from Karen
Simonyan and Andrew Zisserman that became known as the VGGNet.
Its main contribution was in showing that the depth of the network is a
critical component for good performance. Their final best network
contains 16 CONV/FC layers and, appealingly, features an extremely
homogeneous architecture that only performs 3x3 convolutions and
2x2 pooling from the beginning to the end. Their pretrained model is
available for plug and play use in Caffe. A downside of the VGGNet is
that it is more expensive to evaluate and uses a lot more memory and
parameters (140M). Most of these parameters are in the first fully
connected layer, and it was since found that these FC layers can be
removed with no performance downgrade, significantly reducing the
number of necessary parameters.
Case studies

 ResNet. Residual Network developed by Kaiming He et al. was


the winner of ILSVRC 2015. It features special skip
connections and a heavy use of batch normalization. The
architecture is also missing fully connected layers at the end of
the network. The reader is also referred to Kaiming’s presentation
(video, slides), and some recent experiments that reproduce
these networks in Torch. ResNets are currently by far state of the
art Convolutional Neural Network models and are the default
choice for using ConvNets in practice (as of May 10, 2016). In
particular, also see more recent developments that tweak the
original architecture from Kaiming He et al. Identity Mappings in
Deep Residual Networks (published March 2016).
VGG-16 GoogleNet ResNet
AlexNet Architecture
 AlexNet, which employed an 8-layer CNN, won the ImageNet Large Scale Visual Recognition
Challenge 2012 by a phenomenally large margin. This network showed, for the first time, that the
features obtained by learning can transcend manually-designed features, breaking the previous
paradigm in computer vision.
Googlenet

 GoogLeNet was based on a deep convolutional neural network architecture


codenamed “Inception”, which was responsible for setting the new state of the art for
classification and detection in the ImageNet Large-Scale Visual Recognition
Challenge 2014 (ILSVRC 2014).
Understanding and Calculating the number of
Parameters in CNNs
Acknowledgements
 I thank Prof. Andrew Ng for making his teaching content publicly available for reuse
for effective teaching.
 I thank Dr Swagatam Das, Indian Statistical Institute, Kolkata for sharing some of the
slides.
 Lecture content is prepared based in the course “Neural Networks and Deep
Learning” by DeepLearning.AI – Delivered by Andrew Ng
 https://d2l.ai
 https://cs231n.github.io
Recent work from my team

 Following is the link for the jupyter notebook which is


developed by my Student (available @ Kaggle.com)
Mr. Syam (Systems Engineer, TCS, Bangalore)

 https://www.kaggle.com/syamkakarla/traffic-sign-classification-using-
resnet
Thank You

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy