DL_Unit_I_&_Unit_II
DL_Unit_I_&_Unit_II
Deep Learning
UNIT I: Fundamentals of Deep Learning
and
UNIT II: Convolutional Neural Network
Prepared By:
Dr. Bhandare T. V.
(Dept. IT)
Outline
• What is neural network?
• History
– ANN : motivation from biological neuron
• McCULLOCH-PITTS NEURON
– Boolean Functions Using M-P Neuron
– Geometric representation of M-P Neuron
– Limitations Of M-P Neuron
• PERCEPTRON
– Visualization of the equation
– Perceptron model
– How it operates?
– Perceptron vs McCulloch-Pitts Neuron
– Boolean Functions Using Perceptron
– Perceptron Learning Algorithm
– Minsky's "And / Or" Theorem: A Single Perceptron's Limitations
– The And / Or Theorem
– How does a neural network overcome the limitations that a single perceptron has in
learning XOR?
Outline
• Multilayer perceptron
– Activation function
– Gradient Descent
– Gradient Descent and the Delta Rule
– Gradient Decent Algorithm for training in linear unit
– Stochastic Gradient Decent Algorithm for training A linear UNIT
– Difference between Standard Gradient Decent and Stochastic Gradient Decent
– Multilayer Networks and the Backpropagation Algorithm
– Convergence and Local Minima
– Representational Power of Feedforward Networks
– Hidden layer representation
– Generalization, overfitting and stopping criterion
• Convolution Neural Networks
What is neural network?
• Neural network or Artificial Neural Network (ANN)
is a massively parallel distributed processor made up
of simple processing units, which has a natural
propensity for storing experiential knowledge and
making it available for use.
• A neural network contains a large number of
simple neuron like processing elements and a large
number of weighted connections encode
the knowledge of a network. Though biologically
inspired, many of the neural network models
developed do not duplicate the operation of the
human brain.
History
1871-1873
• Reticular Theory
Joseph von Gerlach actually proposed that the
nervous system, our nervous system is a single
continuous network as opposed to a network of
many discrete cells
• Staining Technique
Camillo Golgi discovered a chemical reaction that
allowed him to examine a nervous tissue in much
greater detail than ever before.
He was a proponent of Reticular Theory.
History (Continue…)
1888-1891
• Neuron Doctrine
Santiago Ramón y Cajal used Golgi’s technique to study
nervous system and proposed that it is actually made up of
discrete individual cell forming a network(as opposed to single
continuous network)
• The Term Neuron
The term neuron was coined by Heinrich Wilhelm Gottfried
von Waldeyer-Hartz around 1891
He further consolidate the Neuron Doctrin.
• 1950 The Final Word
In 1950’s electron microscopy finally confirmed the neuron
doctrine by unambiguously demonstrating that nerve cells
were individual cells interconnected through synapses( a
network of many individual neurons)
History (Continue…)
• McCulloch-Pitts Neuron
The first computational model of a neuron was
proposed by Warren McCulloch (neuroscientist) and
Walter Pitts (logician) in 1943.
History (Continue…)
• Perceptron
• Rosenblatt (1958) proposed the perceptron may eventually
be able to learn, make decision, translate languages i.e.
the perceptron as the first model for learning with a teacher
(i.e., supervised learning).
• “The embryo of an electronic computer that [the Navy]
expects will be able to walk, talk, see, write, reproduce itself
and be conscious of its existence.”- New York Times
History (Continue…)
• Further refined and carefully analyzed by Minsky and
Papert (1969) — their model is referred to as
the perceptron model.
• Backpropagation
– Discovered and rediscovered several times throughout 1960’s
to 1970’s
– Werbos(1982) first used it in context of artificial neural
network
– Eventually popularized by the work of Rumelhart in 1986.
• Unsupervised pre-learning
• Success in handwriting recognition
• Success in speech recognition
• First superhuman pattern recognition
• So on…
ANN : motivation from biological neuron
• The most fundamental unit of
deep neural network is called
artificial neuron
• Why is it called neuron?
• The idea comes from biology
such as:
• In biology:
– Biological neurons= neural
cells= neural processing units
• Basic question arises is:
What is the biological
neuron and how it works?
ANN : motivation from biological neuron
(Continue…)
• OR Function
• An OR function neuron would fire if ANY of the inputs is ON
i.e., g(x) ≥ 1 here.
Boolean Functions Using M-P Neuron Continue…
• A Function With An Inhibitory Input
• Here, we have an inhibitory input i.e., x_3 so
whenever x_3 is 1, the output will be 0. Keeping that in
mind, we know that x_1 AND !x_3 would output 1 only
when x_1 is 1 and x_3 is 0 so it is obvious that the threshold
parameter should be 1.
Boolean Functions Using M-P Neuron Continue…
• NOR Function
• For a NOR neuron to fire, we want ALL the inputs to be 0 so the
thresholding parameter should also be 0 and we take them all as
inhibitory input.
• NOT Function
• For a NOT neuron, 1 outputs 0 and 0 outputs 1. So we take the input
as an inhibitory input and set the thresholding parameter to 0.
Geometric representation of M-P Neuron
Continue…
• OR Function
• The plane that satisfies the decision boundary equation x_1 + x_2 + x_3 = 1 is
shown below:
• The plane that satisfies the decision boundary equation x_1 + x_2 + x_3 = 1 is shown
below:
Limitations Of M-P Neuron
• What about non-boolean (say, real) inputs?
• Do we always need to hand code the threshold?
• Are all inputs equal? What if we want to assign
more importance to some inputs?
• What about functions which are not linearly
separable? Say XOR function.
PERCEPTRON
• Overcoming the limitations of the M-P neuron, Frank Rosenblatt, an
American psychologist, proposed the classical perception model, the
mighty artificial neuron, in 1958. It is more generalized
computational model than the McCulloch-Pitts neuron where weights
and thresholds can be learnt over time
• The Perceptron was first proposed by Frank Rosenblatt.
• A perceptron is a single layer neural network, which looks like this
PERCEPTRON
• Now, let’s see what each of these nodes and connections means.
• INPUT NODE:
• The blue nodes are the input nodes. Each such node represents a feature of the input
problem. In the above example, we have m input features. The i-th input feature is
denoted by x_{i}. This layer is called the input layer. In neural networks, we can
only have real numbers as inputs.
• WEIGHT LAYER
• Then, we have the red arrows which connect the input nodes to the orange node.
These are called the synapses. Each one of these synapses has a weight attached to
it, which is denoted by w_{i} i.e. the i-th synapse has the weight w_{i}. We can
simply call this the weights layer.
PERCEPTRON Continue…
• Output Node
• Then, we have the orange node, which says y on it. It is the output
node. It calculates a score based on the inputs and the weights and
then uses an activation function to get a prediction of
either 0 or 1 based on that score.
• The red part of the equation simply goes over m input features and
sums up the products between the input features ( x_{i} ) and the
weights ( w_{i} ). And, this is how we calculate the scores for the
given input features.
Visualization of the equation
• Scores from Inputs
• the w vector makes an angle less than 90 degrees with the positive
example data vectors (x∈P) and an angle more than 90 degrees with
the negative example data vectors (x∈N). it should look something
like this:
Perceptron Learning Algorithm Continue…
• Next, the resultant matrix of the linear model is put into a non-linear
function (in deep learning, this non-linear function is known as the
activation function and is most commonly the ReLU function).
• In total, the effect of the linear and activation functions are to generate a new
perspective on the data. Data scientists call the layer-by-layer process of
matrix multiplication followed by non-linear activation
functions, transforming the feature space.
How does a neural network overcome the limitations that a
single perceptron has in learning XOR? Continue…
• In deep learning, we continue transforming the
feature space until a linear regression is able to
model the data (one of the reasons neural
networks are known as universal function
approximators) In the case of complex imagery
analysis, this may take dozens of layers. In the
case of transforming the XOR feature space until
it can be modeled by a regression, this takes 2
layers.
• With the correct set of weights generated
via backpropagation , the above network can
effectively remap the feature space and recreate The minimum neural network
required to learn the XOR
the XOR function.
function
• The illustration on next slide shows the
transformation of the input feature space into a
feature space that can then produce the correct
XOR output via a linear function.
How does a neural network overcome the limitations that a
single perceptron has in learning XOR? Continue…
The math of a 2-layer neural net learning XOR. Adapted from example in “Deep Learning”
by Goodfellow, Bengio,
Multilayer perceptron
• Multilayer perceptron or also known as Deep
feedforward networks are the foundation of most deep
learning models.
• Networks like CNNs and RNNs are just some special cases of
Feedforward networks. These networks are mostly used
for supervised machine learning tasks.
Multilayer perceptron continue…
• A Multi Layer Perceptron (MLP) contains one or more hidden layers
(apart from one input and one output layer). While a single layer
perceptron can only learn linear functions, a multi layer perceptron
can also learn non – linear functions.
Multilayer perceptron continue…
• Figure shows a multi layer perceptron with a single hidden layer. Note that all
connections have weights associated with them, but only three weights (w0, w1, w2)
are shown in the figure.
• Input Layer: The Input layer has three nodes. The Bias node has a value of 1. The
other two nodes take X1 and X2 as external inputs (which are numerical values
depending upon the input dataset). As discussed above, no computation is performed
in the Input layer, so the outputs from nodes in the Input layer are 1, X1 and X2
respectively, which are fed into the Hidden Layer.
• Hidden Layer: The Hidden layer also has three nodes with the Bias node
having an output of 1. The output of the other two nodes in the Hidden layer
depends on the outputs from the Input layer (1, X1, X2) as well as the weights
associated with the connections (edges). Figure shows the output calculation for one
of the hidden nodes (highlighted). Similarly, the output from other hidden node can
be calculated. Remember that f refers to the activation function. These outputs are
then fed to the nodes in the Output layer.
• Output Layer: The Output layer has two nodes which take inputs from the Hidden
layer and perform similar computations as shown for the highlighted hidden node.
The values calculated (Y1 and Y2) as a result of these computations act as outputs
of the Multi Layer Perceptron.
Multilayer perceptron continue…
• The output Y from the neuron is computed as shown in the
Figure. The function f is non-linear and is called the Activation Function. The
purpose of the activation function is to introduce non-linearity into the output of
a neuron. This is important because most real world data is non linear and
we want neurons to learn these non linear representations.
• Every activation function (or non-linearity) takes a single number and performs
a certain fixed mathematical operation on it. There are several activation
functions you may encounter in practice:
– Sigmoid: takes a real-valued input and squashes it to range between 0 and 1
σ(x) = 1 / (1 + exp(−x))
– tanh: takes a real-valued input and squashes it to the range [-1, 1]
tanh(x) = 2σ(2x) − 1
– ReLU: ReLU stands for Rectified Linear Unit. It takes a real-valued input
and thresholds it at zero (replaces negative values with zero)
f(x) = max(0, x)
Activation function
• What is Activation Function?
It’s just a thing function that you use to get the output of node.
It is also known as Transfer Function.
• Why we use Activation functions with Neural Networks?
It is used to determine the output of neural network like yes or
no. It maps the resulting values in between 0 to 1 or -1 to 1
etc. (depending upon the function).
• The Activation Functions can be basically divided into 2
types-
– Linear Activation Function
– Non-linear Activation Functions
Activation function continue…
• Linear or Identity Activation Function
• As you can see the function is a line or linear. Therefore, the output of
the functions will not be confined between any range.
• As you can see, the ReLU is half rectified (from bottom). f(z) is zero when z is less
than zero and f(z) is equal to z when z is above or equal to zero.
• Range: [ 0 to infinity)
• The function and its derivative both are monotonic.
• But the issue is that all the negative values become zero immediately which
decreases the ability of the model to fit or train from the data properly. That means
any negative input given to the ReLU activation function turns the value into zero
immediately in the graph, which in turns affects the resulting graph by not mapping
the negative values appropriately.
Activation function continue…
4. Leaky ReLU
• It is an attempt to solve the dying ReLU problem
• The leak helps to increase the range of the ReLU function. Usually,
the value of a is 0.01 or so.
• When a is not 0.01 then it is called Randomized ReLU.
• Therefore the range of the Leaky ReLU is (-infinity to infinity).
• Both Leaky and Randomized ReLU functions are monotonic in
nature. Also, their derivatives also monotonic in nature.
Activation function continue…
• Why derivative/differentiation is used ?
• When updating the curve, to know in which direction and how much to
change or update the curve depending upon the slope.That is why we use
differentiation in almost every part of Machine Learning and Deep Learning.
• The overall cost function (and therefore the gradient) depends on the
mean cost function calculated on all of the m training samples ( x(z)
and y(z) refer to each training sample pair).
Gradient Descent Continue…
• Is this the best way of doing things? Batch gradient descent is good because the
training progress is nice and smooth – if you plot the average value of the cost
function over the number of iterations / epochs it will look something like this:
• The line is mostly smooth and predictable. However, a problem with batch gradient
descent in neural networks is that for every gradient descent update in the
weights, you have to cycle through every training sample. For big data sets i.e. >
50,000 training samples, this can be time prohibitive.
Gradient Descent Continue…
• The main advantages:
– We can use fixed learning rate during training without worrying about learning
rate decay.
– It has straight trajectory towards the minimum and it is guaranteed to
converge in theory to the global minimum if the loss function is convex and to
a local minimum if the loss function is not convex.
– It has unbiased estimate of gradients. The more the examples, the lower the
standard error.
• The main disadvantages:
– It requires the loading of the whole dataset into memory, which can be
problematic for big data sets
– Batch gradient descent can’t be efficiently parallelized (compared to the
techniques about to be presented) – this is because each update in the weight
parameters requires a mean calculation of the cost function over all the training
samples.
– The smooth nature of the reducing cost function tends to ensure that the neural
network training will get stuck in local minimums, which makes it less likely
that a global minimum of the cost function will be found.
Gradient Descent Continue…
• Stochastic Gradient Descent
• In Batch Gradient Descent we were considering all the examples for every step of
Gradient Descent.
• But what if our dataset is very huge. Deep learning models crave for data. The
more the data the more chances of a model to be good. Suppose our dataset has 5
million examples, then just to take one step the model will have to calculate the
gradients of all the 5 million examples. This does not seem an efficient way. To
tackle this problem we have Stochastic Gradient Descent. In Stochastic Gradient
Descent (SGD), we consider just one example at a time to take a single step. We do
the following steps in one epoch for SGD:
1) Take an example
2) Feed it to Neural Network
3) Calculate it’s gradient
4) Use the gradient we calculated in step 3 to update the weights
5) Repeat steps 1–4 for all the examples in training dataset
• Since we are considering just one example at a time the cost will fluctuate over the
training examples and it will not necessarily decrease. But in the long run, you will
see the cost decreasing with fluctuations.
Gradient Descent Continue…
• Also because the cost is so fluctuating, it will never reach the minima but it will
keep dancing around it.
• SGD can be used for larger datasets. It converges faster when the dataset is large
as it causes updates to the parameters more frequently.
• As you can see in the figure above, SGD is noisy. That is because it responds to
the effects of each and every sample, and the samples themselves will no doubt
contain an element of noisiness. While this can be a benefit in that it can act to
“kick” the gradient descent out of local minimum values of the cost function, it
can also hinder it settling down into a good minimum. This is why, eventually,
batch gradient descent has outperformed SGD after 1000 iterations.
Gradient Descent Continue…
• Mini Batch Gradient Descent
• Batch Gradient Descent can be used for smoother curves. SGD can be used when
the dataset is large. Batch Gradient Descent converges directly to minima. SGD
converges faster for larger datasets. But, since in SGD we use only one example
at a time, we cannot implement the vectorized implementation on it. This can
slow down the computations. To tackle this problem, a mixture of Batch Gradient
Descent and SGD is used.
• Neither we use all the dataset all at once nor we use the single example at a time.
We use a batch of a fixed number of training examples which is less than the
actual dataset and call it A MINI-BATCH. Doing this helps us achieve the
advantages of both the former variants we saw. So, after creating the mini-batches of
fixed size, we do the following steps in one epoch:
1. Pick a mini-batch
2. Feed it to Neural Network
3. Calculate the mean gradient of the mini-batch
4. Use the mean gradient we calculated in step 3 to update the weights
5. Repeat steps 1–4 for the mini-batches we created
• Just like SGD, the average cost over the epochs in mini-batch gradient descent
fluctuates because we are averaging a small number of examples at a time.
Gradient Descent Continue…
• Mini-batch gradient descent is a trade-off between stochastic gradient
descent and batch gradient descent. In mini-batch gradient descent, the
cost function (and therefore gradient) is averaged over a small number
of samples, from around 10-500. This is opposed to the SGD batch size
of 1 sample, and the BGD size of all the training samples. It looks like
this:
• What’s the benefit of doing it this way? First, it smooths out some of the
noise in SGD, but not all of it, thereby still allowing the “kick” out of
local minimums of the cost function. Second, the mini-batch size is still
small, thereby keeping the performance benefits of SGD.
Gradient Descent Continue…
• how it performs with a min-batch size of 100 samples:
Figure 1 is the error of different hypotheses. For a linear unit with two weights, the
hypothesis space H is the w0, w1 plane. The vertical axis indicates the error of the
corresponding weight vector hypothesis relative to a fixed set of training examples. The
arrow shows the negated gradient at one particular point, indicating the direction in the
w0, w1 plane producing steepest descent along the error surface.
Gradient Descent and the Delta Rule Continue…
• Perceptron training rule guaranteed to converge to 0-error hypothesis
after finite number of iterations if
– Training examples are linearly separable (i.e., no noise)
– Sufficiently small learning rate η
• The advantages of Delta Rule over Perceptron Training Rule
include:
– Guaranteed to always converge to a hypothesis with minimum
squared error (with a small learning rate)
– Allows for noise in the data
– Allows for non-separable functions
• The delta training rule is best understood as training an unthresholded
perceptron, which is a linear unit with output o given as follows:
Gradient Descent and the Delta Rule Continue…
• Training error of a hypothesis relative to the training
examples is as follows:
Equation ……1
• Fig shows speech recognition task that involves distinguishing among possible 10
possible vowels spoken in context of “h_d”
• Input: Parameters F1 and F2 obtained from spectral analysis of sound
• Output: 10 network output corresponds to 10 possible vowel sound
Multilayer Networks and the Backpropagation
Algorithm Continue…
• Sigmoid threshold unit:
– Multiple layers of linear units produce linear functions
– Solution is The sigmoid unit, is the unit similar to perceptron, but
based on smooth differential threshold function
– The output of sigmoid unit is a nonlinear function of its inputs but
it is still differentiable
Where,
» o is output value
» t is target value and y = wixi
• We know that y=f(w) and o=g(y)
– Where f is activation function and g is sigmoid function
• We can have following function:
• Greater the number of training iterations, lower the error for training set
• Architecture within the fixed weight is:
Multilayer Networks and the Backpropagation
Algorithm Continue…
• Error Gradient for Network ❖If error occur at the out put then
propagate back until hit the input node
updating the weights
❖If the training result fit the data perfectly
then it doesn’t mean that it generalized
well
❖In neural network we don’t know
• Let , how to generalize
❖If data is big enough it contain some
noise, even training data fit the data
• how does the error depends on y5 ? very well
❖It may not minimize the errors error
on the whole distribution
❖Gradient decent is a very general
method; it applied to neural network
❖It is also applied to the methods such as
decision tree
Multilayer Networks and the Backpropagation Algorithm
Continue…
• The backpropagation algorithm learns the weights for a multilayer
network given a fixed network of units and interconnections by
employing gradient descent to minimize the squared error between the
network output and target values for those outputs.
• Because these networks have multiple output units, E must be redefined
to sum errors over all network output units:
– Δwji(n): weight update performed during the nth iteration through the main
loop of the algorithm
– α: 0≤α≤1, momentum
– Δwji(n-1): the weight update that occurred during the (n-1)th iteration
Multilayer Networks and the Backpropagation Algorithm
Continue…
• Learning in Arbitrary Acyclic Networks
• The backpropagation algorithm can be easily extended to feedforward
networks of an arbitrary depth.
• The change in the weight update rule is such that when computing δ
values, the δr value for a unit r in a layer m is computed from the δ at
the next deeper layer m+1 according to
• This step may be repeated for any number of hidden layers in the
network. Furthermore, it can be generalized to any acyclic network
regardless of how the network units are arranged. The general rule for
calculating δ for any internal (non-output) unit is
Multilayer Networks and the Backpropagation Algorithm
Continue…
• Comments on Backpropagation:
• Gradient descent over entire network weight vector
• Easily generalized to arbitrary directed graphs
• Will find a local, not necessarily global, error minimum. (However, it
does tend to perform well in practice by running multiple times)
• Often include weight momentum α in Δwji(n) = ηδjxji + αΔwji(n-1)
• Minimizes error over training examples
• Training can take thousands of iterations → slow.
• Using network after training is very fast.
Multilayer Networks and the Backpropagation Algorithm
Continue…
• Efficiency
– For training the network, numbers of iterations are very important.
If it is too small, the errors will be high. If it is too large,
overfitting may occur, and cause errors go high.
– Learning: Intractable in general. Training can take thousands of
iterations, which will be very slow. Learning net with single
hidden unit is NP-hard.
– In practice, backpropagation is very useful.
– It will be very fast to classify examples using network after
training.
– Very general in terms of “patterns”
Multilayer Networks and the Backpropagation Algorithm
Continue…
• Efficiency
– Advantages of Backpropagation
• Robust to noise in the data
• Handles continuous value
• Very general
– Disadvantages of Backpropagation
• No Human Interpretation
• Don’t know how it generates
Convergence and Local Minima
• Backpropagation is only guaranteed to converge to a local, and not
a global, minima
• Since each weight in a network corresponds to a different
dimension in the error space, a local minimum with respect to one
weight may not be a local minimum with respect to other weights
• This can provide an “escape route” from becoming trapped in local
minima
• If the weights are initialized to values close to zero, the sigmoid
threshold function is approximately linear and so they produce
linear outputs
• As the weights grow, though, the network is able to represent more
complex functions that are not linear in nature
• The weights are able to approximate the desired function that they will
be close enough to the global minimum that even becoming stuck in a
local minima will be acceptable.
Convergence and Local Minima Continue…
• Common heuristic methods to reduce the problem of local
minima are:
– Add a momentum term to the weight-update rule
– Use stochastic gradient descent rather than true gradient
descent
– Train multiple networks using the same training data but-
• initialize the networks with different random weights
• If the different networks lead to different local minima,
choose the network that performs best on a validation
set of data or all networks can be kept and treated as a
committee whose output is the (possibly weighted)
average of individual network outputs
Representational Power of Feedforward Networks
• Function that can be represented as:
– Boolean function:
• Every Boolean function can represented by some network with
exactly 2 layers of unit, though the number of hidden units can grow
up exponentially with number of inputs.
– Continuous function:
• Every bounded continuous function can be approximated by a
network with 2 layers of units
• the hidden layer contains sigmoid units while the output layer
consists of linear units
• The number of hidden units will depend on the function to be
approximated.
– Arbitrary function:
• Any function can be approximated with a network of 3 layers of units
• The output uses linear units
• the hidden layers use sigmoid units
• The number of units at each layer is not known
Hypothesis Space Search and Inductive Bias
• Hypothesis Space Search:
– In backpropagation, every possible assignment of network
weights represents a distinct hypothesis.
– This space is continuous, unlike the discrete representations such
as decision trees.
– The hypothesis space is then an n-dimensional Euclidean space of
the n network weights.
• Inductive Bias:
– In backpropagation inductive bias described as “smooth
interpolation between data points.”
– In other words, given two positive data points with no negative
examples between them, backpropagation tends to label the points
in between as positive as well.
Hidden Layer Representation
• As training examples provide input and target output weights, the
network is free to update the internal weights in order to minimize the
error.
• Network capturing properties that is not explicit in the input
representation
• This allows networks to create new features not explicitly introduced
by the designer, though these features must still be computable as
sigmoid function units.
• ANNs are used in general to approximate functions.
• However not every artificial neural network can approximate every
function. There are limitations due to the architecture chosen, the
number of layers and the hidden units per layer.
• For example, a network with just one hidden unit cannot learn the
identity function. The network needs at least three hidden units to
learn this function since there are 8 inputs and 8 outputs, 23 = 8.
Overfitting, and Stopping Criterion
• Stopping Criterion
– The stopping conditions for backpropagation is to continue to
run until the error falls below some threshold.
– this has the tendency to overfit the training examples. This
occurs because the network tunes the weights to fit the
training data explicitly and not the general distribution of the
examples.
– At first the weights are random this tends to occur with larger
iterations
– As the network runs, the complexity of the learned decision
surface increases in response to weights being changed. This
results in a learned function that can include noise.
– If the training data contains a trend or tendency that doesn’t
exist in the actual data, the ANN is likely to learn that trend if
it is given a sufficiently long training period.
Overfitting, and Stopping Criterion Continue…
1. Two 3x3 kernels result in an image one 5x5 kernel results in same
size reduction by 4 reduction.
3. So, we get lower no. of weights but Higher number of weights but lesser
more layers. layers.
5. With more layers, it learns complex, With less layers, it learns simpler non
more non-linear features linear features.
6. With more layers, it necessitates the And, it will use less memory for
need for larger memory backpropogation.
Comparison Between Smaller and Larger Filter Continue…
Smaller Filter Sizes Larger Filter Sizes
7. It has a smaller receptive field as it Larger receptive field per layer.
looks at very few pixels at once.
8. Smaller Filter Sizes Larger Filter Sizes
9. Highly local features extracted without Quite generic features extracted spread
much image overview. across the image.
10. Therefore captures smaller, complex Therefore captures the basic
features in the image components in the image.
11. Amount of information extracted will Amount of information extracted are
be vast, maybe useful in later layers considerably lesser
12 Slow reduction in the image dimension Fast reduction in the image dimension
can make the network deep makes the network shallow
13. Better weight sharing Poorer weight sharing
In an extreme scenario, using a 1x1 Using a image sized filter is equivalent
convolution is like treating each pixel to a fully connected layer.
as a useful feature.
Working of Convolution Layer Continue…
• Strides
• Stride is the number of pixels shifts over the input matrix.
• When the stride is 1 then we move the filters to 1 pixel at a time.
When the stride is 2 then we move the filters to 2 pixels at a time and
so on. The below figure shows convolution would work with a stride
of 2.