0% found this document useful (0 votes)
19 views

DL_Unit_I_&_Unit_II

The presentation covers the fundamentals of deep learning, focusing on neural networks, their history, and key concepts such as the McCulloch-Pitts neuron and the perceptron model. It discusses the limitations of early models and introduces multilayer perceptrons and convolutional neural networks, emphasizing their ability to learn complex functions. The document also outlines the perceptron learning algorithm and its application in decision-making tasks.

Uploaded by

9373dummy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

DL_Unit_I_&_Unit_II

The presentation covers the fundamentals of deep learning, focusing on neural networks, their history, and key concepts such as the McCulloch-Pitts neuron and the perceptron model. It discusses the limitations of early models and introduces multilayer perceptrons and convolutional neural networks, emphasizing their ability to learn complex functions. The document also outlines the perceptron learning algorithm and its application in decision-making tasks.

Uploaded by

9373dummy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 156

A Presentation on

Deep Learning
UNIT I: Fundamentals of Deep Learning
and
UNIT II: Convolutional Neural Network

Prepared By:
Dr. Bhandare T. V.
(Dept. IT)
Outline
• What is neural network?
• History
– ANN : motivation from biological neuron
• McCULLOCH-PITTS NEURON
– Boolean Functions Using M-P Neuron
– Geometric representation of M-P Neuron
– Limitations Of M-P Neuron
• PERCEPTRON
– Visualization of the equation
– Perceptron model
– How it operates?
– Perceptron vs McCulloch-Pitts Neuron
– Boolean Functions Using Perceptron
– Perceptron Learning Algorithm
– Minsky's "And / Or" Theorem: A Single Perceptron's Limitations
– The And / Or Theorem
– How does a neural network overcome the limitations that a single perceptron has in
learning XOR?
Outline
• Multilayer perceptron
– Activation function
– Gradient Descent
– Gradient Descent and the Delta Rule
– Gradient Decent Algorithm for training in linear unit
– Stochastic Gradient Decent Algorithm for training A linear UNIT
– Difference between Standard Gradient Decent and Stochastic Gradient Decent
– Multilayer Networks and the Backpropagation Algorithm
– Convergence and Local Minima
– Representational Power of Feedforward Networks
– Hidden layer representation
– Generalization, overfitting and stopping criterion
• Convolution Neural Networks
What is neural network?
• Neural network or Artificial Neural Network (ANN)
is a massively parallel distributed processor made up
of simple processing units, which has a natural
propensity for storing experiential knowledge and
making it available for use.
• A neural network contains a large number of
simple neuron like processing elements and a large
number of weighted connections encode
the knowledge of a network. Though biologically
inspired, many of the neural network models
developed do not duplicate the operation of the
human brain.
History
1871-1873
• Reticular Theory
Joseph von Gerlach actually proposed that the
nervous system, our nervous system is a single
continuous network as opposed to a network of
many discrete cells
• Staining Technique
Camillo Golgi discovered a chemical reaction that
allowed him to examine a nervous tissue in much
greater detail than ever before.
He was a proponent of Reticular Theory.
History (Continue…)
1888-1891
• Neuron Doctrine
Santiago Ramón y Cajal used Golgi’s technique to study
nervous system and proposed that it is actually made up of
discrete individual cell forming a network(as opposed to single
continuous network)
• The Term Neuron
The term neuron was coined by Heinrich Wilhelm Gottfried
von Waldeyer-Hartz around 1891
He further consolidate the Neuron Doctrin.
• 1950 The Final Word
In 1950’s electron microscopy finally confirmed the neuron
doctrine by unambiguously demonstrating that nerve cells
were individual cells interconnected through synapses( a
network of many individual neurons)
History (Continue…)
• McCulloch-Pitts Neuron
The first computational model of a neuron was
proposed by Warren McCulloch (neuroscientist) and
Walter Pitts (logician) in 1943.
History (Continue…)
• Perceptron
• Rosenblatt (1958) proposed the perceptron may eventually
be able to learn, make decision, translate languages i.e.
the perceptron as the first model for learning with a teacher
(i.e., supervised learning).
• “The embryo of an electronic computer that [the Navy]
expects will be able to walk, talk, see, write, reproduce itself
and be conscious of its existence.”- New York Times
History (Continue…)
• Further refined and carefully analyzed by Minsky and
Papert (1969) — their model is referred to as
the perceptron model.
• Backpropagation
– Discovered and rediscovered several times throughout 1960’s
to 1970’s
– Werbos(1982) first used it in context of artificial neural
network
– Eventually popularized by the work of Rumelhart in 1986.
• Unsupervised pre-learning
• Success in handwriting recognition
• Success in speech recognition
• First superhuman pattern recognition
• So on…
ANN : motivation from biological neuron
• The most fundamental unit of
deep neural network is called
artificial neuron
• Why is it called neuron?
• The idea comes from biology
such as:
• In biology:
– Biological neurons= neural
cells= neural processing units
• Basic question arises is:
What is the biological
neuron and how it works?
ANN : motivation from biological neuron
(Continue…)

Dendrites: Receives signals from other neurons


Synapse: Point of connection to other neurons
Soma: Process the information
Axon: Transmit the output of this neuron
ANN : motivation from biological neuron
(Continue…)
McCULLOCH-PITTS NEURON
The first computational
model of a neuron was
proposed by Warren
McCulloch (neuroscientist)
and Walter Pitts (logician) in
1943.
It may be divided into 2
parts. The first part, g takes
an input (ahem dendrite
ahem), performs an
aggregation and based on
the aggregated value the
second part, f makes a
decision.
McCULLOCH-PITTS NEURON Continue…
• Lets suppose that I want to predict my own decision, whether to watch a random football
game or not on TV. The inputs are all boolean i.e., {0,1} and my output variable is also
boolean {0: Will watch it, 1: Won’t watch it}.
So, x_1 could be isPremierLeagueOn (I like Premier League more)
x_2 could be isItAFriendlyGame (I tend to care less about the friendlies)
x_3 could be isNotHome (Can’t watch it when I’m running errands. Can I?)
x_4 could be isManUnitedPlaying (I am a big Man United fan. GGMU!) and so on.
• These inputs can either be excitatory or inhibitory.

• g(x) is just doing a sum of the inputs — a simple aggregation.


• And theta here is called thresholding parameter.
• This is called Thresholding Logic.
Boolean Functions Using M-P Neuron
• Now lets look at how this very neuron can be used to represent a few
boolean functions. Our inputs are all boolean and the output is also
boolean so essentially, the neuron is just trying to learn a boolean
function.
• M-P Neuron: A Concise Representation

• This representation just denotes that, for the


boolean inputs x_1, x_2 and x_3 if the g(x) i.e., sum ≥ theta, the
neuron will fire otherwise, it won’t.
Boolean Functions Using M-P Neuron Continue…
• AND Function
• An AND function neuron would only fire when ALL the inputs are
ON i.e., g(x) ≥ 3 here.

• OR Function
• An OR function neuron would fire if ANY of the inputs is ON
i.e., g(x) ≥ 1 here.
Boolean Functions Using M-P Neuron Continue…
• A Function With An Inhibitory Input
• Here, we have an inhibitory input i.e., x_3 so
whenever x_3 is 1, the output will be 0. Keeping that in
mind, we know that x_1 AND !x_3 would output 1 only
when x_1 is 1 and x_3 is 0 so it is obvious that the threshold
parameter should be 1.
Boolean Functions Using M-P Neuron Continue…
• NOR Function
• For a NOR neuron to fire, we want ALL the inputs to be 0 so the
thresholding parameter should also be 0 and we take them all as
inhibitory input.

• NOT Function
• For a NOT neuron, 1 outputs 0 and 0 outputs 1. So we take the input
as an inhibitory input and set the thresholding parameter to 0.
Geometric representation of M-P Neuron
Continue…
• OR Function

• OR function’s thresholding parameter theta is 1


• The inputs are boolean, so only 4 combinations are possible — (0,0), (0,1), (1,0) and
(1,1). Now plotting them on a 2D graph and making use of the OR function’s
aggregation equation i.e., x_1 + x_2 ≥ 1 using which we can draw the decision
boundary
• the aggregation equation i.e., x_1 + x_2 =1 to graphically show that all those inputs
whose output when passed through the OR function M-P neuron lie ON or ABOVE
that line and all the input points that lie BELOW that line are going to output 0.
Geometric representation of M-P Neuron
Continue…
• AND Function

• In this case, the decision boundary equation is x_1 + x_2


=2. Here, all the input points that lie ON or ABOVE, just
(1,1), output 1 when passed through the AND function M-P
neuron. It fits! The decision boundary works!
Geometric representation of M-P Neuron
Continue…
• OR Function With 3 Inputs

• The plane that satisfies the decision boundary equation x_1 + x_2 + x_3 = 1 is
shown below:

• The plane that satisfies the decision boundary equation x_1 + x_2 + x_3 = 1 is shown
below:
Limitations Of M-P Neuron
• What about non-boolean (say, real) inputs?
• Do we always need to hand code the threshold?
• Are all inputs equal? What if we want to assign
more importance to some inputs?
• What about functions which are not linearly
separable? Say XOR function.
PERCEPTRON
• Overcoming the limitations of the M-P neuron, Frank Rosenblatt, an
American psychologist, proposed the classical perception model, the
mighty artificial neuron, in 1958. It is more generalized
computational model than the McCulloch-Pitts neuron where weights
and thresholds can be learnt over time
• The Perceptron was first proposed by Frank Rosenblatt.
• A perceptron is a single layer neural network, which looks like this
PERCEPTRON
• Now, let’s see what each of these nodes and connections means.
• INPUT NODE:
• The blue nodes are the input nodes. Each such node represents a feature of the input
problem. In the above example, we have m input features. The i-th input feature is
denoted by x_{i}. This layer is called the input layer. In neural networks, we can
only have real numbers as inputs.

• WEIGHT LAYER
• Then, we have the red arrows which connect the input nodes to the orange node.
These are called the synapses. Each one of these synapses has a weight attached to
it, which is denoted by w_{i} i.e. the i-th synapse has the weight w_{i}. We can
simply call this the weights layer.
PERCEPTRON Continue…
• Output Node
• Then, we have the orange node, which says y on it. It is the output
node. It calculates a score based on the inputs and the weights and
then uses an activation function to get a prediction of
either 0 or 1 based on that score.

• Formula to make a prediction from the Inputs and Weights

• The red part of the equation simply goes over m input features and
sums up the products between the input features ( x_{i} ) and the
weights ( w_{i} ). And, this is how we calculate the scores for the
given input features.
Visualization of the equation
• Scores from Inputs

• To make a prediction from this calculated score, we


have to use an activation function.
Perceptron Model
• Frank Rosenblatt, an American psychologist, proposed the classical
perceptron model in 1958. Further refined and carefully analyzed by Minsky and
Papert (1969) — their model is referred to as the perceptron model.

• The perceptron model, proposed by Minsky-Papert, is a more general computational


model than McCulloch-Pitts neuron. It overcomes some of the limitations of the M-
P neuron by introducing the concept of numerical weights (a measure of importance)
for inputs, and a mechanism for learning those weights. Inputs are no longer
limited to boolean values like in the case of an M-P neuron, it supports real inputs
as well which makes it more useful and generalized.
How it operates?
• this is very similar to an M-P neuron but we take a weighted sum of
the inputs and set the output as one only when the sum is more than an
arbitrary threshold (theta). However, according to the convention,
instead of hand coding the thresholding parameter thetha, we add it as
one of the inputs, with the weight -theta like shown below
How it operates? Continue…
• Consider the task of predicting whether I would
watch a random game of football on TV or using the
behavioral data available. And let's assume my
decision is solely dependent on 3 binary inputs
(binary for simplicity).
Perceptron Vs McCulloch-Pitts Neuron

• From the equations, it is clear that even a perceptron separates the


input space into two halves, positive and negative. All the inputs that
produce an output 1 lie on one side (positive half space) and all the
inputs that produce an output 0 lie on the other side (negative half
space).
Boolean Functions Using Perceptron
• OR Function
• It is clear that the solution separates the input space into two spaces,
negative and positive half spaces.
Boolean Functions Using Perceptron Continue…
• XOR Function
• Now let's look at a non-linear boolean function i.e., you cannot draw a line
to separate positive inputs from the negative ones.
• Notice that the fourth equation contradicts the second and the third equation.
Point is, there are no perceptron solutions for non-linearly separated
data. So the key take away is that a single perceptron cannot learn to
separate the data that are non-linear in nature.
• Perceptron is a single layer neural network and a
multi-layer perceptron is called Neural Networks.

• The perceptron consists of 4 parts.


– Input values or One input layer
– Weights and Bias
– Net sum
– Activation Function
Perceptron Learning Algorithm
• Lets consider our problem deciding whether to watch movie or not
• Suppose we have given a list of m number of movies and a label (class)
associated with each movie indicating whether the user like this movie or
not: Binary Decision
• Further suppose we represent each movie in n features (some boolean, some
real valued)
• We will assume that the data is linearly separable and we want a perceptron
to learn how to make this decision
• In other words we want the perceptron to find the equation of this
separating plane (or find value of w0, w1, w2, w3,………., wm)
Perceptron Learning Algorithm Continue…
Perceptron Learning Algorithm Continue…
• Initialize w with some random vector
• Then iterate over all the examples in the data, (P U N) both
positive and negative examples
• When x belongs to P, we want w.x > 0, basic perceptron
rule
• If x belongs to N, the dot product MUST be less than 0 i.e.
w.x<0
• When x belongs to P, the angle between w and x should be
less than 90 because the cosine of the angle is proportional
to the dot product
Perceptron Learning Algorithm Continue…

• the w vector makes an angle less than 90 degrees with the positive
example data vectors (x∈P) and an angle more than 90 degrees with
the negative example data vectors (x∈N). it should look something
like this:
Perceptron Learning Algorithm Continue…

• So when we are adding x to w, which we do when x belongs to P


and w.x < 0 (Case 1), we are essentially increasing
the cos(alpha) value, which means, we are decreasing
the alpha value, the angle between w and x, which is what we
desire. And the similar intuition works for the case when x belongs
to N and w.x ≥ 0 (Case 2).
Minsky's "And / Or" Theorem: A Single
Perceptron's Limitations
• A perceptron is an approximator of linear functions (with an attached threshold
function). Image source: "Perceptrons" Minsky, Papert.

• In 1969, Marvin Minsky and Seymour Papert published Perceptrons — a


historic text that would alter the course of artificial intelligence research for
decades.
• In the text, Minsky and Papert proved that a single perceptron — a grandparent
to the computational units which compose modern neural networks — was
incapable of learning the exclusive-or (aka XOR) function.
What is XOR?
• The exclusive-or, or XOR, function is one of 16 binary functions that
take a pair of binary values and return "1" for True or "0" for false
depending on some predicate (boolean function).
• For example, XOR will return "1" if and only if one of the two inputs
is a "0" and the other input is a "1". If the input is either "0, 0" or "1,
1", the XOR outputs "0".

The Exclusive–Or Boolean Function.


Why can't a linear model similar to that of a
perceptron learn the XOR function?
• By definition, a linear model is going to
produce some type of linear (a 2D line, 3D
plane or higher dimensional surface that is
straight and lacks curves) function.
• In the above graph, points x̄y and xȳ are the
inputs that should be signaled out by the XOR
function as “True” or “1”.
• The challenge is, XOR is a non-linear
function, meaning that x̄y and xȳ are not
linearly separable from xy and x̄ȳ.
• Indeed, the only way to “fence in” x̄y and xȳ
is with a curved line.
Why can't a linear model similar to that of a
perceptron learn the XOR function? Continue…
• So if the XOR function effectively separates out
x̄y and xȳ, what does the function look like?
• As Fig 2, XOR and its inverse, XNOR, are
unique among the 16 boolean functions because
they cannot be captured with a single line. For
reference, the figure below delineates each of
the 14 linear boolean functions.

Fig 2 : x⊕y = (x∨y)∧(x̄∨ȳ)


The And / Or Theorem
• It now makes sense to examine Minsky and Papert’s theorem directly.
• The And / Or Theorem:
“There exist predicates Ψ1 and Ψ2 of order 1 such that Ψ1 ∧ Ψ2
and Ψ1 ∨ Ψ2 are not of finite order.”
• In the case of the XOR function, the conjunction of Ψ1 ∧ Ψ2 isn’t of
infinite order — indeed, it is of second order.
• The ambiguity of this sentence stems from the tendency of formal
mathematics to state truths that are true for all, n-dimensional spaces.
In the two dimensional space of the XOR function, a lay translation of
the above theorem might be:
• The And / Or Theorem (lay version, two dimensional space):
There are combinations of specific order 1 boolean functions that,
when conjugated with ∧ (“and”) or ∨ (“or”), they are no longer
linear.
How does a neural network overcome the limitations
that a single perceptron has in learning XOR?
• If learning, in a general sense, is simply the extraction of a pattern or trend
from a set of data/inputs/experiences, then one could probably argue that
fundamentally, the act of learning is really just framing and reframing data
again and again across different contexts, until a verifiable pattern can be
observed.
• At each layer in a neural network, the dot product of the layer's input data
and a weight is taken.

image source: Juan Stocca, Think Big Analytics


How does a neural network overcome the limitations that a
single perceptron has in learning XOR? Continue…
• This process, along with the addition of a scalar, is the core of what makes
up the linear model:

a linear model commonly used in deep learning

• Next, the resultant matrix of the linear model is put into a non-linear
function (in deep learning, this non-linear function is known as the
activation function and is most commonly the ReLU function).

ReLU — a nonlinear function

• In total, the effect of the linear and activation functions are to generate a new
perspective on the data. Data scientists call the layer-by-layer process of
matrix multiplication followed by non-linear activation
functions, transforming the feature space.
How does a neural network overcome the limitations that a
single perceptron has in learning XOR? Continue…
• In deep learning, we continue transforming the
feature space until a linear regression is able to
model the data (one of the reasons neural
networks are known as universal function
approximators) In the case of complex imagery
analysis, this may take dozens of layers. In the
case of transforming the XOR feature space until
it can be modeled by a regression, this takes 2
layers.
• With the correct set of weights generated
via backpropagation , the above network can
effectively remap the feature space and recreate The minimum neural network
required to learn the XOR
the XOR function.
function
• The illustration on next slide shows the
transformation of the input feature space into a
feature space that can then produce the correct
XOR output via a linear function.
How does a neural network overcome the limitations that a
single perceptron has in learning XOR? Continue…

The math of a 2-layer neural net learning XOR. Adapted from example in “Deep Learning”
by Goodfellow, Bengio,
Multilayer perceptron
• Multilayer perceptron or also known as Deep
feedforward networks are the foundation of most deep
learning models.
• Networks like CNNs and RNNs are just some special cases of
Feedforward networks. These networks are mostly used
for supervised machine learning tasks.
Multilayer perceptron continue…
• A Multi Layer Perceptron (MLP) contains one or more hidden layers
(apart from one input and one output layer). While a single layer
perceptron can only learn linear functions, a multi layer perceptron
can also learn non – linear functions.
Multilayer perceptron continue…
• Figure shows a multi layer perceptron with a single hidden layer. Note that all
connections have weights associated with them, but only three weights (w0, w1, w2)
are shown in the figure.
• Input Layer: The Input layer has three nodes. The Bias node has a value of 1. The
other two nodes take X1 and X2 as external inputs (which are numerical values
depending upon the input dataset). As discussed above, no computation is performed
in the Input layer, so the outputs from nodes in the Input layer are 1, X1 and X2
respectively, which are fed into the Hidden Layer.
• Hidden Layer: The Hidden layer also has three nodes with the Bias node
having an output of 1. The output of the other two nodes in the Hidden layer
depends on the outputs from the Input layer (1, X1, X2) as well as the weights
associated with the connections (edges). Figure shows the output calculation for one
of the hidden nodes (highlighted). Similarly, the output from other hidden node can
be calculated. Remember that f refers to the activation function. These outputs are
then fed to the nodes in the Output layer.
• Output Layer: The Output layer has two nodes which take inputs from the Hidden
layer and perform similar computations as shown for the highlighted hidden node.
The values calculated (Y1 and Y2) as a result of these computations act as outputs
of the Multi Layer Perceptron.
Multilayer perceptron continue…
• The output Y from the neuron is computed as shown in the
Figure. The function f is non-linear and is called the Activation Function. The
purpose of the activation function is to introduce non-linearity into the output of
a neuron. This is important because most real world data is non linear and
we want neurons to learn these non linear representations.
• Every activation function (or non-linearity) takes a single number and performs
a certain fixed mathematical operation on it. There are several activation
functions you may encounter in practice:
– Sigmoid: takes a real-valued input and squashes it to range between 0 and 1
σ(x) = 1 / (1 + exp(−x))
– tanh: takes a real-valued input and squashes it to the range [-1, 1]
tanh(x) = 2σ(2x) − 1
– ReLU: ReLU stands for Rectified Linear Unit. It takes a real-valued input
and thresholds it at zero (replaces negative values with zero)
f(x) = max(0, x)
Activation function
• What is Activation Function?
It’s just a thing function that you use to get the output of node.
It is also known as Transfer Function.
• Why we use Activation functions with Neural Networks?
It is used to determine the output of neural network like yes or
no. It maps the resulting values in between 0 to 1 or -1 to 1
etc. (depending upon the function).
• The Activation Functions can be basically divided into 2
types-
– Linear Activation Function
– Non-linear Activation Functions
Activation function continue…
• Linear or Identity Activation Function
• As you can see the function is a line or linear. Therefore, the output of
the functions will not be confined between any range.

Fig: Linear Activation Function


• Equation : f(x) = x
• Range : (-infinity to infinity)
• It doesn’t help with the complexity or various parameters of usual data
that is fed to the neural networks.
Activation function continue…
• Non-linear Activation Function
• The Nonlinear Activation Functions are the most
used activation functions. Nonlinearity helps to
makes the graph look something like this
• It makes it easy for the model to generalize or
adapt with variety of data and to differentiate
between the output.
• The main terminologies needed to understand for
nonlinear functions are:
– Derivative or Differential: Change in y-axis
w.r.t. change in x-axis. It is also known as
slope. Fig: Non-linear
Activation Function
– Monotonic function: A function which is
either entirely non-increasing or non-
decreasing.
• The Nonlinear Activation Functions are mainly
divided on the basis of their range or curves.
Activation function continue…
1. Sigmoid or Logistic Activation Function
• The Sigmoid Function curve looks like a S-
shape.
• The main reason why we use sigmoid function
is because it exists between (0 to 1). Therefore,
it is especially used for models where we have
to predict the probability as an output. Since
probability of anything exists only between the
range of 0 and 1, sigmoid is the right choice.
• The function is differentiable. That means, we
can find the slope of the sigmoid curve at any
two points. Fig: Sigmoid Function
• The function is monotonic but function’s
derivative is not.
• The logistic sigmoid function can cause a neural
network to get stuck at the training time.
• The softmax function is a more generalized
logistic activation function which is used for
multiclass classification.
Activation function continue…
2. Tanh or hyperbolic tangent Activation
Function
• tanh is also like logistic sigmoid but better.
The range of the tanh function is from (-1 to
1). tanh is also sigmoidal (s - shaped).
• The advantage is that the negative inputs
will be mapped strongly negative and the
zero inputs will be mapped near zero in the
tanh graph.
• The function is differentiable.
• The function is monotonic while
its derivative is not monotonic.
Fig: tanh v/s Logistic Sigmoid
• The tanh function is mainly used
classification between two classes.
• Both tanh and logistic sigmoid activation
functions are used in feed-forward nets.
Activation function continue…
3. ReLU (Rectified Linear Unit) Activation Function
• The ReLU is the most used activation function in the world right now. Since, it is used
in almost all the convolutional neural networks or deep learning.

• As you can see, the ReLU is half rectified (from bottom). f(z) is zero when z is less
than zero and f(z) is equal to z when z is above or equal to zero.
• Range: [ 0 to infinity)
• The function and its derivative both are monotonic.
• But the issue is that all the negative values become zero immediately which
decreases the ability of the model to fit or train from the data properly. That means
any negative input given to the ReLU activation function turns the value into zero
immediately in the graph, which in turns affects the resulting graph by not mapping
the negative values appropriately.
Activation function continue…
4. Leaky ReLU
• It is an attempt to solve the dying ReLU problem

Fig : ReLU v/s Leaky ReLU

• The leak helps to increase the range of the ReLU function. Usually,
the value of a is 0.01 or so.
• When a is not 0.01 then it is called Randomized ReLU.
• Therefore the range of the Leaky ReLU is (-infinity to infinity).
• Both Leaky and Randomized ReLU functions are monotonic in
nature. Also, their derivatives also monotonic in nature.
Activation function continue…
• Why derivative/differentiation is used ?
• When updating the curve, to know in which direction and how much to
change or update the curve depending upon the slope.That is why we use
differentiation in almost every part of Machine Learning and Deep Learning.

Fig: Derivative of Activation Functions


Activation function continue…

Fig: Activation Function Cheetsheet


Gradient Descent
• It is an optimization algorithm used in training a model. In
simple words, Gradient Descent finds the parameters that
minimize the cost function (error in prediction). Gradient
Descent does this by iteratively moves toward a set of
parameter (Parameters refer to coefficients in Linear
Regression and weights in neural networks) values that
minimize the function, taking steps in the opposite
direction of the gradient.
• What is a Gradient?
– A gradient is a vector-valued function that represents
the slope of the tangent of the graph of the function,
pointing the direction of the greatest rate of increase of
the function. It is a derivative that indicates the incline
or the slope of the cost function.
Gradient Descent
• What is the Learning rate?
– The gradient is a vector-valued function, and as a vector, it has
both a direction and a magnitude. The Gradient descent algorithm
multiplies the gradient by a number (Learning rate or Step size) to
determine the next point.
• For example:
– Having a gradient with a magnitude of 4.2 and a learning rate of
0.01, then the gradient descent algorithm will pick the next point
0.042 away from the previous point.

Fig. Learning rate


Gradient Descent
• Optimization refers to the task of minimizing/maximizing an objective
function f(x) parameterized by x. In machine/deep learning terminology, it’s the task
of minimizing the cost/loss function J(w) parameterized by the model’s
parameters w ∈ R^d. Optimization algorithms (in case of minimization) have one of
the following goals:
– Find the global minimum of the objective function. This is feasible if the objective
function is convex, i.e. any local minimum is a global minimum.
– Find the lowest possible value of the objective function within its neighborhood.
That’s usually the case if the objective function is not convex as the case in most
deep learning problems.
• There are three kinds of optimization algorithms:
– Optimization algorithm that is not iterative and simply solves for one point.
– Optimization algorithm that is iterative in nature and converges to acceptable
solution regardless of the parameters initialization such as gradient descent applied
to logistic regression.
– Optimization algorithm that is iterative in nature and applied to a set of problems
that have non-convex cost functions such as neural networks. Therefore,
parameters’ initialization plays a critical role in speeding up convergence and
achieving lower error rates.
Gradient Descent Continue…
• Gradient Descent is the most common optimization algorithm in machine
learning and deep learning. It is a first-order optimization algorithm. This means
it only takes into account the first derivative when performing the updates on the
parameters. On each iteration, we update the parameters in the opposite direction of
the gradient of the objective function J(w) w.r.t the parameters where the gradient
gives the direction of the steepest ascent. The size of the step we take on each
iteration to reach the local minimum is determined by the learning rate α. Therefore,
we follow the direction of the slope downhill until we reach a local minimum.
• How gradient descent works on logistic regression
• Let’s assume that the logistic regression model has only two parameters:
weight w and bias b.
1. Initialize weight w and bias b to any random numbers.
2. Pick a value for the learning rate α. The learning rate determines how big the
step would be on each iteration.
– If α is very small, it would take long time to converge and become
computationally expensive.
– If α is large, it may fail to converge and overshoot the minimum.
Gradient Descent Continue…
• Gradient descent is a first-order iterative optimization algorithm for finding the
minimum of a function
• This seems little complicated, so let’s break it down.
• The goal of the gradient descent is to minimize a given function which, in our case,
is the loss function of the neural network. To achieve this goal, it performs two
steps iteratively.
– Compute the slope (gradient) that is the first-order derivative of the function at the
current point
– Move-in the opposite direction of the slope increase from the current point by the
computed amount

Fig. Gradient descent


Gradient Descent Continue…
• The idea is to pass the training set through the hidden layers of the neural
network and then update the parameters of the layers by computing the
gradients using the training samples from the training dataset.
• Think of it like this. Suppose a man is at top of the valley and he wants to get to
the bottom of the valley. So he goes down the slope. He decides his next
position based on his current position and stops when he gets to the bottom of
the valley which was his goal.

3-dimensional graph for Gradient Decent


Gradient Descent Continue…
• There are different ways in which that man (weights) can go down the slope.
– Batch Gradient Descent
– Stochastic Gradient Descent
– Mini Batch Gradient Descent
Gradient Descent Continue…
• Batch Gradient Descent
• In Batch Gradient Descent, all the training
data is taken into consideration to take a
single step. We take the average of the
gradients of all the training examples and
then use that mean gradient to update our
parameters. So that’s just one step of
gradient descent in one epoch.
• Batch Gradient Descent is great for convex
or relatively smooth error manifolds. In
this case, we move somewhat directly
towards an optimum solution.
• The graph of cost vs epochs is also quite
smooth because we are averaging over all Cost vs Epochs
the gradients of training data for a single (Source: https://www.bogotobogo.com
step. The cost keeps on decreasing over /python/scikit-learn/scikit-learn_batch-
gradient-descent-versus-stochastic-
the epochs.
gradient-descent.php)
Gradient Descent Continue…
• The figure below shows an example of gradient descent operating in a single
dimension:

Fig. Simple, one-dimensional gradient descent


Gradient Descent Continue…
• When training weights in a neural network, normal batch gradient
descent usually takes the mean squared error of all the training
samples when it is updating the weights of the network:

• Where W are the weights, α is the learning rate and ∇ is the


gradient of the cost function J(W,b) with respect to changes in the
weights. The cost function J was defined as:

• The overall cost function (and therefore the gradient) depends on the
mean cost function calculated on all of the m training samples ( x(z)
and y(z) refer to each training sample pair).
Gradient Descent Continue…
• Is this the best way of doing things? Batch gradient descent is good because the
training progress is nice and smooth – if you plot the average value of the cost
function over the number of iterations / epochs it will look something like this:

Fig. Example batch gradient descent progress

• The line is mostly smooth and predictable. However, a problem with batch gradient
descent in neural networks is that for every gradient descent update in the
weights, you have to cycle through every training sample. For big data sets i.e. >
50,000 training samples, this can be time prohibitive.
Gradient Descent Continue…
• The main advantages:
– We can use fixed learning rate during training without worrying about learning
rate decay.
– It has straight trajectory towards the minimum and it is guaranteed to
converge in theory to the global minimum if the loss function is convex and to
a local minimum if the loss function is not convex.
– It has unbiased estimate of gradients. The more the examples, the lower the
standard error.
• The main disadvantages:
– It requires the loading of the whole dataset into memory, which can be
problematic for big data sets
– Batch gradient descent can’t be efficiently parallelized (compared to the
techniques about to be presented) – this is because each update in the weight
parameters requires a mean calculation of the cost function over all the training
samples.
– The smooth nature of the reducing cost function tends to ensure that the neural
network training will get stuck in local minimums, which makes it less likely
that a global minimum of the cost function will be found.
Gradient Descent Continue…
• Stochastic Gradient Descent
• In Batch Gradient Descent we were considering all the examples for every step of
Gradient Descent.
• But what if our dataset is very huge. Deep learning models crave for data. The
more the data the more chances of a model to be good. Suppose our dataset has 5
million examples, then just to take one step the model will have to calculate the
gradients of all the 5 million examples. This does not seem an efficient way. To
tackle this problem we have Stochastic Gradient Descent. In Stochastic Gradient
Descent (SGD), we consider just one example at a time to take a single step. We do
the following steps in one epoch for SGD:
1) Take an example
2) Feed it to Neural Network
3) Calculate it’s gradient
4) Use the gradient we calculated in step 3 to update the weights
5) Repeat steps 1–4 for all the examples in training dataset
• Since we are considering just one example at a time the cost will fluctuate over the
training examples and it will not necessarily decrease. But in the long run, you will
see the cost decreasing with fluctuations.
Gradient Descent Continue…
• Also because the cost is so fluctuating, it will never reach the minima but it will
keep dancing around it.
• SGD can be used for larger datasets. It converges faster when the dataset is large
as it causes updates to the parameters more frequently.

Cost vs Epochs in SGD


Gradient Descent Continue…
• Stochastic gradient descent updates the weight parameters after
evaluation the cost function after each sample. That is, rather than
summing up the cost function results for all the sample then taking the
mean, stochastic gradient descent (or SGD) updates the weights
after every training sample is analyzed. Therefore, the updates look
like this:
Gradient Descent Continue…
• So how does SGD perform? Let’s take a look. The plot below shows
the average cost versus the number of training epochs / iterations for
batch gradient descent and SGD on the scikit-learn MNIST dataset.

Fig. Batch gradient descent versus SGD


• First, SGD converges much more rapidly than batch gradient
descent. In fact, SGD converges on a minimum J after < 20
iterations.
Gradient Descent Continue…
• Secondly, despite what the average cost function plot says, batch gradient descent
after 1000 iterations outperforms SGD. On the MNIST test set, the SGD run has an
accuracy of 94% compared to a BGD accuracy of 96%.

Fig. Noisy SGD

• As you can see in the figure above, SGD is noisy. That is because it responds to
the effects of each and every sample, and the samples themselves will no doubt
contain an element of noisiness. While this can be a benefit in that it can act to
“kick” the gradient descent out of local minimum values of the cost function, it
can also hinder it settling down into a good minimum. This is why, eventually,
batch gradient descent has outperformed SGD after 1000 iterations.
Gradient Descent Continue…
• Mini Batch Gradient Descent
• Batch Gradient Descent can be used for smoother curves. SGD can be used when
the dataset is large. Batch Gradient Descent converges directly to minima. SGD
converges faster for larger datasets. But, since in SGD we use only one example
at a time, we cannot implement the vectorized implementation on it. This can
slow down the computations. To tackle this problem, a mixture of Batch Gradient
Descent and SGD is used.
• Neither we use all the dataset all at once nor we use the single example at a time.
We use a batch of a fixed number of training examples which is less than the
actual dataset and call it A MINI-BATCH. Doing this helps us achieve the
advantages of both the former variants we saw. So, after creating the mini-batches of
fixed size, we do the following steps in one epoch:
1. Pick a mini-batch
2. Feed it to Neural Network
3. Calculate the mean gradient of the mini-batch
4. Use the mean gradient we calculated in step 3 to update the weights
5. Repeat steps 1–4 for the mini-batches we created
• Just like SGD, the average cost over the epochs in mini-batch gradient descent
fluctuates because we are averaging a small number of examples at a time.
Gradient Descent Continue…
• Mini-batch gradient descent is a trade-off between stochastic gradient
descent and batch gradient descent. In mini-batch gradient descent, the
cost function (and therefore gradient) is averaged over a small number
of samples, from around 10-500. This is opposed to the SGD batch size
of 1 sample, and the BGD size of all the training samples. It looks like
this:

• Where bs is the mini-batch size and the cost function is:

• What’s the benefit of doing it this way? First, it smooths out some of the
noise in SGD, but not all of it, thereby still allowing the “kick” out of
local minimums of the cost function. Second, the mini-batch size is still
small, thereby keeping the performance benefits of SGD.
Gradient Descent Continue…
• how it performs with a min-batch size of 100 samples:

Fig. Mini-batch gradient descent versus the rest


• Mini-batch gradient descent appears be the superior method of gradient descent to
be used in neural networks training. The jagged decline in the average cost function
is evidence that mini-batch gradient descent is “kicking” the cost function out of
local minimum values to reach best
• It is still able to find a good minimum and stick to it. This is confirmed in the test
data – the mini-batch method achieves an accuracy of 98% compared to the next
best, batch gradient descent, which has an accuracy of 96%. The great thing is – it
gets to these levels of accuracy after only 150 iterations or so.
Gradient Descent Continue…
• The main advantages:
– It can be performed in a distributed manner. That is, each mini-batch can be
computed in parallel by “workers” across multiple servers, CPUs and GPUs to
achieve significant improvements in training speeds. There are multiple
algorithms and architectures to perform this parallel operation.
– Faster than Batch version because it goes through a lot less examples than
Batch (all examples).
– Randomly selecting examples will help avoid redundant examples or examples
that are very similar that don’t contribute much to the learning.
– With batch size < size of training set, it adds noise to the learning process that
helps improving generalization error.
– Even though with more examples the estimate would have lower standard error,
the return is less than linear compared to the computational burden we incur.
• The main disadvantages:
– It won’t converge. On each iteration, the learning step may go back and forth
due to the noise. Therefore, it wanders around the minimum region but never
converges.
– Due to the noise, the learning steps have more oscillations and requires adding
learning-decay to decrease the learning rate as we become closer to the
minimum.
Gradient Descent and the Delta Rule
• Overview
• the perceptron rule can find a successful weight vector when the training examples
are linearly separable. It can fail to converge if the examples are not linearly
separable. The delta rule is used to overcome this difficulty.

Figure 1 is the error of different hypotheses. For a linear unit with two weights, the
hypothesis space H is the w0, w1 plane. The vertical axis indicates the error of the
corresponding weight vector hypothesis relative to a fixed set of training examples. The
arrow shows the negated gradient at one particular point, indicating the direction in the
w0, w1 plane producing steepest descent along the error surface.
Gradient Descent and the Delta Rule Continue…
• Perceptron training rule guaranteed to converge to 0-error hypothesis
after finite number of iterations if
– Training examples are linearly separable (i.e., no noise)
– Sufficiently small learning rate η
• The advantages of Delta Rule over Perceptron Training Rule
include:
– Guaranteed to always converge to a hypothesis with minimum
squared error (with a small learning rate)
– Allows for noise in the data
– Allows for non-separable functions
• The delta training rule is best understood as training an unthresholded
perceptron, which is a linear unit with output o given as follows:
Gradient Descent and the Delta Rule Continue…
• Training error of a hypothesis relative to the training
examples is as follows:

Equation ……1

• D : set of training examples,


• td : target output for training example d
• od : linear unit output for training example d, od = o(w(d)) =
w. x(d)
• E(w) : half of squared different between target output td and
linear unit output od , summed over all training examples
Gradient Descent and the Delta Rule Continue…
• Derivation of the Gradient Descent Rule
• calculate the direction of steepest descent along the error surface
• Compute the derivative of E with respect to each component of the vector w
• This vector derivative is called the gradient of E with respect to w, written
∇E(w)

• is partial derivative of error E with respect to a single component w0 that


is in w
• ∇E(w) itself is a vector
• The gradient specifies the direction of steepest increase of E, the training rule
for gradient descent is:w w + Δw ,
where
• Δ (w) = −η∇ E(w), η is the learning rate (a positive constant).
• We put a negative sign before the learning rate since we want to move the
weight vector in the direction that decreases E.
Gradient Descent and the Delta Rule Continue…
• The component form of this training rule is: wi ← wi + Δ wi ,
where

Since we can write

The gradient can be obtained


by differentiating E as:
Where,
xi,d denotes the single input
component
xi for training example d.
Gradient Descent and the Delta Rule Continue…
• Weight update rule for gradient decent is:

• The basis of the delta rule is to change weight in the opposite


direction of gradient, which is the shortest and easiest way to
decrease approximation error
• The changing rate is decided by the learning rate η
• If η is too large, the gradient descent search runs the risk of
overstepping the minimum in the error surface rather than settling
into it
• If you drive too fast, η is too large, you could drive past the house you
are looking for and have to turn around
• if you drive too slowly, η is too small, it might take you a very long
time to find your destination
Gradient decent algorithm for training in linear unit
Gradient decent algorithm for training in linear unit
Continue…
• In summary the algorithm is as follows:
– Pick an initial random weight for each wi the weight vector
– Apply the linear unit to all training example and compute Δ wi for
each weight according to

– Update each wi weight by adding Δ wi


– Repeat this process.
Stochastic approximation to gradient decent
• The difficulties in applying gradient decent are:
– Converging to a local minimum quite slow and require many thousands of
gradient decent steps.
– If there exists multiple local minima in error surface then there is no
guarantee of finding global minimum.
• To overcome these problem use incremental gradient decent or stochastic gradient
decent
• Instead of updating the weights after summing over all training examples, stochastic
gradient descent updates the weights incrementally, following the calculation of the
error for each training example
• The training rule now becomes: Δwi = η(t-o)xi. where,
t: target value
o: unit output
xi: ith input for the training example
η : learning rate
• The algorithm for stochastic gradient descent is nearly identical to standard gradient
descent
Stochastic gradient decent algorithm for training A linear UNIT
Continue…
Difference between standard gradient decent
and stochastic gradient decent
• In Standard gradient descent, the error is summed
over all examples before updating weight. In
Stochastic gradient descent, weights are updated upon
examining each training example
• Standard gradient descent requires more computation
per weight update step for summing over multiple
examples. Standard gradient descent is often used with
a larger step size per weight update than stochastic
gradient descent since it uses the true gradient.
• In cases where there are multiple local minima with
respect to E(w) ,stochastic gradient descent can
sometimes avoid falling into these local minima
because it uses the various
∇Ed (w) rather than ∇ E(w) to guide its search.
Multilayer Networks and the Backpropagation
Algorithm
• Multilayer Networks
• Perceptron are used for single straight surface
• For non linear decision surface we use multilayer network

• Fig shows speech recognition task that involves distinguishing among possible 10
possible vowels spoken in context of “h_d”
• Input: Parameters F1 and F2 obtained from spectral analysis of sound
• Output: 10 network output corresponds to 10 possible vowel sound
Multilayer Networks and the Backpropagation
Algorithm Continue…
• Sigmoid threshold unit:
– Multiple layers of linear units produce linear functions
– Solution is The sigmoid unit, is the unit similar to perceptron, but
based on smooth differential threshold function
– The output of sigmoid unit is a nonlinear function of its inputs but
it is still differentiable

Fig. Sigmoid Threshold Unit


Multilayer Networks and the Backpropagation
Algorithm Continue…
• Sigmoid threshold unit:
• The sigmoid unit computes the linear combination of its inputs then
applies threshold unit to the result where the output is a continues
function of its input
• Output of sigmoid unit is:
O=σ (w . x)
• Where,

• σ is a sigmoid function its output range from 0 to 1


• Derivative of sigmoid function is easily calculated
Multilayer Networks and the Backpropagation
Algorithm Continue…
• Properties of Sigmoid function:
– It outputs real numbers between 0 and 1
– It maps a very large input domain to a small range of outputs
– It never loses information because it is a one to one function
– It increases monotonically
• We can derive gradient decent rules to train:
– One sigmoid unit
– Multilayer networks of sigmoid units → Backpropagation
Multilayer Networks and the Backpropagation Algorithm
Continue…
• ERROR GRADIENT FOR SIGMOID UNIT
• The error of the output unit depend on a specific weight as:

Where,
» o is output value
» t is target value and y = wixi
• We know that y=f(w) and o=g(y)
– Where f is activation function and g is sigmoid function
• We can have following function:

• Greater the number of training iterations, lower the error for training set
• Architecture within the fixed weight is:
Multilayer Networks and the Backpropagation
Algorithm Continue…
• Error Gradient for Network ❖If error occur at the out put then
propagate back until hit the input node
updating the weights
❖If the training result fit the data perfectly
then it doesn’t mean that it generalized
well
❖In neural network we don’t know
• Let , how to generalize
❖If data is big enough it contain some
noise, even training data fit the data
• how does the error depends on y5 ? very well
❖It may not minimize the errors error
on the whole distribution
❖Gradient decent is a very general
method; it applied to neural network
❖It is also applied to the methods such as
decision tree
Multilayer Networks and the Backpropagation Algorithm
Continue…
• The backpropagation algorithm learns the weights for a multilayer
network given a fixed network of units and interconnections by
employing gradient descent to minimize the squared error between the
network output and target values for those outputs.
• Because these networks have multiple output units, E must be redefined
to sum errors over all network output units:

• Output: set of output units in the network


• tkd: target output values associated with the kth output unit and training
example d
• okd : output values associated with the kth output unit and training
example d
• In multilayer networks, there can be multiple local minima and
backpropagation is not guaranteed to converge to a global minima
Multilayer Networks and the Backpropagation Algorithm
Continue…
Multilayer Networks and the Backpropagation Algorithm
Continue…
Multilayer Networks and the Backpropagation Algorithm
Continue…
• Algorithm applied to neural network containing two layer of sigmoid
units. Unit at each layer connected to all units from pervious layer
• This is incremental gradient decent version of Backpropagation.
• It iterates over the training examples repeatedly, calculating the error and
updating the weights often thousands of times until the network performs
well.
• The gradient decent weight update rule : Δwji= η • δj • xji
• similar to the delta training rule except that the error (t-o) in the delta rule
is replaced by the error term

• Calculating the error term for hidden unit is similar


Multilayer Networks and the Backpropagation Algorithm
Continue…
• Training examples only provide target output for the network, the error term for
hidden units is calculated by summing the error terms δk for each output
influenced by hidden unit h.
• Each of these error terms is multiplied by the weight from the hidden unit h to
output unit k which characterizes the degree to which the hidden unit h is
“responsible for” the error in k.
• This is essentially the credit assignment problem.

• Halting Backpropagation Options


– Stop after a certain number of iterations. If the number of iterations is too
low the error may not be reduced sufficiently. Too much iteration can
lead to overfitting the training data.
– Error on training examples falls below some threshold
– Error on a separate validation set of examples meets some criteria
Multilayer Networks and the Backpropagation Algorithm
Continue…
• Adding Momentum:
• The addition of momentum is a variation on the standard Backpropagation
algorithm which has several positive properties:
– It can prevent the algorithm from becoming trapped in local minima
– It prevent the algorithm from stalling on portions of the error surface with
no gradient
– In regions where the gradient is not changing, it can speed up convergence
by increasing the step size
• The alteration of the algorithm makes the weight updates partially dependent on
the update that occurred in the previous iteration. The weight update rule then
becomes:

– Δwji(n): weight update performed during the nth iteration through the main
loop of the algorithm
– α: 0≤α≤1, momentum
– Δwji(n-1): the weight update that occurred during the (n-1)th iteration
Multilayer Networks and the Backpropagation Algorithm
Continue…
• Learning in Arbitrary Acyclic Networks
• The backpropagation algorithm can be easily extended to feedforward
networks of an arbitrary depth.
• The change in the weight update rule is such that when computing δ
values, the δr value for a unit r in a layer m is computed from the δ at
the next deeper layer m+1 according to

• This step may be repeated for any number of hidden layers in the
network. Furthermore, it can be generalized to any acyclic network
regardless of how the network units are arranged. The general rule for
calculating δ for any internal (non-output) unit is
Multilayer Networks and the Backpropagation Algorithm
Continue…
• Comments on Backpropagation:
• Gradient descent over entire network weight vector
• Easily generalized to arbitrary directed graphs
• Will find a local, not necessarily global, error minimum. (However, it
does tend to perform well in practice by running multiple times)
• Often include weight momentum α in Δwji(n) = ηδjxji + αΔwji(n-1)
• Minimizes error over training examples
• Training can take thousands of iterations → slow.
• Using network after training is very fast.
Multilayer Networks and the Backpropagation Algorithm
Continue…
• Efficiency
– For training the network, numbers of iterations are very important.
If it is too small, the errors will be high. If it is too large,
overfitting may occur, and cause errors go high.
– Learning: Intractable in general. Training can take thousands of
iterations, which will be very slow. Learning net with single
hidden unit is NP-hard.
– In practice, backpropagation is very useful.
– It will be very fast to classify examples using network after
training.
– Very general in terms of “patterns”
Multilayer Networks and the Backpropagation Algorithm
Continue…
• Efficiency
– Advantages of Backpropagation
• Robust to noise in the data
• Handles continuous value
• Very general
– Disadvantages of Backpropagation
• No Human Interpretation
• Don’t know how it generates
Convergence and Local Minima
• Backpropagation is only guaranteed to converge to a local, and not
a global, minima
• Since each weight in a network corresponds to a different
dimension in the error space, a local minimum with respect to one
weight may not be a local minimum with respect to other weights
• This can provide an “escape route” from becoming trapped in local
minima
• If the weights are initialized to values close to zero, the sigmoid
threshold function is approximately linear and so they produce
linear outputs
• As the weights grow, though, the network is able to represent more
complex functions that are not linear in nature
• The weights are able to approximate the desired function that they will
be close enough to the global minimum that even becoming stuck in a
local minima will be acceptable.
Convergence and Local Minima Continue…
• Common heuristic methods to reduce the problem of local
minima are:
– Add a momentum term to the weight-update rule
– Use stochastic gradient descent rather than true gradient
descent
– Train multiple networks using the same training data but-
• initialize the networks with different random weights
• If the different networks lead to different local minima,
choose the network that performs best on a validation
set of data or all networks can be kept and treated as a
committee whose output is the (possibly weighted)
average of individual network outputs
Representational Power of Feedforward Networks
• Function that can be represented as:
– Boolean function:
• Every Boolean function can represented by some network with
exactly 2 layers of unit, though the number of hidden units can grow
up exponentially with number of inputs.
– Continuous function:
• Every bounded continuous function can be approximated by a
network with 2 layers of units
• the hidden layer contains sigmoid units while the output layer
consists of linear units
• The number of hidden units will depend on the function to be
approximated.
– Arbitrary function:
• Any function can be approximated with a network of 3 layers of units
• The output uses linear units
• the hidden layers use sigmoid units
• The number of units at each layer is not known
Hypothesis Space Search and Inductive Bias
• Hypothesis Space Search:
– In backpropagation, every possible assignment of network
weights represents a distinct hypothesis.
– This space is continuous, unlike the discrete representations such
as decision trees.
– The hypothesis space is then an n-dimensional Euclidean space of
the n network weights.
• Inductive Bias:
– In backpropagation inductive bias described as “smooth
interpolation between data points.”
– In other words, given two positive data points with no negative
examples between them, backpropagation tends to label the points
in between as positive as well.
Hidden Layer Representation
• As training examples provide input and target output weights, the
network is free to update the internal weights in order to minimize the
error.
• Network capturing properties that is not explicit in the input
representation
• This allows networks to create new features not explicitly introduced
by the designer, though these features must still be computable as
sigmoid function units.
• ANNs are used in general to approximate functions.
• However not every artificial neural network can approximate every
function. There are limitations due to the architecture chosen, the
number of layers and the hidden units per layer.
• For example, a network with just one hidden unit cannot learn the
identity function. The network needs at least three hidden units to
learn this function since there are 8 inputs and 8 outputs, 23 = 8.
Overfitting, and Stopping Criterion
• Stopping Criterion
– The stopping conditions for backpropagation is to continue to
run until the error falls below some threshold.
– this has the tendency to overfit the training examples. This
occurs because the network tunes the weights to fit the
training data explicitly and not the general distribution of the
examples.
– At first the weights are random this tends to occur with larger
iterations
– As the network runs, the complexity of the learned decision
surface increases in response to weights being changed. This
results in a learned function that can include noise.
– If the training data contains a trend or tendency that doesn’t
exist in the actual data, the ANN is likely to learn that trend if
it is given a sufficiently long training period.
Overfitting, and Stopping Criterion Continue…

Fig. Error versus Weight Update


Overfitting, and Stopping Criterion Continue…
• Figure illustrates a typical situation that can arise in ANN
learning
• In both cases:
• Training Set Error: Error E over the training examples decreases
monotonically, as gradient descent minimizes this measure of error
• Validation Set Error: Error over the separate “validation” set of
examples typically decreases at first, then may later increase due to
overfitting the training examples.
Overfitting, and Stopping Criterion Continue…
• Overfitting
• Overfitting tends to occur during later iterations, but not during earlier
iterations
• Reason is:
– Consider that network weight initialized to small random values (weight
values are nearly identical), then very smooth decision surface is
describable.
– As training proceeds the weight being to grow in order to reduce the
error over the training data & complexity of learned decision surface
increases.
– Thus, effective complexity of hypotheses that reached by
Backpropagation increases with number of weight-tuning iterations
– For enough weight-tuning iterations, Backpropagation often create
overly complex decision surfaces that fit the noise in training data or
unrepresentative characteristics of particular training example.
Overfitting, and Stopping Criterion Continue…
• Two techniques are used to avoid Overfitting:-
– Weight Decay:
• Weights are decreased by small factor at each iteration
• Keeps weight value small and avoid complex decision surface
– Validation Set:
• Validation set serves as “unseen data”
• Weights are updated according to the training data, the best set of
weights with respect to error is kept separately
• Then network is trained
• When current error is greater than the “Best so far” , the training
is stopped and best set of weights is accepted.
• Thus, when error goes up on the validation set the training stops.
• This work well when there is enough data for training and validation.
• If available data is not large then k-fold validation is approach is
used.
CNN: Convolutional Neural
Network
Introduction
• Artificial intelligence is actually bridging the gap between the
capabilities of human and machine.
• One of the domain of artificial intelligence is Computer Vision
• The agenda for this field is to enable machines to view the world as
humans do, perceive it in a similar manner and even use the
knowledge for a multitude of tasks such as Image & Video
recognition, Image Analysis & Classification, Media Recreation,
Recommendation Systems, Natural Language Processing, etc.
• The advancements in Computer Vision with Deep Learning has been
constructed and perfected with time, primarily over one particular
algorithm — a Convolutional Neural Network
Introduction
Before diving into the Convolution Neural Network, let us first revisit some
concepts of Neural Network. In a regular Neural Network there are three
types of layers:
• Input Layers: It’s the layer in which we give input to our model. The
number of neurons in this layer is equal to total number of features in our
data (number of pixels incase of an image).
• Hidden Layer: The input from Input layer is then feed into the hidden layer.
There can be many hidden layers depending upon our model and data size.
Each hidden layers can have different numbers of neurons which are
generally greater than the number of features. The output from each layer
is computed by matrix multiplication of output of the previous layer with
learnable weights of that layer and then by addition of learnable biases
followed by activation function which makes the network nonlinear.
• Output Layer: The output from the hidden layer is then fed into a logistic
function like sigmoid or softmax which converts the output of each class
into probability score of each class.
Understanding Convolutional Neural Networks
• In neural networks, Convolutional neural network (ConvNets or
CNNs) is one of the main categories to do images recognition, images
classifications.
• Objects detections, recognition faces etc., are some of the areas where
CNNs are widely used.
• how to use these CNNs for image classification?
• The Problem Space
• Image classification is the task of taking an input image and
outputting a class (a cat, dog, etc) or a probability of classes that best
describes the image.
• For humans, this task of recognition is one of the first skills we learn
from the moment we are born and is one that comes naturally and
effortlessly as adults
Understanding Convolutional Neural Networks
Continue…
• Inputs and Outputs

• When a computer sees an image (takes an image as input), it will see


an array of pixel values.
• Depending on the resolution and size of the image, it will see a 32 x
32 x 3 array of numbers (The 3 refers to RGB values).
• For a color image in JPG form with size 480 x 480. The representative
array will be 480 x 480 x 3. Each of these numbers is given a value
from 0 to 255 which describes the pixel intensity at that point
Understanding Convolutional Neural Networks
Continue…
• What We Want the Computer to Do
• Now the Problem and input/output is known.
• What we want the computer to do is to be able to differentiate
between all the images it’s given and figure out the unique features
that make a dog a dog or that make a cat a cat
• This is the process that goes on in our minds subconsciously, when we
look at a picture of a dog, we can classify it as such if the picture has
identifiable features such as paws or 4 legs
• In a similar way,
the computer is able perform image classification by looking for low
level features such as edges and curves, and then building up to more
abstract concepts through a series of convolutional layers.
• This is a general overview of what a CNN does.
Understanding Convolutional Neural Networks
Continue…
• Biological Connection
• the term convolutional neural networks related to neuroscience or biology
• CNNs do take a biological inspiration from the visual cortex. The visual
cortex has small regions of cells that are sensitive to specific regions of the
visual field.
• This idea was expanded upon by a fascinating experiment by Hubel and
Wiesel in 1962 where they showed that some individual neuronal cells in
the brain responded (or fired) only in the presence of edges of a certain
orientation (https://www.youtube.com/watch?v=Cw5PKV9Rj3o)
• Hubel and Wiesel found out that all of these neurons were organized in a
columnar architecture and that together, they were able to produce visual
perception. This idea of specialized components inside of a system having
specific tasks (the neuronal cells in the visual cortex looking for specific
characteristics) is one that machines use as well, and is the basis behind
CNNs.
Why ConvNets over Feed-Forward Neural Nets?
• An image is nothing but a matrix of pixel values so flatten the image (e.g. 3x3
image matrix into a 9x1 vector) and feed it to a Multi-Level Perceptron for
classification purposes

Fig. Flattening of a 3x3 image matrix into a 9x1 vector


• For extremely basic binary images, the method might show an average precision
score while performing prediction of classes but would have little to no
accuracy when it comes to complex images having pixel dependencies
throughout.
• A ConvNet is able to successfully capture the Spatial and Temporal
dependencies in an image through the application of relevant filters. The
architecture performs a better fitting to the image dataset due to the reduction in
the number of parameters involved and reusability of weights. In other words,
the network can be trained to understand the sophistication of the image better.
What is Convolution
• Convolution Neural Networks or Covnets are
neural networks that share their parameters
• Consider a image of a cuboid having its
length, width (dimension of the image) and
height (as image generally have red, green,
and blue channels).
• Now imagine taking a small patch of this
image and running a small neural network on
it, with k outputs and represent them
vertically. Now slide that neural network
across the whole image, as a result, we will
get another image with different width,
height, and depth. Instead of just R, G and B
channels now we have more channels but
lesser width and height. This operation is
called Convolution.
• If patch size is same as that of the image it
will be a regular neural network. Because of
this small patch, we have fewer weights.
What is Convolution Continue…
• Mathematics which is involved in the whole convolution
process.
• Convolution layers consist of a set of learnable filters (patch in the above
image). Every filter has small width and height and the same depth as that of
input volume (3 if the input layer is image input).
• For example, if we have to run convolution on an image with dimension
34x34x3. Possible size of filters can be axax3, where ‘a’ can be 3, 5, 7, etc
but small as compared to image dimension.
• During forward pass, we slide each filter across the whole input volume step
by step where each step is called stride (which can have value 2 or 3 or even
4 for high dimensional images) and compute the dot product between the
weights of filters and patch from input volume.
• As we slide our filters we’ll get a 2-D output for each filter and we’ll stack
them together and as a result, we’ll get output volume having a depth equal
to the number of filters. The network will learn all the filters.
Layers used to build ConvNets
• A Covnets is a sequence of layers, and every layer transforms one volume to another
through differentiable function.
• Types of layers:
Let’s take an example by running a Covnets on of image of dimension 32 x 32 x 3.
1. Input Layer: This layer holds the raw input of image with width 32, height 32 and
depth 3.
2. Convolution Layer: This layer computes the output volume by computing dot
product between all filters and image patch. Suppose we use total 12 filters for this
layer we’ll get output volume of dimension 32 x 32 x 12.
3. Activation Function Layer: This layer will apply element wise activation function
to the output of convolution layer. Some common activation functions are RELU:
max(0, x), Sigmoid: 1/(1+e^-x), Tanh, Leaky RELU, etc. The volume remains
unchanged hence output volume will have dimension 32 x 32 x 12.
4. Pool Layer: This layer is periodically inserted in the covnets and its main function is
to reduce the size of volume which makes the computation fast reduces memory and
also prevents from overfitting. Two common types of pooling layers are max
pooling and average pooling. If we use a max pool with 2 x 2 filters and stride 2, the
resultant volume will be of dimension 16x16x12 (e.g. Figure).
5. Fully-Connected Layer: This layer is regular neural network layer which takes input
from the previous layer and computes the class scores and outputs the 1-D array of
size equal to the number of classes.
Input Image
• In the figure, we have an RGB image
which has been separated by its three
color planes — Red, Green, and Blue.
• There are a number of such color
spaces in which images exist —
Grayscale, RGB, HSV, CMYK, etc.
• The role of the ConvNet is to reduce
the images into a form which is easier
to process, without losing features
which are critical for getting a good
prediction.
Fig. 4x4x3 RGB Image
• This is important when we are to
design an architecture which is not
only good at learning features but also
is scalable to massive datasets.
Working of Convolution Layer

Fig. Neural network with many convolutional layers


Working of Convolution Layer Continue…
• Convolution is the first layer to extract features from an input image.
Convolution preserves the relationship between pixels by learning
image features using small squares of input data. It is a mathematical
operation that takes two inputs such as image matrix and a filter or
kernel.

Fig. Image matrix multiplies kernel or filter matrix


Working of Convolution Layer Continue…
• Consider a 5 x 5 whose image
pixel values are 0, 1 and filter
matrix 3 x 3 as shown in Fig.1
• Image Dimensions = 5 (Height) x 5
(Breadth) x 1 (Number of channels,
eg. RGB) Fig 1. Image matrix multiplies kernel or filter matrix
• In the demonstration, the green
section resembles our 5x5x1 input
image, 1.
• The element involved in carrying
out the convolution operation in
the first part of a Convolutional
Layer is called the Kernel/Filter,
K, represented in the color yellow.
We have selected K as a 3x3x1
matrix. Fig 2. Convoluting a 5x5x1 image with a
3x3x1 kernel to get a 3x3x1 convolved feature
Working of Convolution Layer Continue…
• Convolution of an image with
different filters can perform
operations such as edge
detection, blur and sharpen by
applying filters. The example
shows various convolution
image after applying different
types of filters (Kernels).
• Fig. Some common filters
Deciding optimal kernel size for CNN
• The linear combination of the pixels weighted by the
convolutional filter extracts some kind of feature from the
image.
• Need to consider following points:
– Most of the useful features in an image are usually local
and it makes sense to take few local pixels at a time to
apply convolutions.
– Most of these useful features may be found in more than
one place in an image.
– an added benefit of using a small kernel instead of a fully
connected network is to benefit from weight sharing and
reduction in computational costs.
Comparison Between Smaller and Larger Filter
Smaller Filter Sizes Larger Filter Sizes

1. Two 3x3 kernels result in an image one 5x5 kernel results in same
size reduction by 4 reduction.

2. We have used (3x3 + 3x3) = 18 We used (5x5) = 25 weights.


weights

3. So, we get lower no. of weights but Higher number of weights but lesser
more layers. layers.

4. Therefore, computationally efficient And, this is computationally


expensive.

5. With more layers, it learns complex, With less layers, it learns simpler non
more non-linear features linear features.

6. With more layers, it necessitates the And, it will use less memory for
need for larger memory backpropogation.
Comparison Between Smaller and Larger Filter Continue…
Smaller Filter Sizes Larger Filter Sizes
7. It has a smaller receptive field as it Larger receptive field per layer.
looks at very few pixels at once.
8. Smaller Filter Sizes Larger Filter Sizes
9. Highly local features extracted without Quite generic features extracted spread
much image overview. across the image.
10. Therefore captures smaller, complex Therefore captures the basic
features in the image components in the image.
11. Amount of information extracted will Amount of information extracted are
be vast, maybe useful in later layers considerably lesser
12 Slow reduction in the image dimension Fast reduction in the image dimension
can make the network deep makes the network shallow
13. Better weight sharing Poorer weight sharing
In an extreme scenario, using a 1x1 Using a image sized filter is equivalent
convolution is like treating each pixel to a fully connected layer.
as a useful feature.
Working of Convolution Layer Continue…
• Strides
• Stride is the number of pixels shifts over the input matrix.
• When the stride is 1 then we move the filters to 1 pixel at a time.
When the stride is 2 then we move the filters to 2 pixels at a time and
so on. The below figure shows convolution would work with a stride
of 2.

Fig. Stride of 2 pixels


Working of Convolution Layer Continue…
• The Kernel shifts 9 times because
of Stride Length = 1 (Non-
Strided), every time performing
a matrix multiplication operation
between K and the portion P of the
image over which the kernel is
hovering.

Fig 2. Movement of the Kernel

Fig 1. Convoluting a 5x5x1 image with a


3x3x1 kernel to get a 3x3x1 convolved feature
Working of Convolution Layer Continue…
• The filter moves to the right with a certain Stride Value till it parses the complete
width. Moving on, it hops down to the beginning (left) of the image with the same
Stride Value and repeats the process until the entire image is traversed.

Fig. Convolution operation on a MxNx3 image matrix with a 3x3x3 Kernel


Working of Convolution Layer Continue…
• Padding
• Sometimes filter does not fit perfectly fit the input image. We have
two options:
– Pad the picture with zeros (zero-padding) so that it fits
– Drop the part of the image where the filter did not fit. This is called
valid padding which keeps only valid part of the image.
• The objective of the Convolution Operation is to extract the high-level
features such as edges, from the input image.
• There are two types of results to the operation —
– One in which the convolved feature is reduced in dimensionality as
compared to the input, this is done by applying Valid Padding
– The other in which the dimensionality is either increased or
remains the same. This is done by applying Same Padding
Working of Convolution Layer Continue…
• For Given:
– Input: n X n
– Padding: p
– Filter size: f X f
– Output: (n+2p-f+1) X (n+2p-f+1)
• Same Padding
– When we augment the 5x5x1 image into a
6x6x1 image and then apply the 3x3x1
kernel over it, we find that the convolved
matrix turns out to be of dimensions 5x5x1.
Hence the name — Same Padding.
– We apply padding so that the output size
is the same as the input size, i.e.,
n+2p-f+1 = n
So, p = (f-1)/2
• Valid Padding
Fig. SAME padding: 5x5x1
– On the other hand, if we perform the same image is padded with 0s to
operation without padding, we are presented
with a matrix which has dimensions of the create a 6x6x1 image
Kernel (3x3x1) itself — Valid Padding.
– It means no padding. If we are using valid
padding, the output will be
(n-f+1) X (n-f+1)
Working of Convolution Layer Continue…
• Strided Convolutions
• Suppose we choose a stride of 2. So, while convoluting through the
image, we will take two steps – both in the horizontal and vertical
directions separately. The dimensions for stride s will be:
– Input: n X n
– Padding: p
– Stride: s
– Filter size: f X f
– Output: [(n+2p-f)/s+1] X [(n+2p-f)/s+1]
• Stride helps to reduce the size of the image, a particularly useful
feature.
Working of Activation Function Layer

Fig. Neural network with many activation function layers


Working of Activation Function Layer Continue…
• Activation Function Layer: This
layer will apply element wise
activation function to the output of
convolution layer. Some common
activation functions are RELU:
max(0, x), Sigmoid: 1/(1+e^-x),
Tanh, Leaky RELU, etc. The
volume remains unchanged hence
output volume will have dimension
32 x 32 x 12.
• Most of the data scientists use ReLU
since performance wise ReLU is
better than the other two.
• Non Linearity (ReLU) Fig : ReLU operation
• ReLU stands for Rectified Linear
Unit for a non-linear operation. The
output is ƒ(x) = max(0,x).
• Why ReLU is important : ReLU’s
purpose is to introduce non-linearity
in our ConvNet. Since, the real world
data would want our ConvNet to
learn would be non-negative linear
values
Working of Pooling Layer

Fig. Neural network with many pooling layers


Working of Pooling Layer
• Pooling Layer
• Pooling layers section would reduce the number of parameters when the images are too
large. Spatial pooling also called subsampling or downsampling which reduces the
dimensionality of each map but retains important information.
• The Pooling layer is responsible for reducing the spatial size of the Convolved Feature.
• This is to decrease the computational power required to process the data through
dimensionality reduction.
• It is useful for extracting dominant features which are rotational and positional
invariant, thus maintaining the process of effectively training of the model.
• Spatial pooling can be of different types:
– Max Pooling:
• Max pooling takes the largest element from the rectified feature map
• returns the maximum value from the portion of the image covered by the Kernel.
– Average Pooling
• returns the average of all the values from the portion of the image covered by the
Kernel.
– Sum Pooling
• Sum of all elements in the feature map called as sum pooling.
Working of Pooling Layer Continue…
• Max Pooling performs as
a Noise Suppressant. It
discards the noisy
activations altogether and
also performs de-noising
along with dimensionality
reduction.
• On the other hand, Average
Pooling simply performs
dimensionality reduction
as a noise suppressing
mechanism.
• Hence, we can say that Max
Fig. Types of Pooling
Pooling performs a lot
better than Average Pooling
Working of Pooling Layer Continue…
• The Convolutional Layer
and the Pooling Layer,
together form the i-th layer
of a Convolutional Neural
Network.
• Depending on the
complexities in the images, Fig. 3x3 pooling over 5x5 convolved feature
the number of such layers
may be increased for
capturing low-levels details
even further, but at the cost
of more computational
power.

Fig: Max Pooling


Working of Fully Connected Layer

Fig. Neural network with fully connected layers


Working of Fully Connected Layer Continue…
• The layer we call as FC layer, we flattened our matrix into vector and
feed it into a fully connected layer like a neural network.

Fig : After pooling layer, flattened as FC layer

• In the above diagram, the feature map matrix will be converted as


vector (x1, x2, x3, …). With the fully connected layers, we combined
these features together to create a model. Finally, we have an
activation function such as softmax or sigmoid to classify the outputs
as cat, dog, car, truck etc.,
Working of Fully Connected Layer Continue…

Fig. Fully Connected Layer (FC Layer)


Working of Fully Connected Layer Continue…
• Adding a Fully-Connected layer is a (usually) cheap way of learning non-linear
combinations of the high-level features as represented by the output of the
convolutional layer.
• The Fully-Connected layer is learning a possibly non-linear function in that space.
• Now that we have converted our input image into a suitable form for our Multi-
Level Perceptron, we shall flatten the image into a column vector.
• The flattened output is fed to a feed-forward neural network and backpropagation
applied to every iteration of training.
• Over a series of epochs, the model is able to distinguish between dominating and
certain low-level features in images and classify them using the Softmax
Classification technique.
• Various architectures of CNNs have been listed below:
1. LeNet
2. AlexNet
3. VGGNet
4. GoogLeNet
5. ResNet
6. ZFNet
Complete CNN
CNN Summary
• Summary
• Provide input image into convolution layer
• Choose parameters, apply filters with strides, padding if
requires. Perform convolution on the image and apply
ReLU activation to the matrix.
• Perform pooling to reduce dimensionality size
• Add as many convolutional layers until satisfied
• Flatten the output and feed into a fully connected layer (FC
Layer)
• Output the class using an activation function (Logistic
Regression with cost functions) and classifies images.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy