ML-MODULE-4 - Part 2
ML-MODULE-4 - Part 2
1
Module-4 Learning with SVM, ANN
▪Learning with Support Vector Machines:
•Introduction,
•Linear Discriminant Functions for Binary Classification,
•Linear Maximal Margin Classifier.
•Linear Soft-Margin Classifier
▪Artificial Neural Network
•Introduction,
•NN Representation, Appropriate Problems,
•Perceptron,
•Back Propagation Algorithm,
•Remarks on Backpropagation Algorithm
2
3
Learning with Neural Networks:
4
▪What is Machine Learning?
▪Machine learning is a subset of artificial intelligence that
allows for optimization. When set up correctly, it helps you
make predictions that minimize the errors that arise from
merely guessing.
▪Machine learning accesses vast amounts of data (both
structured and unstructured) and learns from it to predict the
future.
▪It learns from the data by using multiple algorithms and
techniques.
5
▪What is Deep Learning?
▪Deep learning is a subset of machine learning that deals with
algorithms inspired by the structure and function of the
human brain.
▪Deep learning algorithms can work with an enormous amount
of both structured and unstructured data.
▪Deep learning’s core concept lies in artificial neural
networks, which enable machines to make decisions.
6
▪The primary difference between machine learning and deep
learning is how each algorithm learns and how much data
each type of algorithm uses.
7
8
9
▪What is a neural network?
▪Neural networks, also called artificial neural networks
(ANNs) or simulated neural networks (SNNs), are a subset of
machine learning and are the backbone of deep learning
algorithms.
▪They are called “neural” because they mimic how neurons in
the brain signal one another.
10
▪Neural networks are made up of node layers – an input
layer, one or more hidden layers, and an output layer.
▪Each node is an artificial neuron that connects to the next,
and each has a weight and threshold value.
▪When one node’s output is above the threshold value, that
node is activated and sends its data to the network’s next
layer. If it’s below the threshold, no data passes along.
11
▪What’s the difference between deep learning and neural
networks?
▪As mentioned in the explanation of neural networks above,
but worth noting more explicitly, the “deep” in deep learning
refers to the depth of layers in a neural network.
▪ A neural network of more than three layers, including the
inputs and the output, can be considered a deep-learning
algorithm.
▪That can be represented by the following diagram:
12
13
▪Most deep neural networks are feed-forward, meaning they
only flow in one direction from input to output.
▪However, you can also train your model through
back-propagation, meaning moving in the opposite direction,
from output to input.
▪Back-propagation allows us to calculate and attribute the
error associated with each neuron, allowing us to adjust and
fit the algorithm appropriately.
14
▪Why Deep Learning ?
▪In case of machine learning if we increase the volume of data
to be trained then the performance will not improve i,e it
attains saturation point.
▪But in case of deep learning if we increase the volume of the
data to be trained then the performance will also increase
upto data level.
15
16
▪While neural networks (also called “perceptrons”) have been
around since the 1940s, it is only in the last several decades
where they have become a major part of artificial
intelligence.
▪This is due to the arrival of a technique called
“backpropagation,” which allows networks to adjust their
hidden layers of neurons in situations where the outcome
doesn’t match what the creator is hoping for — like a network
designed to recognize dogs, which misidentifies a cat, for
example.
18
Biological Motivation
▪The basic computational unit of the brain is a neuron.
19
20
▪Neurons have specialize cell part called dendrites and axons.
▪Dendrites bring electrical signals to the cell body and axons
take information away from the cell body.
21
▪The typical nerve cell of the human brain comprises four
parts:
▪1. Function of Dendrite- It receives signals from other
neurons.
▪2. Soma (cell body)- It sums all the incoming signals to
generate input.
▪3. Axon Structure - When the sum reaches a threshold value,
the neuron fires and the signal travels down the axon to the
other neurons.
▪4. Synapses Working - The point of interconnection of one
neuron with other neurons. The amount of signal transmitted
depends upon the strength (synaptic weights) of the
connections.
22
▪A synapse serves as a junction where both the transmission
and processing of information occur, making it a vital means
of communication between neurons
▪A synapse is able to increase or decrease the strength of the
connection from the neuron to neuron and cause excitation
or inhibition of a subsequence neuron. This is where
information is stored.
▪ANNs are loosely motivated by biological neural systems,
there are many complexities to biological neural systems that
are not modeled by ANNs.
23
24
Biological Motivation
25
Basic Structure of ANNs
26
Neuron
27
▪Each ANN is composed of a collection of perceptron's
grouped in layers.
▪A typical structure is shown in below figure. Note the three
layers: input, intermediate (called the hidden layer) and
output.
▪Several hidden layers can be placed between the input and
output layers.
28
▪The following illustration shows a simple ANN –
29
30
Example: Autonomous Driving
31
Example: Autonomous Driving
32
Example: Autonomous Driving
33
Example: Autonomous Driving
34
Example: Autonomous Driving
35
Autonomous Driving
36
37
Example 2: Bank Credit Score
38
Example 2: Bank Credit Score
▪To make things clearer, lets understand ANN using a simple
example: A bank wants to assess whether to approve a loan
application to a customer, so, it wants to predict whether a
customer is likely to default on the loan. It has data like below:
39
Example 2: Bank Credit Score
40
Example 2: Bank Credit Score
41
Example 2: Bank Credit Score
42
Some Applications
▪Autonomous Driving
▪Speech Phenome Recognition
▪Image Classification
▪Financial Prediction
43
Properties of NNs
▪Many neuron-like units: Neural networks consist of multiple
interconnected processing units (neurons), making them capable
of complex pattern recognition.
▪Weighted interconnections: Each connection between neurons
has a weight that determines its influence, and backpropagation
helps adjust these weights to minimize errors.
▪Highly parallel, distributed processing: Neural networks can
process data in parallel, making them efficient for large-scale
computations.
▪Automatic weight tuning: Backpropagation optimizes these
weights through iterative learning using gradient descent
▪Handling high-dimensional inputs: Neural networks excel at
processing high-dimensional data, whether discrete (text,
categorical data) or real-valued (image pixels, sensor readings).
44
Appropriate problems for ANN
45
Appropriate problems for ANN
The backpropagation algorithm is well-suited for problems with
the following characteristics
46
▪Long Training Times Are Acceptable: Deep networks require
significant computational resources, but advancements like
GPU acceleration make training manageable.
▪Fast Evaluation for Predictions: Once trained, neural
networks execute forward propagation quickly, making them
ideal for real-time applications like speech recognition or
autonomous driving.
▪Low Emphasis on Human Interpretability: Unlike decision
trees or rule-based models, neural networks function as black
boxes, focusing on performance rather than explainability.
47
Perceptron
▪The perceptron was introduced by Frank Rosenblatt in 1958,
as a type of artificial neural network capable of learning and
performing binary classification tasks.
▪ We consider it a single-layer neural network with four main
parameters: input values, weights and Bias, net sum, and an
activation function.
▪Perceptron is a single layer neural network and a multi-layer
perceptron is called Neural Networks.
48
▪Each perceptron comprises four different parts:
1.Input Values: A set of values or a dataset for predicting the
output value. They are also described as a dataset’s features
and dataset.
2.Weights: The real value of each feature is known as weight. It
tells the importance of that feature in predicting the final
value.
3.Bias: The activation function is shifted towards the left or
right using bias. You may understand it simply as the
y-intercept in the line equation.
4.Summation Function: The summation function binds the
weights and inputs together. It is a function to find their sum.
5.Activation Function: It introduces non-linearity in the
perceptron model.
49
50
51
Neuron vs Perceptron
•Neuron: In the context of artificial neural networks, a neuron
is a conceptual unit that takes multiple inputs, performs a
weighted sum of these inputs, and applies an activation
function to produce an output. Neurons are inspired by the
biological neurons in the human brain.
•Perceptron: A perceptron is a specific type of artificial
neuron. It takes multiple binary inputs, multiplies each input
by a weight, computes the weighted sum, adds a bias term,
and passes the result through an activation function. The
output of a perceptron is typically binary (0 or 1).
52
•Neuron: Neurons in artificial neural networks can have
different types of outputs depending on the activation
function used. The output can be continuous or binary,
depending on the problem and the desired behavior.
•Perceptron: Perceptron's have a binary output (0 or 1)
due to the step function traditionally used as the
activation function.
53
Feature Neuron Perceptron
A simple artificial neuron used in
A fundamental processing unit in
Definition early machine learning models for
an artificial neural network.
binary classification.
Forms the basis of deep learning A basic model that cannot handle
Complexity models with multiple layers and non-linearly separable problems
neurons. like XOR.
56
Perceptron
57
58
59
60
61
Why do we Need Weight and Bias?
▪Weight and bias are two important aspects of the perceptron
model. These are learnable parameters and as the network
gets trained it adjusts both parameters to achieve the desired
values and the correct output.
62
▪Weights are used to measure the importance of each feature
in predicting output value.
▪Features with values close to zero are said to have lesser
weight or significance.
▪These have less importance in the prediction process
compared to the features with values further from zero
known as weights with a larger value.
▪Besides high-weighted features having greater predictive
power than low-weighting ones, the weight can also be
positive or negative.
▪ If the weight of a feature is positive then it has a direct
relation with the target value, and if it is negative then it has
an inverse relationship with the target value.
63
▪In contrast to weight in a neural network that increases the
speed of triggering an activation function, bias delays the
trigger of the activation function.
▪It acts like an intercept in a linear equation. Simply stated, Bias
is a constant used to adjust the output and help the model to
provide the best fit output for the given data.
64
▪Weights control the signal (or the strength of the connection)
between two neurons. In other words, a weight decides how
much influence the input will have on the output.
▪Biases, which are constant, are an additional input into the
next layer that will always have the value of 1.
▪ Bias units are not influenced by the previous layer (they do
not have any incoming connections) but they do have
outgoing connections with their own weights.
▪The bias unit guarantees that even when all the inputs are
zeros there will still be an activation in the neuron.
65
Activation Function
▪What is a Neural Network Activation Function?
▪An Activation Function decides whether a neuron should be
activated or not. This means that it will decide whether the
neuron’s input to the network is important or not in the
process of prediction using simpler mathematical operations.
66
67
▪Depending on the nature and intensity of these input
signals, the brain processes them and decides
whether the neuron should be activated (“firedˮ) or
not.
▪In deep learning, this is also the role of the Activation
Function—thatʼs why itʼs often referred to as
a Transfer Function in Artificial Neural Network.
68
69
▪When learning about neural networks, you will come across
two essential terms describing the movement of
information—feedforward and backpropagation.
▪Feedforward Propagation - the flow of information occurs in
the forward direction. The input is used to calculate some
intermediate function in the hidden layer, which is then used
to calculate an output.
▪In the feedforward propagation, the Activation Function is a
mathematical “gate” in between the input feeding the current
neuron and its output going to the next layer.
70
▪Backpropagation - the weights of the network connections
are repeatedly adjusted to minimize the difference between
the actual output vector of the net and the desired output
vector.
71
Why do Neural Networks Need an
Activation Function?
▪ Why do Neural Networks need it?
72
▪Activation functions introduce an additional step at each layer
during the forward propagation, but its computation is worth
it.
▪Here is why— Let’s suppose we have a neural network
working without the activation functions.
74
75
▪Mathematically it can be represented as:
76
▪Here are some of the limitations of binary step
function:
•It cannot provide multi-value outputs—for example, it
cannot be used for multi-class classification
problems.
•The gradient of the step function is zero, which
causes a hindrance in the backpropagation process.
77
▪Linear Activation Function
▪The linear activation function, also known as "no activation,"
or "identity function" (multiplied x1.0), is where the
activation is proportional to the input.
▪The function doesn't do anything to the weighted sum of the
input, it simply spits out the value it was given.
78
79
▪Mathematically it can be represented as:
80
▪However, a linear activation function has two major
problems
81
▪Non-linear activation functions solve the following
limitations of linear activation functions:
•They allow backpropagation because now the
derivative function would be related to the input,
and itʼs possible to go back and understand which
weights in the input neurons can provide a better
prediction.
•They allow the stacking of multiple layers of neurons
as the output would now be a non-linear combination
of input passed through multiple layers. Any output
can be represented as a functional computation in a
neural network.
82
▪Sigmoid / Logistic Activation Function
83
84
▪Mathematically it can be represented as:
85
▪Hereʼs why sigmoid/logistic activation function is one
of the most widely used functions:
86
Step Function - Outputs either 0 or 1 based on a threshold.
87
▪Sign Function - Outputs either -1 or 1, providing symmetry
around zero.
88
Popular Activation functions
▪Sigmoid
▪Step
▪Sign
89
90
Architectures
91
Single Layer
92
Multi Layer Perceptrons
93
Recurrent
94
Mesh
95
▪Linearly Separable 2D Data
▪We say a two-dimensional dataset is linearly separable if we
can separate the positive from the negative objects with a
straight line.
▪It doesn’t matter if more than one such line exists. For linear
separability, it’s sufficient to find only one:
96
▪Conversely, no line can separate linearly inseparable 2D data:
97
98
99
100
Representation power
101
Neural Representation of AND, OR, NOT
Logic Gates (Perceptron Algorithm)
102
AND gate / OR gate
▪Identify suitable weights to make following perceptron to
function as AND gate / OR gate
103
▪First, we need to know that the Perceptron algorithm states
that:
▪Also, the steps in this method are very similar to how Neural
Networks learn, which is as follows;
▪Initialize weight values and bias
▪Forward Propagate
▪Check the error
▪Backpropagate and Adjust weights and bias
▪Repeat for all training examples
104
AND gate
105
AND - Another representation
106
OR gate
107
NOT gate
108
NAND gate
109
XOR gate
110
XNOR
111
A single perceptron can learn any
function, as long as the instances
in the dataset are linearly
separable, like AND, OR, NAND,
and NOR!
112
113
114
Key Terms
▪Input Nodes (input layer)
•Just pass the information to the next layer
•A block of nodes is also called layer.
▪Hidden nodes (hidden layer)
•In Hidden layers is where intermediate processing or
computation is done,
•they perform computations and then transfer the weights
(signals or information) from the input layer to the next
layers
•It is possible to have a neural network without a hidden
layer also.
▪Output Nodes (output layer)
•Here we finally use an activation function that maps to the
desired output format (e.g. softmax for classification).
115
Key Terms
▪Connections and weights
•The network consists of connections, each connection
transferring the output of a neuron i to the input of a
neuron j.
•In this sense i is the predecessor of j and j is the successor
of i, Each connection is assigned a weight Wij.
▪Activation function
•the activation function of a node defines the output of
that node given an input or set of inputs.
•A standard computer chip circuit can be seen as a digital
network of activation functions that can be “ON” (1) or
“OFF” (0), depending on input.
•In artificial neural networks this function is also called the
transfer function.
116
Key Terms
▪Learning rule
•The learning rule is a rule or an algorithm which modifies
the parameters of the neural network, in order for a given
input to the network to produce a favored output.
•This learning process typically amounts to modifying the
weights and thresholds.
117
The Perceptron Training Rule
118
The Case of Binary Classification
▪Below you can see the architecture of an actual perceptron, as the
simplest type of artificial neural networks:
119
▪Imagine we have a binary classification problem at hand, and
we want to use a perceptron to learn this task.
▪Moreover, a perceptron can produce 2 values: +1 / -1 where
+1 means that the input example belongs to the + class, and
-1 means the input example belongs to the – class.
120
▪So, in summary:
▪What are we learning? The weights of the perceptron
121
▪We would want the perceptron to produce +1 for one class
and -1 for the other.
▪After observing the output for a given training example, we
will NOT modify the weights unless the produced output
was wrong!
▪For example, if we fed an instance of the – class and the
perceptron returned +1, then it means that we need to
modify the parameters of our network, i.e., the weights.
▪We will keep this process, and we will keep iterating through
the training set as long as necessary until the perceptron
classifies all the training examples correctly.
122
▪How do we update the weights? Is there some sort of
mathematical rule?
▪Yes, and the rule is called the perceptron training rule!
123
▪The variables in here are described as follows:
▪ Δwi : This means how much should I change the value of
the weight.
▪In other words, this is the amount that is added to the old
value of wi to update it. This can be positive or negative,
meaning we might increase or decrease wi
124
▪η : This is the learning rate, or the step size.
▪We tend to choose a small value for this, as if it is too big we
will never converge and if it is too small, we will take for
ever to converge to the correct weight vector and have a
decent classifier.
▪This step size, simply moderates the weight updates just so
the updates would not make an aggressive change to the old
values of the weights.
125
▪t: This is the ground truth label that we have for every
training example in our training set.
126
Perceptron Training Rule
127
▪The training strategy of the perceptron is nicely
demonstrated here:
128
The Intuition Behind the Perceptron Training Rule
129
130
Illustration – Perceptron learning
131
Illustration – Perceptron learning
132
Illustration – Perceptron learning
▪We do only ONE
epoch/iteration
▪Consider training
example-1
133
Illustration – Perceptron learning
▪Consider
training
example-2
134
Illustration –
Perceptron
learning
▪Consider
training
Example-3
135
Limitation
▪Perceptron rule fails if data is not linearly separable
136
Delta Rule
137
Delta Rule
▪The Delta Rule employs
•the error function for what is known as Gradient Descent
learning,
•which involves the ‘modification of weights along the most
direct path in weight-space to minimize error’
•so change applied to a given weight is proportional to the
negative of the derivative of the error with respect to that
weight
138
▪So consider the simple neural network down below.
▪Let’s say we would like to build a binary classifier that would
generate a value close to 1, if our data x belongs to the
positive class, and 0, if it belongs to the negative class.
▪So, we will have to learn a set of weights in our weight space,
that would force this neural network to generate our desired
outputs, given our input.
139
140
▪So, you can say that the weight space is indeed your
hypothesis space, where you have loads and loads of possible
values for these weights.
▪However, only a subset of these weights can turn your neural
network into a successful model with a high degree of
performance on your data. So, the question is how should we
change the weights, w0 and w1 , so that our error, E would
become suitably small.
▪The Delta Rule is an interesting mechanism for searching the
hypothesis space(here the weight space).
▪ Actually, the Delta Rule uses one of the most, if not the most,
popular search technique in the hypothesis space that is
called Gradient Descent.
141
▪So we have to deal with two things:
142
For our Delta Rule to work, we need to have a measure to
quantify the performance of our network i.e., how far away are
the outputs from the ground truth. This measure will be our
error function.
So, for every input and the choice of weight vector (from the
hypothesis space), how far is our output from the ground truth.
One common error function that can be used here, is the Sum
Square Error (SSE) error function:
143
144
▪This error function measures the difference between the
generated output and the ground truth for every example d
cross the whole training set D.
▪Note that the error is a function of our weight vector
▪ The Delta rule searches for these weight vectors and uses
them to generate the output for a given training example.
▪Then by measuring the error, it would update the previously
chosen weights to new values in a way that the output for the
subsequent training examples would get closer to the ground
truth
145
▪Example of Gradient Descent
▪Let’s say you are playing a game where the players are at the
top of a mountain, and they are asked to reach the lowest
point of the mountain. Additionally, they are blindfolded. So,
what approach do you think would make you reach the lake?
▪The best way is to observe the ground and find where the
land descends.
▪ From that position, take a step in the descending direction
and iterate this process until we reach the lowest point.
146
Delta Rule
147
▪So, the more the training goes on the better weights would
the Delta rule find, and the error for those weights would
become less and less. Until eventually, the network has
converged and we say that the model has been trained.
148
▪When you draw the function in 2-dimensional space, this is
how the local minima and global minima will look like:
149
▪When you draw the function in 3-dimensional space, this is
how the local minima and global minima will look like:
150
▪Given the way we have defined the error function, for linear
units, the error surface is always a parabolic with a single
global minimum.
151
▪ Gradient descent is an iterative optimization algorithm for finding
the local minimum of a function.
▪ To find the local minimum of a function using gradient descent,
we must take steps proportional to the negative of the gradient
(move away from the gradient) of the function at the current
point.
▪ If we take steps proportional to the positive of the gradient
(moving towards the gradient), we will approach a local maximum
of the function, and the procedure is called Gradient Ascent.
152
153
▪Gradient descent is an iterative optimization algorithm for
finding the minimum of a function; in our case we want to
minimize the error function.
154
▪The goal of the gradient descent algorithm is to minimize
the given function (say cost function). To achieve this goal, it
performs two steps iteratively:
1.Compute the gradient (slope), the first order derivative of the
function at that point
2.Make a step (move) in the direction opposite to the
gradient, opposite direction of slope increase from the
current point by alpha times the gradient at that point
155
156
▪ The results are optimal weights for the problem at hand.
157
158
159
▪Learning rate
160
161
162
163
Derivation of Gradient Descent
164
Derivation of Gradient Descent
Where xid denotes the single input component xid for the training
example d. Therefore weight update rule for gradient descent is
165
166
▪For a given learning rate η, the computed Δwi tells us how
much we need to change, wi in order to minimize the total
error, E.
▪Finally, we will add this to the current value of our weight in
order to update that weight, using the learning rule: wi = wi +
Δwi
167
Gradient Descent Algorithm
168
▪Consider a training set of 2 training data x1 and x2. Let x0 = 1.
The neural network is the same that is depicted in Fig.1.
▪ In this neural network, please note that x0is called the bias
unit and it is always equal to 1. As a result the only input from
which we will be able to feed in our training data into the
neural network would be through x.
169
▪This means that our training data are 1-dimensional.
▪Moreover, we will have to initialize our 2 weights randomly as
well and define a learning rate.
▪Finally, in our training set, for every training data, we will also
have a ground-truth, so that we could actually compute the
error and update the weights in the network.
▪So, all of these are defined as follows:
170
171
▪ Figure 4: The forward-pass for Computing the Total Error Across
the Training Set
172
▪So, we have computed the individual errors, summed them,
and divided them by 2, and computed the total error.
▪Now for each of the 2 weights in our network, we will have
to compute the gradient of the error according to our
derivative rule that we have derived, and then we will
multiply the gradient by the learning rate to learn the amount
we will have to change the current value of our weights.
▪This is called the Backward phase where we back-propagate
the gradients from the output side towards the input side of
the network, in order to learn the degree by which we will
have to increase/decrease every single weight in our neural
network.
173
▪Finally, we will add this value to the old value of our weights,
to compute their new values. This is called LEARNING! We are
learning the weights based on the errors that we make for
every training data!
174
175
176
177
▪Stochastic gradient descent (SGD) runs a training epoch for
each example within the dataset and it updates each training
example's parameters one at a time.
▪Let’s say we have 5 observations and each observation has
three features and the values that I’ve taken are completely
random.
178
▪Now if we use the SGD, will take the first observation, then
pass it through the neural network, calculate the error and
then update the parameters.
179
▪Then will take the second observation and perform similar
steps with it. This step will be repeated until all observations
have been passed through the network and the parameters
have been updated.
180
▪Each time the parameter is updated, it is known as an
Iteration. Here since we have 5 observations, the parameters
will be updated 5 times or we can say that there will be 5
iterations.
▪Had this been the Gradient Descent we would have passed all
the observations together and the parameters have been
updated only once. In the case of SGD, there will be ‘m’
iterations per epoch, where ‘m’ is the number of observations
in a dataset.
181
Mini-batch Gradient Descent
▪Another type of Gradient Descent is the Mini-batch Gradient
Descent
Again let’s take the same example. Assume that the batch size
is 2. So we’ll take the first two observations, pass them
through the neural network, calculate the error and then
update the parameters.
182
▪Then we will take the next two observations and perform
similar steps i.e will pass through the network, calculate the
error and update the parameters.
183
▪Now since we’re left with the single observation in the final
iteration, there will be only a single observation and will
update the parameters using this observation.
184
185
186
▪Comparison between Stochastic Gradient Descent
and Gradient Descent
1.Speed: SGD is faster than gradient descent because it
uses only one training example per iteration, whereas
gradient descent requires the entire dataset. This speed
advantage becomes more significant as the size of the
dataset increases.
2.Convergence: Gradient descent has a more
predictable convergence behaviour because it uses the
average gradient of the entire dataset. In contrast,
SGD’s convergence behaviour can be more erratic
due to its random sampling of individual training
examples.
3.Memory: Gradient descent requires more memory
than SGD because it must store the entire dataset for
each iteration. SGD only needs to store the current
training example, making it more memory-efficient. 187
▪Convergence rates
•Stochastic Gradient Descent: Faster convergence rate due to
the use of single training examples in each iteration.
•Gradient Descent: Slower convergence rate, as it uses the
entire dataset for each iteration.
▪Accuracy
•Stochastic Gradient Descent: Less accurate than gradient
descent, as it calculates the gradient on single examples,
which may not accurately represent the overall dataset.
•Gradient Descent: More accurate because it uses the average
gradient calculated over the entire dataset.
188
Perceptron vs Delta rule
▪Mainly there are two differences between the perceptron and
the delta rule.
189
Perceptron vs Delta rule
190
191
192
MULTILAYER NETWORKS AND THE
BACKPROPAGATION ALGORITHM
▪A perceptron can only learn a linear decision boundary in a
classification task, or fit a line to a bunch of data points in a
regression task.
193
▪In contrast to a simple perceptron, a multilayer neural
network is capable of learning highly
sophisticated(non-linear) decision boundaries for the
classification task. It can also fit highly non-linear curves for
the regression task. This is the same flexibility that we have
with deep learning
194
195
A Differentiable Threshold Unit
▪Each neuron in an ANN has an Activation Function, which we
denote with f().
▪We are interested in ANNs that can represent highly
sophisticated non-linear functions, which is the type of
problem we face in the real world!
▪ So, non-linearity is a desirable feature of an activation
function
196
▪The following two conditions for choosing activation function
f() :
1.We want f() to be non-linear. This makes the entire ANN a
collection of nested non-linear functions, capable of
representing some scary non-linear function.
2.We want f() to be continuous, and differentiable with
respect to its input. This makes the entire ANN trainable using
gradient descent.
197
Sigmoid function
▪The sigmoid function is a function that satisfies both of our
requirements, that is, both non-linearity and differentiablity.
▪A sigmoid threshold unit, is actually really similar to a
perceptron, but the difference is that it is based on a
smoothed and differentiable function, as opposed to the step
function that is used in a perceptron. Below, is a visual
representation of a sigmoid unit:
198
199
Multilayer Networks (ANN)
200
▪Gradient Descent vs. Backpropagation: What’s the
Difference?
▪In a feedforward neural network, the input moves forward
from the input layer to the output layer. Backpropagation
helps improve the neural network’s output. It does this by
propagating the error backward from the output layer to the
input layer.
201
202
203
▪To know in which direction to navigate, gradient descent uses
backpropagation.
▪More specifically, it uses the gradients calculated through
backpropagation. These gradients are used for determining
the direction to navigate to find the minimum point.
Specifically, we aim to find the negative gradient.
▪This is because a negative gradient indicates a decreasing
slope. A decreasing slope means that moving downward will
lead us to the minimum point. For example:
▪
204
205
206
▪ As you can see, the output of sigmoid ranges between 0 and
1, and it increases monotonically with respect to its input.
207
▪Since sigmoid can map a large input domain into a small
range of [0,1], it is commonly referred to as the squashing
function.
▪The derivative the sigmoid function with respect to its input
can be computed using its output
208
The BACKPROPAGATION Algorithm
209
BACKPROPOGATION
210
Because we are considering networks with
multiple output units rather than single
units as before, we begin by redefining E to
sum the errors over all of the network
output units
213
▪This algorithm applies to feedforward networks
•containing two layers of sigmoid units,
•with units at each layer connected to all units
from the preceding layer.
▪This is the incremental or stochastic, gradient
descent version of Backpropagation.
▪The notation used here is the same as that used in
earlier sections, with the following extensions:
214
▪
215
▪
216
Termination Conditions
217
▪ Gradient Descent is widely used for optimization in machine
learning and deep learning.
218
3. Convexity (Preferable): Gradient Descent works best with
convex functions where a global minimum exists, as it
ensures convergence to the optimum. In non-convex
functions, it might get stuck in local minima.
4. Learning Rate: A proper learning rate must be defined. Too
high can lead to overshooting, while too low can make
convergence very slow.
5. Feature Scaling: Features in the dataset should be
normalized or standardized for smoother and faster
convergence.
219
Difficulties in Applying Gradient Descent
While Gradient Descent is powerful, there are several challenges:
1. Choice of Learning Rate:
• Selecting an optimal learning rate can be tricky.
• A poor choice can lead to divergence (too high) or slow
convergence (too low).
▪ Local Minima and Saddle Points: In non-convex optimization
problems, it can get stuck in local minima or saddle points where
gradients are zero but the point is not optimal.
▪ Computational Cost: For very large datasets, computing the gradient
over the entire dataset can be computationally expensive and
time-consuming.
▪ Exploding/Vanishing Gradients: In deep learning, gradients can either
become excessively large (exploding) or diminish to near zero
(vanishing), hampering learning.
2. Feature Scaling Dependency:
• If features are not scaled properly, the algorithm might take longer to
converge.
220
Learning Algorithm:Backpropagation
The following slides describes teaching process of multi-layer
neural network employing backpropagation algorithm. To
illustrate this process the three layer neural network with two
inputs and one output, which is shown in the picture below, is
used:
Learning
Algorithm:Backpropagation
Each neuron is composed of two units. First unit adds products
of weights coefficients and input signals. The second unit
realise nonlinear function, called neuron transfer (activation)
function. Signal e is adder output signal, and y = f(e) is output
signal of nonlinear element. Signal y is also output signal of
neuron.
Learning Algorithm: Backpropagation
• To teach the neural network we need training data set. The
training data set consists of input signals (x1 and x2 )
assigned with corresponding target (desired output) z.
237
238
239
9-
240
241
242
243
244
Illustration
▪we will build a neural network with three layers:
•Input layer with two inputs neurons
•One hidden layer with two neurons
•Output layer with a single neuron
245
Dataset
246
▪Our initial weights will be as following: w1 = 0.11, w2 = 0.21,
w3 = 0.12, w4 = 0.08, w5 = 0.14 and w6 = 0.15
247
Forward Pass
248
Calculating Error
249
Reducing Error
250
Backpropagation (of errors)
251
▪Now, using the new weights we will repeat the forward pass
through another training example
252
253
▪Some of the main problems associated with the
backpropagation learning algorithm:
1.Vanishing and Exploding Gradients:
In deep neural networks, especially those with many layers,
the gradients of the loss function with respect to the
weights may become very small (vanishing gradients) or
very large (exploding gradients). This can make it difficult to
update the weights effectively and lead to slow
convergence or divergence during training.
2.Local Minima and Saddle Points:
Backpropagation is prone to getting stuck in local minima
or saddle points in the optimization landscape. While it is
often able to escape shallow local minima, it can struggle
with deeper and more complex optimization landscapes.
254
3. Sensitivity to Initial Conditions:
The performance of backpropagation can be sensitive to
the initial values of the weights. Choosing appropriate
initial conditions is crucial for achieving good convergence
and avoiding poor local minima.
4.Overfitting:
Backpropagation may lead to overfitting, especially when
dealing with small datasets or when the model is too
complex. Overfitting occurs when the model learns the
training data too well, including its noise and outliers, and
fails to generalize to new, unseen data.
255
5. Requires Large Amounts of Labeled Data:
Backpropagation relies on supervised learning and
requires a large amount of labeled training data to
effectively learn the underlying patterns. In some
domains, obtaining labeled data can be expensive or
impractical.
6.Computational Intensity:
Training deep neural networks using backpropagation
can be computationally intensive, especially for
large models and datasets. This can lead to long
training times and may require significant
computational resources.
256
7. Hyperparameter Sensitivity:
The performance of the backpropagation algorithm is
sensitive to the choice of hyperparameters, such
as learning rate, batch size, and regularization terms.
Finding the right set of hyperparameters can be a
challenging and time-consuming task.
8. Lack of Interpretability:
Neural networks trained with backpropagation are
often considered as black-box models, lacking
interpretability. Understanding the inner workings of
a neural network and explaining its decisions can be
challenging.
257
▪Here are several strategies to speed up the learning process in
multi-layer neural networks:
1.Batch Normalization:
1. Batch normalization normalizes the inputs of each layer
during training. It helps mitigate issues like vanishing or
exploding gradients and allows for more stable and faster
convergence.
2.Weight Initialization:
1. Proper weight initialization is crucial for faster
convergence. Techniques such as He initialization or
Xavier/Glorot initialization can help set the initial weights
to values that promote efficient learning.
258
3. Use of Different Activation Functions:
Choosing appropriate activation functions can impact the
learning speed. Rectified Linear Unit (ReLU) is a popular
choice due to its simplicity and ability to mitigate the
vanishing gradient problem.
4. Learning Rate Scheduling:
Adaptive learning rate strategies, like learning rate
annealing or scheduling, can be employed to adjust the
learning rate during training. Techniques such as Cyclical
Learning Rates or using learning rate schedules like Step
Decay or Exponential Decay can help balance faster
convergence without overshooting.
259
5. Optimization Algorithms:
Advanced optimization algorithms, such as Adam,
RMSprop, or AdaGrad, can be more efficient than
traditional stochastic gradient descent (SGD). They adapt
the learning rates for each parameter, potentially leading to
faster convergence.
6. Data Augmentation:
Increasing the effective size of the training dataset through
data augmentation can help the model generalize better
and converge faster. Techniques like rotation, scaling, and
flipping can be applied to artificially increase the diversity
of the training set.
260
7. Dropout Regularization:
Dropout is a regularization technique that randomly drops
a subset of neurons during training. It helps prevent
overfitting and encourages the network to learn more
robust and general features, potentially accelerating
convergence.
8. Early Stopping:
Monitor the model's performance on a validation set
during training, and stop training when the performance
starts to degrade. This prevents overfitting and can lead to
faster convergence.
261
9. Parallelization and GPU Acceleration:
Training deep neural networks can be computationally
intensive. Utilizing parallel computing resources, such as
multiple GPUs, can significantly speed up the training
process.
10.Transfer Learning:
Transfer learning involves using pre-trained models on a
related task to initialize the weights of your network. This
can provide a good starting point, especially when you have
limited labeled data for your specific task.
262