0% found this document useful (0 votes)
17 views262 pages

ML-MODULE-4 - Part 2

Module 4 covers learning with Support Vector Machines (SVM) and Artificial Neural Networks (ANN), including their structures, algorithms, and applications. It explains the principles of AI, machine learning, and deep learning, highlighting the differences between these concepts and the role of neural networks. The module also discusses the perceptron model, activation functions, and the importance of weights and biases in neural networks.

Uploaded by

22c50.varsha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views262 pages

ML-MODULE-4 - Part 2

Module 4 covers learning with Support Vector Machines (SVM) and Artificial Neural Networks (ANN), including their structures, algorithms, and applications. It explains the principles of AI, machine learning, and deep learning, highlighting the differences between these concepts and the role of neural networks. The module also discusses the perceptron model, activation functions, and the importance of weights and biases in neural networks.

Uploaded by

22c50.varsha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 262

MODULE 4

Learning with SVM, ANN (8 hours)

1
Module-4 Learning with SVM, ANN
▪Learning with Support Vector Machines:
•Introduction,
•Linear Discriminant Functions for Binary Classification,
•Linear Maximal Margin Classifier.
•Linear Soft-Margin Classifier
▪Artificial Neural Network
•Introduction,
•NN Representation, Appropriate Problems,
•Perceptron,
•Back Propagation Algorithm,
•Remarks on Backpropagation Algorithm

2
3
Learning with Neural Networks:

▪What is artificial intelligence (AI)?


▪Artificial intelligence, commonly referred to as AI, is the
process of imparting data, information, and human
intelligence to machines.
▪The main goal of Artificial Intelligence is to develop
self-reliant machines that can think and act like humans.
▪These machines can mimic human behavior and perform
tasks by learning and problem-solving.
▪AI uses predictions and automation to optimize and solve
complex tasks that humans have historically done, such as
facial and speech recognition, decision making and
translation.

4
▪What is Machine Learning?
▪Machine learning is a subset of artificial intelligence that
allows for optimization. When set up correctly, it helps you
make predictions that minimize the errors that arise from
merely guessing.
▪Machine learning accesses vast amounts of data (both
structured and unstructured) and learns from it to predict the
future.
▪It learns from the data by using multiple algorithms and
techniques.

5
▪What is Deep Learning?
▪Deep learning is a subset of machine learning that deals with
algorithms inspired by the structure and function of the
human brain.
▪Deep learning algorithms can work with an enormous amount
of both structured and unstructured data.
▪Deep learning’s core concept lies in artificial neural
networks, which enable machines to make decisions.

6
▪The primary difference between machine learning and deep
learning is how each algorithm learns and how much data
each type of algorithm uses.

▪Deep learning automates much of the feature extraction


piece of the process, eliminating some of the manual human
intervention required.
▪It also enables the use of large data sets, earning the title of
scalable machine learning.
▪That capability is exciting as we explore the use of
unstructured data further, particularly since over 80% of an
organization’s data is estimated to be unstructured.

7
8
9
▪What is a neural network?
▪Neural networks, also called artificial neural networks
(ANNs) or simulated neural networks (SNNs), are a subset of
machine learning and are the backbone of deep learning
algorithms.
▪They are called “neural” because they mimic how neurons in
the brain signal one another.

10
▪Neural networks are made up of node layers – an input
layer, one or more hidden layers, and an output layer.
▪Each node is an artificial neuron that connects to the next,
and each has a weight and threshold value.
▪When one node’s output is above the threshold value, that
node is activated and sends its data to the network’s next
layer. If it’s below the threshold, no data passes along.

11
▪What’s the difference between deep learning and neural
networks?
▪As mentioned in the explanation of neural networks above,
but worth noting more explicitly, the “deep” in deep learning
refers to the depth of layers in a neural network.
▪ A neural network of more than three layers, including the
inputs and the output, can be considered a deep-learning
algorithm.
▪That can be represented by the following diagram:

12
13
▪Most deep neural networks are feed-forward, meaning they
only flow in one direction from input to output.
▪However, you can also train your model through
back-propagation, meaning moving in the opposite direction,
from output to input.
▪Back-propagation allows us to calculate and attribute the
error associated with each neuron, allowing us to adjust and
fit the algorithm appropriately.

14
▪Why Deep Learning ?
▪In case of machine learning if we increase the volume of data
to be trained then the performance will not improve i,e it
attains saturation point.
▪But in case of deep learning if we increase the volume of the
data to be trained then the performance will also increase
upto data level.

15
16
▪While neural networks (also called “perceptrons”) have been
around since the 1940s, it is only in the last several decades
where they have become a major part of artificial
intelligence.
▪This is due to the arrival of a technique called
“backpropagation,” which allows networks to adjust their
hidden layers of neurons in situations where the outcome
doesn’t match what the creator is hoping for — like a network
designed to recognize dogs, which misidentifies a cat, for
example.

▪Another important advance has been the arrival of deep


learning neural networks, in which different layers of a
multilayer network extract different features until it can
recognize what it is looking for.
17
Artificial Neural network(ANN)
▪The study of artificial neural network(ANN) has been inspired
by the observation that biological learning systems are built
of very high complex, nonlinear parallel interconnections of
neurons.
▪The human body is made up of trillions of cells.
▪Cells of the nervous system, called nerve cells or neurons, are
specialized to carry "messages" through an electrochemical
process.

18
Biological Motivation
▪The basic computational unit of the brain is a neuron.

19
20
▪Neurons have specialize cell part called dendrites and axons.
▪Dendrites bring electrical signals to the cell body and axons
take information away from the cell body.

21
▪The typical nerve cell of the human brain comprises four
parts:
▪1. Function of Dendrite- It receives signals from other
neurons.
▪2. Soma (cell body)- It sums all the incoming signals to
generate input.
▪3. Axon Structure - When the sum reaches a threshold value,
the neuron fires and the signal travels down the axon to the
other neurons.
▪4. Synapses Working - The point of interconnection of one
neuron with other neurons. The amount of signal transmitted
depends upon the strength (synaptic weights) of the
connections.

22
▪A synapse serves as a junction where both the transmission
and processing of information occur, making it a vital means
of communication between neurons
▪A synapse is able to increase or decrease the strength of the
connection from the neuron to neuron and cause excitation
or inhibition of a subsequence neuron. This is where
information is stored.
▪ANNs are loosely motivated by biological neural systems,
there are many complexities to biological neural systems that
are not modeled by ANNs.

23
24
Biological Motivation

25
Basic Structure of ANNs

▪ANN was developed by Jeffrey Hilton.


▪ANNs are composed of multiple nodes, which imitate
biological neurons of the human brain.
▪The neurons are connected by links and they interact with
each other.
▪The nodes can take input data and perform simple operations
on the data.
▪The result of these operations is passed to other neurons.
▪The output at each node is called its activation or node
value.
▪Each link is associated with weight. ANNs are capable of
learning, which takes place by altering weight values.

26
Neuron

27
▪Each ANN is composed of a collection of perceptron's
grouped in layers.
▪A typical structure is shown in below figure. Note the three
layers: input, intermediate (called the hidden layer) and
output.
▪Several hidden layers can be placed between the input and
output layers.

28
▪The following illustration shows a simple ANN –

Input Layer => Accepts


all inputs that are given
by the user.
Hidden Layer =>
Performs all the
calculations to find
hidden features and
patterns.
Output Layer => Predicts
the Output based on
input and weighted
calculation at hidden
neurons.

29
30
Example: Autonomous Driving

31
Example: Autonomous Driving

32
Example: Autonomous Driving

33
Example: Autonomous Driving

34
Example: Autonomous Driving

35
Autonomous Driving

36
37
Example 2: Bank Credit Score

38
Example 2: Bank Credit Score
▪To make things clearer, lets understand ANN using a simple
example: A bank wants to assess whether to approve a loan
application to a customer, so, it wants to predict whether a
customer is likely to default on the loan. It has data like below:

39
Example 2: Bank Credit Score

40
Example 2: Bank Credit Score

41
Example 2: Bank Credit Score

42
Some Applications
▪Autonomous Driving
▪Speech Phenome Recognition
▪Image Classification
▪Financial Prediction

43
Properties of NNs
▪Many neuron-like units: Neural networks consist of multiple
interconnected processing units (neurons), making them capable
of complex pattern recognition.
▪Weighted interconnections: Each connection between neurons
has a weight that determines its influence, and backpropagation
helps adjust these weights to minimize errors.
▪Highly parallel, distributed processing: Neural networks can
process data in parallel, making them efficient for large-scale
computations.
▪Automatic weight tuning: Backpropagation optimizes these
weights through iterative learning using gradient descent
▪Handling high-dimensional inputs: Neural networks excel at
processing high-dimensional data, whether discrete (text,
categorical data) or real-valued (image pixels, sensor readings).
44
Appropriate problems for ANN

45
Appropriate problems for ANN
The backpropagation algorithm is well-suited for problems with
the following characteristics

▪Instances with Many Attribute-Value Pairs: Neural networks


can model complex relationships, making them ideal for
datasets with multiple features (e.g., image pixels, medical
records, financial indicators).
▪Flexible Target Function Output: Whether classifying objects
(discrete values) or predicting continuous outcomes (real
values), backpropagation effectively learns patterns.
▪Tolerance to Noisy Data: Due to weighted connections and
non-linear activations, neural networks can generalize well
even when training examples contain errors.

46
▪Long Training Times Are Acceptable: Deep networks require
significant computational resources, but advancements like
GPU acceleration make training manageable.
▪Fast Evaluation for Predictions: Once trained, neural
networks execute forward propagation quickly, making them
ideal for real-time applications like speech recognition or
autonomous driving.
▪Low Emphasis on Human Interpretability: Unlike decision
trees or rule-based models, neural networks function as black
boxes, focusing on performance rather than explainability.

47
Perceptron
▪The perceptron was introduced by Frank Rosenblatt in 1958,
as a type of artificial neural network capable of learning and
performing binary classification tasks.
▪ We consider it a single-layer neural network with four main
parameters: input values, weights and Bias, net sum, and an
activation function.
▪Perceptron is a single layer neural network and a multi-layer
perceptron is called Neural Networks.

48
▪Each perceptron comprises four different parts:
1.Input Values: A set of values or a dataset for predicting the
output value. They are also described as a dataset’s features
and dataset.
2.Weights: The real value of each feature is known as weight. It
tells the importance of that feature in predicting the final
value.
3.Bias: The activation function is shifted towards the left or
right using bias. You may understand it simply as the
y-intercept in the line equation.
4.Summation Function: The summation function binds the
weights and inputs together. It is a function to find their sum.
5.Activation Function: It introduces non-linearity in the
perceptron model.

49
50
51
Neuron vs Perceptron
•Neuron: In the context of artificial neural networks, a neuron
is a conceptual unit that takes multiple inputs, performs a
weighted sum of these inputs, and applies an activation
function to produce an output. Neurons are inspired by the
biological neurons in the human brain.
•Perceptron: A perceptron is a specific type of artificial
neuron. It takes multiple binary inputs, multiplies each input
by a weight, computes the weighted sum, adds a bias term,
and passes the result through an activation function. The
output of a perceptron is typically binary (0 or 1).

52
•Neuron: Neurons in artificial neural networks can have
different types of outputs depending on the activation
function used. The output can be continuous or binary,
depending on the problem and the desired behavior.
•Perceptron: Perceptron's have a binary output (0 or 1)
due to the step function traditionally used as the
activation function.

53
Feature Neuron Perceptron
A simple artificial neuron used in
A fundamental processing unit in
Definition early machine learning models for
an artificial neural network.
binary classification.

Takes weighted inputs, applies Computes a weighted sum of


Functionality an activation function, and inputs and applies a step function
produces an output. to decide output (0 or 1).

Uses various non-linear functions Uses a threshold-based step


Activation
like ReLU, Sigmoid, Tanh for function, making it suitable for
Function
complex learning. linearly separable problems.

Forms the basis of deep learning A basic model that cannot handle
Complexity models with multiple layers and non-linearly separable problems
neurons. like XOR.

Adjusts weights using Updates weights using the


Learning Process backpropagation and gradient Perceptron Learning Rule, but
descent for efficient learning. lacks deep learning capabilities.

Found in deep neural networks Used in simple classification tasks


Application (MLPs, CNNs, RNNs) for tasks but replaced by advanced models
like image recognition and NLP. in modern ML. 54
55
A Perceptron

56
Perceptron

57
58
59
60
61
Why do we Need Weight and Bias?
▪Weight and bias are two important aspects of the perceptron
model. These are learnable parameters and as the network
gets trained it adjusts both parameters to achieve the desired
values and the correct output.

62
▪Weights are used to measure the importance of each feature
in predicting output value.
▪Features with values close to zero are said to have lesser
weight or significance.
▪These have less importance in the prediction process
compared to the features with values further from zero
known as weights with a larger value.
▪Besides high-weighted features having greater predictive
power than low-weighting ones, the weight can also be
positive or negative.
▪ If the weight of a feature is positive then it has a direct
relation with the target value, and if it is negative then it has
an inverse relationship with the target value.

63
▪In contrast to weight in a neural network that increases the
speed of triggering an activation function, bias delays the
trigger of the activation function.
▪It acts like an intercept in a linear equation. Simply stated, Bias
is a constant used to adjust the output and help the model to
provide the best fit output for the given data.

64
▪Weights control the signal (or the strength of the connection)
between two neurons. In other words, a weight decides how
much influence the input will have on the output.
▪Biases, which are constant, are an additional input into the
next layer that will always have the value of 1.
▪ Bias units are not influenced by the previous layer (they do
not have any incoming connections) but they do have
outgoing connections with their own weights.
▪The bias unit guarantees that even when all the inputs are
zeros there will still be an activation in the neuron.

65
Activation Function
▪What is a Neural Network Activation Function?
▪An Activation Function decides whether a neuron should be
activated or not. This means that it will decide whether the
neuron’s input to the network is important or not in the
process of prediction using simpler mathematical operations.

▪The role of the Activation Function is to derive output from a


set of input values fed to a node (or a layer).
▪Let’s take a step back and clarify: What exactly is a node?

▪Well, if we compare the neural network to our brain, a node


is a replica of a neuron that receives a set of input
signals—external stimuli.

66
67
▪Depending on the nature and intensity of these input
signals, the brain processes them and decides
whether the neuron should be activated (“firedˮ) or
not.
▪In deep learning, this is also the role of the Activation
Function—thatʼs why itʼs often referred to as
a Transfer Function in Artificial Neural Network.

▪The primary role of the Activation Function is to


transform the summed weighted input from the node
into an output value to be fed to the next hidden
layer or as output.

68
69
▪When learning about neural networks, you will come across
two essential terms describing the movement of
information—feedforward and backpropagation.
▪Feedforward Propagation - the flow of information occurs in
the forward direction. The input is used to calculate some
intermediate function in the hidden layer, which is then used
to calculate an output.
▪In the feedforward propagation, the Activation Function is a
mathematical “gate” in between the input feeding the current
neuron and its output going to the next layer.

70
▪Backpropagation - the weights of the network connections
are repeatedly adjusted to minimize the difference between
the actual output vector of the net and the desired output
vector.

▪To put it simply—backpropagation aims to minimize the cost


function by adjusting the network’s weights and biases.
▪The cost function gradients determine the level of adjustment
with respect to parameters like activation function, weights,
bias, etc.

71
Why do Neural Networks Need an
Activation Function?
▪ Why do Neural Networks need it?

▪Well, the purpose of an activation function is to add


non-linearity to the neural network.

72
▪Activation functions introduce an additional step at each layer
during the forward propagation, but its computation is worth
it.
▪Here is why— Let’s suppose we have a neural network
working without the activation functions.

▪In that case, every neuron will only be performing a linear


transformation on the inputs using the weights and biases.
▪It’s because it doesn’t matter how many hidden layers we
attach in the neural network; all layers will behave in the
same way because the composition of two linear functions is
a linear function itself.

▪Although the neural network becomes simpler, learning any


complex task is impossible, and our model would be just a
linear regression model.
73
Some types of Neural Networks Activation Functions
▪Binary Step Function: Binary step function depends on a
threshold value that decides whether a neuron should be
activated or not.

▪The input fed to the activation function is compared to a


certain threshold; if the input is greater than it, then the
neuron is activated, else it is deactivated, meaning that its
output is not passed on to the next hidden layer.

74
75
▪Mathematically it can be represented as:

76
▪Here are some of the limitations of binary step
function:
•It cannot provide multi-value outputs—for example, it
cannot be used for multi-class classification
problems.
•The gradient of the step function is zero, which
causes a hindrance in the backpropagation process.

77
▪Linear Activation Function
▪The linear activation function, also known as "no activation,"
or "identity function" (multiplied x1.0), is where the
activation is proportional to the input.
▪The function doesn't do anything to the weighted sum of the
input, it simply spits out the value it was given.

78
79
▪Mathematically it can be represented as:

80
▪However, a linear activation function has two major
problems

•Itʼs not possible to use backpropagation as the


derivative of the function is a constant and has no
relation to the input x.
•All layers of the neural network will collapse into
one if a linear activation function is used.
•No matter the number of layers in the neural
network, the last layer will still be a linear function of
the first layer. So, essentially, a linear activation
function turns the neural network into just one
layer.

81
▪Non-linear activation functions solve the following
limitations of linear activation functions:
•They allow backpropagation because now the
derivative function would be related to the input,
and itʼs possible to go back and understand which
weights in the input neurons can provide a better
prediction.
•They allow the stacking of multiple layers of neurons
as the output would now be a non-linear combination
of input passed through multiple layers. Any output
can be represented as a functional computation in a
neural network.

82
▪Sigmoid / Logistic Activation Function

▪This function takes any real value as input and


outputs values in the range of 0 to 1.

▪The larger the input (more positive), the closer the


output value will be to 1.0, whereas the smaller the
input (more negative), the closer the output will be to
0.0, as shown below.

83
84
▪Mathematically it can be represented as:

85
▪Hereʼs why sigmoid/logistic activation function is one
of the most widely used functions:

•It is commonly used for models where we have to


predict the probability as an output. Since
probability of anything exists only between the range
of 0 and 1, sigmoid is the right choice because of its
range.
•The function is differentiable and provides a smooth
gradient, i.e., preventing jumps in output values. This
is represented by an S-shape of the sigmoid
activation function.

86
Step Function - Outputs either 0 or 1 based on a threshold.

Commonly used in perceptrons for binary classification.


Limitation: It lacks a gradient, making it unsuitable for
backpropagation-based learning.

87
▪Sign Function - Outputs either -1 or 1, providing symmetry
around zero.

Used in some machine learning models where negative


outputs are meaningful.
Limitation: Like the step function, it has no gradient, making it
incompatible with gradient descent.

88
Popular Activation functions
▪Sigmoid

▪Step

▪Sign

89
90
Architectures

91
Single Layer

92
Multi Layer Perceptrons

93
Recurrent

94
Mesh

95
▪Linearly Separable 2D Data
▪We say a two-dimensional dataset is linearly separable if we
can separate the positive from the negative objects with a
straight line.
▪It doesn’t matter if more than one such line exists. For linear
separability, it’s sufficient to find only one:

96
▪Conversely, no line can separate linearly inseparable 2D data:

97
98
99
100
Representation power

101
Neural Representation of AND, OR, NOT
Logic Gates (Perceptron Algorithm)

102
AND gate / OR gate
▪Identify suitable weights to make following perceptron to
function as AND gate / OR gate

103
▪First, we need to know that the Perceptron algorithm states
that:

▪Prediction (y) = 1 if Wx+b > 0 and


0 if Wx+b ≤ 0

▪Also, the steps in this method are very similar to how Neural
Networks learn, which is as follows;
▪Initialize weight values and bias
▪Forward Propagate
▪Check the error
▪Backpropagate and Adjust weights and bias
▪Repeat for all training examples
104
AND gate

105
AND - Another representation

106
OR gate

107
NOT gate

108
NAND gate

109
XOR gate

110
XNOR

111
A single perceptron can learn any
function, as long as the instances
in the dataset are linearly
separable, like AND, OR, NAND,
and NOR!

112
113
114
Key Terms
▪Input Nodes (input layer)
•Just pass the information to the next layer
•A block of nodes is also called layer.
▪Hidden nodes (hidden layer)
•In Hidden layers is where intermediate processing or
computation is done,
•they perform computations and then transfer the weights
(signals or information) from the input layer to the next
layers
•It is possible to have a neural network without a hidden
layer also.
▪Output Nodes (output layer)
•Here we finally use an activation function that maps to the
desired output format (e.g. softmax for classification).
115
Key Terms
▪Connections and weights
•The network consists of connections, each connection
transferring the output of a neuron i to the input of a
neuron j.
•In this sense i is the predecessor of j and j is the successor
of i, Each connection is assigned a weight Wij.
▪Activation function
•the activation function of a node defines the output of
that node given an input or set of inputs.
•A standard computer chip circuit can be seen as a digital
network of activation functions that can be “ON” (1) or
“OFF” (0), depending on input.
•In artificial neural networks this function is also called the
transfer function.

116
Key Terms
▪Learning rule
•The learning rule is a rule or an algorithm which modifies
the parameters of the neural network, in order for a given
input to the network to produce a favored output.
•This learning process typically amounts to modifying the
weights and thresholds.

117
The Perceptron Training Rule

▪It is important to learn the training process of huge neural


networks.
▪However, we need to simplify this by first understanding how
a simple perceptron is trained, and how its weights are
updated!
▪Only then, will we be able to understand the dynamics of
complicated and monstrous neural networks, such as CNNs,
RNNs, MLPs, ANNs, …

118
The Case of Binary Classification
▪Below you can see the architecture of an actual perceptron, as the
simplest type of artificial neural networks:

119
▪Imagine we have a binary classification problem at hand, and
we want to use a perceptron to learn this task.
▪Moreover, a perceptron can produce 2 values: +1 / -1 where
+1 means that the input example belongs to the + class, and
-1 means the input example belongs to the – class.

▪Obviously, as we have 2 classes, we would want to learn the


weight vector of our perceptron in such a way that, for every
training example (depending on whether it belongs to the + /
– class), the perceptron would produce the correct +1 / -1.

120
▪So, in summary:
▪What are we learning? The weights of the perceptron

▪Is any weight acceptable? Absolutely not! We would want to


find a weight vector that makes the perceptron produce +1
for the + class, and -1 for the negative class

121
▪We would want the perceptron to produce +1 for one class
and -1 for the other.
▪After observing the output for a given training example, we
will NOT modify the weights unless the produced output
was wrong!
▪For example, if we fed an instance of the – class and the
perceptron returned +1, then it means that we need to
modify the parameters of our network, i.e., the weights.
▪We will keep this process, and we will keep iterating through
the training set as long as necessary until the perceptron
classifies all the training examples correctly.

122
▪How do we update the weights? Is there some sort of
mathematical rule?
▪Yes, and the rule is called the perceptron training rule!

▪At every step of feeding a training example, when the


perceptron fails to produce the correct +1/-1, we revise every
weight wi associated with every input xi, according to the
following rule:
wi = wi + Δwi
where:
Δwi = η(t – o)xi

123
▪The variables in here are described as follows:
▪ Δwi : This means how much should I change the value of
the weight.
▪In other words, this is the amount that is added to the old
value of wi to update it. This can be positive or negative,
meaning we might increase or decrease wi

124
▪η : This is the learning rate, or the step size.
▪We tend to choose a small value for this, as if it is too big we
will never converge and if it is too small, we will take for
ever to converge to the correct weight vector and have a
decent classifier.
▪This step size, simply moderates the weight updates just so
the updates would not make an aggressive change to the old
values of the weights.

125
▪t: This is the ground truth label that we have for every
training example in our training set.

▪o : This is the output of our model, which in this case can be


either+1 or -1.

▪xi : This is the ith dimension of our input training example x,


which is connected to the weight wi

126
Perceptron Training Rule

127
▪The training strategy of the perceptron is nicely
demonstrated here:

128
The Intuition Behind the Perceptron Training Rule

▪Suppose our perceptron correctly classified a training example!


▪Then clearly, we know that we will not need to change the
weights of our perceptron!
▪But does our learning rule confirm this as well? If the example
has been classified correctly, then it means that (t – o) is 0 !
▪ Because when an example is classified correctly, the output of
our perceptron is for sure equal to our ground truth, i.e., o = t!

129
130
Illustration – Perceptron learning

131
Illustration – Perceptron learning

132
Illustration – Perceptron learning
▪We do only ONE
epoch/iteration
▪Consider training
example-1

133
Illustration – Perceptron learning
▪Consider
training
example-2

134
Illustration –
Perceptron
learning

▪Consider
training
Example-3

135
Limitation
▪Perceptron rule fails if data is not linearly separable

136
Delta Rule

137
Delta Rule
▪The Delta Rule employs
•the error function for what is known as Gradient Descent
learning,
•which involves the ‘modification of weights along the most
direct path in weight-space to minimize error’
•so change applied to a given weight is proportional to the
negative of the derivative of the error with respect to that
weight

138
▪So consider the simple neural network down below.
▪Let’s say we would like to build a binary classifier that would
generate a value close to 1, if our data x belongs to the
positive class, and 0, if it belongs to the negative class.
▪So, we will have to learn a set of weights in our weight space,
that would force this neural network to generate our desired
outputs, given our input.

139
140
▪So, you can say that the weight space is indeed your
hypothesis space, where you have loads and loads of possible
values for these weights.
▪However, only a subset of these weights can turn your neural
network into a successful model with a high degree of
performance on your data. So, the question is how should we
change the weights, w0 and w1 , so that our error, E would
become suitably small.
▪The Delta Rule is an interesting mechanism for searching the
hypothesis space(here the weight space).
▪ Actually, the Delta Rule uses one of the most, if not the most,
popular search technique in the hypothesis space that is
called Gradient Descent.

141
▪So we have to deal with two things:

▪Find out what is the error(i.e., difference between predicted


and ground truth label) --- Sum Square Error (SSE) error
function
▪Search for the most suitable values of weights that minimizes
the error ---- Gradient Descent.

142
For our Delta Rule to work, we need to have a measure to
quantify the performance of our network i.e., how far away are
the outputs from the ground truth. This measure will be our
error function.

So, for every input and the choice of weight vector (from the
hypothesis space), how far is our output from the ground truth.

One common error function that can be used here, is the Sum
Square Error (SSE) error function:

143
144
▪This error function measures the difference between the
generated output and the ground truth for every example d
cross the whole training set D.
▪Note that the error is a function of our weight vector
▪ The Delta rule searches for these weight vectors and uses
them to generate the output for a given training example.
▪Then by measuring the error, it would update the previously
chosen weights to new values in a way that the output for the
subsequent training examples would get closer to the ground
truth

145
▪Example of Gradient Descent
▪Let’s say you are playing a game where the players are at the
top of a mountain, and they are asked to reach the lowest
point of the mountain. Additionally, they are blindfolded. So,
what approach do you think would make you reach the lake?

▪The best way is to observe the ground and find where the
land descends.
▪ From that position, take a step in the descending direction
and iterate this process until we reach the lowest point.

146
Delta Rule

147
▪So, the more the training goes on the better weights would
the Delta rule find, and the error for those weights would
become less and less. Until eventually, the network has
converged and we say that the model has been trained.

148
▪When you draw the function in 2-dimensional space, this is
how the local minima and global minima will look like:

149
▪When you draw the function in 3-dimensional space, this is
how the local minima and global minima will look like:

150
▪Given the way we have defined the error function, for linear
units, the error surface is always a parabolic with a single
global minimum.

151
▪ Gradient descent is an iterative optimization algorithm for finding
the local minimum of a function.
▪ To find the local minimum of a function using gradient descent,
we must take steps proportional to the negative of the gradient
(move away from the gradient) of the function at the current
point.
▪ If we take steps proportional to the positive of the gradient
(moving towards the gradient), we will approach a local maximum
of the function, and the procedure is called Gradient Ascent.
152
153
▪Gradient descent is an iterative optimization algorithm for
finding the minimum of a function; in our case we want to
minimize the error function.

▪To find a local minimum of a function using gradient descent,


one takes steps proportional to the negative of the gradient
of the function at the current point.

154
▪The goal of the gradient descent algorithm is to minimize
the given function (say cost function). To achieve this goal, it
performs two steps iteratively:
1.Compute the gradient (slope), the first order derivative of the
function at that point
2.Make a step (move) in the direction opposite to the
gradient, opposite direction of slope increase from the
current point by alpha times the gradient at that point

155
156
▪ The results are optimal weights for the problem at hand.
157
158
159
▪Learning rate

160
161
162
163
Derivation of Gradient Descent

164
Derivation of Gradient Descent

Where xid denotes the single input component xid for the training
example d. Therefore weight update rule for gradient descent is

165
166
▪For a given learning rate η, the computed Δwi tells us how
much we need to change, wi in order to minimize the total
error, E.
▪Finally, we will add this to the current value of our weight in
order to update that weight, using the learning rule: wi = wi +
Δwi

167
Gradient Descent Algorithm

168
▪Consider a training set of 2 training data x1 and x2. Let x0 = 1.
The neural network is the same that is depicted in Fig.1.
▪ In this neural network, please note that x0is called the bias
unit and it is always equal to 1. As a result the only input from
which we will be able to feed in our training data into the
neural network would be through x.

169
▪This means that our training data are 1-dimensional.
▪Moreover, we will have to initialize our 2 weights randomly as
well and define a learning rate.
▪Finally, in our training set, for every training data, we will also
have a ground-truth, so that we could actually compute the
error and update the weights in the network.
▪So, all of these are defined as follows:

170
171
▪ Figure 4: The forward-pass for Computing the Total Error Across
the Training Set
172
▪So, we have computed the individual errors, summed them,
and divided them by 2, and computed the total error.
▪Now for each of the 2 weights in our network, we will have
to compute the gradient of the error according to our
derivative rule that we have derived, and then we will
multiply the gradient by the learning rate to learn the amount
we will have to change the current value of our weights.
▪This is called the Backward phase where we back-propagate
the gradients from the output side towards the input side of
the network, in order to learn the degree by which we will
have to increase/decrease every single weight in our neural
network.

173
▪Finally, we will add this value to the old value of our weights,
to compute their new values. This is called LEARNING! We are
learning the weights based on the errors that we make for
every training data!

174
175
176
177
▪Stochastic gradient descent (SGD) runs a training epoch for
each example within the dataset and it updates each training
example's parameters one at a time.
▪Let’s say we have 5 observations and each observation has
three features and the values that I’ve taken are completely
random.

178
▪Now if we use the SGD, will take the first observation, then
pass it through the neural network, calculate the error and
then update the parameters.

179
▪Then will take the second observation and perform similar
steps with it. This step will be repeated until all observations
have been passed through the network and the parameters
have been updated.

180
▪Each time the parameter is updated, it is known as an
Iteration. Here since we have 5 observations, the parameters
will be updated 5 times or we can say that there will be 5
iterations.
▪Had this been the Gradient Descent we would have passed all
the observations together and the parameters have been
updated only once. In the case of SGD, there will be ‘m’
iterations per epoch, where ‘m’ is the number of observations
in a dataset.

181
Mini-batch Gradient Descent
▪Another type of Gradient Descent is the Mini-batch Gradient
Descent
Again let’s take the same example. Assume that the batch size
is 2. So we’ll take the first two observations, pass them
through the neural network, calculate the error and then
update the parameters.

182
▪Then we will take the next two observations and perform
similar steps i.e will pass through the network, calculate the
error and update the parameters.

183
▪Now since we’re left with the single observation in the final
iteration, there will be only a single observation and will
update the parameters using this observation.

184
185
186
▪Comparison between Stochastic Gradient Descent
and Gradient Descent
1.Speed: SGD is faster than gradient descent because it
uses only one training example per iteration, whereas
gradient descent requires the entire dataset. This speed
advantage becomes more significant as the size of the
dataset increases.
2.Convergence: Gradient descent has a more
predictable convergence behaviour because it uses the
average gradient of the entire dataset. In contrast,
SGD’s convergence behaviour can be more erratic
due to its random sampling of individual training
examples.
3.Memory: Gradient descent requires more memory
than SGD because it must store the entire dataset for
each iteration. SGD only needs to store the current
training example, making it more memory-efficient. 187
▪Convergence rates
•Stochastic Gradient Descent: Faster convergence rate due to
the use of single training examples in each iteration.
•Gradient Descent: Slower convergence rate, as it uses the
entire dataset for each iteration.
▪Accuracy
•Stochastic Gradient Descent: Less accurate than gradient
descent, as it calculates the gradient on single examples,
which may not accurately represent the overall dataset.
•Gradient Descent: More accurate because it uses the average
gradient calculated over the entire dataset.

188
Perceptron vs Delta rule
▪Mainly there are two differences between the perceptron and
the delta rule.

1. The perceptron is based on an output from a step function,


whereas the delta rule uses the linear combination of
inputs directly.
2. The perceptron is guaranteed to converge to a consistent
hypothesis assuming the data is linearly separable. The
delta rules converges in the limit but it does not need the
condition of linearly separable data.

189
Perceptron vs Delta rule

190
191
192
MULTILAYER NETWORKS AND THE
BACKPROPAGATION ALGORITHM
▪A perceptron can only learn a linear decision boundary in a
classification task, or fit a line to a bunch of data points in a
regression task.

193
▪In contrast to a simple perceptron, a multilayer neural
network is capable of learning highly
sophisticated(non-linear) decision boundaries for the
classification task. It can also fit highly non-linear curves for
the regression task. This is the same flexibility that we have
with deep learning

194
195
A Differentiable Threshold Unit
▪Each neuron in an ANN has an Activation Function, which we
denote with f().
▪We are interested in ANNs that can represent highly
sophisticated non-linear functions, which is the type of
problem we face in the real world!
▪ So, non-linearity is a desirable feature of an activation
function

196
▪The following two conditions for choosing activation function
f() :
1.We want f() to be non-linear. This makes the entire ANN a
collection of nested non-linear functions, capable of
representing some scary non-linear function.
2.We want f() to be continuous, and differentiable with
respect to its input. This makes the entire ANN trainable using
gradient descent.

197
Sigmoid function
▪The sigmoid function is a function that satisfies both of our
requirements, that is, both non-linearity and differentiablity.
▪A sigmoid threshold unit, is actually really similar to a
perceptron, but the difference is that it is based on a
smoothed and differentiable function, as opposed to the step
function that is used in a perceptron. Below, is a visual
representation of a sigmoid unit:

198
199
Multilayer Networks (ANN)

σ'(y) = σ(y) (1- σ(y))

200
▪Gradient Descent vs. Backpropagation: What’s the
Difference?
▪In a feedforward neural network, the input moves forward
from the input layer to the output layer. Backpropagation
helps improve the neural network’s output. It does this by
propagating the error backward from the output layer to the
input layer.

201
202
203
▪To know in which direction to navigate, gradient descent uses
backpropagation.
▪More specifically, it uses the gradients calculated through
backpropagation. These gradients are used for determining
the direction to navigate to find the minimum point.
Specifically, we aim to find the negative gradient.
▪This is because a negative gradient indicates a decreasing
slope. A decreasing slope means that moving downward will
lead us to the minimum point. For example:

204
205
206
▪ As you can see, the output of sigmoid ranges between 0 and
1, and it increases monotonically with respect to its input.
207
▪Since sigmoid can map a large input domain into a small
range of [0,1], it is commonly referred to as the squashing
function.
▪The derivative the sigmoid function with respect to its input
can be computed using its output

208
The BACKPROPAGATION Algorithm

▪The BACKPROPAGATION Algorithm learns the weights for a


multilayer network, given a network with a fixed set of units
and interconnections.
▪It employs gradient descent to attempt to minimize the
squared error between the network output values and the
target values for these outputs.
▪This section presents the BACKPROPAGATION algorithm, and
the following section gives the derivation for the gradient
descent weight update rule used by BACKPROPAGATION

209
BACKPROPOGATION

210
Because we are considering networks with
multiple output units rather than single
units as before, we begin by redefining E to
sum the errors over all of the network
output units

Here outputs is the set of outputs in the network,


and tkd and okd are the target and output values
associated with the kth output unit and training
example d.

One major difference in the case of multilayer networks is - the


error surface can have multiple local minima
Despite this obstacle, in practice Backpropagation Algorithm
been found to produce excellent results in many real-world
applications.
211
212
BACKPROPOGATION Algorithm

213
▪This algorithm applies to feedforward networks
•containing two layers of sigmoid units,
•with units at each layer connected to all units
from the preceding layer.
▪This is the incremental or stochastic, gradient
descent version of Backpropagation.
▪The notation used here is the same as that used in
earlier sections, with the following extensions:

214

215

216
Termination Conditions

217
▪ Gradient Descent is widely used for optimization in machine
learning and deep learning.

The conditions for its application: Differentiable Objective


function, must be from a continuous domain, must be
convex function, choosing proper learning rate, features are
to be scaled etc.

1. Differentiable Objective Function: The cost or loss function


must be differentiable so that gradients can be calculated.
2. Continuous Domain: It is applied to problems with
continuous input variables since it relies on gradient
computation.

218
3. Convexity (Preferable): Gradient Descent works best with
convex functions where a global minimum exists, as it
ensures convergence to the optimum. In non-convex
functions, it might get stuck in local minima.
4. Learning Rate: A proper learning rate must be defined. Too
high can lead to overshooting, while too low can make
convergence very slow.
5. Feature Scaling: Features in the dataset should be
normalized or standardized for smoother and faster
convergence.

219
Difficulties in Applying Gradient Descent
While Gradient Descent is powerful, there are several challenges:
1. Choice of Learning Rate:
• Selecting an optimal learning rate can be tricky.
• A poor choice can lead to divergence (too high) or slow
convergence (too low).
▪ Local Minima and Saddle Points: In non-convex optimization
problems, it can get stuck in local minima or saddle points where
gradients are zero but the point is not optimal.
▪ Computational Cost: For very large datasets, computing the gradient
over the entire dataset can be computationally expensive and
time-consuming.
▪ Exploding/Vanishing Gradients: In deep learning, gradients can either
become excessively large (exploding) or diminish to near zero
(vanishing), hampering learning.
2. Feature Scaling Dependency:
• If features are not scaled properly, the algorithm might take longer to
converge.

220
Learning Algorithm:Backpropagation
The following slides describes teaching process of multi-layer
neural network employing backpropagation algorithm. To
illustrate this process the three layer neural network with two
inputs and one output, which is shown in the picture below, is
used:
Learning
Algorithm:Backpropagation
Each neuron is composed of two units. First unit adds products
of weights coefficients and input signals. The second unit
realise nonlinear function, called neuron transfer (activation)
function. Signal e is adder output signal, and y = f(e) is output
signal of nonlinear element. Signal y is also output signal of
neuron.
Learning Algorithm: Backpropagation
• To teach the neural network we need training data set. The
training data set consists of input signals (x1 and x2 )
assigned with corresponding target (desired output) z.

• The network training is an iterative process. In each


iteration weights coefficients of nodes are modified using
new data from training data set. Modification is calculated
using algorithm described below:

• Each teaching step starts with forcing both input signals


from training set. After this stage we can determine output
signals values for each neuron in each network layer.
Learning
Algorithm:Backpropagation
Pictures below illustrate how signal is propagating through
the network, Symbols w(xm)n represent weights of
connections between network input xm and neuron n in
input layer. Symbols yn represents output signal of neuron n.
Learning Algorithm:Backpropagation
Learning Algorithm:
Backpropagation
Learning
Algorithm:Backpropagation
Learning
Algorithm:Backpropagation
Learning Algorithm:Backpropagation
Propagation of signals through the output layer.
Learning Algorithm:Backpropagation
In the next algorithm step the output signal of the
network y is compared with the desired output value (the
target), which is found in training data set. The difference is
called error signal d of output layer neuron
Learning Algorithm:Backpropagation
The idea is to propagate error signal d (computed in single
teaching step) back to all neurons, which output signals were
input for discussed neuron.
Learning Algorithm:Backpropagation

The idea is to propagate error signal d (computed in single


teaching step) back to all neurons, which output signals were
input for discussed neuron.
Learning Algorithm:Backpropagation
The weights' coefficients wmn used to propagate errors back
are equal to this used during computing output value. Only
the direction of data flow is changed (signals are propagated
from output to inputs one after the other). This technique is
used for all network layers. If propagated errors came from
few neurons they are added. The illustration is below:
Learning Algorithm:Backpropagation
When the error signal for each neuron is computed, the
weights coefficients of each neuron input node may be
modified. In formulas below df(e)/de represents
derivative of neuron activation function (which weights
are modified).
Learning Algorithm:Backpropagation
When the error signal for each neuron is computed, the
weights coefficients of each neuron input node may be
modified.
In formulas below df(e)/de represents derivative of neuron
activation function (which weights are modified).
Learning Algorithm:Backpropagation
When the error signal for each neuron is computed, the
weights coefficients of each neuron input node may be
modified. In formulas below df(e)/de represents
derivative of neuron activation function (which weights
are modified).
▪Backpropagation Convergence
•Backpropagation does a gradient descent, so it will
converge to a local minimum (perhaps not global). To
alliviate this you can
1.add momentum, as before,
2.use stochastic gradient descent,
3.train multiple nets with different initial weights.
4.convergence is slow at first then fast.

237
238
239
9-

240
241
242
243
244
Illustration
▪we will build a neural network with three layers:
•Input layer with two inputs neurons
•One hidden layer with two neurons
•Output layer with a single neuron

245
Dataset

246
▪Our initial weights will be as following: w1 = 0.11, w2 = 0.21,
w3 = 0.12, w4 = 0.08, w5 = 0.14 and w6 = 0.15

247
Forward Pass

248
Calculating Error

249
Reducing Error

The question now is how to change\update the weights value


so that the error is reduced?

The answer is Backpropagation!

250
Backpropagation (of errors)

251
▪Now, using the new weights we will repeat the forward pass
through another training example

252
253
▪Some of the main problems associated with the
backpropagation learning algorithm:
1.Vanishing and Exploding Gradients:
In deep neural networks, especially those with many layers,
the gradients of the loss function with respect to the
weights may become very small (vanishing gradients) or
very large (exploding gradients). This can make it difficult to
update the weights effectively and lead to slow
convergence or divergence during training.
2.Local Minima and Saddle Points:
Backpropagation is prone to getting stuck in local minima
or saddle points in the optimization landscape. While it is
often able to escape shallow local minima, it can struggle
with deeper and more complex optimization landscapes.

254
3. Sensitivity to Initial Conditions:
The performance of backpropagation can be sensitive to
the initial values of the weights. Choosing appropriate
initial conditions is crucial for achieving good convergence
and avoiding poor local minima.
4.Overfitting:
Backpropagation may lead to overfitting, especially when
dealing with small datasets or when the model is too
complex. Overfitting occurs when the model learns the
training data too well, including its noise and outliers, and
fails to generalize to new, unseen data.

255
5. Requires Large Amounts of Labeled Data:
Backpropagation relies on supervised learning and
requires a large amount of labeled training data to
effectively learn the underlying patterns. In some
domains, obtaining labeled data can be expensive or
impractical.
6.Computational Intensity:
Training deep neural networks using backpropagation
can be computationally intensive, especially for
large models and datasets. This can lead to long
training times and may require significant
computational resources.

256
7. Hyperparameter Sensitivity:
The performance of the backpropagation algorithm is
sensitive to the choice of hyperparameters, such
as learning rate, batch size, and regularization terms.
Finding the right set of hyperparameters can be a
challenging and time-consuming task.
8. Lack of Interpretability:
Neural networks trained with backpropagation are
often considered as black-box models, lacking
interpretability. Understanding the inner workings of
a neural network and explaining its decisions can be
challenging.

257
▪Here are several strategies to speed up the learning process in
multi-layer neural networks:
1.Batch Normalization:
1. Batch normalization normalizes the inputs of each layer
during training. It helps mitigate issues like vanishing or
exploding gradients and allows for more stable and faster
convergence.
2.Weight Initialization:
1. Proper weight initialization is crucial for faster
convergence. Techniques such as He initialization or
Xavier/Glorot initialization can help set the initial weights
to values that promote efficient learning.

258
3. Use of Different Activation Functions:
Choosing appropriate activation functions can impact the
learning speed. Rectified Linear Unit (ReLU) is a popular
choice due to its simplicity and ability to mitigate the
vanishing gradient problem.
4. Learning Rate Scheduling:
Adaptive learning rate strategies, like learning rate
annealing or scheduling, can be employed to adjust the
learning rate during training. Techniques such as Cyclical
Learning Rates or using learning rate schedules like Step
Decay or Exponential Decay can help balance faster
convergence without overshooting.

259
5. Optimization Algorithms:
Advanced optimization algorithms, such as Adam,
RMSprop, or AdaGrad, can be more efficient than
traditional stochastic gradient descent (SGD). They adapt
the learning rates for each parameter, potentially leading to
faster convergence.
6. Data Augmentation:
Increasing the effective size of the training dataset through
data augmentation can help the model generalize better
and converge faster. Techniques like rotation, scaling, and
flipping can be applied to artificially increase the diversity
of the training set.

260
7. Dropout Regularization:
Dropout is a regularization technique that randomly drops
a subset of neurons during training. It helps prevent
overfitting and encourages the network to learn more
robust and general features, potentially accelerating
convergence.
8. Early Stopping:
Monitor the model's performance on a validation set
during training, and stop training when the performance
starts to degrade. This prevents overfitting and can lead to
faster convergence.

261
9. Parallelization and GPU Acceleration:
Training deep neural networks can be computationally
intensive. Utilizing parallel computing resources, such as
multiple GPUs, can significantly speed up the training
process.
10.Transfer Learning:
Transfer learning involves using pre-trained models on a
related task to initialize the weights of your network. This
can provide a good starting point, especially when you have
limited labeled data for your specific task.

262

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy