0% found this document useful (0 votes)
69 views102 pages

AI Lec-Module-III

The document discusses artificial intelligence and its applications. It outlines a course on AI that covers four modules: [1] concepts and terminology of AI, issues and application areas [2] uncertainty in AI, probability, inference and neural networks [3] classification, regression, supervised, unsupervised and reinforcement learning [4] applications of AI. It then goes on to describe various machine learning concepts like knowledge representation, learning paradigms including supervised, unsupervised and reinforcement learning, and different learning rules such as error correction, memory based, Hebbian and competitive learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views102 pages

AI Lec-Module-III

The document discusses artificial intelligence and its applications. It outlines a course on AI that covers four modules: [1] concepts and terminology of AI, issues and application areas [2] uncertainty in AI, probability, inference and neural networks [3] classification, regression, supervised, unsupervised and reinforcement learning [4] applications of AI. It then goes on to describe various machine learning concepts like knowledge representation, learning paradigms including supervised, unsupervised and reinforcement learning, and different learning rules such as error correction, memory based, Hebbian and competitive learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

AI and its Application

Dr. Sachin Kumar Jain


PDPM IIITDM Jabalpur
skjain@iiitdmj.ac.in

Teachers open the door but you must enter by yourself


---Chinese Proverb
Course Outline
Module-I
• What is AI? AI Concepts, Terminology, and Application Areas, AI: Issues,
Concerns and Ethical Considerations, The Future with AI, uninformed search,
Heuristic search.

Module II
• Uncertainty in AI, Uncertainty, Probability, Syntax and Semantics, Inference,
Independence and Bayes' Rule, Bayesian Network, Neural Networks, Support
Vector Machine.

Module III
• Classification & Regression, Supervised, Unsupervised and Reinforcement
Learning, Theory, concepts and applications.

Module IV
• Applications of AI.
Module-III
Learning & Adaptation
Knowledge Representation

➢ Training Data
➢ Basic Rules
➢ Building prior information into NN design
▪ Restricting network architecture
▪ Constraining the choice of synaptic weights

➢ Building invariances into NN design


▪ Invariance by structure
▪ Invariance by training
▪ Invariant feature space
Learning & Adaptation
➢ An interactive process of adjustments applied to its
synaptic weights and bias levels.
➢ Learning is a relatively permanent change in behavior
brought about by experience.

Learning Process
➢ NN is stimulated by an environment.
➢ NN undergoes changes in its free parameters due to above.
➢ The NN responds in a new way to the environment.
Learning Paradigms
➢ Supervised (Learning with teacher)
➢ Unsupervised (Learning without teacher)

➢ Batch Learning
➢ Instantaneous Learning

➢ Offline Learning
➢ Online Learning (Learning-on-the-fly)
Learning Paradigms
Supervised (Learning with teacher)
Supervised Learning is the process of making an algorithm
to learn to map an input to a particular output. This is
achieved using the labelled datasets that you have
collected. If the mapping is correct, the algorithm has
successfully learned.
The teacher acts as a
supervisor, or, an authoritative
source of information that the
student can rely on to guide their
learning. You can also think of the
student’s mind as a computational
engine.
Learning Paradigms
Supervised Learning: Algorithms
Poplar supervised learning algorithms includes:
• Error-correction learning rule
• Linear regression
• Support Vector Machine (SVM)
• Logistic regression
• Random forest
Learning Paradigms
Supervised (Learning with teacher)
Supervised Learning has a lot of challenges and disadvantages
that you could face while working with these algorithms. Some
of these includes:
• You could overfit your algorithm easily
• Good examples need to be used to train the data
• Computation time is very large for Supervised Learning
• Unwanted data could reduce the accuracy
• Pre-Processing of data is always a challenge
• If the dataset is incorrect, you make your algorithm learn
incorrectly which can bring losses
Learning Paradigms
Unsupervised Learning
Unsupervised Learning can be thought of as self-learning where
the algorithm can find previously unknown patterns in datasets
that do not have any sort of labels. You do not interfere when
the algorithm learns.
It helps in modelling probability density functions, finding
anomalies in the data, and much more. For example, think of a
student who has textbooks and all the required material to study
but has no teacher to guide. Ultimately, the student will have to
learn by himself or herself to pass the exams.
Learning Paradigms
Unsupervised Learning
Unsupervised Learning can be classified under the following
two types:
• Clustering
• Association
Learning Paradigms
Unsupervised Learning
Although Unsupervised Learning is used in many well-known
applications and works brilliantly, there are still many
disadvantages, some of these are:
• There is no way of obtaining the way or method the data is
sorted as the dataset is unlabeled.
• They may be less accurate as the input data is not known and
labelled by the humans making the machine do it.
• The information obtained by the algorithm may not always
correspond to the output class that we required.
• The user has to understand and map the output obtained with
the corresponding labels.
Learning Paradigms
Reinforcement Learning
In reinforcement learning, the learning of an input-output
mapping is performed through continued interaction with the
environment in order to minimize a scalar index of
performance.
Reinforcement learning is
closely related to dynamic
programming.
It solves a particular kind of
problem where decision
making is sequential, and the
goal is long-term.
Learning Paradigms
Supervised vs Unsupervised Learning

Feature Supervised Learning Unsupervised Learning


Accuracy More accurate results Less accurate results

More complex. Requires


Less complex and more more computation power
Complexity
easily understood to process due to
ambiguity in data

Input and output Only input variables are


Input/Output
variables are given given
Learning takes place Learning takes place
Time
offline online and in real-time
Learning Rules
➢ Error Correction Learning
➢ Memory Based Learning
➢ Hebbian Learning
➢ Competitive Learning
➢ Boltzmann Learning
Learning Rules
Error Correction Learning
Error-Correction Learning is the technique of comparing the system
output to the desired output value, and using that error to direct the
training.

Error correction learning algorithms attempt to minimize this error


signal at each training iteration.

In the most direct route, the error values can be used to directly
adjust the tap weights, using an algorithm, such as, the
backpropagation algorithm.
Learning Rules
Error Correction Learning: Illustration
Let’s consider a FFNN with only a single neuron k as the
computational node in the output layer. Neuron k is driven by a
signal vector x(n) produced by one or more layers of hidden
neurons.

An error signal, denoted by ek(n), is


defined as
Learning Rules
Error Correction Learning: Illustration
The error signal ek(n) actuates a control mechanism, the purpose of
which is to apply a sequence of corrective adjustments to the
synaptic weights of neuron k. The corrective adjustments are
designed to make the output signal yk(n) come closer to the desired
response dm(n) in a step-by-step manner.
This objective is achieved by minimizing a cost function or index of
performance ℰ(n) defined in terms of the error signal ek(n) as:

In particular, minimization of the cost function ℰ(n) leads to a


learning rule commonly referred to as the delta rule or Widrow-Hoff
rule.
Learning Rules
Error Correction Learning: Illustration
Let ωkj(n) denote the value of synaptic weight ωkj of neuron k
excited by element xj(n) of the signal vector x(n) at time step n.
According to the delta rule, the adjustment Δωkj(n) applied to the
synaptic weight ωkj at time step n is defined by

where, ƞ is a positive constant which determines the rate of learning


as we proceed from one step in the learning process to another.
Having computed the synaptic adjustment Δωkj(n), the updated value
of synaptic weight Δωkj, is given by
Learning Rules
Memory Based Learning
In memory-based learning, all (or most) of the past experiences are
explicitly stored in a large memory of correctly classified input-
output examples: [(xi, di)]𝑁
𝑖=1 , where xi denotes an input vector and di
denotes the corresponding desired response.
All memory-based learning algorithms involve two essential
ingredients:
a. Criterion used for defining the local neighborhood of the test vector
xtest.
b. Learning rule applied to the training examples in the local
neighborhood of xtest.
Learning Rules
Memory Based Learning: Example
For example, in a binary pattern classification problem there are two
classes of hypotheses, denoted by ԑ1 and ԑ2, to be considered. In this
example, the desired response di takes the value 0 (or -1) for class ԑ1
and the value 1 for class ԑ2.
When classification of a test vector test (not seen before) is required,
the algorithm responds by retrieving and analysing the training data
in a “local neighborhood” of xtest.
Learning Rules
Memory Based Learning: Nearest Neighbor
In a simple yet effective type of memory-based learning known as
the nearest neighbor rule, the local neighborhood is defined as the
training example which lies in the immediate neighborhood of the
test vector xtest. In particular, the vector

is said to be the nearest neighbor of xtest if

where, d(xi, xtest) is the Euclidean distance between the vectors xi and
xtest.
Learning Rules
Hebbian Learning
Hebb’s postulate of learning is the oldest and the most famous of all
learning rules; it is named in honor of the neuropsychologist Hebb.

When an axon of cell A is near enough to excite a cell B and


repeatedly or persistently takes part in firing it, some growth process
or metabolic changes take place in one or both cells such that A’s
efficiency as one of the B, is increased.
Learning Rules
Hebbian Learning
➢ If two neurons on either side of a synapse are activated
simultaneously then the strength of that synapse is selectively
increased.
➢ If two neurons on either side of a synapse are activated
asynchronously then the strength of that synapse is
selectively weakened or eliminated.
➢ If there is no signal correlation, the weight does not change,
the sign of the weight between two nodes depends on the sign
of the input between those nodes.
Learning Rules
Hebbian Learning: Mathematical models
Learning Rules
Hebbian Learning: Mathematical models

➢ Hebb’s hypothesis

∆𝑤𝑘𝑗 𝑛 = 𝛾(𝑦𝑘 𝑛 𝑥𝑗 𝑛 )

➢ Covariance hypothesis

∆𝑤𝑘𝑗 𝑛 = 𝛾( 𝑦𝑘 − 𝑦ത (𝑥𝑗 −𝑥))


ҧ
Learning Rules
Competitive Learning
In competitive learning, as the name implies, the output neurons of a
neural network compete among themselves to become active.

Whereas in a neural network based on Hebbian learning several


output neurons may be active simultaneously, in competitive
learning only a single output neuron is active at anyone time.
It is concerned with unsupervised training.
It is also known as the ‘Winner takes it all’
rule. Only one neuron remains active at
once.
Learning Rules
Competitive Learning
For a neuron k to be the winning neuron, its induced local field vk for
a specified input pattern x must be the largest among all the neurons
in the network. The output signal yk of winning neuron k is set equal
to one; the output signals of all the neurons which lose the
competition are set equal to zero.

where, the induced local field yk represents the combined action of


all the forward and feedback inputs to neuron k.
Learning Rules
Competitive Learning: Process
Let ωkj denote the synaptic weight connecting input node j to neuron
k. Suppose that each neuron is allotted a fixed amount of synaptic
weight (i.e., all synaptic weights are positive), which is distributed
among its input nodes that is, for all k.

A neuron then learns by shifting synaptic weights from its inactive


to active input nodes. If a neuron does not respond to a particular
input pattern, no learning takes place in that neuron.
Learning Rules
Competitive Learning: Process
If a particular neuron wins the competition, each input node of that
neuron relinquishes some proportion of its synaptic weight, and the
weight relinquished is then distributed equally among the active
input nodes. According to the standard competitive learning rule, the
change Δωkj applied to synaptic weight ωkj is defined by

where, ƞ is the learning rate parameter. This rule has the overall
effect of moving the synaptic weight vector ωk of winning neuron k
towards the input pattern x.
Learning Rules
Boltzmann Learning
The Boltzmann learning rule, named in honor of Ludwig Boltzmann,
is a stochastic learning algorithm derived from idea rooted in
statistical mechanics. A neural network designed on the basis of the
Boltzmann learning rule is called a Boltzmann machine.
In a Boltzmann machine the neurons constitute a recurrent structure,
and they operate in a binary manner since, for example, they are
either in an ‘on’ state denoted by + 1 or in an ‘off’ state denoted by -
1. The machine is characterised by an energy function; E the value
of which is determined by the particular states occupied by the
individual neurons of the machine.
Learning Rules
Boltzmann Learning
Energy function is given as,
Learning Rules
Boltzmann Learning

Visible nodes — those nodes which we can and do


measure, and the Hidden nodes – those nodes which we
cannot or do not measure.
A Perceptron
Perceptron Convergence Algorithm

➢ Variables and parameters


– x(n) = [+1, x1(n),…, xp(n)]; w(n) = [b(n), w1(n),…,wp(n)]
– y(n) = actual response (output); d(n) = desired
response
– η = learning rate, a positive number less than 1

• Step 1: Initialization
– Set w(0) = 0, then do the following for n = 1, 2, 3, …

• Step 2: Activation
– Activate the perceptron by applying input vector x(n)
and desired output d(n)
Perceptron Convergence Algorithm

• Step 3: Computation of actual response


– y(n) = sgn[wT(n)x(n)]
– Where sgn(.) is the signum function

• Step 4: Adaptation of weight vector


w(n+1) = w(n) + η[d(n) – y(n)]x(n)

Where
d(n) = +1 if x(n) belongs to C1
d(n) = -1 if x(n) belongs to C2

• Step 5
– Increment n by 1, and go back to step 2
Optimization Techniques
➢ Method of Steepest Descent
➢ Newton’s Method
➢ Gauss-Newton Method
➢ Least Mean Square Algorithm
Method of Steepest Descent
➢ The necessary condition for optimality is that the gradient
of the cost function should be zero, i.e.:
𝑁
1
𝛻£ 𝒘 =0 £ 𝒘 = ෍ 𝑑𝑖 − 𝐹 𝐱 𝑖 , 𝒘
2
2

𝑖=1

➢ The gradient vector of the cost function can be expressed as:


𝜕£ 𝜕£ 𝜕£
𝛻£ 𝒘 =[ , , …, ]
𝜕𝑤1 𝜕𝑤2 𝜕𝑤𝑚

➢ In the steepest descent, the successive adjustments applied to the


weight vector w are in the direction of steepest descent, that is, in
a direction opposite to the gradient vector 𝛻£ 𝒘 . Accordingly,
weight update using this can be written as:
𝐰 𝐧 + 1 = 𝒘 𝑛 − η𝛻£ 𝒘(𝒏)
where, 𝛻£ 𝒘(𝒏) is the gradient vector evaluated at w(n).
Problem on Steepest Descent
➢ The cost function is:

where, σ2 is some constant, and

a) Find the optimum value of w for which the cost function


reaches its minimum value. Answer: w1=1.6, w2=-0.954
b) Using steepest descent method, compute w for learning
rate η=0.3 and 1. Consider, initial w(0)=[1, 0].
Least-Mean Square Algorithm
➢ It operates with a linear neuron only and it is based on the
use of instantaneous values for the cost function, i.e.
1
𝛻£ 𝒘 = 𝑒2(n)
2

➢ Differentiating it with respect to w yields,


𝜕£ 𝒘 𝜕𝑒(𝑛)
=𝑒(𝑛)
𝜕𝒘 𝜕𝒘

➢ As the error signal is defined as: 𝑒 𝑛 = 𝑑 𝑛 − 𝒙𝑇 𝑛 𝒘(𝑛),


hence, 𝜕𝑒(𝑛)
= −𝒙(𝑛)
𝜕𝒘
Accordingly, the weight update using LMS can be expressed as:
𝒘 n + 1 = 𝒘 𝑛 + η𝒙(𝑛)𝑒(n)
Least-Mean Square Algorithm
Learning rate is very important as it greatly affect the
convergence of LMS.
An estimated limit of the learning rate are formulated as:
2
0<η<
𝑡𝑟[𝑹𝑥 ]

where, denominator is a trace of correlation matrix Rx,


which is defined as sum of the diagonal elements of square
matrix Rx.
MLFFNN: Key Terms
➢ Epoch: One complete presentation of the entire
training data set during the learning.
➢ Function signal vs Error signal (Forward vs Backward)
➢ Sequential vs Batch update (Speed vs Sturdy)
➢ Activation function (Differentiable)
➢ Learning rate and Momentum (Fast learning vs stability)
➢ Stopping criteria (Target error, Max. iterations, convergence)
➢ Input normalization (Mean removal, Scaling, Decorrelation)
➢ Parameters initialization (avoid extreme values)
Back-Propagation Algorithm
Overview:
➢ To compute the weights of a feedforward multilayer neural
network adaptatively, given a set of labeled training examples.
➢ Backpropagation works by applying the gradient descent rule
to a feedforward network.
➢ The algorithm is composed of two parts that get repeated over
and over until, either it settles down to an optimal solution, or
a pre-set maximal number of epochs are completed.
• Feedforward pass/sweep:
• Backpropagation pass/sweep:
Back-Propagation Algorithm
Overview:
➢ Feedforward pass/sweep: A training input pattern is
presented to the network input layer. The network propagates
the input pattern from layer to layer until the output pattern is
generated by the output layer.
➢ Backpropagation pass/sweep: An error is calculated and then
propagated backwards through the network from the output
layer to the input layer. The weights are modified as the error
is propagated, starting with the hidden-to-output weights, then
the input-to-hidden weights, with respect to the sum of squares
error and through a series of weight update rules called the
Delta Rule.
Back-Propagation Algorithm
Input signals
1
x1 1 y1
1
2
x2 2 y2
2

i wij j wjk
xi k yk

m
n l yl
xn
Input Hidden Output
layer layer layer

Error signals

Three-layer back-propagation neural network.


Back-Propagation Algorithm
Derivation:
The error signal at the output of neuron j at the nth training cycle is given as:
e j ( n) = d j ( n) − y j ( n)
The instantaneous value of error energy for neuron j is:
1 2
e j ( n)
2
The total error energy E(n) can be computed by summing up the
instantaneous energy over all the neurons in the output layer:
1
E (n) =
2
e
jC
j
2
(n)
where C is the number of all the neurons in the output layer. If N is the total
number of patterns, the averaged squared error energy is:
1 N
Eav =  E (n )
N n =1
Back-Propagation Algorithm
Derivation:
The back propagation algorithm follows error-correction learning based
gradient descent rule. Accordingly, the weight correction/update can be
computed as:
E
wij ( n ) = −
wij

E(n) is a function of a function of a function of a function of wji(n):

E (n) is a function of e j (n)


e j (n) is a function of y j (n)
y j (n) is a function of v j (n)
v j (n) is a function of w ji (n)
Back-Propagation Algorithm
Derivation:
The derivatives for different constituents can be computed as:

E ( n )  1
e j
=
e j
 j (n)
2 jC
e 2
= ej

e j 
y j
=
y j
dj − yj = −1

y j 
=  (v j ( n ) ) =  ' (v j ( n ) )
v j v j
v j   
w ji
=  
w ji  i
w ji ( n ) y i ( n ) 

= yi ( n )
Back-Propagation Algorithm
Derivation:
The equation regarding the weight corrections can be written as:
w ji (n) =  j ( n) yi (n)

where δj(n) is defined as the local gradient and is given by:

(n) (n) e j (n) y j (n)


 j (n) = − =− = e j (n) j ' (v j (n))
v j (n) e j (n) y j (n) v j (n)

If j is an output neuron, we have a definition of ej(n), so, δj(n) is defined as:


 j ( n) = ( d j ( n) − y j ( n)) j ' (v j ( n))

If j is a hidden neuron then δj(n) is defined as:


(n) y j (n) (n)
 j (n) = − =−  j ' (v j (n))
y j (n) v j (n) y j (n)
Back-Propagation Algorithm
Advantages:
➢ Easy to use and implement due to low computation burden.
➢ Robust: Can tolerate small disturbances.
➢ Applicable to wide range of problems.

Disadvantages:
➢ Slow convergence, large training time.
➢ Chances of being trapped in Local minima.
➢ Works with only differentiable activation functions.
Support Vector Machine
Overview
Support Vector Machine
Overview
Support Vector Machine (SVM) is a powerful supervised
machine learning algorithm used for linear or nonlinear
classification, regression, and even outlier detection tasks.
The main objective of the SVM algorithm is to find the
optimal hyperplane in an N-dimensional space that can
separate the data points in different classes in the feature
space.
The hyperplane tries that the margin between the closest
points of different classes should be as maximum as
possible.
Support Vector Machine
Terminology
Hyperplane: Hyperplane is the decision boundary that is
used to separate the data points of different classes in a
feature space. In the case of linear classifications, it will be a
linear equation i.e. wx+b = 0.
Support Vector Machine
Terminology

Support Vectors: Support vectors are the closest data points to


the hyperplane, which makes a critical role in deciding the
hyperplane and margin.
Margin: Margin is the distance between the support vector and
hyperplane. The main objective of the support vector machine
algorithm is to maximize the margin. The wider margin indicates
better classification performance.
Support Vector Machine
Terminology
Kernel: Kernel is the mathematical function, which is used
in SVM to map the original input data points into high-
dimensional feature spaces, so, that the hyperplane can be
easily found out even if the data points are not linearly
separable in the original input space. Some of the common
kernel functions are linear, polynomial, radial basis
function(RBF), and sigmoid.

A new variable y is created as a function of distance from the origin. A non-


linear function that creates a new variable is referred to as a kernel.
Support Vector Machine
Terminology
Hard Margin: The maximum-margin hyperplane or the
hard margin hyperplane is a hyperplane that properly
separates the data points of different categories without any
misclassifications.
Soft Margin: When the data is not perfectly separable or
contains outliers, SVM permits a soft margin technique.
Each data point has a slack variable introduced by the soft-
margin SVM formulation, which softens the strict margin
requirement and permits certain misclassifications or
violations. It discovers a compromise between increasing the
margin and reducing violations.
Support Vector Machine
Philosophy
Suppose we have a dataset that has two tags (green and blue),
and the dataset has two features x1 and x2. We want a classifier
that can classify the pair(x1, x2) of coordinates in either green
or blue. Consider the below image.
So as it is 2-d space so by just using a
straight line, we can easily separate
these two classes. But there can be
multiple lines that can separate these
classes. Consider the below image.
Support Vector Machine
Philosophy
the SVM algorithm helps to find the best line or decision boundary;
this best boundary or region is called as a hyperplane. SVM
algorithm finds the closest point of the lines from both the classes.
These points are called support vectors. The distance between the
vectors and the hyperplane is called as margin. And the goal of SVM
is to maximize this margin. The hyperplane with maximum margin is
called the optimal hyperplane.

So as it is 2-d space so by just using a


straight line, we can easily separate
these two classes. But there can be
multiple lines that can separate these
classes. Consider the below image.
Support Vector Machine
Types of SVM
Based on the nature of the decision boundary, Support Vector Machines
(SVM) can be divided into two main parts:
Linear SVM: Linear SVMs use a linear decision boundary to separate
the data points of different classes. When the data can be precisely
linearly separated, linear SVMs are very suitable. This means that a single
straight line can entirely divide the data points into their respective
classes.
Non-Linear SVM: Non-Linear SVM can be used to classify data when it
cannot be separated into two classes by a straight line (in the case of 2D).
By using kernel functions, nonlinear SVMs can handle nonlinearly
separable data. The original input data is transformed by these kernel
functions into a higher-dimensional feature space, where the data points
can be linearly separated. A linear SVM is used to locate a nonlinear
decision boundary in this modified space.
Support Vector Machine
Mathematical Formulation
Consider a binary classification problem with two classes, labeled as
+1 and -1. We have a training dataset (𝒙𝑖 , 𝑑𝑖 ) 𝑁
𝑖=1 consisting of input
feature vectors xi and their corresponding class labels di or Yi.
The equation of a decision surface (linear
hyperplane) can be written as:
𝝎𝑇 𝒙 + 𝑏 = 0
where ω is an adjustable weight vector and b
is a bias. For linear classifier:
+1 ∶ 𝝎𝑇 𝒙 + 𝑏 ≥ 0, 𝑖. 𝑒. 𝑑𝑖 = +1
𝑦=൝
−1 ∶ 𝝎𝑇 𝒙 + 𝑏 < 0, 𝑖. 𝑒. 𝑑𝑖 = −1
The separation between the hyperplane and closest data point is
called the margin of separation.
Classification & Regression
Classification
Classification is a process of finding a function which helps in
dividing the dataset into classes based on different parameters.
In Classification, a computer program is trained on the
training dataset and based on that training, it categorizes the
data into different classes.
Classification & Regression
Regression
Regression is a process of finding the correlations between
dependent and independent variables. It helps in predicting
the continuous variables such as prediction of Market Trends,
prediction of House prices, etc.
Classification & Regression
Regression vs Classification
Regression Algorithm Classification Algorithm
In Regression, the output variable must be of In Classification, the output variable must be a
continuous nature or real value. discrete value.
The task of the regression algorithm is to map The task of the classification algorithm is to
the input value (x) with the continuous output map the input value(x) with the discrete
variable(y). output variable(y).
In Regression, we try to find the best fit line, In Classification, we try to find the decision
which can predict the output more accurately. boundary, which can divide the dataset into
different classes.
Regression algorithms can be used to solve the Classification Algorithms can be used to solve
regression problems such as Weather classification problems such as Identification
Prediction, House price prediction, etc. of spam emails, Speech Recognition,
Identification of cancer cells, etc.
The regression Algorithm can be further The Classification algorithms can be divided
divided into Linear and Non-linear Regression. into Binary Classifier and Multi-class Classifier.
Classification & Regression
Classification
There are two types of Classifications:
• Binary Classifications: A Binary classification problem
has only two possible outcomes.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT
SPAM, CAT or DOG, etc.

• Multi-class Classifier: If a classification problem has more


than two outcomes, then it is called as Multi-class
Classifier.
Example: Classifications of types of crops, Classification of
types of music.
Classification & Regression
Learners in Classification Problems
In the classification problems, there are two types of learners:
• Lazy Learners: Lazy Learner firstly stores the training dataset
and wait until it receives the test dataset. In Lazy learner case,
classification is done on the basis of the most related data stored
in the training dataset.
It takes less time in training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning.
• Eager Learners: Eager Learners develop a classification model
based on a training dataset before receiving a test dataset.
Opposite to Lazy learners, Eager Learner takes more time in
learning, and less time in prediction. Example: Decision Trees,
ANN.
Classification & Regression
Evaluating a Classification model
Once a Classification model is designed, it is necessary to
evaluate its performance. Although there are number of ways,
e.g. Cross-Entropy Loss, AUC-ROC curve (AUC: Area Under
the Curve, and ROC: Receiver Operating Characteristics
Curve), the confusion matrix and F1-score are popular:
Classification & Regression
Evaluating a Classification model: Confusion Matrix
The confusion matrix provides us a matrix/table as output and
describes the performance of the model. It is also known as
the error matrix.
The matrix consists of predictions result in a summarized
form, which has a total number of correct predictions and
incorrect predictions.
A confusion matrix looks like
Actual Actual Negative
as below table:
Positive

Predicted Positive True Positive False Positive


Predicted Negative False Negative True Negative
Classification & Regression
Evaluating a Classification model: F1-Score
The harmonic mean of precision and recall is known as F1-Score.
Precision measures the proportion of true positives over the total
number of predicted positives, while recall measures the
proportion of true positives over the total number of actual
positives.
Classification & Regression
Classification
Various types of classification algorithms are:
• K-Nearest Neighbours
• Support Vector Machines
• Decision Tree Classification
• Random Forest Classification
K-Nearest Neighbor(KNN)
Distance Metrics Used in KNN Algorithm
The K-NN algorithm works by finding the K nearest
neighbors to a given data point based on a distance metric,
such as Euclidean distance. The class or value of the data
point is then determined by the majority vote or average of the
K neighbors.
• Euclidean Distance

• Manhattan Distance
Classification & Regression
Regression
Regression is a process of finding the correlations between
dependent and independent variables. It helps in predicting
the continuous variables such as prediction of Market Trends,
prediction of House prices, etc.
Classification & Regression
Regression: Relationship type
Strong relationships Weak relationships

Y Y

X X

Y Y

X X
Classification & Regression
Regression: Relationship type
No relationship

X
Classification & Regression
Regression
Various types of regression algorithms are:
• Simple Linear Regression
• Multiple Linear Regression
• Polynomial Regression
• Support Vector Regression
• Decision Tree Regression
• Random Forest Regression
Classification & Regression
Linear Regression
Linear regression is a type of supervised machine learning
algorithm that computes the linear relationship between a
dependent variable and one or more independent features.
When the number of the independent feature, is 1 then it is
known as Univariate Linear regression, and in the case of
more than one feature, it is known as multivariate linear
regression.
A simple linear regression involves only one independent
variable and one dependent variable. The equation for simple
linear regression is:
𝑌 = 𝜃1 + 𝜃2 X
Classification & Regression
Linear Regression
Classification & Regression
Linear Regression
The goal of the algorithm is to find the best Fit Line equation
that can predict the values based on the independent variables.
The best-fit line equation provides a straight line that
represents the relationship between the dependent and
independent variables. The best-fit line implies that the error
between the predicted and actual values should be kept to a
minimum.
We utilize the cost function to compute the best values in
order to get the best fit line since different values for weights
or the coefficient of lines result in different regression lines.
Classification & Regression
Linear Regression
The regression line is a hypothetical function, as actual data
distribution is not strictly linear. Hence, there exists some

difference between actual output Y and the predicted output 𝑌.
Accordingly, the cost function is defined as a Mean Squared
Error (MSE), which calculates the average of the squared
errors between the predicted values 𝑌෠ and the actual value Y.
The purpose is to determine the optimal values for the
intercept 𝜃1 and the coefficient of the input feature 𝜃2
providing the best-fit line for the given data points.
Classification & Regression
Linear Regression: Gradient Descent optimization
The MSE cost function can be calculated as:
𝑛
1 2
𝐶𝑜𝑠𝑡 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝐽 = ෍ 𝑦ො𝑖 − 𝑦𝑖
𝑛
𝑖
The different values for weights or the coefficient of lines (𝜃1 ,
𝜃2 ) gives a different line of regression, so we need to calculate
the best values for 𝜃1 and 𝜃2 to find the best fit line, so to
calculate this we use the above cost function.
Utilizing the MSE function, the iterative process of gradient
descent is applied to update the values of 𝜃1 and 𝜃2 .

Courtesy: https://www.geeksforgeeks.org/
Classification & Regression
Linear Regression: Gradient Descent optimization
The optimization algorithm gradient descent train the linear
regression model by iteratively modifying the model’s
parameters to reduce the mean squared error (MSE) of the
model on a training dataset.

Courtesy: https://www.geeksforgeeks.org/
Classification & Regression
Linear Regression: Gradient Descent optimization

Courtesy: https://www.geeksforgeeks.org/
Classification & Regression
Linear Regression: Gradient Descent optimization

Courtesy: https://www.geeksforgeeks.org/
Classification & Regression
Linear Regression: Gradient Descent optimization
The optimization algorithm gradient descent train the linear
regression model by iteratively modifying the model’s
parameters to reduce the mean squared error (MSE) of the
model on a training dataset. The update rule is given as:
𝑛
2
𝜃෠1 = 𝜃1 − 𝜂(𝐽𝜃′ 1 ) 𝜃෠1 = 𝜃1 − 𝜂 ෍ 𝑦ො𝑖 − 𝑦𝑖
𝑛
𝑖=1
𝑛
2
𝜃෠2 = 𝜃2 − 𝜂(𝐽𝜃′ 2 ) 𝜃෠2 = 𝜃2 − 𝜂 ෍ 𝑦ො𝑖 − 𝑦𝑖 . 𝑥𝑖
𝑛
𝑖=1

Courtesy: https://www.geeksforgeeks.org/
Classification & Regression
Linear Regression: Least Mean Square
Another way to find the values of model parameters that
minimize the error is the Ordinary Least Squares method. The
formulas for 𝜃1 and 𝜃2 in terms of the data points are:

𝜃෠2 =

𝜃෠1 = 𝑦ത − 𝜃෠2 𝑥ҧ

Courtesy: https://www.geeksforgeeks.org/
Classification & Regression
Evaluating a Regression model
A variety of evaluation measures can be used to determine the
strength of any linear regression model. These assessment
metrics often give an indication of how well the model is
producing the observed outputs. Some of these are:
• R-squared method
• Mean Square Error (MSE)
• Mean Absolute Error (MAE)
• Root Mean Squared Error (RMSE)

Courtesy: https://www.geeksforgeeks.org/
Classification & Regression
Evaluating a Regression model
• R-squared method: It is a statistical method that determines
the goodness of fit. It can be calculated from the below
formula:
𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛
𝑅2 =
𝑇𝑜𝑡𝑎𝑙 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛

𝑆𝑆𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑆𝑆𝑟𝑒𝑠
𝑅2 = =1−
𝑆𝑆𝑡𝑜𝑡𝑎𝑙 𝑆𝑆𝑡𝑜𝑡𝑎𝑙

Courtesy: https://www.geeksforgeeks.org/
Classification & Regression
Graphical view
Linear Model

Mean Model

𝑌 = 𝜃1 + 𝜃2 X

SStotal = SSReg + SSres

Variability accounted Unexplained


Total variability in = + variability
y-values for by the regression
Classification & Regression
Evaluating a Regression model
• Mean Square Error (MSE): It is an evaluation metric that
calculates the average of the squared differences between
the actual and predicted values for all the data points. It is
expressed as:
𝑛
1 2
𝑀𝑆𝐸 = ෍ 𝑦𝑖 − 𝑦ො𝑖
𝑛
𝑖=1

• MSE is sensitive to outliers as large errors contribute


significantly to the overall score.
Classification & Regression
Evaluating a Regression model
• Mean Absolute Error (MAE): It measures the average
absolute difference between the predicted values and actual
values. It is expressed as:
𝑛
1
𝑀𝐴𝐸 = ෍ 𝑦𝑖 − 𝑦ො𝑖
𝑛
𝑖=1

• Lower MAE value indicates better model performance. It is


not sensitive to the outliers as we consider absolute
differences.
Classification & Regression
Evaluating a Regression model
• Root Mean Square Error (RMSE): The square root of the
residuals’ variance is the Root Mean Squared Error. It
describes how well the observed data points match the
expected values. It is expressed as:

σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො𝑖 2
𝑅𝑀𝑆𝐸 =
𝑛
Classification & Regression
Example:
A real estate agent wishes to House Price Square
in $1000s Feet
examine the relationship between the (Y) (X)
selling price of a home and its size 245 1400
(measured in square feet). A random 312 1600
sample of 10 houses is selected. 279 1700

Use performance measures to 308 1875

quantify the relationship and predict 199 1100

the selling price for a house sized 219 1550

2000 sq. feet. 405 2350


324 2450
319 1425
255 1700
Classification & Regression
House Price Square
in $1000s Feet
Example: (Y) (X)
245 1400
312 1600
279 1700
Scatter Plot of the house model
308 1875
199 1100
450
219 1550
400
House Price ($1000s)

405 2350
350
324 2450
300 319 1425
250 255 1700
200
150
100
50
0
0 500 1000 1500 2000 2500 3000
Square Feet
Classification & Regression
Regression Statistics
Example: R Square 0.58082
Observations 10

House price model: Scatter Plot and Prediction Line


450
400
House Price ($1000s)

350 Slope
300
250
= 0.10977
200
150
100
50
Intercept 0
= 98.248 0 500 1000 1500 2000 2500 3000
Square Feet

house price = 98.24833 + 0.10977 (square feet)


Classification & Regression
Example:

Predict the price for a house with 2000 square feet:

house price = 98.25 + 0.1098 (sq.ft.)

= 98.25 + 0.1098(2000)

= 317.85

The predicted price for a house with 2000 square


feet is 317.85($1,000s) = $317,850
Classification & Regression
Example-2:
A data about experience and salary of a X (years) Y (salary,
company is provided in the table. Obtain a $1,000)
regression model using LMS, and compute 3 30
its performance measures. Predict the salary 8 57
if total experience of an employee is 10
9 64
years.
Linear Regression: Y=3.5*X+23.2 13 72
120
3 36
100
6 43
80

11 59
Salary

60

40
21 90
1 20
20

0
0 5 10 15 20 25
Years
Classification & Regression
Example-2:
X (years) Y (salary,
𝜃෠2 = $1,000)
3 30
𝜃෠1 = 𝑦ത − 𝜃෠2 𝑥ҧ 8 57
9 64
13 72
𝜃෠1 = 23.2 and 𝜃෠2 =3.5 3 36

Hence, regression model will be: 6 43


11 59
𝑌 = 23.2 + 3.5 𝑋 21 90
1 20
Classification & Regression
Example-3:
Number Repair
A company that repairs small computers needs of components time
i xi yi
to develop a better way of providing
1 1 23
customers typical repair time estimates. To 2 2 29
begin this process, they compiled data on 3 4 64
repair times (in minutes) and the number of 4 4 72
components needing repair or replacement 5 4 80
from the previous week. The data, sorted by 6 5 87
number of components are as follows: 7 6 96
8 6 105
Obtain a regression model, and its 9 8 127
performance measures. Predict the repair time, 10 8 119
if number of components are 12. 11 9 145
12 9 149
13 10 165
14 10 154
Classification & Regression
Polynomial Regression
Polynomial Regression is a regression algorithm that models
the relationship between a dependent (y) and independent
variable (x) as nth degree polynomial. The Polynomial
Regression equation is given below:
y= b0+b1x1+ b2x12+ b2x13+...... bnx1n

The dataset used in Polynomial regression for training is of


non-linear nature.
Classification & Regression
Polynomial Regression
The choice of the polynomial degree (n) is a crucial aspect of
polynomial regression. A higher degree allows the model to fit the
training data more closely, but it may also lead to overfitting,
especially if the degree is too high. Therefore, the degree should be
chosen based on the complexity of the underlying relationship in the
data.

The polynomial regression model is trained to find the coefficients that


minimize the difference between the predicted values and the actual
values in the training data.

Once the model is trained, it can be used to make predictions on new,


unseen data. The polynomial equation captures the non-linear patterns
observed in the training data, allowing the model to generalize to non-
linear relationships.
Classification & Regression
SVM Regression
Support Vector Machine when used for regression problems is
known as Support Vector Regression (SVR). It works for
continuous variables.
SVR tries to determine a hyperplane
with a maximum margin, so that
maximum number of datapoints are
covered in that margin. The main
goal of SVR is to consider the
maximum datapoints within the
boundary lines and the hyperplane
(best-fit line) must contain a
maximum number of datapoints.
Classification & Regression
Decision Tree Regression

Decision Tree is a supervised learning algorithm which can be


used for solving both classification and regression problems.
It can solve problems for both categorical and numerical data
Decision Tree regression builds a tree-like structure in which
each internal node represents the "test" for an attribute, each
branch represent the result of the test, and each leaf node
represents the final decision or result.
Classification & Regression
Decision Tree Regression

A decision tree is constructed


starting from the root node/parent
node (dataset), which splits into
left and right child nodes (subsets
of dataset). These child nodes are
further divided into their children
node, and themselves become the
parent node of those nodes.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy