0% found this document useful (0 votes)
18 views74 pages

Classification

Uploaded by

Suraj Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views74 pages

Classification

Uploaded by

Suraj Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 74

CLASSIFICATION

Instructor
Dr. Prashant Srivastava
INTRODUCTION
 Classification is a type of supervised learning
where a target feature, which is of categorical
type, is predicted for the test data on the basis
of information imparted by the training data.

 The target categorical feature is known as


class.
CLASSIFICATION
LEARNING STEPS
COMMON CLASSIFICATION
ALGORITHMS
 k-nearest neighbour
 Logistic regression
 Decision tree
 Random forest
 Support Vector Machine
 Naïve Bayes classifier
k-NEAREST NEIGHBOUR
 A simple but extremely powerful classification
algorithm.
k-NEAREST NEIGHBOUR
 The unknown and unlabelled data which comes
for a prediction problem is judged on the basis
of the training dataset elements which are
similar to the unknown element.

 The class labels of unknown element is


assigned on the basis of class labels of the
similar training dataset elements.
k-NEAREST NEIGHBOUR
 Consider a Student dataset.
k-NEAREST NEIGHBOUR
1. Leader- students having good communication
skills as well as a good level of aptitude.

2. Speaker- students having good


communication skills but not so good level of
aptitude.

3. Intel- students having not so good


communication skill but a good level of
aptitude.
k-NEAREST NEIGHBOUR
 Performance of the classification model is
measured by the number of correct
classifications made by the model when applied
to an unknown dataset.

 If the class value predicted for most of the test


data elements matches with the actual class
value that they have, then we say that the
classification model possesses a good accuracy.
k-NEAREST NEIGHBOUR
k-NEAREST NEIGHBOUR
 Two challenges-

1. What is the basis of similarity?

2. How many similar elements should be


considered for deciding the class label of each
test data element?
k-NEAREST NEIGHBOUR
 Measures of similarity.

 The most common approach adopted by kNN to


measure similarity between two data elements
is Euclidean distance.
k-NEAREST NEIGHBOUR
k-NEAREST NEIGHBOUR
 To find out the closest or nearest
neighbours of the test data point,
Euclidean distance of the different dots
need to be calculated from the asterisk.

 Then, the class value of the closest


neighbours helps in assigning the class
value of the test data element.
k-NEAREST NEIGHBOUR
 How many similar elements should be
considered?

 The answer lies in the value of ‘k’ which is a


user-defined parameter given as input to the
algorithm.

 In kNN algorithm, the value of ‘k’ indicates the


number of neighbours that need to be
considered.
k-NEAREST NEIGHBOUR
 For example, if the value of k is 3, only three
nearest neighbours or three training data
elements closest to the test data element is
considered.

 Out of the three data elements, the class which


is predominant is considered as the class label
to be assigned to the test data.
k-NEAREST NEIGHBOUR
k-NEAREST NEIGHBOUR
Choosing value of ‘k’
 If the value of k is very large, the class label
of the majority class of the training data set will
be assigned to the test data regardless of the
class labels of the neighbours nearest to the
test data.

 If the value of k is very small, the class


value of a noisy data or outlier in the training
dataset which is nearest neighbour to the test
data will be assigned to the test data.
k-NEAREST NEIGHBOUR
 Then, what should be ‘k’?

 The best value of ‘k’ is somewhere between


these two extremes.
k-NEAREST NEIGHBOUR
1. One common practice- set k equal to the
square root of the number of training records.

2. Test several ‘k’ values on a variety of test


datasets and choose the one that delivers the
best performance.
k-NEAREST NEIGHBOUR
Algorithm-
1. Predict a class value for new data: Calculate
distance(X, Xi) from i=1,2,3,….,n.
where X= new data point, Xi= training data,
distance as per your chosen distance metric.
2. Sort these distances in increasing order with
corresponding train data.
3. From this sorted list, select the top ‘K’ rows.
4. Find the most frequent class from these chosen
‘K’ rows. This will be your predicted class.
k-NEAREST NEIGHBOUR
Strengths-
1. Simple and easy to understand.

2. Effective in certain situations.

3. Almost no time required for training phase.


k-NEAREST NEIGHBOUR
Weakness-
1. Does not learn anything in real sense.

2. Classification process is very slow.

3. Large amount of computational space is


required.
k-NEAREST NEIGHBOUR
 k-nearest neighbour is a ‘lazy learner’.

 What is lazy learning?


k-NEAREST NEIGHBOUR
 Lazy learning is a type of machine learning that
doesn't process training data until it needs to
make a prediction.

 Instead of building models during training, lazy


learning algorithms wait until they encounter a
new query.
k-NEAREST NEIGHBOUR
 This
method stores and compares training
examples when making predictions.

 It's also called instance-based or memory-based


learning.
k-NEAREST NEIGHBOUR
 Lazy learning algorithms work by memorizing
the training data rather than constructing a
general model.

 When a new query is received, lazy learning


retrieves similar instances from the training set
and uses them to generate a prediction.

 The similarity between instances is usually


calculated using distance metrics.
LOGISTIC REGRESSION
 A type of regression analysis used for predicting
the outcome of a categorical variable.

 In logistic regression, dependant variable (Y) is


binary (0,1) and independent variables (X) are
continuous in nature.
LOGISTIC REGRESSION
 The goal of logistic regression is to predict the
likelihood that Y = 1 given certain value of X.

 IfX and Y have a strong positive linear


relationship, the probability that a person will
have a score of Y = 1 will increase as the values
of X increase.

 So, we are predicting probabilities rather than


the scores of dependent variable.
LOGISTIC REGRESSION
 It’s essential to emphasize that logistic
regression is not just a classification algorithm;
it’s a method for estimating probabilities.

 Logistic regression is a powerful classification


technique which estimates the likelihood of an
input belonging to a particular class.

 This estimation is inherently a probability


prediction, which must be converted into binary
values (0 or 1) to make class predictions.
HOW LOGISTIC
REGRESSION WORKS?
 Logistic Regression algorithm works by
implementing a linear equation with
independent or explanatory variables to predict
a response value.
 For example, we consider the example of
number of hours studied and probability of
passing the exam.
 Here, number of hours studied is the
explanatory variable and it is denoted by x1.
 Probability of passing the exam is the response
or target variable and it is denoted by z.
HOW LOGISTIC
REGRESSION WORKS?
 If we have one explanatory variable (x) and one
response variable (z), then the linear equation
would be given mathematically with the
following equation-

z = a + bx
SIGMOID FUNCTION
 This predicted response value, denoted by z is
then converted into a probability value that lie
between 0 and 1.

 We use the sigmoid function in order to map


predicted values to probability values.

 This sigmoid function then maps any real value


into a probability value between 0 and 1.
SIGMOID FUNCTION
 Sigmoid function is used to map predictions to
probabilities. The sigmoid function has an S
shaped curve.

 It is also called sigmoid curve.

 A Sigmoid function is a special case of the


Logistic function.
SIGMOID FUNCTION
DECISION BOUNDARY
 The sigmoid function returns a probability value
between 0 and 1.

 This probability value is then mapped to a


discrete class which is either “0” or “1”.
DECISION BOUNDARY
 In order to map this probability value to a
discrete class (pass/fail, yes/no, true/false), we
select a threshold value.

 Thisthreshold value is called Decision


boundary.

 Above this threshold value, we will map the


probability values into class 1 and below which
we will map values into class 0.
DECISION BOUNDARY
Mathematically, it can be expressed as follows:-

p ≥ 0.5 => class = 1


p < 0.5 => class = 0
DECISION BOUNDARY
 Generally, the decision boundary is set to 0.5.

 So, if the probability value is 0.8 (> 0.5), we will


map this observation to class 1.

 Similarly, if the probability value is 0.2 (< 0.5),


we will map this observation to class 0.
DECISION BOUNDARY
SUPPORT VECTOR
MACHINE
 Support Vector Machine (SVM) is a model,
which can do linear classification as well as
regression.

 SVM is based on the concept of a surface,


called a hyperplane, which draws a boundary
between data instances plotted in the
multidimensional feature space.
SUPPORT VECTOR
MACHINE
 The output prediction of an SVM is one of the
two conceivable classes which are already
defined in the training data.

 SVM algorithm builds an N-dimensional


hyperplane model that assigns future instances
into one of the two possible output classes.
SUPPORT VECTOR
MACHINE
 The goal of the SVM algorithm is to create the
best line or decision boundary that can
segregate n-dimensional space into classes so
that we can easily put the new data point in the
correct category in the future.

 Thisbest decision boundary is called a


hyperplane.
SUPPORT VECTOR
MACHINE
SUPPORT VECTOR
MACHINE
 There may be many possible hyperplanes.

 One of the challenges with the SVM model is to


find the optimal hyperplane.
SUPPORT VECTOR
MACHINE
SUPPORT VECTOR
MACHINE
 There can be multiple hyperplanes to segregate
the classes in n-dimensional space, but we
need to find out the best hyperplane that helps
to classify the data points.

 We always create a hyperplane that has a


maximum margin, which means the maximum
distance between the data points.
SUPPORT VECTOR
MACHINE
 The data points or vectors that are the closest
to the hyperplane and which affect the position
of the hyperplane are termed as Support
Vector.

 Since these vectors support the hyperplane,


hence called a support vector.
SUPPORT VECTOR
MACHINE
SUPPORT VECTOR
MACHINE
Hyperplane and Margin-
 Mathematically, in a two dimensional space,
hyperplane can be defined by
SUPPORT VECTOR
MACHINE
 Extending this concept to an N-dimensional
space, hyperplane can be defined by the
equation

 In short, it can be represented as follows-


SUPPORT VECTOR
MACHINE
 The farther the data points lie from the
hyperplane, the more confident we are about
correct categorization.

 The distance between hyperplane and data


points is known as margin.
SUPPORT VECTOR
MACHINE
Maximum Margin Hyperplane-
 Refers to identifying the hyperplane which has
the largest separation with the data instances
of the two classes.

 Though any set of three hyperplanes can do the


correct classification, why do we need to search
for the set of hyperplanes causing the largest
separation?
SUPPORT VECTOR
MACHINE
 The answer is that doing so helps us in
achieving more generalization and hence less
number of issues in the classification of
unknown data.
SUPPORT VECTOR
MACHINE
Identifying MMH for linearly separable data
 An outer boundary needs to be drawn for the
data instances belonging to different classes.

 These outer boundaries are known as convex


hull.

 The MMH can be drawn as the perpendicular


bisector of the shortest line between convex
hull.
SUPPORT VECTOR
MACHINE
SUPPORT VECTOR
MACHINE
 Find a set of values for the vector c such that
two hyperplanes, represented by the equations
below, can be specified
SUPPORT VECTOR
MACHINE
 All the instances that belong to one class falls
above one hyperplane and all the data
instances belonging to another class falls below
another hyperplane.

 According to vector geometry, the distance of


these planes should be
SUPPORT VECTOR
MACHINE
 In order to maximize the distance between
hyperplanes, the value of vector c should be
minimised.

 The task of SVM is to solve the optimization


problem
SUPPORT VECTOR
MACHINE
Strengths-
1. Can be used for both classification and
regression

2. Not much impacted by data with noise or


outlier

3. Prediction using this model is promising


SUPPORT VECTOR
MACHINE
Weaknesses-
1. Applicable only for binary classification

2. Slow for large datasets

3. Quite memory-intensive
DECISION TREE
 One of the most widely adopted algorithms for
classification.

 It builds model in the form of tree structure.

 Exceptionally productive.
DECISION TREE
 Used for multidimensional analysis with
multiple classes.

 The goal of decision tree learning is to create a


model based on past data called past vector
that predicts the value of the output variable
based on input variables in the feature vector.
DECISION TREE
 Each node of a decision tree corresponds to one
of the feature vector.

 For every node there are edges to children,


wherein there is an edge for each of the
possible values (or range of values) of the
feature associated with the node.
DECISION TREE
 The tree terminates at different leaf nodes
where each leaf node represents a possible
value for the output variable.

 The output variable is determined by following


a path that starts at the root and is guided by
the values of the input variables.
BUILDING A DECISION
TREE
 It starts from the root node, which is nothing
but the entire dataset.

 It selects the feature which predicts the target


class in the strongest way.
BUILDING A DECISION
TREE
 The decision tree splits the dataset into multiple
partitions, with data in each partition having a
distinct value for the feature based on which
the partitioning has happened.

 This is the first set of branches.


BUILDING A DECISION
TREE
 Likewise, the algorithm continues splitting the
nodes on the basis of the feature which helps in
the best partition.

 This continues till a stopping criteria is reached.


BUILDING A DECISION
TREE
 The usual stopping criteria are-

1. All or most of the examples at a particular


node have the same class
2. All features have been used up in the
partitioning
3. The tree has grown to a pre-defined
threshold limit
BUILDING A DECISION
TREE
 There are many implementations of decision
tree.

 The biggest challenge of a decision tree


algorithm is to find out which feature to split
upon.
BUILDING A DECISION
TREE
 Data should be split in such a way that the
partitions created by the split should contain
examples belonging to a single class.

 If that happens, the partitions are considered as


pure.
BUILDING A DECISION
TREE
 Entropy is the measure of impurity of an
attribute or feature.

 The information gain is calculated on the


basis of decrease in entropy (S) after a dataset
is split according to a particular attribute (A).
BUILDING A DECISION
TREE
 Constructing a tree is all about finding an
attribute that returns the highest
information gain.
REFERENCES
1. Ethem Alpaydin. Machine learning. MIT press,
2021.

2. S Dutt, S Chandramouli, A K Das. Machine


Learning. Pearson, 2022.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy