0% found this document useful (0 votes)
7 views10 pages

Module 4 - Classification (1)

Classification is a supervised machine learning technique that involves training a model on labeled data and testing it to predict unseen data classes. It can be categorized into binary and multi-class classification, with various algorithms like KNN, Naïve Bayes, MLP, and SVM available for implementation. Evaluation metrics such as accuracy, precision, and recall are used to assess the performance of classification models.

Uploaded by

steve.martinez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views10 pages

Module 4 - Classification (1)

Classification is a supervised machine learning technique that involves training a model on labeled data and testing it to predict unseen data classes. It can be categorized into binary and multi-class classification, with various algorithms like KNN, Naïve Bayes, MLP, and SVM available for implementation. Evaluation metrics such as accuracy, precision, and recall are used to assess the performance of classification models.

Uploaded by

steve.martinez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

MODULE 4: CLASSIFICATION

WHAT IS CLASSIFICATION?
Classification is a type of supervised machine learning consisting of a two-step method: training the
model on a labeled dataset and testing the model on a small batch of data to predict how good the
model performs for previously unseen data (to predict the class of each instance). The dataset
contains a class for each instance (which we call label), and we use a machine learning algorithm to
train the model. Training is how machines get an insight into how the underlying patterns in the
data are associated with the class it belongs to. After training a model, we test it on a small sample
of labeled data. Testing helps to give us a fair idea about how our model will perform.

BINARY AND MULTI-CLASS CLASSIFICATION


A classification problem can be a binary classification or multi-class classification. The difference
between the two is that there are only two output classes in binary classification, and each instance
in the dataset belongs to either one of them. The multi-class classification is a bit more complicated
because the set of available classes to which an instance in the dataset can belong is more than two.
Many efficient algorithms like Support Vactor Machine (SVM), Logistic Regression (LR), and
Perceptrons are designed for binary classification and can be extended for multi-class classification
using heuristics. There are two strategies for doing that:

One v/s Rest:


In Python, this type of heuristic method is often addressed as OvR. The method involves splitting
the multi-class classification into pairs of binary class comparisons where one class represents a
unitary set of classes, and the other represents the remaining set of classes. Let us understand this
using sentiment detection, which involves three classes: positive, negative, and neutral. In the OvR
method, the classification process will involve decisions that will look like this:

[Positive] v/s [Negative, Neutral]? (or Is Positive?)

[Negative] v/s [Positive, Neutral]? (or Is Negative?)

[Neutral] v/s [Positive, Negative]? (or Is Neutral?)

So the multi-class classification problem is broken down into three binary classification problems.
The disadvantage of this approach is that it will require individual binary classification models for
each and every class we will try to predict. This heuristics method will also be slow if the number of
instances in the dataset is too large or the number of possible classes is huge.

The end product of each binary classification problem is a probability score, which denotes the
likelihood of the instance to belong to a class. Finally, the argmax (maximum of all) of these scores
is used to predict the class of the instance.

One v/s One:


More popularly known as OvO method. Like OvR, this method also involves splitting the multi-class
classification problem into multiple binary class classification problems. However, in this method,
each binary classification problem involves comparing exactly one class with another class. Let us
extend the previous example of sentiment detection with three classes Positive, Negative, and
Neutral. The binary classification problems which would be created using the OvO method will look
like this:

[Positive] v/s [Negative]?

[Positive] v/s [Neutral]?

[Neutral] v/s [Negative]?

[Neutral] v/s [Positive]?

[Negative] v/s [Positive]?

[Negative] v/s [Neutral]?

So, as you can see that for three classes, there will be six binary classification models. There is a
straightforward formula to calculate this:

(n * (n – 1) ) / 2

(which is all possible combination given by nC2)

The downside of this method is that compared to OvR, the number of models prepared are many
more and hence slower. The final class label prediction is done by calculating the total number of
predictions for each binary decision made in the OvO method.
CLASSIFICATION ALGORITHMS
There are multiple algorithms that can perform classification, and these algorithms differ from each
other based on their underlying mathematical approach. Each algorithm takes a different approach,
but finally, all of them work towards answering a decision problem. Below we will discuss some
algorithms which are popular and can be used to classify real-world data. In the sections below, we
will first discuss the algorithm, followed by a brief theoretical explanation, and finally, a code
snippet to implement the algorithm in Python.

The code snippet given below is to load the dataset and divide it into the train and the test sets. The
dataset we used here is about car evaluation 1 and has four output classes based on the predictor
attributes. Since all the features consist of string variables, we need to convert them to valid integer
values.

## library imports required


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder

## loading data
df = pd.read_csv(file)
## in y we store the classes for each instance
## in X we take all the attributes/features for each instance
y=df['class']
X=df.iloc[:,:-1]

## splitting the X and y in train and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

## converting the features from string value to integer value


ord_enc = OrdinalEncoder()
X_train[["buying","maint","doors","persons","lug_boot","safety"]] = ord_enc.fit_transform(X_train[
["buying","maint","doors","persons","lug_boot","safety"]])
X_test[["buying","maint","doors","persons","lug_boot","safety"]] = ord_enc.fit_transform(X_test[["
buying","maint","doors","persons","lug_boot","safety"]])

## converting the labels to valid integer values


y_train = lab_enc.fit_transform(y_train)
y_test = lab_enc.fit_transform(y_test)

1Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of
California, School of Information and Computer Science.
KNN
KNN stands for K-nearest-neighbour, and it is mathematically the simplest of algorithms. In the
KNN training phase, all data-points from the training data are plotted in n-dimensional feature
space with respective class labels. A new data-point is plotted in the feature space during the
testing phase, and Euclidean distance is calculated from the new data-point to the already labeled
data-points. The new data-point is assigned to a class based on the k-number of labeled data-points
nearest to this point (the k-nearest neighbors).

Figure 1: Example of K-NN 2

Let us take a look at the image above for a better understanding. The new unlabelled data-point
(which we will encounter in the testing phase) is marked as the green circle. Based on the value of
k, the green circle will get a shape and color. If the value of k is three, then the green circle will be
classified as red triangle (two of the three neighbors are red triangles), and if the value of k is five,
then the class assigned will be blue square (three out of five neighbors are blue squares).

The code snippet for KNN is given below:

## importing valid library


from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

## declaring our model


KNN = KNeighborsClassifier()

## training the model with the training dataset

2 Wikipedia.org
KNN.fit(X_train, y_train)

## predicting the labels with our model


prediction_KNN = KNN.predict(X_test)

## printing the evaluation metrics for our model


print(metrics.classification_report(y_test, prediction_KNN))

The output of our print statement will look like:


precision recall f1-score support

0 0.81 0.76 0.78 79


1 0.71 0.50 0.59 10
2 0.92 0.98 0.95 247
3 1.00 0.30 0.46 10

accuracy 0.90 346


macro avg 0.86 0.63 0.70 346
weighted avg 0.89 0.90 0.89 346

Naïve Bayes
Naïve Bayes uses Bayes theorem of conditional probability (with an assumption of conditional
independence between every pair of features). Let us assume that we have a red fruit with a
diameter of around 3 inches. Every feature of the fruit are independent of each other, and they
individually contribute to the recognition of the fruit, and there is no need to relate the features to
find the type of fruit. Naïve Bayes models are fast and easy to build and works very well for large
datasets with a high dimensional feature set.

The code for classification using Naïve Bayes in Python is given below:

## importing libraries
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics

## declaring our model


GNB = GaussianNB()

## training our model


GNB.fit(X_train, y_train)

## predicting labels with our model


prediction_GNB = GNB.predict(X_test)
## printing evaluation metric for out model
print(metrics.classification_report(y_test, prediction_GNB))

The output of the last print statement will look like this:
precision recall f1-score support

0 0.47 0.11 0.18 79


1 0.00 0.00 0.00 10
2 0.86 0.77 0.81 247
3 0.10 1.00 0.17 10

accuracy 0.60 346


macro avg 0.36 0.47 0.29 346
weighted avg 0.72 0.60 0.63 346

MLP
MLP stands for Multi-Layer Perceptron, and to understand how this algorithm works, you will need
to know what a perceptron is. A perceptron is a replication model of a human neuron (a nerve cell)
and is a simple binary classifier. The features are combined to generate weight, and then a linear
function is used for classification purposes. A multi-layer perceptron is a combination of layers of
perceptrons (many perceptrons organized in multiple layers) with at least one hidden layer created
to perform multi-class classification. It is never about creating a complex structure as a human
brain, but more about understanding how the decision-making process works in the human brain
(natural neural networks) and using the same logic to perform predictive analysis (using artificial
neural networks).

The input to a perceptron are feature inputs, weights of input features, and a bias. All the inputs are
passed through an activation function to generate an output. The activation function is generally a
sigmoid function that outputs a value with a range 0 to 1 or a hyperbolic tangent function (called
tanh) that outputs a value ranging between -1 to +1. The weighted input is generally a small float
value between 0 and 0.3, and the bias is 1.

The code for classification using an MLP is given below:

## importing libraries
from sklearn.neural_network import MLPClassifier
from sklearn import metrics

## declaring our model


clf = MLPClassifier(random_state=1, max_iter=300)

## training our model


clf.fit(X_train, y_train)

## predicting labels with our model


prediction_MLP = clf.predict(X_test)

## printing evaluation metrics of our model


print(metrics.classification_report(y_test, prediction_MLP))

The output of the last print statement will look like this:
precision recall f1-score support

0 0.89 0.96 0.93 79


1 0.77 1.00 0.87 10
2 0.99 0.96 0.97 247
3 1.00 0.80 0.89 10

accuracy 0.96 346


macro avg 0.91 0.93 0.91 346
weighted avg 0.96 0.96 0.96 346
SVM
SVM or Support Vector Machine is a fast, dependable algorithm capable of performing very well if
the dataset is of a limited size. SVM is a non-probabilistic classifier that plots labeled data in n-
dimensional space and tries to separate the class of data-points using a hyperplane. New data-
points are plotted in the space, and the class of the new data-point is assigned based on the side of
the hyperplane the point falls.

Figure 2: SVM hyperplanes 3

The image above shows how hyperplanes are drawn in a 2-dimensional XY-plane. H1 is not a good
example of a hyperplane as it does not separate the data-points properly. H2 separates the data-
points but has a very small gap with the nearest black and white points (hence, chances of error are
very high). H3 is the best-suited hyperplane for separating the data-points.

The Python code for classification using SVM is given below:

## importing libraries
from sklearn.svm import SVC
from sklearn import metrics

## declaring our model


SVM = SVC()

## training our model


SVM.fit(X_train, y_train)

3 Wikipedia.org
## predicting labels with our model
prediction_SVM = SVM.predict(X_test)

## printing evaluation metric of our model


print(metrics.classification_report(y_test, prediction_SVM))

The output of the last print statement will look like this:
precision recall f1-score support

0 0.85 0.77 0.81 79


1 0.83 0.50 0.62 10
2 0.93 0.98 0.95 247
3 1.00 1.00 1.00 10

accuracy 0.92 346


macro avg 0.90 0.81 0.85 346
weighted avg 0.91 0.92 0.91 346

EVALUATION METRIC
Classification is done on a labeled dataset where the dataset is split for train and test in a ratio of
approximately 4:1 (80% training, 20% test). The output labels are generated as part of the original
dataset. However, in most cases, all instances from the test set are assigned a class by human
annotators. For example, the sentiment of tweets or the rating of movies. These annotations are
considered ground truth, and the labels generated by our machine learning models are compared to
the human-generated truth values to generate various metrics. To understand evaluation metrics,
you must understand a few essential terms. Let us take an example of a binary classification
problem where you have to assign each instance to either a Positive or Negative class. Let us look at
the table given below:

Positive (predicted) Negative (Predicted)


Positive (actual) 48 2
Negative (actual) 6 44

This table is known as the Confusion Matrix. The rows represent the ground truth, and the columns
represent the predictions by our classifier. From the above table, we know that our model has
successfully predicted 48 Positive instances as Positive and 44 Negative instances as Negative.
There are two instances of Positive which are wrong classified as Negative and six instances of
Negative which are wrongly classified as Positive.

Positive (predicted) Negative (predicted)


Positive (actual) True Positive (TP) False Negative (FN)
Negative (actual) False Positive (FP) True Negative (TN)
True Positive are the correctly classified Positive instances, and True Negative are the correctly
classified Negative instances.
False Positive are the Negative instances that are wrongly classified as Positive and False Negative
are the Positive instances that are improperly classified as Negative.
The first and most basic evaluation metric is Accuracy. It practically denotes the accuracy of our
model, and the formula for accuracy is:
Accuracy = Number of Correct Predictions/ Total number of Instances
For binary classification, we can also modify the above formula as:
Accuracy = (TP + TN)/ (TP + TN + FP + FN)
Precision is another important metric that measures the proportion of Positive labels that are
classified correctly by the model. This means that it can tell us what fraction of total prediction our
classifier model has done correctly. The formula to find Precision is:
Precision = TP / (TP + FP)
Recall can be defined as the ratio of the correctly classified Positive labels to the total number of
actual Positive labels present in the dataset. It tells us the proportion of correct labels predicted by
our classifier with respect to the ground truth. The formula for Recall is:
Recall = TP / (TP + FN)
F-score is another important metric that acts as a balance between Precision and Recall. It is
mathematically expressed as the harmonic mean of Precision and Recall. The formula for F-score
can be given as follows:
F-score = 2 * (Precision * Recall) / (Precision + Recall)
This can also be expressed as:
F-score = TP / [TP + 0.5 (FP + FN)]
You can calculate any of these metrics by printing the confusion matrix of your model. In Python,
you can do this using the following code:

from sklearn.metrics import confusion_matrix

## y_true is the ground truth, and y_pred is the predicted labels


confusion_matrix(y_true, y_pred)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy