Module 4 - Classification (1)
Module 4 - Classification (1)
WHAT IS CLASSIFICATION?
Classification is a type of supervised machine learning consisting of a two-step method: training the
model on a labeled dataset and testing the model on a small batch of data to predict how good the
model performs for previously unseen data (to predict the class of each instance). The dataset
contains a class for each instance (which we call label), and we use a machine learning algorithm to
train the model. Training is how machines get an insight into how the underlying patterns in the
data are associated with the class it belongs to. After training a model, we test it on a small sample
of labeled data. Testing helps to give us a fair idea about how our model will perform.
So the multi-class classification problem is broken down into three binary classification problems.
The disadvantage of this approach is that it will require individual binary classification models for
each and every class we will try to predict. This heuristics method will also be slow if the number of
instances in the dataset is too large or the number of possible classes is huge.
The end product of each binary classification problem is a probability score, which denotes the
likelihood of the instance to belong to a class. Finally, the argmax (maximum of all) of these scores
is used to predict the class of the instance.
So, as you can see that for three classes, there will be six binary classification models. There is a
straightforward formula to calculate this:
(n * (n – 1) ) / 2
The downside of this method is that compared to OvR, the number of models prepared are many
more and hence slower. The final class label prediction is done by calculating the total number of
predictions for each binary decision made in the OvO method.
CLASSIFICATION ALGORITHMS
There are multiple algorithms that can perform classification, and these algorithms differ from each
other based on their underlying mathematical approach. Each algorithm takes a different approach,
but finally, all of them work towards answering a decision problem. Below we will discuss some
algorithms which are popular and can be used to classify real-world data. In the sections below, we
will first discuss the algorithm, followed by a brief theoretical explanation, and finally, a code
snippet to implement the algorithm in Python.
The code snippet given below is to load the dataset and divide it into the train and the test sets. The
dataset we used here is about car evaluation 1 and has four output classes based on the predictor
attributes. Since all the features consist of string variables, we need to convert them to valid integer
values.
## loading data
df = pd.read_csv(file)
## in y we store the classes for each instance
## in X we take all the attributes/features for each instance
y=df['class']
X=df.iloc[:,:-1]
1Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of
California, School of Information and Computer Science.
KNN
KNN stands for K-nearest-neighbour, and it is mathematically the simplest of algorithms. In the
KNN training phase, all data-points from the training data are plotted in n-dimensional feature
space with respective class labels. A new data-point is plotted in the feature space during the
testing phase, and Euclidean distance is calculated from the new data-point to the already labeled
data-points. The new data-point is assigned to a class based on the k-number of labeled data-points
nearest to this point (the k-nearest neighbors).
Let us take a look at the image above for a better understanding. The new unlabelled data-point
(which we will encounter in the testing phase) is marked as the green circle. Based on the value of
k, the green circle will get a shape and color. If the value of k is three, then the green circle will be
classified as red triangle (two of the three neighbors are red triangles), and if the value of k is five,
then the class assigned will be blue square (three out of five neighbors are blue squares).
2 Wikipedia.org
KNN.fit(X_train, y_train)
Naïve Bayes
Naïve Bayes uses Bayes theorem of conditional probability (with an assumption of conditional
independence between every pair of features). Let us assume that we have a red fruit with a
diameter of around 3 inches. Every feature of the fruit are independent of each other, and they
individually contribute to the recognition of the fruit, and there is no need to relate the features to
find the type of fruit. Naïve Bayes models are fast and easy to build and works very well for large
datasets with a high dimensional feature set.
The code for classification using Naïve Bayes in Python is given below:
## importing libraries
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
The output of the last print statement will look like this:
precision recall f1-score support
MLP
MLP stands for Multi-Layer Perceptron, and to understand how this algorithm works, you will need
to know what a perceptron is. A perceptron is a replication model of a human neuron (a nerve cell)
and is a simple binary classifier. The features are combined to generate weight, and then a linear
function is used for classification purposes. A multi-layer perceptron is a combination of layers of
perceptrons (many perceptrons organized in multiple layers) with at least one hidden layer created
to perform multi-class classification. It is never about creating a complex structure as a human
brain, but more about understanding how the decision-making process works in the human brain
(natural neural networks) and using the same logic to perform predictive analysis (using artificial
neural networks).
The input to a perceptron are feature inputs, weights of input features, and a bias. All the inputs are
passed through an activation function to generate an output. The activation function is generally a
sigmoid function that outputs a value with a range 0 to 1 or a hyperbolic tangent function (called
tanh) that outputs a value ranging between -1 to +1. The weighted input is generally a small float
value between 0 and 0.3, and the bias is 1.
## importing libraries
from sklearn.neural_network import MLPClassifier
from sklearn import metrics
The output of the last print statement will look like this:
precision recall f1-score support
The image above shows how hyperplanes are drawn in a 2-dimensional XY-plane. H1 is not a good
example of a hyperplane as it does not separate the data-points properly. H2 separates the data-
points but has a very small gap with the nearest black and white points (hence, chances of error are
very high). H3 is the best-suited hyperplane for separating the data-points.
## importing libraries
from sklearn.svm import SVC
from sklearn import metrics
3 Wikipedia.org
## predicting labels with our model
prediction_SVM = SVM.predict(X_test)
The output of the last print statement will look like this:
precision recall f1-score support
EVALUATION METRIC
Classification is done on a labeled dataset where the dataset is split for train and test in a ratio of
approximately 4:1 (80% training, 20% test). The output labels are generated as part of the original
dataset. However, in most cases, all instances from the test set are assigned a class by human
annotators. For example, the sentiment of tweets or the rating of movies. These annotations are
considered ground truth, and the labels generated by our machine learning models are compared to
the human-generated truth values to generate various metrics. To understand evaluation metrics,
you must understand a few essential terms. Let us take an example of a binary classification
problem where you have to assign each instance to either a Positive or Negative class. Let us look at
the table given below:
This table is known as the Confusion Matrix. The rows represent the ground truth, and the columns
represent the predictions by our classifier. From the above table, we know that our model has
successfully predicted 48 Positive instances as Positive and 44 Negative instances as Negative.
There are two instances of Positive which are wrong classified as Negative and six instances of
Negative which are wrongly classified as Positive.