Unit II
Unit II
SUPERVISED LEARNING: Introduction, Linear methods for classification, Linear methods for
regression, Support Vector Machine, SVM- the dual formulation, SVM- the maximum margin
with noise, Decision trees, over fitting
Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data means
some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor that teaches
the machines to predict the output correctly. It applies the same concept as a student learns in the
supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the machine
learning model. The aim of a supervised learning algorithm is to find a mapping function to map the
input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image classification, Fraud
Detection, spam filtering, etc.
In supervised learning, models are trained using labelled dataset, where the model learns about each
type of data. Once the training process is completed, the model is tested on the basis of test data (a
subset of the training set), and then it predicts the output.
The working of supervised learning can be easily understood from the above Figure 1.
Suppose we have a dataset of different types of shapes, which includes square, rectangle, triangle, and
Polygon. Now the first step is that we need to train the model for each shape.
If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square.
If the given shape has three sides, then it will be labelled as a triangle.
If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify the
shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies the
shape on the bases of a number of sides, and predicts the output.
Classification algorithms are used when the output variable is categorical, which means there are two
classes such as Yes-No, Male-Female, True-false, etc. Below are some popular classification
algorithms which come under supervised learning:
➢ Logistic Regression
➢ Decision Trees
➢ Support vector Machines
➢ Random Forest
1.3.2 Regression
Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc. Below are some popular Regression algorithms which come under supervised learning:
➢ Linear Regression
➢ Bayesian Linear Regression
➢ Regression Trees
➢ Non-Linear Regression
➢ Polynomial Regression
➢ With the help of supervised learning, the model can predict the output on the basis of prior
experiences.
➢ In supervised learning, we can have an exact idea about the classes of objects.
➢ Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.
➢ Supervised learning models are not suitable for handling the complex tasks.
➢ Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
➢ Training required lots of computation times.
➢ In supervised learning, we need enough knowledge about the classes of object.
1.3.5 Classification Terminologies in Machine Learning
Classifier: It is an algorithm that is used to map the input data to a specific category.
Classification Model: The model predicts or draws a conclusion to the input data given for training, it
will predict the class or category for the data.
Feature: A feature is an individual measurable property of the phenomenon being observed.
Binary Classification: It is a type of classification with two outcomes, for eg – either true or false.
Multi-Class Classification: The classification with more than two classes, in multi-class classification
each sample is assigned to one and only one label or target.
Multi-label Classification: This is a type of classification where each sample is assigned to a set of
labels or targets.
Train the Classifier: Each classifier in sci-kit learn uses the fit (X, y) method to fit the model for
training the train X and train label y.
Predict the Target: For an unlabeled observation X, the predict (X) method returns predicted label y.
Evaluate: This basically means the evaluation of the model i.e., classification report, accuracy score,
etc.
Classification Algorithms can be further divided into the mainly two categories:
Linear Models
➢ Logistic Regression
➢ Support Vector Machines
Non-linear Models
➢ K-Nearest Neighbours
➢ Kernel SVM
➢ Naïve Bayes
➢ Decision Tree Classification
➢ Random Forest Classification
Classification is the process of finding or discovering a model or function which helps in separating the
data into multiple categorical classes i.e. discrete values. In classification, data is categorized under
different labels according to some parameters given input and then the labels are predicted for the data.
In machine learning, classification is a supervised learning concept which basically categorizes a set of
data into classes. The most common classification problems are – speech recognition, face detection,
handwriting recognition, document classification, etc. It can be either a binary classification problem
or a multi-class problem too. There are a bunch of machine learning algorithms for classification in
machine learning. Let us take a look at those classification algorithms in machine learning.
Figure 3: Binary Classification and Multiclass Classification
Types of Learners in Classification:
Lazy Learners: Lazy learners simply store the training data and wait until a testing data appears. The
classification is done using the most related data in the stored training data. They have more predicting
time compared to eager learners. Ex – k-nearest neighbor, case-based reasoning.
Eager Learners: Eager learners construct a classification model based on the given training data before
getting data for predictions. It must be able to commit to a single hypothesis that will work for the entire
space. Due to this, they take a lot of time in training and less time for a prediction. Ex – Decision Tree,
Naive Bayes, Artificial Neural Networks.
By achieving the best-fit regression line, the model aims to predict y value such that the error difference
between predicted value and true value is minimum. So, it is very important to update the ‘b’ and ‘m’
values, to reach the best value that minimize the error between predicted y value (pred) and true y value
(y).
Cost function (J) of Linear Regression is the Root Mean Squared Error (RMSE) between predicted
y value (pred) and true y value (y).
1.5.1 Multiple Linear Regression:
This same concept can be extended to cases where there are more than two variables. This is called
Multiple Linear Regression. For instance, consider a scenario where you have to predict the price of
the house based upon its area, number of bedrooms, the average income of the people in the area, the
age of the house, and so on. In this case, the dependent variable (target variable) is dependent upon
several independent variables. A regression model involving multiple variables can be represented as:
y = b0 + m1b1 + m2b2 + m3b3 + … … mnbn. This is the equation of a Hyperplane.
Logistic regression is a linear method, but the predictions are transformed using the logistic function.
The impact of this is that we can no longer understand the predictions as a linear combination of the
inputs as we can with linear regression, for example, continuing on from above, the model can be stated
as:
Or
Logistic regression is an estimation of the logit function and the logit function is simply a log of odds
in favor of the event.
Example for Logistic Regression:
Let’s say we have a model that can predict whether a person is male or female based on their height.
Given a height of 150cm is the person male or female.
If we are modeling people’s sex as male or female from their height, then the first class could be male
and the logistic regression model could be written as the probability of male given a person’s height,
or more formally:
P(sex=male|height)
Written another way, we are modeling the probability that an input (X) belongs to the default class
(Y=1), we can write this formally as:
P(X) = P(Y=1|X)
Note that the probability prediction must be transformed into a binary value (0 or 1) in order to actually
make a probability prediction.
We have learned the coefficients of b0 = -100 and b1 = 0.6. Using the equation above we can calculate
the probability of male given a height of 150cm or more formally P(male|height=150). We will use
exp() for e, because that is what you can use if you type this example into your spreadsheet: y = e^(b0
+ b1*X) / (1 + e^(b0 + b1*X))
y = exp(-100 + 0.6*150) / (1 + EXP(-100 + 0.6*X)) y = 0.0000453978687
Or a probability of near zero that the person is a male.
In practice we can use the probabilities directly. Because this is classification and we want a crisp
answer, we can snap the probabilities to a binary class value, for example:
if p(male) < 0.5
if p(male) >= 0.5
Example:
Consider a fictional dataset that describes the weather conditions for playing a game of
golf. Given the weather conditions, each tuple classifies the conditions as fit (“Yes”) or unfit(“No”) for
playing golf.
Outlook Temperature Humidity Windy Play Golf
Rainy Hot High False No
Rainy Hot High True No
Overcast Hot High False Yes
Sunny Mild High False Yes
Sunny Cool Normal False Yes
Sunny Cool Normal True No
Overcast Cool Normal True Yes
Rainy Mild High False No
Rainy Cool Normal False Yes
The dataset is divided into two parts, namely, feature matrix and the response vector.
Feature matrix contains all the vectors (rows) of dataset in which each vector consists of the value of
dependent features. In above dataset, features are ‘Outlook’, ‘Temperature’, ‘Humidity’ and ‘Windy’.
Response vector contains the value of class variable (prediction or output) for each row of feature
matrix. In above dataset, the class variable name is ‘Play golf’.
With relation to our dataset, this concept can be understood as:
We assume that no pair of features are dependent. For example, the temperature being ‘Hot’ has nothing
to do with the humidity or the outlook being ‘Rainy’ has no effect on the winds. Hence, the features
are assumed to be independent.
Secondly, each feature is given the same weight (or importance). For example, knowing only
temperature and humidity alone can’t predict the outcome accuracy. None of the attributes is irrelevant
and assumed to be contributing equally to the outcome.
So, in the figure above, we have calculated P(xi | yj) for each xi in X and yj in y manually in the tables.
For example, probability of playing golf given that the temperature is cool, i.e P(temp. = cool | play
golf = Yes) = 3/9.
Also, we need to find class probabilities (P(y)) which has been calculated in the table For example,
P(play golf = Yes) = 9/14.
1.5.4 K-Nearest Neighbor
It is a lazy learning algorithm that stores all instances corresponding to training data in n- dimensional
space. It is a lazy learning algorithm as it does not focus on constructing a general internal model,
instead, it works on storing instances of training data.
K nearest neighbors is a simple algorithm used for both classification and regression problems. It
basically stores all available cases to classify the new cases by a majority vote of its k neighbors. The
case assigned to the class is most common amongst its K nearest neighbors measured by a distance
function (Euclidean, Manhattan, Minkowski, and Hamming).
These distance functions can be Euclidean, Manhattan, Minkowski and Hamming distance. First three
functions are used for continuous function and the fourth one (Hamming) for categorical variables. If
K = 1, then the case is simply assigned to the class of its nearest neighbor. At times, choosing K turns
out to be a challenge while performing k-NN modelling.
As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to
category A.
Classification is computed from a simple majority vote of the k nearest neighbors of each point. It is
supervised and takes a bunch of labelled points and uses them to label other points. To label a new
point, it looks at the labelled points closest to that new point also known as its nearest neighbors. It has
those neighbors vote, so whichever label the most of the neighbors have is the label for the new point.
The “k” is the number of neighbors it checks.
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to
know either it is a cat or dog. So, for this identification, we can use the KNN algorithm, as it works on
a similarity measure. Our KNN model will find the similar features of the new data set to the cats and
dogs’ images and based on the most similar features it will put it in either cat or dog category.
Here all ‘cats’ belongs to ‘category A’ and all ‘dogs’ belong to Category B, the ‘input value’ is the
‘New Data Point’
Advantages of KNN Algorithm:
➢ It is simple to implement.
➢ It is robust to the noisy training data
➢ It can be more effective if the training data is large. Disadvantages of KNN Algorithm:
➢ Always needs to determine the value of K which may be complex some time.
➢ The computation cost is high because of calculating the distance between the data points for
all the training samples.
both classification and regression challenges. However, it is mostly used in classification problems. In
this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features
you have) with the value of each feature being the value of a particular coordinate. Then, we perform
classification by finding the hyper-plane that differentiate the two classes very well.
Identify the right hyper-plane:
The hyperplane that separates two classes in training data is
wx+b=0
It is also called the decision boundary (surface). So many possible hyperplanes, which one to choose?
This distance is called as Margin. Margin for hyper-plane C is high as compared to both A and B.
Another lightning reason for selecting the hyper-plane with higher margin is robustness. If we select a
hyper-plane having low margin then there is high chance of miss-classification. one star at another end
is like an outlier for star class. SVM has a feature to ignore outliers and find the hyper-plane that has
maximum margin. Hence, we can say, SVM is robust to outliers.
SVM looks for the separating hyperplane with the largest margin. SVM is basically a binary classifier,
although it can be modified for multi-class classification as well as regression. Unlike logistic
regression and other neural network models, SVMs try to maximize the separation between two classes
of points.
A support-vector machine constructs a hyperplane or set of hyperplanes in a high or infinite
dimensional space, this hyperplane bisects the two classes in the best way possible.
Now to construct the OPTIMAL HYPERPLANE it takes the support of two other hyperplanes that are
parallel & equidistant from it on either side!
These two support hyperplanes lie on the most extreme points between the classes and are called
support-vectors.
Therefore, what we just need to do is to find the support hyperplanes (using for simplicity), that have
the maximum distance between them! From this, we can easily get the OPTIMAL HYPERPLANE.
This is simply called the Maximum Margin Hyperplane. The distance between the support hyperplanes
is called the Margin.
Hence, our goal is to simply find the Maximum Margin M. Using vector operations, we can find that
(given the OPTIMAL HYPERPLANE (w.x+b=0)), the Margin is equal to:
Observe here that our goal has been deduced to finding the optimal w from the equation w.x+b=0!
However, this also comes with a constraint that the points in a class must not lie within the two support
hyperplanes! That can be mathematically represented as:
It means that there can be no points inside of the hyperplane as shown in the below figure and this is
This is the big drawback of SVMs! The two classes need to be fully separable. This is never the case
in real-world datasets! This is where Soft Margin SVMs come into play
1.6.1 Deriving the Maximum Margin Equation
For deriving the Objective function, we assume that the dataset is linearly separable. We know that the
two support hyperplanes lie on the Support Vectors and there are no points between them. Hence, we
can first consider the mid-hyperplane between these two support hyperplanes to be:
And considering that the distance between this and the support hyperplanes is 1, we get
In the above figure, let vectors x₀ & z₀ be parallel vectors
where vector k is a line perpendicular to the vector x₀ & z₀. The magnitude of vector k is M.
Therefore,
If you observe the above equation (5), it’s just the Objective function, subtracted by the inequality
constraint! It must only satisfy the criteria αᵢ > 0 when the constraint is an equality constraint.
However, the above equations are inequality constraint equations. So, an additional set of conditions
called the Kuhn — Tucker conditions need to satisfy as well. We will go through the conditions later.
To find the optima, just like before, we take the first-order derivative and equate it to 0:
To find the optimal values, we can simply substitute the values back into (5),
Here, W is called the Objective functional & it is a function of all (αᵢ … to … αₙ) represented as Λ
(Capital Lambda).
This derivation was necessary to account for the inequality constraint [Eq. (4)]. Since it’s now
accounted for, the Objective functional W is the new function that needs to be optimized instead of Eq.
(3). This is called the DUAL FORMULATION because the initial Objective function has been
modified! While Eq. (3) was minimized, W has to be maximized.
Internal nodes represent the features of a dataset, branches represent the decision rules and each leaf
node represents the outcome.
The tree is constructed in a top-down recursive divide and conquer approach. A decision node will have
two or more branches and a leaf represents a classification or decision. The topmost node in the decision
tree that corresponds to the best predictor is called the root node, and the best thing about a decision
tree is that it can handle both categorical and numerical data.
1.7.1 Decision Tree Terminologies:
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into 2 or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting
a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to
the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the
child nodes.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision node further gets split into one
decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes
(Accepted offers and Declined offer). Consider the below diagram:
1.8.2 Overfitting:
A statistical model is said to be over fitted, when we train it with a lot of data. When a model is trained
with so much of data, it starts learning from the noise and inaccurate data entries in our data set. Then
the model does not
categorize the data correctly, because of too much of details and noise. The causes of overfitting are
the non-parametric and non-linear methods because these types of machine learning algorithms have
more freedom in building the model based on the dataset and therefore theycan really build
unrealistic models. A solution to avoid overfitting is using a linear algorithmif we have linear
data or using the parameters like the maximal depth if we are using decisiontrees.
Cross- Validation: A standard way to find out-of-sample prediction error is to use 5- fold cross
validation.
Early Stopping: Its rules provide us the guidance as to how many iterations can be run before learner
begins to over-fit.
Pruning: Pruning is extensively used while building related models. It simply removes the nodes,
which add little predictive power for the problem in hand.
Regularization: It introduces a cost term for bringing in more features with the objective function.
Hence, it tries to push the coefficients for many variables to zero and hence reduce cost term.
1.8.4 Good Fit in a Statistical Model:
Ideally, the case when the model makes the predictions with 0 error, is said to have a good fit on the
data. This situation is achievable at a spot between overfitting and under fitting. In order to understand
it we will have to look at the performance of our model with the passage of time, while it is learning
from training dataset.
With the passage of time, our model will keep on learning and thus the error for the model on the
training and testing data will keep on decreasing. If it will learn for too long, the model will become
more prone to overfitting due to presence of noise and less useful details. Hence, the performance of
our model will decrease. In order to get a good fit, we will stop at a point just before where the error
starts increasing. At this point, the model is said to have good skills on training dataset as well our
unseen testing dataset.