Chapter 3notes
Chapter 3notes
Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. In supervised
learning, the training data provided to the machines work as the supervisor that teaches the
machines to predict the output correctly. It applies the same concept as a student learns in the
supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping
function to map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image classification,
Fraud Detection, spam filtering, etc.
In supervised learning, models are trained using labelled dataset, where the model learns about
each type of data. Once the training process is completed, the model is tested on the basis of test
data (a subset of the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and
diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle,
and Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify
the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies
the shape on the bases of a number of sides, and predicts the output.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather forecasting,
Market Trends, etc. Below are some popular Regression algorithms which come under
supervised learning:
Linear Regression
Regression Trees
Non-Linear Regression
Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there are
two classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
Random Forest
Decision Trees
Logistic Regression
Support vector Machines
Advantages of Supervised learning:
o With the help of supervised learning, the model can predict the output on the basis of
prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.
Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.
Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the given dataset,
which means it does not have any idea about the features of the dataset. The task of the
unsupervised learning algorithm is to identify the image features on their own. Unsupervised
learning algorithm will perform this task by clustering the image dataset into the groups
according to similarities between images.
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so to solve
such cases, we need unsupervised learning.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
o Clustering: Clustering is a method of grouping the objects into clusters such that objects
with most similarities remains into a group and has less or no similarities with the objects
of another group. Cluster analysis finds the commonalities between the data objects and
categorizes them as per the presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It determines the set of
items that occurs together in the dataset. Association rule makes marketing strategy more
effective. Such as people who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of Association rule is Market Basket Analysis.
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
Example: Suppose we have an image of different types of fruits. The task of our supervised learning
model is to identify the fruits and classify them accordingly. So to identify the image in supervised
learning, we will give the input data as well as output for that, which means we will train the model by the
shape, size, color, and taste of each fruit. Once the training is completed, we will test the model by giving
the new set of fruit. The model will identify the fruit and predict the output using a suitable algorithm.
Unsupervised learning can be used for two types of problems: Clustering and Association.
Example: To understand the unsupervised learning, we will use the example given above. So unlike
supervised learning, here we will not provide any supervision to the model. We will just provide the input
dataset to the model and allow the model to find the patterns from the data. With the help of a suitable
algorithm, the model will train itself and divide the fruits into different groups according to the most
similar features between them.
The main differences between Supervised and Unsupervised learning are given below:
Supervised learning algorithms are trained using labeled Unsupervised learning algorithms are trained using
data. unlabeled data.
Supervised learning model takes direct feedback to Unsupervised learning model does not take any
check if it is predicting correct output or not. feedback.
Supervised learning model predicts the output. Unsupervised learning model finds the hidden
patterns in data.
In supervised learning, input data is provided to the In unsupervised learning, only input data is provided
model along with the output. to the model.
The goal of supervised learning is to train the model so The goal of unsupervised learning is to find the
that it can predict the output when it is given new data. hidden patterns and useful insights from the
unknown dataset.
Supervised learning needs supervision to train the Unsupervised learning does not need any
model. supervision to train the model.
Supervised learning can be used for those cases where Unsupervised learning can be used for those cases
we know the input as well as corresponding outputs. where we have only input data and no
corresponding output data.
Supervised learning model produces an accurate result. Unsupervised learning model may give less accurate
result as compared to supervised learning.
Supervised learning is not close to true Artificial Unsupervised learning is more close to the true
intelligence as in this, we first train the model for each Artificial Intelligence as it learns similarly as a child
data, and then only it can predict the correct output. learns daily routine things by his experiences.
It includes various algorithms such as Linear It includes various algorithms such as Clustering,
Regression, Logistic Regression, Support Vector KNN, and Apriori algorithm.
Machine, Multi-class Classification, Decision tree,
Bayesian Logic, etc.
Note: The supervised and unsupervised learning both are the machine learning methods, and selection of any of
these learning depends on the factors related to the structure and volume of your dataset and the use cases of the
problem.
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every year and get
sales on that. The below list shows the advertisement made by the company in the last 5 years and the
corresponding sales:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know the
prediction about the sales for this year. So to solve such type of prediction problems in machine
learning, we need regression analysis.
Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using this plot,
the machine learning model can make predictions about the data. In simple words, "Regression shows a
line or curve that passes through all the datapoints on target-predictor graph in such a way that the
vertical distance between the datapoints and the regression line is minimum." The distance between
datapoints and line tells whether a model has captured a strong relationship or not.
o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important factor, the
least important factor, and how each factor is affecting the other factors.
Types of Regression
There are various types of regressions which are used in data science and machine learning. Each type has
its own importance on different scenarios, but at the core, all the regression methods analyze the effect of
the independent variable on dependent variables. Here we are discussing some important types of
regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Linear Regression in Machine Learning
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical
method that is used for predictive analysis. Linear regression makes predictions for continuous/real or
numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (x) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to the
value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:
Mathematically, we can represent a linear regression as:
y= a0+a1x+ ε
Here,
The values for x and y variables are training datasets for Linear Regression model representation.
If the dependent variable decreases on the Y-axis and independent variable increases on the X-
axis, then such a relationship is called a negative linear relationship.
Cost function-
o The different values for weights or coefficient of lines (a 0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the best fit
line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps the input
variable to the output variable. This mapping function is also known as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of
squared error occurred between the predicted values and actual values. It can be written as:
Where,
Residuals: The distance between the actual value and predicted values is called residual. If the observed
points are far from the regression line, then the residual will be high, and so cost function will high. If the
scatter points are close to the regression line, then the residual will be small and hence the cost function.
The key point in Simple Linear Regression is that the dependent variable must be a continuous/real
value. However, the independent variable can be measured on continuous or categorical values.
o Model the relationship between the two variables. Such as the relationship between Income
and expenditure, experience and Salary, etc.
o Forecasting new observations. Such as Weather forecasting according to temperature, Revenue
of a company according to the investments in a year, etc.
y= a0+a1x+ ε
Where,
a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing or decreasing.
ε = The error term. (For a good model it will be negligible)
Here we are taking a dataset that has two variables: salary (dependent variable) and experience
(Independent variable). The goals of this problem is:
o We want to find out if there is any correlation between these two variables
o We will find the best fit line for the dataset.
o How the dependent variable is changing by changing the independent variable.
In this section, we will create a Simple Linear Regression model to find out the best fitting line for
representing the relationship between these two variables.
For example, a person's salary can be affected by their years of experience, years of education,
daily working hours, etc. In this case we would use multiple linear regression.
In the previous topic, we have learned about Simple Linear Regression, where a single
Independent/Predictor(X) variable is used to model the response variable (Y). But there may be
various cases in which the response variable is affected by more than one predictor variable; for
such cases, the Multiple Linear Regression algorithm is used.
Multiple Linear Regression is one of the important regression algorithms which models
the linear relationship between a single dependent continuous variable and more than
one independent variable.
o For MLR, the dependent or target variable(Y) must be the continuous/real, but the
predictor or independent variable may be of continuous or categorical form.
o Each feature variable must model the linear relationship with the dependent variable.
o MLR tries to fit a regression line through a multidimensional space of data-points.
MLR equation:
In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple
predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple Linear Regression, so
the same is applied for the multiple linear regression equation, the equation becomes:
Where,
Y= Output/Response variable
We have a dataset of 50 start-up companies. This dataset contains five main information: R&D
Spend, Administration Spend, Marketing Spend, State, and Profit for a financial year. Our
goal is to create a model that can easily determine which company has a maximum profit, and
which is the most affecting factor for the profit of a company.
ML Polynomial Regression
o Polynomial Regression is a regression algorithm that models the relationship between a
dependent(y) and independent variable(x) as nth degree polynomial. The Polynomial
Regression equation is given below:
o If we apply a linear model on a linear dataset, then it provides us a good result as we have seen
in Simple Linear Regression, but if we apply the same model without any modification on a non-
linear dataset, then it will produce a drastic output. Due to which loss function will increase, the
error rate will be high, and accuracy will be decreased.
o So for such cases, where data points are arranged in a non-linear fashion, we need the
Polynomial Regression model. We can understand it in a better way using the below comparison
diagram of the linear dataset and non-linear dataset.
o In the above image, we have taken a dataset which is arranged non-linearly. So if we try to cover
it with a linear model, then we can clearly see that it hardly covers any data point. On the other
hand, a curve is suitable to cover most of the data points, which is of the Polynomial model.
o Hence, if the datasets are arranged in a non-linear fashion, then we should use the Polynomial
Regression model instead of Simple Linear Regression.
Note: A Polynomial Regression algorithm is also called Polynomial Linear Regression because it
does not depend on the variables, instead, it depends on the coefficients, which are arranged in a
linear fashion.
When we compare the above three equations, we can clearly see that all three equations are Polynomial
equations but differ by the degree of variables. The Simple and Multiple Linear equations are also
Polynomial equations with a single degree, and the Polynomial regression equation is Linear equation
with the nth degree. So if we add a degree to our linear equations, then it will be converted into
Polynomial Linear equations.
Unlike regression, the output variable of Classification is a category, not a value, such as "Green or Blue",
"fruit or animal", etc. Since the Classification algorithm is a Supervised learning technique, hence it takes
labeled input data, which means it contains input with the corresponding output.
In classification algorithm, a discrete output function(y) is mapped to input variable(x).
The main goal of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below diagram, there
are two classes, class A and Class B. These classes have features that are similar to each other and
dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier. There are two
types of Classifications:
o Binary Classifier: If the classification problem has only two possible outcomes, then it is called
as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then it is called
as Multi-class Classifier.
o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o It is used for evaluating the performance of a classifier, whose output is a probability value
between the 0 and 1.
o For a good binary Classification model, the value of log loss should be near to 0.
o The value of log loss increases if the predicted value deviates from the actual value.
o The lower log loss represents the higher accuracy of the model.
2. Confusion Matrix in Machine Learning
The confusion matrix is a matrix used to determine the performance of the classification models for a
given set of test data. It can only be determined if the true values for test data are known. The matrix itself
can be easily understood, but the related terminologies may be confusing. Since it shows the errors in the
model performance in the form of a matrix, hence also known as an error matrix. Some features of
Confusion matrix are given below:
o For the 2 prediction classes of classifiers, the matrix is of 2*2 table, for 3 classes, it is 3*3 table,
and so on.
o The matrix is divided into two dimensions, that are predicted values and actual values along with
the total number of predictions.
o Predicted values are those values, which are predicted by the model, and actual values are the
true values for the given observations.
o True Negative: Model has given prediction No, and the real or actual value was also No.
o True Positive: The model has predicted yes, and the actual value was also true.
o False Negative: The model has predicted no, but the actual value was Yes, it is also called
as Type-II error.
o False Positive: The model has predicted Yes, but the actual value was No. It is also called
a Type-I error.
o It evaluates the performance of the classification models, when they make predictions on test
data, and tells how good our classification model is.
o It not only tells the error made by the classifiers but also the type of errors such as it is either
type-I or type-II error.
o With the help of the confusion matrix, we can calculate the different parameters for the model,
such as accuracy, precision, etc.
Suppose we are trying to create a model that can predict the result for the disease that is either a person
has that disease or not. So, the confusion matrix for this is given as:
o The table is given for the two-class classifier, which has two predictions "Yes" and "NO." Here,
Yes defines that patient has the disease, and No defines that patient does not has that disease.
o The classifier has made a total of 100 predictions. Out of 100 predictions, 89 are true predictions,
and 11 are incorrect predictions.
o The model has given prediction "yes" for 32 times, and "No" for 68 times. Whereas the actual
"Yes" was 27, and actual "No" was 73 times.
We can perform various calculations for the model, such as the model's accuracy, using this matrix. These
calculations are given below:
o Classification Accuracy: It is one of the important parameters to determine the accuracy of the
classification problems. It defines how often the model predicts the correct output. It can be
calculated as the ratio of the number of correct predictions made by the classifier to all number of
predictions made by the classifiers. The formula is given below:
o Misclassification rate: It is also termed as Error rate, and it defines how often the model gives
the wrong predictions. The value of error rate can be calculated as the number of incorrect
predictions to all number of the predictions made by the classifier. The formula is given below:
o Precision: It can be defined as the number of correct outputs provided by the model or out of all
positive classes that have predicted correctly by the model, how many of them were actually true.
It can be calculated using the below formula:
o Recall: It is defined as the out of total positive classes, how our model predicted correctly. The
recall must be as high as possible.
Classification algorithms can be used in different places. Below are some popular use cases of
Classification Algorithms:
Suppose we have a new data point and we need to put it in the required category. Consider the below
image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean distance
is the distance between two points, which we have already studied in geometry. It can be
calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.
o There is no particular way to determine the best value for "K", so we need to try some values to
find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the
model.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below
diagram in which there are two different categories that are classified using a decision boundary or
hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose
we see a strange cat that also has some features of dogs, so if we want a model that can accurately
identify whether it is a cat or dog, so such a model can be created by using the SVM algorithm. We will
first train our model with lots of images of cats and dogs so that it can learn about different features of
cats and dogs, and then we test it with this strange creature. So as support vector creates a decision
boundary between these two data (cat and dog) and choose extreme cases (support vectors), it will see the
extreme case of cat and dog. On the basis of the support vectors, it will classify it as a cat. Consider the
below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier.
The dimensions of the hyperplane depend on the features present in the dataset, which means if there are
2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance between
the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the
hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a
Support vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset
that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that
can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there can
be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is
called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These
points are called support vectors. The distance between the vectors and the hyperplane is called
as margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is
called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have used two
dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d
space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
The Apriori algorithm uses frequent itemsets to generate association rules, and it is designed to
work on the databases that contain transactions. With the help of these association rule, it
determines how strongly or how weakly two objects are connected. This algorithm uses
a breadth-first search and Hash Tree to calculate the itemset associations efficiently. It is the
iterative process for finding the frequent itemsets from the large dataset.
This algorithm was given by the R. Agrawal and Srikant in the year 1994. It is mainly used
for market basket analysis and helps to find those products that can be bought together. It can
also be used in the healthcare field to find drug reactions for patients.
Frequent itemsets are those items whose support is greater than the threshold value or user-
specified minimum support. It means if A & B are the frequent itemsets together, then
individually A and B should also be the frequent itemset.
Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, in these two transactions, 2 and 3
are the frequent itemsets.
Steps for Apriori Algorithm
Below are the steps for the apriori algorithm:
Step-1: Determine the support of itemsets in the transactional database, and select the minimum
support and confidence.
Step-2: Take all supports in the transaction with higher support value than the minimum or
selected support value.
Step-3: Find all the rules of these subsets that have higher confidence value than the threshold or
minimum confidence.
Example: Suppose we have the following dataset that has various transactions, and from this
dataset, we need to find the frequent itemsets and generate the association rules using the Apriori
algorithm:
Solution:
Step-1: Calculating C1 and L1:
o In the first step, we will create a table that contains support count (The frequency of each itemset
individually in the dataset) of each itemset in the given dataset. This table is called the Candidate
set or C1.
o Now, we will take out all the itemsets that have the greater support count that the Minimum
Support (2). It will give us the table for the frequent itemset L1.
Since all the itemsets have greater or equal support count than the minimum support, except the
E, so E itemset will be removed.
Step-2: Candidate Generation C2, and L2:
o In this step, we will generate C2 with the help of L1. In C2, we will create the pair of the itemsets
of L1 in the form of subsets.
o After creating the subsets, we will again find the support count from the main transaction table of
datasets, i.e., how many times these pairs have occurred together in the given dataset. So, we will
get the below table for C2:
o Again, we need to compare the C2 Support count with the minimum support count, and after
comparing, the itemset with less support count will be eliminated from the table C2. It will give
us the below table for L2
As the given threshold or minimum confidence is 50%, so the first three rules A ^B → C, B^C
→ A, and A^C → B can be considered as the strong association rules for the given problem.
Advantages of Apriori Algorithm
o This is easy to understand algorithm
o The join and prune steps of the algorithm can be easily implemented on large datasets.
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that
each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the categories of
groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in this
algorithm.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the particular
k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given
below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two
points as k points, which are not the part of our dataset. Consider the below image:
o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance between
two points. So, we will draw a median between both the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and
points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow for clear
visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new centroid.
To choose the new centroids, we will compute the center of gravity of these centroids, and will
find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue points are
right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-
points.
o We will repeat the process by finding the center of gravity of centroids, so the new centroids will
be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data points. So,
the image will be:
o We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as
shown in the below image:
The performance of the K-means clustering algorithm depends upon highly efficient clusters that it forms.
But choosing the optimal number of clusters is a big task. There are some different ways to find the
optimal number of clusters, but here we are discussing the most appropriate method to find the number of
clusters or value of K. The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters. This method
uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the
total variations within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given
below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
In the above formula of WCSS,
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and its
centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as Euclidean
distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is considered as
the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow method.
The graph for the elbow method looks like the below image: