0% found this document useful (0 votes)
1 views61 pages

Unit 3

The document provides an overview of various machine learning algorithms, categorizing them into supervised, unsupervised, and reinforcement learning. It details specific algorithms such as Decision Trees, K-Nearest Neighbors, and Random Forests, explaining their applications and differences in handling classification and regression tasks. Additionally, it compares classification and regression methodologies, highlighting their unique characteristics and use cases.

Uploaded by

nithinn0829
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views61 pages

Unit 3

The document provides an overview of various machine learning algorithms, categorizing them into supervised, unsupervised, and reinforcement learning. It details specific algorithms such as Decision Trees, K-Nearest Neighbors, and Random Forests, explaining their applications and differences in handling classification and regression tasks. Additionally, it compares classification and regression methodologies, highlighting their unique characteristics and use cases.

Uploaded by

nithinn0829
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

UNIT - III

Learning with Trees – Decision Trees – Constructing Decision Trees – Classification and
Regression
Trees – Ensemble Learning – Boosting – Bagging – Different ways to Combine Classifiers –
Basic
Statistics – Gaussian Mixture Models – Nearest Neighbor Methods – Unsupervised Learning –
K means
Algorithms

Machine Learning algorithms are the programs that can learn the hidden
patterns from the data, predict the output, and improve the performance from
experiences on their own. Different algorithms can be used in machine learning
for different tasks, such as simple linear regression that can be used for
prediction problems like stock market prediction, and the KNN algorithm can be
used for classification problems.
In this topic, we will see the overview of some popular and most commonly used
machine learning algorithms along with their use cases and categories.

Types of Machine Learning Algorithms


Machine Learning Algorithm can be broadly classified into three types:

1. Supervised Learning Algorithms


2. Unsupervised Learning Algorithms
3. Reinforcement Learning algorithm

The below diagram illustrates the different ML algorithm, along with the
categories:
1) Supervised Learning Algorithm
Supervised learning is a type of Machine learning in which the machine needs
external supervision to learn. The supervised learning models are trained using
the labeled dataset. Once the training and processing are done, the model is
tested by providing a sample test data to check whether it predicts the correct
output.

The goal of supervised learning is to map input data with the output data.
Supervised learning is based on supervision, and it is the same as when a student
learns things in the teacher's supervision. The example of supervised learning is
spam filtering.

Supervised learning can be divided further into two categories of problem:

○ Classification
○ Regression

Examples of some popular supervised learning algorithms are Simple Linear


regression, Decision Tree, Logistic Regression, KNN algorithm, etc. Read more..
2) Unsupervised Learning Algorithm
It is a type of machine learning in which the machine does not need any external
supervision to learn from the data, hence called unsupervised learning. The
unsupervised models can be trained using the unlabelled dataset that is not
classified, nor categorized, and the algorithm needs to act on that data without
any supervision. In unsupervised learning, the model doesn't have a predefined
output, and it tries to find useful insights from the huge amount of data. These
are used to solve the Association and Clustering problems. Hence further, it can
be classified into two types:

○ Clustering
○ Association

Examples of some Unsupervised learning algorithms are K-means Clustering,


Apriori Algorithm, Eclat, etc.

3) Reinforcement Learning
In Reinforcement learning, an agent interacts with its environment by producing
actions, and learn with the help of feedback. The feedback is given to the agent
in the form of rewards, such as for each good action, he gets a positive reward,
and for each bad action, he gets a negative reward. There is no supervision
provided to the agent. Q-Learning algorithm is used in reinforcement learning.
Read more…

List of Popular Machine Learning Algorithm


1. Linear Regression Algorithm
2. Logistic Regression Algorithm
3. Decision Tree
4. SVM
5. Naïve Bayes
6. KNN
7. K-Means Clustering
8. Random Forest
9. Apriori
10. PCA

1. Linear Regression
Linear regression is one of the most popular and simple machine learning
algorithms that is used for predictive analysis. Here, predictive analysis defines
prediction of something, and linear regression makes predictions for continuous
numbers such as salary, age, etc.

It shows the linear relationship between the dependent and independent


variables, and shows how the dependent variable(y) changes according to the
independent variable (x).

It tries to best fit a line between the dependent and independent variables, and
this best fit line is knowns as the regression line.

The equation for the regression line is:

y= a0+ a*x+ b
Here, y= dependent variable

x= independent variable

a0 = Intercept of line.

Linear regression is further divided into two types:

○ Simple Linear Regression: In simple linear regression, a single


independent variable is used to predict the value of the dependent
variable.
○ Multiple Linear Regression: In multiple linear regression, more than one
independent variables are used to predict the value of the dependent
variable.

The below diagram shows the linear regression for prediction of weight
according to height:
2. Logistic Regression
Logistic regression is the supervised learning algorithm, which is used to predict
the categorical variables or discrete values. It can be used for the classification
problems in machine learning, and the output of the logistic regression
algorithm can be either Yes or NO, 0 or 1, Red or Blue, etc.

Logistic regression is similar to the linear regression except how they are used,
such as Linear regression is used to solve the regression problem and predict
continuous values, whereas Logistic regression is used to solve the Classification
problem and used to predict the discrete values.

Instead of fitting the best fit line, it forms an S-shaped curve that lies between 0
and 1. The S-shaped curve is also known as a logistic function that uses the
concept of the threshold. Any value above the threshold will tend to 1, and below
the threshold will tend to 0. Read more..

3. Decision Tree Algorithm


A decision tree is a supervised learning algorithm that is mainly used to solve the
classification problems but can also be used for solving the regression problems.
It can work with both categorical variables and continuous variables. It shows a
tree-like structure that includes nodes and branches, and starts with the root
node that expand on further branches till the leaf node. The internal node is used
to represent the features of the dataset, branches show the decision rules, and
leaf nodes represent the outcome of the problem.
Some real-world applications of decision tree algorithms are identification
between cancerous and non-cancerous cells, suggestions to customers to buy a
car, etc. Read more..

4. Support Vector Machine Algorithm


A support vector machine or SVM is a supervised learning algorithm that can also
be used for classification and regression problems. However, it is primarily used
for classification problems. The goal of SVM is to create a hyperplane or decision
boundary that can segregate datasets into different classes.

The data points that help to define the hyperplane are known as support vectors,
and hence it is named as support vector machine algorithm.

Some real-life applications of SVM are face detection, image classification, Drug
discovery, etc. Consider the below diagram:

As we can see in the above diagram, the hyperplane has classified datasets into
two different classes
5. Naïve Bayes Algorithm:
Naïve Bayes classifier is a supervised learning algorithm, which is used to make
predictions based on the probability of the object. The algorithm named as Naïve
Bayes as it is based on Bayes theorem, and follows the naïve assumption that
says' variables are independent of each other.
The Bayes theorem is based on the conditional probability; it means the
likelihood that event(A) will happen, when it is given that event(B) has already
happened. The equation for Bayes theorem is given as:

Naïve Bayes classifier is one of the best classifiers that provide a good result for a
given problem. It is easy to build a naïve bayesian model, and well suited for the
huge amount of dataset. It is mostly used for text classification.

6. K-Nearest Neighbour (KNN)


K-Nearest Neighbour is a supervised learning algorithm that can be used for
both classification and regression problems. This algorithm works by assuming
the similarities between the new data point and available data points. Based on
these similarities, the new data points are put in the most similar categories. It is
also known as the lazy learner algorithm as it stores all the available datasets and
classifies each new case with the help of K-neighbours. The new case is assigned
to the nearest class with most similarities, and any distance function measures
the distance between the data points. The distance function can be Euclidean,
Minkowski, Manhattan, or Hamming distance, based on the requirement.

7. K-Means Clustering
K-means clustering is one of the simplest unsupervised learning algorithms,
which is used to solve the clustering problems. The datasets are grouped into K
different clusters based on similarities and dissimilarities, it means, datasets with
most of the commonalties remain in one cluster which has very less or no
commonalities between other clusters. In K-means, K-refers to the number of
clusters, and means refer to the averaging the dataset in order to find the
centroid.

It is a centroid-based algorithm, and each cluster is associated with a centroid.


This algorithm aims to reduce the distance between the data points and their
centroids within a cluster.

This algorithm starts with a group of randomly selected centroids that form the
clusters at starting and then perform the iterative process to optimize these
centroids' positions.
It can be used for spam detection and filtering, identification of fake news, etc.

8. Random Forest Algorithm


Random forest is the supervised learning algorithm that can be used for both
classification and regression problems in machine learning. It is an ensemble
learning technique that provides the predictions by combining the multiple
classifiers and improve the performance of the model.

It contains multiple decision trees for subsets of the given dataset, and find the
average to improve the predictive accuracy of the model. A random-forest
should contain 64-128 trees. The greater number of trees leads to higher
accuracy of the algorithm.

To classify a new dataset or object, each tree gives the classification result and
based on the majority votes, the algorithm predicts the final output.

Random forest is a fast algorithm, and can efficiently deal with the missing &
incorrect data.

9. Apriori Algorithm
Apriori algorithm is the unsupervised learning algorithm that is used to solve the
association problems. It uses frequent itemsets to generate association rules, and
it is designed to work on the databases that contain transactions. With the help
of these association rule, it determines how strongly or how weakly two objects
are connected to each other. This algorithm uses a breadth-first search and Hash
Tree to calculate the itemset efficiently.

The algorithm process iteratively for finding the frequent itemsets from the large
dataset.

The apriori algorithm was given by the R. Agrawal and Srikant in the year 1994. It
is mainly used for market basket analysis and helps to understand the products
that can be bought together. It can also be used in the healthcare field to find
drug reactions in patients.

10. Principle Component Analysis


Principle Component Analysis (PCA) is an unsupervised learning technique,
which is used for dimensionality reduction. It helps in reducing the
dimensionality of the dataset that contains many features correlated with each
other. It is a statistical process that converts the observations of correlated
features into a set of linearly uncorrelated features with the help of orthogonal
transformation. It is one of the popular tools that is used for exploratory data
analysis and predictive modeling.

PCA works by considering the variance of each attribute because the high
variance shows the good split between the classes, and hence it reduces the
dimensionality.

Some real-world applications of PCA are image processing, movie


recommendation system, optimizing the power allocation in various
communication channels.

CLASSIFICATION VS REGRESSION

Classification Algorithms
Classification is the process of finding or discovering a model or function that
helps in separating the data into multiple categorical classes i.e. discrete values.
In classification, data is categorized under different labels according to some
parameters given in the input and then the labels are predicted for the data.

In a classification task, we are supposed to predict discrete target variables(class


labels) using independent features.
In the classification task, we are supposed to find a decision boundary that can
separate the different classes in the target variable.
The derived mapping function could be demonstrated in the form of “IF-THEN”
rules. The classification process deals with problems where the data can be
divided into binary or multiple discrete labels. Let’s take an example, suppose we
want to predict the possibility of the winning of a match by Team A on the basis
of some parameters recorded earlier. Then there would be two labels Yes and No.
Binary Classification and Multiclass Classification

Types of Classification Algorithms


There are different types of State of the art classification algorithms that have
been developed over time to give the best results for classification tasks by
employing techniques like bagging and boosting.

Decision Tree
Random Forest Classifier
K – Nearest Neighbors
Support Vector Machine
Regression Algorithms
Regression is the process of finding a model or function for distinguishing the
data into continuous real values instead of using classes or discrete values. It can
also identify the distribution movement depending on the historical data.
Because a regression predictive model predicts a quantity, therefore, the skill of
the model must be reported as an error in those predictions.

In a regression task, we are supposed to predict a continuous target variable


using independent features.
In the regression tasks, we are faced with generally two types of problems linear
and non-linear regression.
Let’s take a similar example in regression also, where we are finding the
possibility of rain in some particular regions with the help of some parameters
recorded earlier. Then there is a probability associated with the rain.
Types of Regression Algorithms
There are different types of State of the art regression algorithms that have been
developed over time to give the best results for regression tasks by employing
techniques like bagging and boosting.

Lasso Regression
Ridge Regression
XGBoost Regressor
LGBM Regressor
Comparison between Classification and Regression
Classification

Regression

In this problem statement, the target variables are discrete. In this problem
statement, the target variables are continuous.
Problems like Spam Email Classification, Disease prediction like problems are
solved using Classification Algorithms. Problems like House Price Prediction,
Rainfall Prediction like problems are solved using regression Algorithms.
In this algorithm, we try to find the best possible decision boundary which can
separate the two classes with the maximum possible separation. In this
algorithm, we try to find the best-fit line which can represent the overall trend in
the data.
Evaluation metrics like Precision, Recall, and F1-Score are used here to evaluate
the performance of the classification algorithms. Evaluation metrics like Mean
Squared Error, R2-Score, and MAPE are used here to evaluate the performance of
the regression algorithms.
Here we face the problems like binary Classification or Multi-Class Classification
problems. Here we face the problems like Linear Regression models as well as
non-linear models.
Input Data are Independent variables and categorical dependent variable. Input
Data are Independent variables and continuous dependent variable.
The classification algorithm’s task mapping the input value of x with the discrete
output variable of y.

The regression algorithm’s task is mapping input value (x) with continuous
output variable (y).

Output is Categorical labels. Output is Continuous numerical values.


Objective is to Predict categorical/class labels. Objective is to Predicting
continuous numerical values.
Example use cases are Spam detection, image recognition, sentiment analysis
Example use cases are Stock price prediction, house price prediction, demand
forecasting.
Examples of classification algorithms are:

Logistic Regression, Decision Trees, Random Forest, Support Vector Machines


(SVM), K-Nearest Neighbors (K-NN), Naive Bayes, Neural Networks, K-Means
Clustering, Multi-layer Perceptron (MLP), etc.

Examples of regression algorithms are:

Linear Regression, Polynomial Regression, Ridge Regression, Lasso Regression,


Support Vector Regression (SVR), Decision Trees for Regression, Random Forest
Regression, K-Nearest Neighbors (K-NN) Regression, Neural Networks for
Regression, etc.

When to Use Regression/Classification?


Classification trees are employed when there’s a need to categorize the dataset
into distinct classes associated with the response variable. Often, these classes
are binary, such as “Yes” or “No,” and they are mutually exclusive. While there are
instances where there may be more than two classes, a modified version of the
classification tree algorithm is used in those scenarios.
On the other hand, regression trees are utilized when dealing with continuous
response variables. For instance, if the response variable represents continuous
values like the price of an object or the temperature for the day, a regression tree
is the appropriate choice.

There are situations where a blend of regression and classification approaches is


necessary. For instance, ordinal regression comes into play when dealing with
ranked or ordered categories, while multi-label classification is suitable for cases
where data points can be associated with multiple classes at the same time.

So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a
third dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider
the below image:

Decision Tree Classification Algorithm


○ Decision Tree is a Supervised learning technique that can be used for
both classification and Regression problems, but mostly it is preferred
for solving Classification problems. It is a tree-structured classifier,
where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the
outcome.
○ In a Decision tree, there are two nodes, which are the Decision Node
and Leaf Node. Decision nodes are used to make any decision and have
multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
○ The decisions or the test are performed on the basis of features of the
given dataset.
○ It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
○ It is called a decision tree because, similar to a tree, it starts with the
root node, which expands on further branches and constructs a
tree-like structure.
○ In order to build a tree, we use the CART algorithm, which stands for
Classification and Regression Tree algorithm.
○ A decision tree simply asks a question, and based on the answer
(Yes/No), it further split the tree into subtrees.
○ Below diagram explains the general structure of a decision tree:

Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.

Why use Decision Trees?


There are various algorithms in Machine learning, so choosing the best algorithm
for the given dataset and problem is the main point to remember while creating
a machine learning model. Below are the two reasons for using the Decision tree:
○ Decision Trees usually mimic human thinking ability while making a
decision, so it is easy to understand.
○ The logic behind the decision tree can be easily understood because it
shows a tree-like structure.

Decision Tree Terminologies


Root Node: Root node is from where the decision tree starts. It represents the
entire dataset, which further gets divided into two or more homogeneous
sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into
sub-nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the
tree.
Parent/Child node: The root node of the tree is called the parent node, and
other nodes are called the child nodes.
How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm
starts from the root node of the tree. This algorithm compares the values of root
attribute with the record (real dataset) attribute and, based on the comparison,
follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the
other sub-nodes and move further. It continues the process until it reaches the
leaf node of the tree. The complete process can be better understood using the
below algorithm:

○ Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
○ Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
○ Step-3: Divide the S into subsets that contains possible values for the
best attributes.
○ Step-4: Generate the decision tree node, which contains the best
attribute.
○ Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as
a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the decision
tree starts with the root node (Salary attribute by ASM). The root node splits
further into the next decision node (distance from the office) and one leaf node
based on the corresponding labels. The next decision node further gets split into
one decision node (Cab facility) and one leaf node. Finally, the decision node
splits into two leaf nodes (Accepted offers and Declined offer). Consider the
below diagram:

Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to select the
best attribute for the root node and for sub-nodes. So, to solve such problems
there is a technique which is called as Attribute selection measure or ASM. By
this measurement, we can easily select the best attribute for the nodes of the
tree. There are two popular techniques for ASM, which are:

○ Information Gain
○ Gini Index

1. Information Gain:
○ Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
○ It calculates how much information a feature provides us about a class.
○ According to the value of information gain, we split the node and build
the decision tree.
○ A decision tree algorithm always tries to maximize the value of
information gain, and a node/attribute having the highest information
gain is split first. It can be calculated using the below formula:
​ Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It
specifies randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)


Where,

○ S= Total number of samples


○ P(yes)= probability of yes
○ P(no)= probability of no

2. Gini Index:
○ Gini index is a measure of impurity or purity used while creating a
decision tree in the CART(Classification and Regression Tree) algorithm.
○ An attribute with the low Gini index should be preferred as compared to
the high Gini index.
○ It only creates binary splits, and the CART algorithm uses the Gini index
to create binary splits.
○ Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2


Pruning: Getting an Optimal Decision tree
Pruning is a process of deleting the unnecessary nodes from a tree in order to
get the optimal decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture
all the important features of the dataset. Therefore, a technique that decreases
the size of the learning tree without reducing accuracy is known as Pruning.
There are mainly two types of tree pruning technology used:

○ Cost Complexity Pruning


○ Reduced Error Pruning.

Advantages of the Decision Tree


○ It is simple to understand as it follows the same process which a human
follow while making any decision in real-life.
○ It can be very useful for solving decision-related problems.
○ It helps to think about all the possible outcomes for a problem.
○ There is less requirement of data cleaning compared to other
algorithms.

Disadvantages of the Decision Tree


○ The decision tree contains lots of layers, which makes it complex.
○ It may have an overfitting issue, which can be resolved using the
Random Forest algorithm.
○ For more class labels, the computational complexity of the decision tree
may increase.
Ensemble Learning
Ensemble means ‘a collection of things’ and in Machine
Learning terminology, Ensemble learning refers to the
approach of combining multiple ML models to produce a
more accurate and robust prediction compared to any
individual model. It implements an ensemble of fast
algorithms (classifiers) such as decision trees for learning and
allows them to vote.

What is Ensemble Learning with examples?

Ensemble learning is a machine learning technique that combines the


predictions from multiple individual models to obtain a better
predictive performance than any single model. The basic idea behind
ensemble learning is to leverage the wisdom of the crowd by
aggregating the predictions of multiple models, each of which may
have its own strengths and weaknesses. This can lead to improved
performance and generalization.

Ensemble learning can be thought of as compensation for poor


learning algorithms that are computationally more expensive than a
single model. But they are more efficient than a single non-ensemble
model that has passed through a lot of learning. In this article, we will
have a comprehensive overview of the importance of ensemble
learning and how it works, different types of ensemble classifiers,
advanced ensemble learning techniques, and some algorithms (such as
random forest, xgboost) for better clarification of the common
ensemble classifiers and finally their uses in the technical world.

Several individual base models (experts) are fitted to learn from the
same data and produce an aggregation of output based on which a
final decision is taken. These base models can be machine learning
algorithms such as decision trees (mostly used), linear models, support
vector machines (SVM), neural networks, or any other model that is
capable of making predictions.

Most commonly used ensembles include techniques such as Bagging-


used to generate Random Forest algorithms and Boosting- to generate
algorithms such as Adaboost, Xgboost etc.

Ensemble Learning Techniques

Gradient Boosting Machines (GBM): Gradient Boosting is a popular


ensemble learning technique that sequentially builds a group of
decision trees and corrects the residual errors made by previous trees,
enhancing its predictive accuracy. It trains each new weak learner to fit
the residuals of the previous ensemble's predictions thus making it less
sensitive to individual data points or outliers in the data.

Extreme Gradient Boosting (XGBoost): XGBoost features tree pruning,


regularization, and parallel processing, which makes it a preferred
choice for data scientists seeking robust and accurate predictive
models.

CatBoost: It is designed to handle features categorically that eliminates


the need for extensive pre-processing.CatBoost is known for its high
predictive accuracy, fast training, and automatic handling of overfitting.

Stacking: It combines the output of multiple base models by training a


combiner(an algorithm that takes predictions of base models) and
generate more accurate prediction. Stacking allows for more flexibility
in combining diverse models, and the combiner can be any machine
learning algorithm.

Random Subspace Method (Random Subspace Ensembles): It is an


ensemble learning approach that improves the predictive accuracy by
training base models on random subsets of input features. It mitigates
overfitting and improves the generalization by introducing diversity in
the model space.
Random Forest Variants: They introduce variations in tree construction,
feature selection, or model optimization to enhance performance.

Selecting the right advanced ensemble technique depends on the


nature of the data, the specific problem trying to be solved, and the
computational resources available. It often requires experimentation
and changes to achieve the best results.

Algorithm based on Bagging and Boosting

Bagging Algorithm

Bagging is a supervised learning technique that can be used for both


regression and classification tasks. Here is an overview of the steps
including Bagging classifier algorithm:

Bootstrap Sampling: Divides the original training data into ‘N’ subsets
and randomly selects a subset with replacement in some rows from
other subsets. This step ensures that the base models are trained on
diverse subsets of the data and there is no class imbalance.

Base Model Training: For each bootstrapped sample, train a base model
independently on that subset of data. These weak models are trained in
parallel to increase computational efficiency and reduce time
consumption.

Prediction Aggregation: To make a prediction on testing data combine


the predictions of all base models. For classification tasks, it can include
majority voting or weighted majority while for regression, it involves
averaging the predictions.

Out-of-Bag (OOB) Evaluation: Some samples are excluded from the


training subset of particular base models during the bootstrapping
method. These “out-of-bag” samples can be used to estimate the
model’s performance without the need for cross-validation.

Final Prediction: After aggregating the predictions from all the base
models, Bagging produces a final prediction for each instance.

Bagging Vs Boosting
We all use the Decision Tree Technique on day to day life to make the decision.
Organizations use these supervised machine learning techniques like Decision
trees to make a better decision and to generate more surplus and profit.

Ensemble methods combine different decision trees to deliver better predictive


results, afterward utilizing a single decision tree. The primary principle behind the
ensemble model is that a group of weak learners come together to form an
active learner.

There are two techniques given below that are used to perform ensemble
decision tree.

Bagging
Bagging is used when our objective is to reduce the variance of a decision tree.
Here the concept is to create a few subsets of data from the training sample,
which is chosen randomly with replacement. Now each collection of subset data
is used to prepare their decision trees thus, we end up with an ensemble of
various models. The average of all the assumptions from numerous tress is used,
which is more powerful than a single decision tree.

Random Forest is an expansion over bagging. It takes one additional step to


predict a random subset of data. It also makes the random selection of features
rather than using all features to develop trees. When we have numerous random
trees, it is called the Random Forest.

These are the following steps which are taken to implement a Random forest:

○ Let us consider X observations Y features in the training data set. First, a


model from the training data set is taken randomly with substitution.
○ The tree is developed to the largest.
○ The given steps are repeated, and prediction is given, which is based on
the collection of predictions from n number of trees.

Advantages of using Random Forest technique:

○ It manages a higher dimension data set very well.


○ It manages missing quantities and keeps accuracy for missing data.

Disadvantages of using Random Forest technique:

Since the last prediction depends on the mean predictions from subset trees, it
won't give precise value for the regression model.

Boosting:
Boosting is another ensemble procedure to make a collection of predictors. In
other words, we fit consecutive trees, usually random samples, and at each step,
the objective is to solve net error from the prior trees.

If a given input is misclassified by theory, then its weight is increased so that the
upcoming hypothesis is more likely to classify it correctly by consolidating the
entire set at last converts weak learners into better performing models.

Gradient Boosting is an expansion of the boosting procedure.

​ Gradient Boosting = Gradient Descent + Boosting


It utilizes a gradient descent algorithm that can optimize any differentiable loss
function. An ensemble of trees is constructed individually, and individual trees
are summed successively. The next tree tries to restore the loss ( It is the
difference between actual and predicted values).

Advantages of using Gradient Boosting methods:

○ It supports different loss functions.


○ It works well with interactions.

Disadvantages of using a Gradient Boosting methods:

○ It requires cautious tuning of different hyper-parameters.


Difference between Bagging and Boosting:

Bagging Boosting

Various training data subsets are Each new subset contains the
randomly drawn with replacement components that were misclassified
from the whole training dataset. by previous models.

Bagging attempts to tackle the Boosting tries to reduce bias.


over-fitting issue.

If the classifier is unstable (high If the classifier is steady and


variance), then we need to apply straightforward (high bias), then we
bagging. need to apply boosting.

Every model receives an equal Models are weighted by their


weight. performance.

Objective to decrease variance, not Objective to decrease bias, not


bias. variance.

It is the easiest way of connecting It is a way of connecting predictions


predictions that belong to the same that belong to the different types.
type.

Every model is constructed New models are affected by the


independently. performance of the previously
developed model.
Random Forest Algorithm
Random Forest is a popular machine learning algorithm that belongs to the
supervised learning technique. It can be used for both Classification and
Regression problems in ML. It is based on the concept of ensemble learning,
which is a process of combining multiple classifiers to solve a complex problem
and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of


decision trees on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset." Instead of relying on one
decision tree, the random forest takes the prediction from each tree and based
on the majority votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents
the problem of overfitting.

The below diagram explains the working of the Random Forest algorithm:
Assumptions for Random Forest
Since the random forest combines multiple trees to predict the class of the
dataset, it is possible that some decision trees may predict the correct output,
while others may not. But together, all the trees predict the correct output.
Therefore, below are two assumptions for a better Random forest classifier:

○ There should be some actual values in the feature variable of the


dataset so that the classifier can predict accurate results rather than a
guessed result.
○ The predictions from each tree must have very low correlations.

Why use Random Forest?


Below are some points that explain why we should use the Random Forest
algorithm:

○ It takes less training time as compared to other algorithms.


○ It predicts output with high accuracy, even for the large dataset it runs
efficiently.
○ It can also maintain accuracy when a large proportion of data is
missing.

How does Random Forest algorithm work?


Random Forest works in two-phase first is to create the random forest by
combining N decision tree, and second is to make predictions for each tree
created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points
(Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.


Step-5: For new data points, find the predictions of each decision tree, and assign
the new data points to the category that wins the majority votes.

The working of the algorithm can be better understood by the below example:

Example: Suppose there is a dataset that contains multiple fruit images. So, this
dataset is given to the Random forest classifier. The dataset is divided into
subsets and given to each decision tree. During the training phase, each decision
tree produces a prediction result, and when a new data point occurs, then based
on the majority of results, the Random Forest classifier predicts the final decision.
Consider the below image:

Applications of Random Forest


There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of
loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the
disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest


○ Random Forest is capable of performing both Classification and
Regression tasks.
○ It is capable of handling large datasets with high dimensionality.
○ It enhances the accuracy of the model and prevents the overfitting
issue.

Disadvantages of Random Forest


○ Although random forest can be used for both classification and
regression tasks, it is not more suitable for Regression tasks.

EXPLANATION GIVEN IN CLASS HAS BEEN REFERRED FROM THIS


URL.
https://livebook.manning.com/book/grokking-machine-learning/c
hapter-12/
Basic Statistics

Gathering Data
Gathering data is the first step in statistical analysis.

Say for example that you want to know something about all the people in
France.

The population is then all of the people in France.

It is too much effort to gather information about all of the members of a


population (e.g. all 67+ million people living in France). It is often much easier
to collect a smaller group of that population and analyze that. This is called a
sample.

A representative sample
The sample needs to be similar to the whole population of France. It should
have the same characteristics as the population. If you only include people
named Jacques living in Paris who are 48 years old, the sample will not be
similar to the whole population.

So for a good sample, you will need people from all over France, with different
ages, professions, and so on.

If the members of the sample have similar characteristics (like age, profession,
etc.) to the whole population of France, we say that the sample is
representative of the population.

A good representative sample is crucial for statistical methods.

Descriptive Statistics
The information (data) from your sample or population can be visualized with
graphs or summarized by numbers. This will show key information in a simpler
way than just looking at raw data. It can help us understand how the data is
distributed.
Graphs can visually show the data distribution.

Examples of graphs include:

● Histograms
● Pie charts
● Bar graphs
● Box plots

Some graphs have a close connection to numerical summary statistics.


Calculating those gives us the basis of these graphs.

For example, a box plot visually shows the quartiles of a data distribution.

Quartiles are the data split into four equal size parts, or quarters. A quartile is
one type of summary statistics.

Summary statistics
Summary statistics take a large amount of information and sums it up in a few
key values.

Numbers are calculated from the data which also describe the shape of the
distributions. These are individual 'statistics'.

Some important examples are:

● Mean, median and mode


● Range and interquartile range
● Quartiles and percentiles
● Standard deviation and variance

Statistical Inference
Statistics from the data in the sample is used to make conclusions about the
whole population. This is a type of statistical inference.

Probability theory is used to calculate the certainty that those statistics also
apply to the population.
When using a sample, there will always be some uncertainty about what the
data looks like for the population.

Uncertainty is often expressed as confidence intervals.

Confidence intervals are numerical ways of showing how likely it is that the true
value of this statistic is within a certain range for the population.

Hypothesis testing is a another way of checking if a statement about a


population is true. More precisely, it checks how likely it is that a hypothesis is
true is based on the sample data.

Some examples of statements or questions that can be checked with hypothesis


testing:

● People in the Netherlands taller than people in Denmark


● Do people prefer Pepsi or Coke?
● Does a new medicine cure a disease?

Causal Inference
Causal inference is used to investigate if something causes another thing.

For example: Does rain make plants grow?

If we think two things are related we can investigate to see if they correlate.
Statistics can be used to find out how strong this relation is.

Even if things are correlated, finding out of something is caused by other things
can be difficult. It can be done with good experimental design or other special
statistical techniques.

The terms 'population' and 'sample' are important in statistics and refer to
key concepts that are closely related.

Population and Samples


Population: Everything in the group that we want to learn about.
Sample: A part of the population.

Examples of populations and a sample from those populations:

Population Sample

All of the people in Germany 500 Germans

All of the customers of Netflix 300 Netflix customers

Every car manufacturer Tesla, Toyota, BMW, Ford

For good statistical analysis, the sample needs to be as "similar" as possible to


the population. If they are similar enough, we say that the sample is
representative of the population.

The sample is used to make conclusions about the whole population. If the
sample is not similar enough to the whole population, the conclusions could be
useless.

Parameters and Statistics


Parameter: A number that describes something about the whole population.

Sample statistic: A number that describes something about the sample.

The parameters are the key things we want to learn about. The parameters are
usually unknown.

Sample statistics gives us estimates for parameters.

There will always be some uncertainty about how accurate estimates are. More
certainty gives us more useful knowledge.
For every parameter we want to learn about we can get a sample and calculate
a sample statistic, which gives us an estimate of the parameter.

Some Important Examples

Parameter Sample statistic

Mean Sample mean

Median Sample median

Mode Sample mode

Variance Sample variance

Standard deviation Sample standard deviation

Mean, median and mode are different types of averages (typical values in a
population).

For example:

● The typical age of people in a country


● The typical profits of a company
● The typical range of an electric car

Variance and standard deviation are two types of values describing how spread
out the values are.
A single class of students in a school would usually be about the same age. The
age of the students will have low variance and standard deviation.

A whole country will have people of all kinds of different ages. The variance and
standard deviation of age in the whole country would then be bigger than in a
single school grade.

Different Types of Sampling Methods


Random Sampling
A random sample is where every member of the population has an equal chance
to be chosen.

Random sampling is the best. But, it can be difficult, or impossible, to make


sure that it is completely random.

Note: Every other sampling method is compared to how close it is to a random sample -
the closer, the better.

Convenience Sampling
A convenience sample is where the participants that are the easiest to reach are
chosen.

Note: Convenience sampling is the easiest to do.

In many cases this sample will not be similar enough to the population, and the
conclusions can potentially be useless.

Systematic Sampling
A systematic sample is where the participants are chosen by some regular
system.
For example:

● The first 30 people in a queue


● Every third on a list
● The first 10 and the last 10

Stratified Sampling
A stratified sample is where the population is split into smaller groups called
'strata'.

The 'strata' can, for example, be based on demographics, like:

● Different age groups


● Professions

Stratification of a sample is the first step. Another sampling method (like


random sampling) is used for the second step of choosing participants from all
of the smaller groups (strata).

Clustered Sampling
A clustered sample is where the population is split into smaller groups called
'clusters'.

The clusters are usually natural, like different cities in a country.

The clusters are chosen randomly for the sample.

All members of the clusters can participate in the sample, or members can be
chosen randomly from the clusters in a third step.

Different types of data


There are two main types of data: Qualitative (or 'categorical') and quantitative
(or 'numerical'). These main types also have different sub-types depending on
their measurement level.

Qualitative Data
Information about something that can be sorted into different categories that
can't be described directly by numbers.

Examples:

● Brands
● Nationality
● Professions

With categorical data we can calculate statistics like proportions. For example,
the proportion of Indian people in the world, or the percent of people who prefer
one brand to another.

Quantitative Data
Information about something that is described by numbers.

Examples:

● Income
● Age
● Height

With numerical data we can calculate statistics like the average income in a
country, or the range of heights of players in a football team.

Measurement Levels
The main types of data are Qualitative (categories) and Quantitative
(numerical). These are further split into the following measurement levels.

These measurement levels are also called measurement 'scales'

Nominal Level
Categories (qualitative data) without any order.

Examples:

● Brand names
● Countries
● Colors

Ordinal level
Categories that can be ordered (from low to high), but the precise "distance"
between each is not meaningful.

Examples:

● Letter grade scales from F to A


● Military ranks
● Level of satisfaction with a product

Consider letter grades from F to A: Is the grade A precisely twice as good as a


B? And, is the grade B also twice as good as C?

Exactly how much distance it is between grades is not clear and precise. If the
grades are based on amounts of points on a test, you can say that there is a
precise "distance" on the point scale, but not the grades themselves.

Interval Level
Data that can be ordered and the distance between them is objectively
meaningful. But there is no natural 0-value where the scale originates.

Examples:

● Years in a calendar
● Temperature measured in Fahrenheit

Note: Interval scales are usually invented by people, like degrees of temperature.

0 degrees Celsius is 32 degrees of Fahrenheit. There is consistent distances between


each degree (for every 1 extra degree of Celsius, there is 1.8 extra Fahrenheit), but
they do not agree on where 0 degrees is.

Ratio Level
Data that can be ordered and there is a consistent and meaningful distance
between them. And it also has a natural 0-value.

Examples:

● Money
● Age
● Time

Data that is on the ratio level (or "ratio scale") gives us the most detailed
information. Crucially, we can compare precisely how big one value is compared
to another. This would be the ratio between these values, like twice as big, or
ten times as small.

Gaussian mixture models


According to a probabilistic model known as a "Gaussian aggregate version,"
there are unknown parameters for each record factor. Mixture fashions are an
extension of ok-manner clustering that incorporates information not only about
the locations of the latent Gaussian facilities but also the covariance shape of the
facts.

For Gaussian blended model estimation, Scikit-learn provides a range of training


options that fit the different estimating methodologies outlined below.

Gaussian Mixture
○ Through the Gaussian combination object, the
expectation-maximization (EM) method for generating a mixture of
Gaussian fashions is implemented.
○ Additionally, it can compute the Bayesian Information Criterion to
determine how many clusters there are in the data and create
confidence ellipsoids for multivariate models.
○ A Gaussian Mixture Model can be learned from train data using the
Gaussian Mixture. Fit technique, Using the GaussianMixture.
○ Prediction technique and test data may assign each sample the
Gaussian that it most likely belongs to.
The GaussianMixture offers several parameters, including spherical, diagonal,
tied, and complete covariance, to limit the covariance of the difference classes
computed.

GMM covariances
○ A variety of covariance types are shown for Gaussian mixture models.
○ For further details on the estimator, refer to Gaussian mixture models.
○ Even though GMM is frequently employed for clustering, we can
contrast the resulting clusters with the dataset's real classes.
○ To ensure the validity of this comparison, we initialize the Gaussian
means using the means of the classes from the training set.
○ On the iris dataset, we use several GMM covariance types to plot
projected labels on training and held-out test data.
○ In order to improve performance, we contrast GMMs with spherical,
diagonal, complete, and linked covariance matrices.
○ Full covariance should perform the best overall.
○ The plots display the test data as crosses and the train data as dots. The
iris dataset is quadratic.
○ The fact that just the first two dimensions are displayed here causes
some points to be divided by additional dimensions.

Pros and Cons of class GaussianMixture:


Pros:

Speed:

It is the method that learns mixed models the quickest.

Cons:

Singularities:

It becomes difficult to estimate the covariance matrices when there aren't


enough points in each combination. Unless the covariances are artificially
regularised, the process is known to diverge and produce solutions with infinite
likelihood.

A number of components:
This set of rules will continuously assign all the components it has access to in
the absence of outside cues, choosing how many to utilise based on theoretical
records criteria or held-out statistics.

Estimation algorithm expectation-maximization:

○ An effective statistical approach to circumvent this issue through an


iterative procedure is expectation maximization.
○ In order to calculate the likelihood that each factor of the version will
yield each statistics factor, one must first assume random additives
(randomly distributed across the origin, taught via ok-manner, or even
just randomly aimed at the facts factors).
○ The parameters are then adjusted to increase the probability that the
data will match those allocations. Certainly, carrying out this process
repeatedly will lead to a local optimum.

Normal or Gaussian Distribution


Gaussian distributions, either univariate or multivariate, can be used to represent
a large number of real-world worldwide datasets. It makes perfect sense to
believe that the clusters are derived from various Gaussian Distributions. Or, to
put it another way, it attempted to represent the dataset as a composite of
different Gaussian Distributions.

Algorithm of Expectation-Maximization
Calculating most-probability estimates for model parameters in cases where the
data is missing some information components, is incomplete, or contains some
hidden variables. In order to estimate a new set of data, EM selects some random
values for the missing data points. Once the numbers are rectified, these new
values are employed recursively to estimate a better initial date by filling in the
gaps.

The estimate (E-step) and maximization (M-step) steps in the


Expectation-Maximization (EM) algorithm are the two most crucial processes
that are iteratively carried out.

Estimation Step (E-step):


Initializing our model's parameters, such as the mean (k), covariance matrix (k),
and mixing coefficients (k), comes first in the estimate step (E-step).The latent
variable k is frequently used to express these probabilities.
After that, Using the current parameter values, estimate the values of the latent
variables k.

Step for Maximisation

○ Using the estimated latent variable k, we update the parameter values


mu_k, sigma_k, and pi_k in the maximization step.
○ By applying the weighted average of the data points and the
accompanying latent variable probabilities.
○ By averaging the probability associated with each component's latent
variable, we can update the mixing coefficients (k).

Repeat the E-step and M-step until convergence.

○ In essence, the latent variables are updated based on the current


parameter values in the estimate process.
○ However, we update the parameter values using the estimated latent
variables in the maximization step. We keep doing this until our model
converges repeatedly.
○ The phases mentioned above are exclusive to GMMs. However, the
Estimization-step and Maximization-step general idea applies to any
models that employ the EM algorithm.

K-Nearest Neighbor(KNN) Algorithm for


Machine Learning
○ K-Nearest Neighbour is one of the simplest Machine Learning
algorithms based on Supervised Learning technique.
○ K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most
similar to the available categories.
○ K-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then
it can be easily classified into a well suite category by using K- NN
algorithm.
○ K-NN algorithm can be used for Regression as well as for Classification
but mostly it is used for the Classification problems.
○ K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
○ It is also called a lazy learner algorithm because it does not learn from
the training set immediately instead it stores the dataset and at the
time of classification, it performs an action on the dataset.
○ KNN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.
○ Example: Suppose, we have an image of a creature that looks similar to
cat and dog, but we want to know either it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a similarity
measure. Our KNN model will find the similar features of the new data
set to the cats and dogs images and based on the most similar features
it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have a
new data point x1, so this data point will lie in which of these categories. To solve
this type of problem, we need a K-NN algorithm. With the help of K-NN, we can
easily identify the category or class of a particular dataset. Consider the below
diagram:
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:

○ Step-1: Select the number K of the neighbors


○ Step-2: Calculate the Euclidean distance of K number of neighbors
○ Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
○ Step-4: Among these k neighbors, count the number of the data points
in each category.
○ Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
○ Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required
category. Consider the below image:
○ Firstly, we will choose the number of neighbors, so we will choose the
k=5.
○ Next, we will calculate the Euclidean distance between the data points.
The Euclidean distance is the distance between two points, which we
have already studied in geometry. It can be calculated as:
○ By calculating the Euclidean distance we got the nearest neighbors, as
three nearest neighbors in category A and two nearest neighbors in
category B. Consider the below image:
○ As we can see the 3 nearest neighbors are from category A, hence this
new data point must belong to category A.

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN
algorithm:

○ There is no particular way to determine the best value for "K", so we


need to try some values to find the best out of them. The most
preferred value for K is 5.
○ A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model.
○ Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


○ It is simple to implement.
○ It is robust to the noisy training data
○ It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
○ Always needs to determine the value of K which may be complex some
time.
○ The computation cost is high because of calculating the distance
between the data points for all the training samples.

K-Means Clustering Algorithm


K-Means Clustering is an unsupervised learning algorithm that is used to solve
the clustering problems in machine learning or data science. In this topic, we will
learn what is K-means clustering algorithm, how the algorithm works, along with
the Python implementation of k-means clustering.

What is K-Means Algorithm?


K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters. Here K defines the number of
pre-defined clusters that need to be created in the process, as if K=2, there will be
two clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different


clusters in such a way that each dataset belongs only one group that has
similar properties.
It allows us to cluster the data into different groups and a convenient way to
discover the categories of groups in the unlabeled dataset on its own without the
need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid.


The main aim of this algorithm is to minimize the sum of distances between the
data point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into
k-number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

○ Determines the best value for K center points or centroids by an


iterative process.
○ Assigns each data point to its closest k-center. Those data points which
are near to the particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from
other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?


The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input
dataset).

Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.


Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:

○ Let's take number k of clusters, i.e., K=2, to identify the dataset and to
put them into different clusters. It means here we will try to group these
datasets into two different clusters.
○ We need to choose some random k points or centroid to form the
cluster. These points can be either the points from the dataset or any
other point. So, here we are selecting the below two points as k points,
which are not the part of our dataset. Consider the below image:

○ Now we will assign each data point of the scatter plot to its closest
K-point or centroid. We will compute it by applying some mathematics
that we have studied to calculate the distance between two points. So,
we will draw a median between both the centroids. Consider the below
image:

From the above image, it is clear that points left side of the line is near to the K1
or blue centroid, and points to the right of the line are close to the yellow
centroid. Let's color them as blue and yellow for clear visualization.
○ As we need to find the closest cluster, so we will repeat the process by
choosing a new centroid. To choose the new centroids, we will compute
the center of gravity of these centroids, and will find new centroids as
below:

○ Next, we will reassign each datapoint to the new centroid. For this, we
will repeat the same process of finding a median line. The median will
be like below image:

From the above image, we can see, one yellow point is on the left side of the line,
and two blue points are right to the line. So, these three points will be assigned to
new centroids.
As reassignment has taken place, so we will again go to the step-4, which is
finding new centroids or K-points.
○ We will repeat the process by finding the center of gravity of centroids,
so the new centroids will be as shown in the below image:
○ As we got the new centroids so again will draw the median line and
reassign the data points. So, the image will be:

○ We can see in the above image; there are no dissimilar data points on
either side of the line, which means our model is formed. Consider the
below image:

As our model is ready, so we can now remove the assumed centroids, and the
two final clusters will be as shown in the below image:
How to choose the value of "K number of clusters" in
K-means Clustering?

The performance of the K-means clustering algorithm depends upon highly


efficient clusters that it forms. But choosing the optimal number of clusters is a
big task. There are some different ways to find the optimal number of clusters,
but here we are discussing the most appropriate method to find the number of
clusters or value of K. The method is given below:

Elbow Method
The Elbow method is one of the most popular ways to find the optimal number
of clusters. This method uses the concept of WCSS value. WCSS stands for Within
Cluster Sum of Squares, which defines the total variations within a cluster. The
formula to calculate the value of WCSS (for 3 clusters) is given below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3


distance(Pi C3)2
In the above formula of WCSS,

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between
each data point and its centroid within a cluster1 and the same for the other two
terms.

To measure the distance between data points and centroid, we can use any
method such as Euclidean distance or Manhattan distance.

To find the optimal value of clusters, the elbow method follows the below steps:

○ It executes the K-means clustering on a given dataset for different K


values (ranges from 1-10).
○ For each value of K, calculates the WCSS value.
○ Plots a curve between calculated WCSS values and the number of
clusters K.
○ The sharp point of bend or a point of the plot looks like an arm, then
that point is considered as the best value of K.

Since the graph shows the sharp bend, which looks like an elbow, hence it is
known as the elbow method. The graph for the elbow method looks like the
below image:
Note: We can choose the number of clusters equal to the given data points. If we
choose the number of clusters equal to the data points, then the value of WCSS
becomes zero, and that will be the endpoint of the plot.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy