0% found this document useful (0 votes)
31 views76 pages

ML Unit 4

This document discusses supervised learning classification, focusing on how algorithms learn from labeled data to predict outputs for new data. It covers various types of classification problems, popular algorithms like K-Nearest Neighbors and Support Vector Machines, evaluation metrics, and challenges in classification such as class imbalance and overfitting. Additionally, it provides insights into the KNN algorithm, its implementation, advantages, and disadvantages.

Uploaded by

Shrimann Vyas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views76 pages

ML Unit 4

This document discusses supervised learning classification, focusing on how algorithms learn from labeled data to predict outputs for new data. It covers various types of classification problems, popular algorithms like K-Nearest Neighbors and Support Vector Machines, evaluation metrics, and challenges in classification such as class imbalance and overfitting. Additionally, it provides insights into the KNN algorithm, its implementation, advantages, and disadvantages.

Uploaded by

Shrimann Vyas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

UNIT 4

Supervised Learning Classification


Supervised learning is a type of machine learning where the algorithm learns from labeled data.

In other words, during the learning process, the algorithm is given input data along with the
correct output, or label.

The goal of supervised learning is to learn a mapping from inputs to outputs, so the model can
predict the output for new, unseen data.

What is Classification in Supervised Learning?


Classification is one of the main tasks in supervised learning. It involves predicting a
categorical label or class for an input.

The model learns from examples where both the input data and the correct class label are
provided.

Once trained, the model can classify new, unseen data into one of the categories it learned from
the training data.

For example, in a medical diagnosis task, the goal of classification could be to predict whether a
patient has a certain disease based on their symptoms.

The input features (data) could include things like age, blood pressure, temperature, and heart
rate, while the output (label) would be either "disease present" or "disease not present."

Types of Classification Problems


1. Binary Classification:
This is the simplest form of classification, where there are only two possible classes. Common
examples include:

Email spam detection (spam or not spam).

Medical diagnosis (disease or no disease).

Credit card fraud detection (fraudulent or non-fraudulent transaction).

2. Multiclass Classification:
This involves more than two classes. The goal is to classify an input into one of multiple classes.
Examples include:

Handwritten digit recognition (digits 0-9).

News article categorization (e.g., politics, sports, entertainment, etc.).

Animal species classification (cat, dog, bird, etc.).

3. Multilabel Classification:
In multilabel classification, each instance can belong to more than one class at the same time.

UNIT 4 1
For example:

Movie genre classification (a movie can be both "action" and "comedy").

Image tagging, where an image can have multiple tags like "cat", "beach", "sunset".

Popular Classification Algorithms


1. Logistic Regression:

A linear model used for binary classification.

It models the probability that a given input belongs to a particular class using a logistic
function (sigmoid).

2. Decision Trees:

A flowchart-like structure where each internal node represents a feature, each branch
represents a decision rule, and each leaf node represents a class label.

Simple to interpret and visualize, but prone to overfitting.

3. Support Vector Machines (SVM):

SVMs find a hyperplane that best separates data points of different classes.

They can handle both linear and non-linear classification problems using kernel tricks.

4. K-Nearest Neighbors (KNN):

A non-parametric method where a data point is classified based on the majority class of its
K nearest neighbors.

Simple and intuitive, but can be computationally expensive for large datasets.

5. Random Forests:

An ensemble method that combines multiple decision trees to improve accuracy and reduce
overfitting.

It builds many decision trees and averages their predictions.

6. Neural Networks:

Neural networks consist of layers of interconnected nodes (neurons) and are capable of
learning complex patterns.

They are particularly effective for large datasets and tasks like image classification, natural
language processing, and speech recognition.

Evaluation Metrics for Classification


Accuracy: The proportion of correct predictions over all predictions.

Precision: The ratio of correctly predicted positive observations to the total predicted positives.
It answers the question: Of all the instances the model predicted as positive, how many were
actually positive?

UNIT 4 2
Recall (Sensitivity): The ratio of correctly predicted positive observations to all actual positives.
It answers: Of all the actual positive instances, how many did the model identify?

F1-Score: The harmonic mean of precision and recall, useful when there is an imbalance
between classes (i.e., one class is more prevalent than the other).

ROC and AUC:

ROC (Receiver Operating Characteristic) curve plots the true positive rate (recall) against
the false positive rate.

AUC (Area Under the Curve) represents the overall performance of the model.

Challenges in Classification
1. Class Imbalance:
If one class is much more prevalent than others, the model may be biased toward the majority
class, resulting in poor performance on the minority class. Techniques like oversampling,
undersampling, or using specialized algorithms can help address this.

2. Overfitting:
If the model is too complex, it may perform well on the training data but poorly on unseen data.
This is called overfitting. Regularization techniques, such as pruning decision trees or adding
penalties to the loss function, can help prevent overfitting.

3. Feature Selection:
Choosing the right features is crucial for model performance. Irrelevant or redundant features
can reduce accuracy and increase computational complexity. Feature selection methods like
forward selection, backward elimination, and L1 regularization can be used.

4. Noise and Outliers:


Noisy or outlier data can mislead the model during training. Proper preprocessing and anomaly
detection methods can help mitigate these issues.

💡 Discuss the K-nearest neighbor algorithm with a suitable example. [APR


2023]

K-Nearest Neighbours Classification


K-Nearest Neighbors (KNN) is a supervised learning algorithm used for both classification and
regression, but it is most commonly used for classification.

It falls under the category of instance-based learning, meaning that it makes predictions based
on the instances (or examples) in the training dataset rather than constructing an explicit model.

KNN is often used for classification problems, where the goal is to categorize a new data point
into one of several classes based on the classes of its neighboring points in the training set.

The algorithm assumes that similar data points exist close to each other in space.

UNIT 4 3
The idea is to predict the class of a data point by looking at the classes of its nearest neighbors
in the feature space.

Basic Idea:
KNN works by finding the "K" closest data points to a given data point and then making a
prediction based on the majority class of those closest neighbors.

The "neighbors" refer to the data points in the dataset that are closest to the point being
classified.

K is a parameter that defines how many neighbors should be considered when making the
classification decision.

The prediction for a new data point is made by finding the K closest data points from the
training dataset and assigning the most common class among these neighbors to the new point.

Steps to Implement KNN Classification:


1. Choose a value for K.

A small K (like 1 or 3) makes the model sensitive to noise in the data, while a large K makes it
more general but less sensitive to local variations.

2. Calculate the distance between the new data point and all points in the training set.

3. Identify the K closest data points to the new data point based on the calculated distances.

4. Determine the majority class among these K nearest neighbors.

5. Assign the class label of the majority to the new data point.

Example of KNN Classification:


Imagine you have a dataset of animals with features like "height" and "weight" and a class label
"type" (e.g., cat, dog, or rabbit).

Suppose you're trying to classify a new animal with a height of 15 cm and weight of 5 kg.

You choose K = 3.

The algorithm calculates the distance between this new animal and all the animals in the training
set.

It then selects the 3 closest animals (the nearest neighbors) and looks at their classes.

If two out of the three closest neighbors are labeled as "cat" and one is labeled as "dog," the
new animal will be classified as a "cat" because it has the majority class label.

Choosing the K Value:


The choice of K is critical:

A small K can make the model very sensitive to noise, potentially leading to overfitting.

A larger K makes the model more general, but it can smooth over the local patterns in the data,
potentially leading to underfitting.

UNIT 4 4
Cross-validation is often used to select the optimal K value by testing different values and
checking which one performs best on unseen data.

Key Components of KNN:


Data Points: Each data point has features (input variables) and a class label (the target
variable).

Distance Metric: A way to measure how far apart data points are in the feature space. Common
distance metrics are:

Euclidean distance: The straight-line distance between two points.

Manhattan distance: The sum of the absolute differences of their coordinates.

K value: The number of nearest neighbors to check. It is a hyperparameter that needs to be


chosen by the user.

Advantages of KNN:
Simplicity: It’s easy to understand and implement.

Non-Parametric: KNN doesn't make assumptions about the underlying data distribution (e.g.,
normal distribution), which is useful when the data doesn't follow a specific pattern.

Flexible: It can be used for both classification and regression tasks.

Disadvantages of KNN:
Computationally Expensive: As the size of the training dataset increases, the algorithm
requires more time to compute the distances for each new prediction. This is because the
algorithm has to compare the new data point with every point in the training set.

Storage: KNN requires storing all the training data, which can be memory-intensive, especially
for large datasets.

Sensitive to Irrelevant Features: If the data has many irrelevant or redundant features, the
algorithm’s performance may decrease because irrelevant features can distort the distance
calculations.

Sensitivity to K Value: The choice of K significantly affects the performance. Choosing an


appropriate K is important.

Optimizations and Variations:


To improve KNN's performance, several techniques can be used:

Dimensionality Reduction: Since KNN can suffer from the "curse of dimensionality" (where the
distance between points becomes less meaningful as the number of features increases),
techniques like PCA (Principal Component Analysis) can be used to reduce the number of
features.

Choose an appropriate K value: This can be done through cross-validation or using techniques
like the "elbow method" to find the optimal K value.

UNIT 4 5
Normalize or scale features: Since KNN is distance-based, the algorithm is sensitive to the
scale of the features. For example, features with larger ranges (e.g., income in thousands vs.
age in years) can dominate the distance calculations. Normalizing or standardizing the data can
improve the performance of KNN.

Use distance weighting: Instead of giving equal importance to all neighbors, you can give
closer neighbors more weight when making the classification decision. This can be achieved by
using a weighted distance function, where the vote of each neighbor is weighted by its
proximity to the query point.

Support Vector Machine

Support Vector Machine (SVM) is a powerful and widely used algorithm for classification tasks
in machine learning.

It is particularly effective for binary classification, where the goal is to separate data into two
distinct classes.

SVM works by finding an optimal hyperplane that best divides the data into these classes.

A hyperplane is a decision boundary that separates the feature space into two regions, each
corresponding to a class.

The margin is defined as the distance between the hyperplane and the closest data points from
each class.

These closest points are called support vectors, and they play a critical role in defining the
optimal hyperplane.

SVM aims to maximize this margin because a larger margin is associated with better
generalization to unseen data.

In simple terms, by creating a wide gap between the classes, SVM reduces the chances of
misclassification on new, unseen data.

This is why SVM is often preferred for problems where the classes are well-separated, or close
to it.

UNIT 4 6
Linear SVM
In its simplest form, SVM is used for linearly separable data, meaning the two classes can be
separated by a straight line (in two dimensions) or a hyperplane (in higher dimensions).

For linearly separable data, SVM works by finding the hyperplane that maximizes the margin
between the two classes.

Mathematically, the goal of SVM in the linear case is to find a hyperplane described by the
equation:
w ⋅ x + b = 0
Where:

w is the normal vector to the hyperplane.


xis a data point.
bis the bias term.
The support vectors are the data points that lie closest to this hyperplane, and these points are
used to determine the optimal w and bthat maximize the margin.

Non-Linear SVM and Kernel Trick


In many real-world scenarios, data is not linearly separable, meaning the two classes cannot be
separated by a straight line or hyperplane.

To address this, SVM uses a technique called the kernel trick.

The kernel trick involves transforming the data into a higher-dimensional space where it
becomes linearly separable.

In this new space, SVM can find a hyperplane that effectively separates the classes.

The kernel function takes the original data and maps it into a higher-dimensional space.
Common types of kernels include:

1. Linear Kernel: No transformation, used when data is already linearly separable.

2. Polynomial Kernel: Maps the data into a higher-dimensional space using polynomial
functions.

3. Radial Basis Function (RBF) Kernel: One of the most popular kernels, it maps data into an
infinite-dimensional space, making it effective for complex datasets.

4. Sigmoid Kernel: Uses the sigmoid function to transform the data.

The use of kernels allows SVM to perform well on complex datasets with non-linear decision
boundaries by implicitly mapping the data into a higher-dimensional space without the need to
compute this mapping explicitly.

Advantages of SVM
SVM has several advantages, making it a popular choice for classification tasks:

UNIT 4 7
1. Effective in High-Dimensional Spaces: SVM works well when the number of features
(dimensions) is large, making it suitable for text classification and image recognition tasks.

2. Memory Efficiency: SVM is memory efficient because it only relies on the support vectors to
define the decision boundary, rather than using all the training data.

3. Robust to Overfitting: SVM is less prone to overfitting compared to some other algorithms,
especially in high-dimensional spaces, due to the use of the margin for classification.

4. Works Well with Non-linear Data: The kernel trick allows SVM to handle complex, non-linear
decision boundaries effectively.

Disadvantages of SVM
Despite its advantages, SVM has some limitations:

1. Computational Complexity: The training time for SVM can be high, especially for large
datasets. The need to compute pairwise distances between all data points and optimize the
margin can be computationally expensive.

2. Sensitivity to Noise: SVM can be sensitive to noisy data and outliers, particularly if the value of
the regularization parameter C is set too high.

3. Choice of Kernel and Hyperparameters: The performance of SVM heavily depends on the
choice of kernel and the appropriate setting of hyperparameters like C and kernel parameters.
Tuning these parameters can be time-consuming and require cross-validation.

Applications of SVM
SVM is used in a wide range of applications due to its ability to handle both linear and non-linear
data. Some common applications include:

1. Image Classification: SVM is widely used in computer vision tasks, such as classifying images
of objects or handwriting recognition.

2. Text Classification: SVM is often used in natural language processing (NLP) for tasks like spam
email detection or sentiment analysis.

3. Bioinformatics: SVM is used in areas like gene classification and protein structure prediction.

4. Face Detection: SVM has been used in detecting faces in images, a crucial task in computer
vision.
Maximum Margin Linear Separators (Optimal Decision Boundary)

UNIT 4 8
Maximum Margin Linear Separators (Optimal Decision Boundary) in Support
Vector Machines (SVM)
In machine learning, the goal of classification is to divide data into different classes or
categories.

In cases where the data is linearly separable (i.e., the classes can be separated by a straight
line or hyperplane in a multi-dimensional space), a linear classifier can be used.

Support Vector Machines (SVM) are a popular method for classification that focuses on finding
the optimal decision boundary to separate the classes.

1. What is a Linear Separator?


In a classification task, the goal is to separate different classes of data points.

A linear separator is a straight line (in two dimensions), a plane (in three dimensions), or a
hyperplane (in higher dimensions) that divides the data into different classes.

In binary classification, the objective is to find a hyperplane that separates two classes of
data.

A linear separator is called "linear" because it is a straight line (or a flat plane) in the feature
space.

For example, in 2D, a linear separator is simply a straight line that divides the data points of one
class from the other class.

In higher dimensions, it's a hyperplane, but the underlying concept is the same.

2. What is the Margin?

UNIT 4 9
The margin is the distance between the linear separator (or hyperplane) and the closest data
points from either of the two classes.

These closest points are called support vectors.

The margin is important because:

A larger margin between the classes means that the classifier is more confident in its
predictions for new, unseen data.

The goal of SVM is to find the separator that maximizes this margin, as a wider margin
generally results in better generalization to new data.

The margin γ is given by the formula:


1
γ= ∥w∥


3. Maximizing the Margin (Optimal Decision Boundary):


The idea behind the maximum margin linear separator is to choose the line or hyperplane that
maximizes this margin.

The intuition is that a larger margin reduces the chances of misclassification in future data.

By focusing on the data points closest to the decision boundary (the support vectors), the
classifier can make more accurate predictions on unseen data.

Mathematically, maximizing the margin is equivalent to solving an optimization problem, where


we want to maximize the margin subject to certain constraints.

4. Support Vectors:
Support vectors are the data points that are closest to the decision boundary (hyperplane).

These points are crucial because they directly influence the position and orientation of the
hyperplane.

The SVM algorithm uses only these support vectors to determine the optimal boundary.

Why are support vectors important?

The other data points (those far from the boundary) do not influence the position of the
hyperplane.

The support vectors are the most difficult to classify, so they define the boundary.

If you remove or change a support vector, the position of the decision boundary might
change significantly, while removing non-support vectors will not.

5. Formulation of the Optimal Hyperplane:


Equation of a Hyperplane:
In an n-dimensional feature space, the equation of a hyperplane can be written as:

w ⋅ x + b = 0
where:

UNIT 4 10
w is the weight vector, which is perpendicular to the hyperplane.

x is the vector of input features (the data point).

b is the bias term, which helps shift the hyperplane away from the origin.

The goal is to find values of w and b that maximize the margin between the two classes.

Maximizing the Margin:


To maximize the margin, we need to minimize the following objective function:
Minimize = 12 ∣∣w∣∣2  ​

where:

||w|| is the norm (length) of the weight vector.

The factor of 1/2 is included to simplify calculations later.

The reason we minimize ||w|| is that the margin is inversely proportional to the magnitude of the
weight vector. By minimizing the weight vector’s magnitude, we maximize the margin.

Constraints:
The margin is maximized subject to the constraint that the data points must be correctly
classified. For each data point, we need the following conditions:

For class 1 data points:

w ⋅ xi + b ≥ +1

For class 2 data points:

w ⋅ xi + b ≤ −1

Here, xi represents the data points, and the constraints ensure that all data points are on the

correct side of the margin, with a margin of at least 1 from the hyperplane.

SVM as Constrained Optimization Problem


Quadratic Programming Solution To Find Maximum Margin
Separators [Refer Textbook**]
In the context of Quadratic Programming (QP) for finding maximum margin separators in
Support Vector Machines (SVM), the goal is to optimize the hyperplane parameters that best
separate two classes of data points. This optimization is done by minimizing the norm of the
weight vector w while also ensuring that the margin constraints are satisfied. Here's a
breakdown of the process, using the Lagrange multiplier method.

1. Objective: Minimize the Norm of the Weight Vector


The SVM optimization problem can be formulated as a Quadratic Programming problem, where
the objective is to minimize the norm of the weight vector w, i.e.,

minw 12 ∥w∥2 
​ ​

1
The factor 2 simplifies the calculations later on.

UNIT 4 11
2. Subject to Margin Constraints
For a given training dataset with features xi and corresponding labels yi , we need to enforce the ​ ​

margin constraints. These constraints ensure that the data points are correctly classified with a
gap (margin) between them. The constraints can be written as:

yi (w ⋅ xi + b) ≥ 1,
​ ​ for all i
Here, w is the weight vector, bis the bias term, and yiy_i is the label (either +1 or -1).

3. Lagrangian Function
To solve this constrained optimization problem, we use the Lagrange multiplier method. The
Lagrangian L(w, b, α)for this problem combines the objective function with the margin
constraints, where αi are the Lagrange multipliers associated with each constraint: ​

L(w, b, α) = 12 ∥w∥2 − ∑i=1 αi (yi (w ⋅ xi + b) − 1)


n
​ ​ ​ ​ ​

n
The term ∑i=1 αi (yi (w ​ ​ ​ ⋅ xi + b) − 1)represents the penalties for the violations of the margin

constraints. The αi are non-negative values that adjust how strongly the constraint is enforced.

4. Taking Partial Derivatives


Now, to find the optimal values of w and b, we take the partial derivatives of the Lagrangian with
respect to w and b, and set them to zero.

Partial Derivative with respect to w :


∂L
∂w
​ = w − ∑ni=1 αi yi xi = 0 ​ ​ ​ ​

Rearranging this, we get the relationship:


n
w = ∑i=1 αi yi xi  ​ ​ ​ ​

This shows that the weight vector w is a linear combination of the support vectors xi , weighted ​

by the Lagrange multipliers αi . ​

Partial Derivative with respect to bb:


∂L
∂b
​ = − ∑ni=1 αi yi = 0 ​ ​ ​

This implies that the sum of the Lagrange multipliers weighted by the labels must be zero:
n
∑i=1 αi yi = 0 ​ ​ ​

5. Optimization Problem
Once we have the derivatives, we substitute these results into the Lagrangian to obtain the
optimization problem in terms of the Lagrange multipliers αi . The dual form of the optimization ​

problem is:

maxα ∑i=1 αi − 12 ∑i=1 ∑j=1 αi αj yi yj (xi ⋅ xj )


n n n
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

subject to the constraints:


n
αi ≥ 0,

∑i=1 αi yi = 0 ​ ​ ​

UNIT 4 12
6. Solving the Quadratic Programming Problem
This is a quadratic optimization problem in terms of the Lagrange multipliers αi . The solution of

this problem gives the optimal values of αi , which can then be used to compute the optimal

weight vector w and bias term b.


Kernels for Learning Non-Linear Functions

In Support Vector Machines (SVM), kernels are used to enable the algorithm to learn non-
linear decision boundaries, which is essential when the data is not linearly separable in its
original feature space.

Kernels allow SVMs to work in higher-dimensional spaces, where the data might become
linearly separable, without explicitly performing the transformation.

This technique is central to making SVMs applicable to a wider range of problems,


particularly those where the classes cannot be separated by a straight line (or hyperplane) in
the original feature space.

Basic Concept of Kernels in SVM


SVMs typically work by finding a hyperplane that best separates the classes in the feature
space.

When the data is not linearly separable, SVMs map the data to a higher-dimensional feature
space where it is more likely that a separating hyperplane can be found.

This mapping is achieved through a mathematical function called a kernel.

However, directly computing this transformation can be computationally expensive and


difficult.

The kernel trick allows SVMs to compute the inner product (dot product) of data points in
this higher-dimensional space without explicitly performing the transformation.

This is done by defining a kernel function that calculates the dot product in the new feature
space, making the computation more efficient.

General Form of Kernel


Given two data points xand yin the original feature space, a kernel function K(x, y)
computes the dot product between their transformed representations in a higher-

UNIT 4 13
dimensional feature space:

K(x, y) = ϕ(x) ⋅ ϕ(y)


where ϕ(x)is the mapping of the input xxx into a higher-dimensional space.

In practice, the function ϕ(x)is not computed directly; instead, we use K(x, y)to compute
the required quantities, which simplifies the process.

Popular Kernels
Several types of kernel functions are commonly used in SVMs.

These kernels are designed to handle different types of data and transform the feature
space in a way that makes the data separable.

1. Linear Kernel

The linear kernel is the simplest kernel and is used when the data is already linearly
separable or nearly so.

The linear kernel function is:

K(x, y) = x ⋅ y
This kernel does not map the data into a higher-dimensional space—it just computes the
standard dot product in the original input space.

It is used when there is no need for a non-linear transformation.

2. Polynomial Kernel

The polynomial kernel is used for non-linear data where the decision boundary is a
polynomial function.

The polynomial kernel function is:

K(x, y) = (x ⋅ y + 1)d 
Where:

dis the degree of the polynomial (e.g., d=2 for quadratic boundaries).
This kernel allows the decision boundary to be more flexible, handling cases where the
relationship between data points requires a higher-order polynomial function.

3. Radial Basis Function (RBF) or Gaussian Kernel

The RBF kernel is one of the most popular and widely used kernels for non-linear
classification problems.

It can handle complex, non-linear relationships between data points.

The RBF kernel function is:

K(x, y) = exp (− 2σ 2 )
∥x−y∥2

Where:

UNIT 4 14
∥xi − xj ∥2 is the squared Euclidean distance between the points xand y
​ ​

σ is a parameter that controls the width of the Gaussian function (also called the
bandwidth).

The RBF kernel maps data into an infinite-dimensional space and is very effective for
many types of non-linear decision boundaries, especially when the relationship between
the classes is complex or highly non-linear.

4. Sigmoid Kernel

The sigmoid kernel is derived from the activation function used in neural networks.

The sigmoid kernel function is:

K(xi , y) = tanh(αx ⋅ y + c)


Where αand c are Kernal Parameters.

This kernel can model complex decision boundaries and has a behavior similar to a
neural network.

However, it is less commonly used than other kernels, like the RBF kernel.

Choosing the Right Kernel


Choosing the right kernel depends on the nature of the data and the problem being solved. For
example:

If the data appears to have a linear relationship, a linear kernel is the simplest and most
efficient choice.

If the data has a complex, non-linear structure, the RBF kernel is often a good choice
because it can create flexible decision boundaries.

The polynomial kernel is useful when the decision boundary resembles a polynomial
function (e.g., quadratic or cubic).

In practice, the RBF kernel is often preferred for most non-linear classification problems due to
its flexibility and ability to handle complex decision boundaries.

SVM for Non-Linear Classification using Radical


Basis Functions (RBF)
Support Vector Machines (SVMs) are powerful tools for both linear and non-linear classification
problems. When dealing with non-linearly separable data, SVMs can be extended by using kernel
functions. One of the most popular kernels is the Radial Basis Function (RBF) kernel, which allows
SVMs to find non-linear decision boundaries in a high-dimensional feature space. This extension is
essential when the data cannot be separated by a linear hyperplane in its original space.

1. The Need for Non-Linear Classification in SVM


In many real-world classification problems, the data points from different classes are not linearly
separable. For example, imagine a dataset where one class forms a circular shape and the other

UNIT 4 15
class is in the surrounding region. In such cases, a straight line (or a hyperplane in higher
dimensions) cannot separate the two classes effectively.
This is where non-linear classification comes in. To solve this, SVM uses a kernel trick to map the
data into a higher-dimensional space where it becomes easier to find a linear separator, even
though the data might not be linearly separable in its original space.

2. What is the RBF Kernel?


The Radial Basis Function (RBF) kernel is a function that computes the similarity between two
points xand ybased on their distance in the feature space. It is defined as:

K(x, y) = exp (− ∥x−y∥ )


2

2σ 2

Where:

xand yare the feature vectors of two data points.


∥x − y∥2 is the squared Euclidean distance between the points xand y.
σ is a parameter that controls the width of the Gaussian function (the "spread" of the kernel).
The function decays as the distance between and increases, meaning that points that are
closer to each other will have a larger kernel value.

exp (− ∥x−y∥ )
2

2σ 2

The RBF kernel is known for its ability to map data into an infinite-dimensional space, which allows
it to handle highly complex decision boundaries.

3. The SVM Problem with the RBF Kernel


The general goal of an SVM is to find a hyperplane (in some feature space) that maximizes the
margin between the two classes. However, in cases of non-linear separability, we map the data into
a higher-dimensional space using a kernel function such as RBF.

3.1. Formulating the Optimization Problem with the RBF Kernel


The main idea in SVM with the RBF kernel is to solve a quadratic optimization problem to find the
optimal separating hyperplane in this new, higher-dimensional space. The optimization problem is
formulated as follows:

1. Objective function:
The objective is to minimize the following function, which corresponds to the margin
maximization:
n
minw,b 12 ∥w∥2 + C ∑i=1 ξi 
​ ​ ​ ​

where:

w is the weight vector defining the hyperplane.


bis the bias term.

UNIT 4 16
ξi are slack variables, which allow for some misclassification in case the data is not

perfectly separable.

C is a regularization parameter that controls the trade-off between maximizing the margin
and minimizing classification errors (i.e., penalizing misclassification).

2. Constraints:
The constraints ensure that each data point is correctly classified (up to the allowed
misclassification):

(w ⋅ xi + b) ≥ 1 − ξi
​ ​
for each data point i
where yi is the class label of the data point xi .
​ ​

3.2. The Role of the Kernel Trick


Instead of directly transforming the data into the higher-dimensional space, SVMs use the kernel
trick. The kernel trick allows us to compute the dot product of two transformed data points in the
higher-dimensional space without explicitly computing their transformation. This avoids the need to
calculate the feature mapping ϕ(x)\phi(x) directly, which would be computationally expensive.

In the case of the RBF kernel, the kernel computes the similarity between data points in the
transformed feature space by simply using the formula:

K(x, y) = exp (− ∥x−y∥ )


2

2σ 2

Thus, we don't need to know the exact transformation ϕ(x), just the kernel function that can
compute the dot product in the higher-dimensional space.

4. How the RBF Kernel Solves Non-Linear Classification


The RBF kernel allows SVM to create non-linear decision boundaries by mapping the data points
into a higher-dimensional space where the decision boundary can be a hyperplane. The key steps
are:

1. Non-linear transformation:
By using the RBF kernel, SVM transforms the data points into a higher-dimensional space where
the classes may become separable.

2. Maximizing the margin in the transformed space:


In this transformed space, the SVM algorithm finds the optimal hyperplane that maximizes the
margin between the two classes, just as it would in a linear case.

3. Non-linear decision boundary:


When projected back into the original space, the hyperplane found in the transformed space
corresponds to a non-linear decision boundary.

For example, in a two-dimensional space, the decision boundary might be a curve (such as a circle
or an ellipse), which is not possible with a linear kernel. The RBF kernel enables the SVM to
automatically find such curved boundaries.

5. Hyperparameters in RBF Kernel

UNIT 4 17
The performance of an SVM with the RBF kernel is influenced by two main hyperparameters:

1. C:
The regularization parameter controls the trade-off between maximizing the margin and
minimizing the misclassification errors. A large
C gives higher importance to minimizing errors, potentially leading to overfitting, while a small
C allows for more margin violations, leading to underfitting.
2. σ :
The σ\sigma parameter (also known as the bandwidth of the kernel) controls the influence of
individual training samples. A small σ\sigma makes the kernel more sensitive to the distance
between points, meaning only very close points have a significant influence on the decision
boundary. A large
σ increases the influence of distant points, potentially making the boundary too smooth and
unable to capture intricate patterns in the data.

6. Advantages and Disadvantages of Using RBF in SVM

Advantages:
Flexibility: The RBF kernel is highly flexible and can handle a wide range of non-linear patterns
in the data. It is particularly effective when the decision boundary is highly non-linear.

Generalization: The SVM with the RBF kernel is often effective in high-dimensional spaces and
has good generalization properties, especially when tuned correctly.

Disadvantages:
Parameter sensitivity: The performance of the SVM with the RBF kernel is highly sensitive to
the choice of the hyperparameters and . Incorrect choices can lead to overfitting or
underfitting.

Computational cost: The RBF kernel requires calculating pairwise distances between all points,
which can be computationally expensive for large datasets.

RBF

Support Vector Machines (SVM) is a powerful machine learning algorithm, particularly for
classification tasks. While SVM is well-known for its ability to create linear decision boundaries, it
can also be used for non-linear classification by transforming the input data into a higher-
dimensional space where a linear separator can be found. This transformation is achieved using
kernels, and one of the most commonly used kernels for non-linear classification is the Radial
Basis Function (RBF) kernel.

1. The Need for Non-Linear Classification


In many real-world classification problems, the data points from different classes are not linearly
separable. For example, imagine a dataset where one class forms a circular shape and the other
class is in the surrounding region. In such cases, a straight line (or a hyperplane in higher
dimensions) cannot separate the two classes effectively.

UNIT 4 18
This is where non-linear classification comes in. To solve this, SVM uses a kernel trick to map the
data into a higher-dimensional space where it becomes easier to find a linear separator, even
though the data might not be linearly separable in its original space.

2. Radial Basis Function (RBF) Kernel


The RBF kernel is a popular kernel used in SVM for non-linear classification. It can handle complex
decision boundaries and works well for many types of data. The RBF kernel is defined as:

K(xi , xj ) = exp (− )
∥xi − xj ∥2
​ ​

2σ 2
​ ​ ​

Where:

K(xi , xj )is the kernel function that computes the similarity between two data points xi and xj
​ ​ ​ ​

,

∥xi − xj ∥2 is the squared Euclidean distance between the two points,
​ ​

σ is a parameter that controls the width of the kernel (also called bandwidth).
The RBF kernel measures how similar two points are. When the points are closer together, the
kernel value is high (indicating similarity). As the points move further apart, the kernel value
decreases. The parameter σ determines how quickly this similarity decays as the points get farther
apart.

3. How RBF Transforms Data for Non-Linear Classification


In non-linear classification with SVM using the RBF kernel, we never explicitly transform the data
into a higher-dimensional space. Instead, we use the kernel trick. The kernel trick allows us to
compute the similarity between two points in a higher-dimensional space without having to
calculate the actual coordinates of the points in that space.

The idea is that by using the RBF kernel, we implicitly map the input data into a higher-dimensional
space (often an infinite-dimensional space) where a linear decision boundary can separate the
data. This is important because:

In the higher-dimensional space, the data might become more linearly separable, even though
it wasn't in the original space.

The kernel trick makes this transformation computationally feasible, as we don't need to
compute the transformed data explicitly.

4. SVM Optimization with RBF Kernel


Just like in linear SVM, we need to solve an optimization problem to find the optimal separating
hyperplane. The key difference is that in non-linear SVM, the decision boundary is now defined
using the kernel function instead of the dot product in the original space.

a. The Objective Function:


The objective in SVM is to find a hyperplane that maximizes the margin between the two classes.
The goal is to minimize the following objective function:

UNIT 4 19
minw,b 12 ∥w∥2  ​ ​

Where w\mathbf{w} is the weight vector and bb is the bias term. The optimization is subject to the
constraints that all data points are correctly classified.

b. Dual Formulation with RBF Kernel:


Instead of solving this optimization problem in the original feature space, we work with the dual
form of the SVM problem, which allows us to use the kernel trick. The dual form of the SVM
problem with the RBF kernel becomes:

maxα ∑i=1 αi − 12 ∑i=1 ∑j=1 αi αj yi yj K(xi , xj )


n n n
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Where:

αi are the Lagrange multipliers associated with the constraints,


yi is the label of the data point xi ,


​ ​

K(xi , xj )is the RBF kernel function.


​ ​

This optimization problem finds the optimal values of αi , which correspond to the support vectors ​

—the data points that define the decision boundary.

c. Decision Function:
Once we solve for the Lagrange multipliers αi , we can compute the final decision function, which ​

predicts the class label for a new data point x. The decision function is:
n
K(xi, x) + bf(x) = ∑i=1 αi yi K(xi , x) + b ​ ​ ​ ​

Here:

K(xi , x)is the RBF kernel function, which computes the similarity between the support vectors

and the new data point x,

αi are the Lagrange multipliers corresponding to the support vectors,


yi is the label of the support vector xi 


​ ​

bb is the bias term.

The sign of f(x) determines the predicted class of the data point x.

5. Choosing Parameters for RBF Kernel


The performance of an SVM with an RBF kernel depends on the choice of the two main
parameters:

1. C: This is the regularization parameter that controls the trade-off between maximizing the
margin and minimizing the classification error. A larger value means the model will focus more
on classifying all data points correctly (with less tolerance for misclassifications), while a
smaller value will allow some misclassifications to maximize the margin.

2. σ : This controls the width of the RBF kernel. A small makes the kernel more sensitive to the
distance between points, leading to a more complex model with a tighter decision boundary. A

UNIT 4 20
large makes the kernel less sensitive, leading to a smoother decision boundary.

6. Advantages of SVM with RBF Kernel


Flexibility: The RBF kernel can model highly complex decision boundaries, allowing SVM to
classify non-linear data effectively.

No Need for Explicit Transformation: The kernel trick allows us to work in a higher-
dimensional space without explicitly transforming the data, making it computationally efficient.

Effective for Many Data Types: The RBF kernel works well for many types of data and is often
the default choice in practice.

7. Disadvantages of SVM with RBF Kernel


Parameter Sensitivity: The performance of the model depends on the choice of C and σ .
Choosing the right values requires careful tuning, often through cross-validation.

Computational Complexity: While the kernel trick avoids explicit mapping, training an SVM
with the RBF kernel can still be computationally expensive, especially for large datasets.

Explain with an example the variant of SVM,


Support Vector Regression. [APR 2023]
Support Vector Regression (SVR) is a machine learning technique used for regression problems,
where the goal is to predict a continuous output variable based on input features. SVR is based
on the concept of Support Vector Machines (SVM), which is primarily used for classification
tasks.

In SVR, the focus is on finding a function that deviates as little as possible from the actual data
while also allowing for some flexibility (called a margin) to handle outliers.

SVR works by finding a function that best fits the data points while keeping the error (difference
between predicted and actual values) within a certain margin.

This is done by mapping the data points to a higher-dimensional space and then constructing a
hyperplane that approximates the underlying relationship between the input variables and the
target variable.

Key Concepts of Support Vector Regression


1. Regression Line:
The goal of SVR is to find a function f(x)f(x) that best predicts the target variable yy based on
the input features xx. The function must not only minimize the error but also be as simple as
possible. This is similar to linear regression, where the goal is to find the best-fitting line, but in
SVR, there is more flexibility allowed.

2. Epsilon (ε) - The Margin of Tolerance:


One of the unique features of SVR is the introduction of a margin of tolerance, defined by a
parameter called ϵ\epsilon. The idea is that errors smaller than
ϵare tolerated, meaning that if the predicted value y^is within ϵof the actual value y, it is

UNIT 4 21
considered acceptable. This helps prevent the model from fitting noise in the data, focusing
only on the significant deviations.

3. Support Vectors:
In SVR, the model is defined not by all the data points but by a small subset of points called
support vectors. These are the points that lie on the boundaries of the ϵ − tubeor that are
outside it. The model focuses on minimizing the error around these support vectors.

How SVR Works:


Step 1: Mapping to Higher Dimensions: SVR first maps the original data into a higher-
dimensional space using a function called a kernel. The kernel function helps transform the
input data into a space where a linear relationship can be found, even if the original data is non-
linear.

Step 2: Defining a Hyperplane: Once the data is in a higher-dimensional space, SVR finds a
hyperplane (a generalized version of a line in higher dimensions) that best fits the data. In the
case of regression, instead of finding a hyperplane that separates the data into classes (like in
SVM classification), the goal is to find a hyperplane that predicts the continuous target variable.

Step 3: Defining the Margin: The ε-tube (epsilon tube) is the region around the hyperplane
where no penalty is applied to errors. If the predicted value lies within this tube, no penalty is
applied. If the predicted value lies outside the tube, the error is penalized, and the model is
adjusted to minimize this error.

Step 4: Support Vectors: The support vectors are the data points that lie closest to the
boundary of the ε-tube. These support vectors are the critical elements of the model, as they
help determine the position of the hyperplane.

Mathematical Formulation of SVR


Consider a dataset with input-output pairs (xi , yi ), where xi
​ ​ ∈ Rd (i.e., the input is a dd-
dimensional vector) and yi ​ ∈ R(i.e., the output is a scalar).
Linear Function:
The prediction function in linear SVR can be written as:

Y = wx + b
where:

w is the weight vector (the direction of the regression hyperplane),


bis the bias term (the intercept of the hyperplane),
xis the input feature vector.

Advantages of SVR
Robust to Outliers: By allowing deviations within the margin ϵ, SVR can handle noise and
outliers more effectively than other regression techniques.

UNIT 4 22
Flexibility: The use of kernels allows SVR to model nonlinear relationships between features
and target variables.

High-dimensional Data: SVR can work well in high-dimensional spaces, especially with the
kernel trick.

Disadvantages of SVR
Computational Complexity: SVR can be computationally expensive, particularly with large
datasets and complex kernel functions.

Choice of Parameters: The performance of SVR is highly sensitive to the choice of C , ϵ, and
kernel parameters, requiring careful tuning.

💡 Define the following terms with reference to SVM: [OCT 22]


i) Separating hyperplane

ii) Margin

Multiclass Classification Techniques


Multiclass classification, also referred to as multinomial classification, is a type of supervised
learning task where the goal is to assign an instance (or data point) to one of several possible
classes or categories.

Unlike binary classification, where there are only two possible outcomes (e.g., yes/no,
true/false), multiclass classification involves more than two possible classes.

For example, in a fruit classification problem, the goal might be to categorize an image of a fruit
as either an apple, banana, or orange, which are three distinct classes.

In multiclass classification, each input is associated with only one label, but there are more than
two options.

This can be seen in many real-world applications such as:

Recognizing handwritten digits (0-9) in the MNIST dataset.

Classifying types of flowers, such as setosa, versicolor, and virginica in the Iris dataset.

Categorizing text into different topics, like sports, politics, and entertainment.

Approaches to Multiclass Classification


There are several methods to tackle multiclass classification problems.

These methods depend on how they manage multiple classes within the model.

Some popular approaches include:


One-vs-Rest (OvR) or One-vs-All (OvA)

UNIT 4 23
The One-vs-Rest (OvR) technique, also known as One-vs-All (OvA), is one of the most
popular and straightforward approaches to extend binary classification algorithms to
multiclass classification problems.

The general idea is to convert the multiclass problem into multiple binary classification
tasks, where each binary classifier attempts to distinguish one class from all the others.

OvR is particularly useful for machine learning models that inherently support binary
classification (such as Support Vector Machines, Logistic Regression, and Decision
Trees).

This technique works by creating one classifier per class, with each classifier being
responsible for predicting whether an instance belongs to the target class or not.

What is One-vs-Rest (OvR)?


In One-vs-Rest (OvR) classification, we treat the problem as multiple binary
classification tasks.

For each class, a separate classifier is trained to distinguish that class from all the other
classes combined.

In other words, for each class, the model learns to predict whether an input belongs to
that class or not.

How One-vs-Rest Works


The general idea of OvR is to train one binary classifier for each class, where the classifier is
responsible for recognizing that class versus all other classes. Here’s how OvR works step-
by-step:

1. Training Phase:

For a multiclass problem with N classes (say Class A, B, C, and D), OvR trains N
classifiers.

For each classifier, the data for the given class is treated as the positive class, and all
other classes are treated as the negative class.

For example, if we have 4 classes: A, B, C, and D, we train:

Classifier 1: Class A vs. (Classes B, C, D)

Classifier 2: Class B vs. (Classes A, C, D)

Classifier 3: Class C vs. (Classes A, B, D)

Classifier 4: Class D vs. (Classes A, B, C)

Each classifier only learns to separate one class from the rest of the classes.

2. Prediction Phase:

Once the classifiers are trained, when a new input is provided, each classifier makes
a prediction:

UNIT 4 24
Each classifier will predict whether the input belongs to the class it is trained for
(positive) or not (negative).

The final prediction for the input is the class whose classifier provides the highest
confidence score or probability.

For example, suppose we have a new input, and the predictions from the
classifiers are:

Classifier 1 (A vs. Not A): Predicts Class A with a probability of 0.7

Classifier 2 (B vs. Not B): Predicts Class B with a probability of 0.2

Classifier 3 (C vs. Not C): Predicts Class C with a probability of 0.4

Classifier 4 (D vs. Not D): Predicts Class D with a probability of 0.3

The final prediction would be Class A, because the classifier for Class A has the
highest probability (0.7).

Advantages:
Simplicity: OvR is easy to implement and works with any binary classifier, like logistic
regression or support vector machines.

Scalability: It handles problems with many classes efficiently, as each classifier deals
with just one class vs. the others.

Interpretability: Each classifier is focused on one class, making it easier to understand


the model's decisions.

Disadvantages:
Imbalanced Data: OvR can struggle with imbalanced datasets, where some classes have
significantly fewer examples.

Overlapping Classes: If classes overlap, classifiers may misclassify inputs.

Computational Cost: With many classes, the number of classifiers increases, which can
be computationally expensive.
One-vs-One (OvO)
The One-vs-One (OvO) approach is a strategy used to extend binary classification
algorithms to multiclass classification problems.

It is one of the most commonly used techniques for tackling multiclass classification,
especially when dealing with models that are inherently binary classifiers, such as
Support Vector Machines (SVM), logistic regression, or decision trees.

In OvO, instead of creating one classifier that tries to separate all classes, we create a
binary classifier for each pair of classes in the dataset.

Each classifier distinguishes between just two classes, and the final prediction is made
based on the results from all these pairwise classifiers.

UNIT 4 25
The class that wins the most pairwise comparisons is selected as the final predicted
class.

What is One-vs-One (OvO)?


In One-vs-One (OvO) classification, a binary classifier is trained for every possible pair
of classes.

This means that for a problem with C classes, the number of classifiers needed will be
the number of ways you can select two classes from C, which is mathematically given by
the combination formula:
C(C−1)
Number of classifiers = C = 2 ​

For example, if there are 3 classes: A, B, and C, OvO will train the following binary
classifiers:

1. Classifier 1: A vs B

2. Classifier 2: A vs C

3. Classifier 3: B vs C

Each classifier will try to distinguish between two classes.

During training, the classifier will be given data that belongs to only two classes at a time,
and it will learn how to separate those classes.

Steps in One-vs-One Classification


1. Training Phase:

For each pair of classes, a binary classifier is trained. Each classifier gets a dataset
with only the two classes involved in that classifier.

2. Prediction Phase:

For a new, unseen input, each of the binary classifiers makes a prediction.

Each classifier votes on which class the data point belongs to.

The class with the most votes across all classifiers is selected as the final prediction
for that input.

Example of OvO for 3 Classes


Let’s consider a multiclass classification problem with three classes: A, B, and C. The steps
for OvO would look like this:

1. Step 1: Pairwise Classifier Creation


We create three binary classifiers:

Classifier 1: Class A vs. Class B

Classifier 2: Class A vs. Class C

Classifier 3: Class B vs. Class C

UNIT 4 26
2. Step 2: Training the Classifiers

Classifier 1 is trained on the instances from Class A and Class B.

Classifier 2 is trained on the instances from Class A and Class C.

Classifier 3 is trained on the instances from Class B and Class C.

3. Step 3: Making Predictions

Suppose you have a new test instance. Each classifier will vote for one of the two
classes it was trained on:

Classifier 1 might predict A or B.

Classifier 2 might predict A or C.

Classifier 3 might predict B or C.

After gathering the votes from all classifiers, the class with the most votes is the final
prediction.

Example with Votes


Let’s assume we have the following predictions from the classifiers for a test instance:

Classifier 1 (A vs B): Predicts A.

Classifier 2 (A vs C): Predicts C.

Classifier 3 (B vs C): Predicts C.

In this case, the class C receives 2 votes (from Classifier 2 and Classifier 3), while A and B
each receive 1 vote. Therefore, the final prediction would be C.

Advantages of One-vs-One
1. Simpler Binary Classification Problems: Each binary classifier only needs to separate
two classes, which can be simpler to model, especially for complex problems.

2. Better Performance with Some Algorithms: In some cases, binary classifiers like
Support Vector Machines (SVMs) perform better when they only need to distinguish
between two classes at a time, rather than having to consider all classes together in a
multiclass problem.

3. Handling Class Imbalance: OvO can be beneficial in cases where the classes are
imbalanced (i.e., some classes have many more examples than others). Since each
classifier only deals with two classes at a time, it may be easier to address imbalance
within each pair.

Disadvantages of One-vs-One
1. Computational Complexity: The main drawback of OvO is that it requires training a large
C(C−1)
number of classifiers. If there are C classes, 2

classifiers are needed. As the
number of classes grows, the computational cost increases exponentially. For example,

UNIT 4 27
with 10 classes, you would need to train 45 classifiers, and with 100 classes, 4,950
classifiers.

2. Memory and Storage Requirements: Since each pair of classes has its own classifier,
the storage and memory required to keep track of all these models can be quite large,
especially when dealing with a dataset that has many classes.

3. Prediction Time: After training the classifiers, predicting the class of a new sample
involves running it through all the classifiers, which can be time-consuming, especially
when the number of classes is large.

Applications of One-vs-One
The One-vs-One technique is commonly used in a variety of machine learning tasks, such
as:

1. Image Classification: In problems like object recognition, where an image needs to be


classified into one of several categories, OvO can be effective because it simplifies the
task by breaking it down into smaller, pairwise classification problems.

2. Text Classification: When classifying documents into multiple categories (e.g., news
topics like sports, politics, or entertainment), OvO allows each classifier to specialize in
distinguishing between two categories, which can improve overall performance.

3. Medical Diagnosis: In healthcare, OvO can be used for classifying medical conditions or
diseases based on symptoms, where each classifier distinguishes between pairs of
diseases.

4. Speech Recognition: In automatic speech recognition systems that classify spoken


words into different classes, OvO can be used to distinguish between pairs of words or
phrases.

Challenges in Multiclass Classification


1. Class Imbalance
Like in binary classification, multiclass problems can suffer from class imbalance, where some
classes have far more examples than others. This can lead to biased models that perform well
on the majority class but poorly on the minority classes. Techniques like oversampling,
undersampling, or using balanced weights can help address this issue.

2. Scalability

As the number of classes increases, training multiple classifiers (as in OvR or OvO) can become
computationally expensive. Additionally, the model might struggle to learn effectively from a
large number of classes, especially if the dataset is not sufficiently large or well-represented for
each class.

3. Complexity in Model Interpretation


Multiclass classification models can become harder to interpret, especially when they are
complex models like decision trees, neural networks, or random forests. Visualizing the decision

UNIT 4 28
boundaries and understanding the decision process for each class may require more advanced
techniques.

Applications of Multiclass Classification


Multiclass classification is used in various domains:

Image recognition: Identifying objects in an image, where each object corresponds to a class.

Text classification: Categorizing documents or messages into topics such as sports, politics, or
entertainment.

Speech recognition: Recognizing spoken words or commands.

Medical diagnosis: Classifying medical conditions based on symptoms or test results.

💡 What are different variants of multi-class classification? Explain them with


suitable examples. [APR 2023]

Calculate macro average precision, macro average recall, and macro


average F-score for the following confusion matrix of multi-class
classification. [APR 2023]
Predictions \ Actual values A B C D

A 100 80 10 10

B 0 9 0 1

C 0 1 8 1

D 0 1 0 9

Balanced and Imbalanced Multi-Class


Classification Problems
In machine learning, a multiclass classification problem involves predicting one of several
possible classes or categories for each data point.

When the classes in the dataset are not evenly distributed, meaning that some classes have
significantly more examples than others, the problem is referred to as an imbalanced multi-
class classification problem.

This type of problem can create significant challenges for machine learning algorithms because
the model may become biased toward the majority class, leading to poor performance on the
underrepresented classes.

What Is an Imbalanced Multi-Class Classification Problem?


An imbalanced multi-class classification problem occurs when the distribution of samples
across the different classes is uneven.

UNIT 4 29
In a typical classification task, you would expect each class to have a similar number of
samples, but in many real-world scenarios, this is not the case.

For example, in a medical dataset for diagnosing diseases, you might have far more healthy
patients than patients with a rare disease.

In such cases, the minority classes (such as those representing the rare disease) are
underrepresented, and the majority classes (such as the healthy patients) dominate the dataset.

For instance, in a three-class problem (A, B, and C):

Class A might have 80% of the data (majority class).

Class B might have 15% of the data (minority class).

Class C might have only 5% of the data (very minority class).

This imbalance can significantly affect the ability of the machine learning model to correctly
classify the minority classes, because it might get "distracted" by the large number of samples
from the majority class.

Why Is It a Problem?
Imbalanced datasets present several challenges for machine learning models:

1. Bias Toward the Majority Class: Many machine learning algorithms are designed to maximize
overall accuracy, meaning that if the majority class is overwhelmingly represented, the
algorithm may simply predict the majority class for all data points, leading to high accuracy but
poor performance on the minority classes.

2. Poor Model Generalization: The model might not learn to recognize the distinguishing features
of the minority classes effectively. Since there are fewer examples of the minority classes, the
model does not get enough exposure to these classes during training, which leads to poor
generalization to new, unseen examples from the minority classes.

3. Ineffective Performance Metrics: Accuracy, the most commonly used performance metric, is
not always a reliable indicator in imbalanced problems. A model that predicts the majority class
for every data point can achieve a high accuracy, but it would fail to identify minority class
examples correctly. This makes metrics like precision, recall, F1-score, and area under the ROC
curve more important in evaluating model performance.

Causes of Class Imbalance in Machine Learning


Class imbalance in machine learning refers to situations where the distribution of instances
across the different classes in a dataset is uneven.

One or more classes have significantly more samples than others, leading to a situation where
the machine learning model may develop biases toward the majority class.

Understanding the causes of class imbalance is important, as it helps in identifying why the
problem occurs and how to address it.

Below are some common causes of class imbalance in machine learning:

UNIT 4 30
1. Natural Occurrence in Data
One of the most common causes of class imbalance arises from the natural distribution of
events in the real world. Many phenomena naturally occur more frequently in one class than
another. For instance:

In medical diagnostics, most people may be healthy, and only a small percentage may
suffer from a particular disease. As a result, datasets will naturally have more "healthy"
samples compared to "diseased" samples.

In fraud detection, fraudulent transactions occur much less frequently than legitimate
transactions, which leads to an imbalanced dataset with far more non-fraudulent
transactions.

In object detection, some objects (such as cars or people) are much more common in
everyday images, while others (like rare animals or specific products) are much less
frequent.

When these natural disparities exist, the data generated for machine learning will reflect these
uneven distributions, resulting in imbalanced classes.

2. Data Collection Bias


Another significant cause of class imbalance is data collection bias, which occurs when the
data collected for a machine learning task is not representative of the true distribution of
classes. This can happen for several reasons:

Sampling Bias: Sometimes, the method used to collect data favors certain groups or
outcomes over others. For example, if a survey is conducted to detect a specific disease but
is only done in an area where the disease is rare, the dataset will have a lower proportion of
disease cases.

Data Availability: In many cases, it is simply easier or cheaper to collect data from the
majority class. For instance, if you are working with an e-commerce platform, it might be
easy to collect data on regular customer purchases but much harder to obtain data on
fraudulent transactions due to their rarity.

Event Monitoring: Some rare events, like equipment failures in industrial settings or natural
disasters, might not happen frequently enough to be adequately represented in the dataset.
Hence, datasets based on these types of events will tend to have fewer instances from the
rare class.

3. Rare or Expensive Events


Some datasets are imbalanced because the events themselves are rare or costly to observe. In
domains like healthcare, detecting rare diseases or monitoring rare outcomes is inherently
challenging due to the scarcity of such events. Similarly, detecting fraudulent transactions, rare
industrial failures, or abnormal behavior in security surveillance requires monitoring a large
number of normal events to identify the rare ones.
Because rare events are difficult to observe, they may be underrepresented in the data, leading
to class imbalance. This is especially common when gathering real-world data over long

UNIT 4 31
periods, where the occurrence of rare events is low relative to more common events.

Techniques to handle class imbalance


1. Random Resampling
Random resampling is one of the simplest techniques used to address class imbalance.

It involves modifying the dataset by either oversampling the minority class or undersampling
the majority class to make the class distribution more balanced.

Oversampling the Minority Class


In oversampling, more copies of data points from the minority class are added to the
dataset.

This can be done by duplicating existing samples or generating new ones by random
sampling from the minority class.

The goal is to increase the number of minority class instances so that the model is exposed
to them more often during training, thus improving its ability to predict these rare classes.

Undersampling the Majority Class


In undersampling, samples from the majority class are randomly removed from the dataset
to reduce its size and make the class distribution more balanced.

This technique helps prevent the model from becoming biased towards the majority class,
but it can also lead to a loss of valuable data and underfitting if too much data is removed.

Both methods have their pros and cons. Oversampling helps the model learn more from the
minority class but can lead to overfitting since duplicated samples might cause the model to
memorize rather than generalize.

On the other hand, undersampling can lead to a loss of important information from the majority
class, potentially reducing the model’s performance on the majority class.

2. Tomek Links

UNIT 4 32
Tomek Links are pairs of instances that are very close to each other but belong to opposite
classes.

The technique aims to improve the performance of classifiers in imbalanced multiclass


classification problems by selectively removing instances from the majority class.

When two instances form a Tomek Link, they are the nearest neighbors to each other, but they
belong to different classes.

By removing one of the instances from the majority class in each Tomek Link pair, the decision
boundary between the classes becomes clearer, and the classifier can distinguish between the
classes more effectively.

The main idea is that the instances in a Tomek Link are too close to each other and thus
represent noise or ambiguity in the classification process.

By removing the majority class instance of each pair, you create more space between the
classes.

This results in a more distinct separation, making the task of classification easier and improving
the model's generalization capabilities.

Tomek Links are a form of under-sampling because they remove datapoints from the majority
class.

However, this technique only removes those majority class instances that are very close to
instances of the minority class, focusing on cleaning up the boundary areas rather than
randomly reducing the majority class size.

This helps prevent overfitting and ensures that the classifier is not overwhelmed by the majority
class.

As a result, Tomek Links can be an effective tool for dealing with class imbalance in multiclass
classification problems, improving both model accuracy and decision boundary clarity.

3. SMOTE (Synthetic Minority Over-sampling Technique)

UNIT 4 33
SMOTE (Synthetic Minority Over-sampling Technique) is a more advanced technique for
handling class imbalance by generating synthetic examples rather than duplicating existing
ones.

It focuses on creating new, synthetic samples for the minority class to make the class
distribution more balanced.

For each minority class instance, SMOTE selects a random sample of its k-nearest neighbors.

It randomly selects one or more neighbors and creates synthetic instances along the line
segment between the original instance and the selected neighbor.

These synthetic instances are placed in the feature space, essentially creating new, realistic
examples based on the existing data.

Example:
In a two-dimensional feature space, if an instance from the minority class is located at (3, 4),
and its nearest neighbor is at (4, 5), SMOTE would generate new points between these two
positions, such as (3.5, 4.5), which is a synthetic point that combines information from both the
original instance and its neighbor.

4. Class Weights
Class weights are a method to adjust the learning process in order to give more importance to
the minority class during training, without needing to modify the dataset itself.

In this approach, the algorithm is modified to penalize mistakes on the minority class more
than mistakes on the majority class.

The idea is to make the classifier more sensitive to the minority class by assigning higher
penalties (or weights) to misclassifications of minority class instances.

How Class Weights Work


Many machine learning algorithms, such as logistic regression, decision trees, support vector
machines (SVMs), and neural networks, allow users to assign class weights.

These weights help adjust the algorithm’s behavior when training the model.

UNIT 4 34
The general approach is to:

Increase the weight of the minority class.

Decrease the weight of the majority class.

For example, in a binary classification problem where class A is the majority class and class B is
the minority class, the classifier might be penalized more for misclassifying class B instances.

This encourages the model to focus on correctly identifying the minority class to improve its
performance.

Advantages of Class Weights


No Data Loss: Unlike oversampling and undersampling, class weights do not require any
changes to the dataset itself, and thus no data is lost or duplicated.

Flexible: Class weights can be adjusted for any classifier that supports this feature, making it a
very flexible and widely applicable technique.

Disadvantages of Class Weights


Tuning Required: Finding the right class weights often requires tuning, and the wrong balance
can still lead to a biased model.

Overfitting Risk: If the weights are too high, the model may become too focused on the minority
class, leading to overfitting on the minority class and poor generalization.

Ensemble Learning
Ensemble learning is a powerful technique in machine learning that combines multiple models
to make better predictions than any single model could on its own.

The core idea is that by combining the strengths of different models, an ensemble can often
achieve higher accuracy and more robust predictions, especially when individual models may
have their own weaknesses.

In ensemble learning, several models, often referred to as "learners" or "base models," are
trained to solve the same problem.

Once trained, these models' predictions are combined in some way to produce a final output.

The reasoning behind ensemble learning is that different models may make different errors, and
by combining their predictions, these errors can cancel each other out, leading to improved
overall performance.

There are two main types of ensemble learning: Bagging and Boosting, each with its own
approach to combining models and reducing errors.

1. Bagging (Bootstrap Aggregating):

Bagging involves training multiple models (typically of the same type) on different
subsets of the data. These subsets are created by randomly sampling the original
dataset with replacement (bootstrap sampling). Each model is trained independently.

UNIT 4 35
The most common example of bagging is the Random Forest algorithm, where many
decision trees are trained on different subsets of the data, and their predictions are
combined through voting or averaging.

Goal: To reduce variance and prevent overfitting.

2. Boosting:

Boosting is a sequential ensemble method where each new model is trained to correct
the mistakes made by previous models. The idea is to focus more on the data points that
were misclassified or poorly predicted by previous models.

In boosting, models are added one after another, and each model gives more weight to
the instances that previous models got wrong. This way, the ensemble focuses on
improving weak points.

Popular Boosting Algorithms:

AdaBoost (Adaptive Boosting): Each new model gives more weight to the
misclassified data from previous models.

Gradient Boosting: Models are built in a way that corrects the errors from the
previous model using gradient descent.

Goal: To reduce both bias and variance, improving accuracy on complex problems.

Advantages of Ensemble Learning


1. Improved Accuracy: By combining different models, ensemble methods can achieve higher
accuracy than individual models. Even if each model has a different bias or error pattern, the
ensemble can smooth out these differences and make more accurate predictions.

2. Robustness: Ensembles tend to be more robust than individual models, especially in the
presence of noisy data or outliers. The diversity of models in an ensemble makes it less likely
that the entire system will be affected by issues that might impact one model.

3. Reduction of Overfitting: While individual models, particularly complex ones, can overfit the
training data, ensemble methods like bagging (e.g., random forests) help to reduce overfitting
by averaging predictions across multiple models, which can lead to better generalization.

Challenges of Ensemble Learning


1. Increased Computational Cost: Since multiple models are trained and evaluated, ensemble
methods are generally more computationally expensive than using a single model. This can be a
concern in situations where speed or computational resources are limited.

2. Complexity in Implementation: While the concept of ensemble learning is simple, implementing


it effectively, particularly with more advanced techniques like stacking or boosting, can be
complex and require careful tuning of hyperparameters.

3. Risk of Diminishing Returns: As more models are added to an ensemble, there may come a
point where the benefit of adding additional models becomes minimal. This means that after a

UNIT 4 36
certain point, the improvement in performance might not justify the increased computational
effort.

💡 What do you mean by ensemble learning? Differentiate between bagging &


boosting. [APR 2023]

What is ensemble learning? Explain the concept of Random Forest


ensemble learning. [OCT 23]

Bagging
Bagging stands for Bootstrap Aggregating, a powerful ensemble learning technique that aims
to improve the performance and stability of machine learning models.

The concept of bagging is based on the idea of combining multiple models (often of the same
type) trained on different subsets of the data to achieve a more accurate and robust final
prediction.

The goal is to reduce the variance of the model, making it less prone to overfitting and more
generalized to new, unseen data.

Bagging works by creating multiple models that are trained on different parts of the training
data.

These parts are generated using a method called bootstrapping, which involves randomly
sampling the training data with replacement.

This means that some data points may appear multiple times in the same subset, while others
may not appear at all.

After training, each of these models makes predictions, and the final prediction is made by
combining the outputs of all the models, typically through voting (for classification tasks) or
averaging (for regression tasks).

The core principle behind bagging is to reduce variance. Variance in a model refers to the
sensitivity of the model to fluctuations in the training data.

High variance models tend to overfit, meaning they perform well on the training data but fail to
generalize to new data.

By averaging the predictions from several models trained on different data subsets, bagging
helps to smooth out errors and reduce the model’s overall variance.

Step-by-Step Process of Bagging


1. Data Subsampling (Bootstrapping): The first step in bagging is to create multiple subsets of the
training data. This is done by randomly sampling the data with replacement. For example, if you
have a training dataset of 1000 samples, each subset might also have 1000 samples, but some
samples may be repeated in the subset while others are omitted.

UNIT 4 37
2. Model Training: Each of these subsets is then used to train a separate model. These models are
typically of the same type, such as decision trees or support vector machines. However, each
model is exposed to different data, which introduces diversity in the predictions made by each
model.

3. Prediction: Once all the models are trained, they are used to make predictions on new data. For
classification problems, the predictions from all the models are combined using a majority vote,
meaning the class that most models predict is chosen as the final prediction. For regression
problems, the predictions are averaged to get the final result.

4. Final Aggregation: The final step is to aggregate the results from all the individual models. This
aggregation typically reduces the impact of any one model’s errors. In classification, it’s based
on voting (majority rule), while in regression, it’s the average of the predictions.

Example: Random Forest


One of the most popular examples of bagging is the Random Forest algorithm. In a random
forest, multiple decision trees are trained using bagging, and their predictions are aggregated.

Since each tree is trained on a different bootstrap sample, the trees are likely to make different
mistakes.

When the results of many trees are combined, the overall error tends to decrease, as errors
made by some trees are canceled out by others.

Benefits of Bagging
1. Reduction of Overfitting: Bagging helps reduce overfitting, particularly in complex models like
decision trees, which can be highly sensitive to small variations in the training data. By
averaging multiple models, bagging smooths out the predictions and leads to a model that
generalizes better to unseen data.

2. Improved Accuracy: Since bagging aggregates the predictions of multiple models, it can lead to
a more accurate prediction compared to a single model. Even if individual models perform
poorly, the ensemble can still achieve good overall performance.

3. Stability: Bagging stabilizes the model by making it less sensitive to fluctuations in the training
data. If one model is overfitted or underfitted, the effect of this is lessened by the other models
in the ensemble.

4. Parallelization: Bagging is a technique that can be easily parallelized because each model is
trained independently on a different subset of the data. This makes it efficient to implement,
especially on large datasets.

Drawbacks of Bagging
1. Increased Computational Cost: One of the main drawbacks of bagging is that it requires
training multiple models, which increases the computational cost. For large datasets or complex
models, this can be a significant concern in terms of both time and resources.

2. Diminishing Returns: After a certain point, adding more models to the ensemble may lead to
diminishing returns in terms of performance improvement. The additional models may not

UNIT 4 38
contribute significantly to the accuracy, and their inclusion may only increase computational
complexity.

3. Model Diversity: Bagging typically uses the same type of model for each subset of the data.
While this is useful for reducing variance, it may not be as effective in situations where different
model types or approaches could offer additional diversity and improve the ensemble's
performance.
Sub bagging
Subagging (Subset Aggregating) is a variant of the bagging technique in ensemble learning
that aims to improve the accuracy and robustness of machine learning models.

While it shares some similarities with bagging, subagging has a distinct approach in how the
data is sampled for training individual models, offering potential advantages in certain
scenarios.

The main idea behind subagging is to use smaller, randomly selected subsets of the training
data instead of bootstrapping, which is used in traditional bagging.

Subagging works by training multiple models, similar to bagging, but with one key difference:
instead of sampling with replacement (as in bagging), subagging uses random subsets of the
data without replacement.

This means that each model in the ensemble is trained on a different, smaller subset of the data,
and no data point is repeated in any subset.

By using subsets that are smaller than the full training set, subagging can provide a more
diverse set of models, while still maintaining the overall goal of improving model performance
through aggregation.

In subagging, just like in bagging, the predictions of all individual models are combined to make
the final prediction.

For classification tasks, this typically involves majority voting, while for regression tasks, the
predictions are averaged.

Step-by-Step Process of Subagging


1. Data Subsampling: Instead of using the full training data or sampling with replacement as in
bagging, subagging selects smaller random subsets of the training data. The subsets are
chosen without replacement, meaning each data point appears only once in a given subset.
These subsets are typically smaller in size compared to the original dataset.

2. Model Training: Each of these randomly selected subsets is used to train a separate model.
Like bagging, these models are typically the same type (e.g., decision trees, neural networks)
but are trained on different portions of the data. Since each model is trained on a different
subset, the ensemble members are likely to make different errors, which enhances the diversity
of the ensemble.

3. Prediction: Once all models are trained, they are used to make predictions on new, unseen data.
For classification tasks, each model casts a vote, and the class with the majority of votes is

UNIT 4 39
selected as the final prediction. For regression tasks, the final prediction is the average of all the
model predictions.

4. Final Aggregation: As with bagging, the final result is derived from aggregating the predictions
of all individual models. This aggregation helps reduce the overall error, particularly by
averaging out individual model biases and errors.

Key Differences Between Bagging and Subagging


Data Sampling: In bagging, data is sampled with replacement, meaning some data points can
appear multiple times in the training set for a particular model. In subagging, data is sampled
without replacement, ensuring that each data point in the training set appears only once in each
model’s training data.

Subset Size: In bagging, the training sets are typically the same size as the original dataset. In
subagging, the training subsets are smaller and are typically a fraction of the original dataset,
often around 50-80% of the total data size.

Goal: Bagging focuses on reducing variance by using multiple models trained on different data
samples, whereas subagging aims to achieve a balance between reducing variance and
introducing more diversity by using smaller, non-repetitive subsets of data.

Advantages of Subagging
1. Reduced Variance: Similar to bagging, subagging helps reduce the variance of high-variance
models (e.g., decision trees), making the ensemble more stable and less prone to overfitting.
The different subsets of data prevent overfitting to any particular portion of the data.

2. Diversity of Models: Since subagging uses smaller random subsets of data without
replacement, the models in the ensemble are likely to make different types of errors. This
diversity can lead to better generalization and improved performance on new data.

3. Efficiency: Subagging can be more computationally efficient than bagging, especially when
working with large datasets. By using smaller subsets of data for training the models, the
training process requires less computational power, while still benefiting from the aggregation
of multiple models.

4. Improved Performance: Subagging can sometimes improve performance over bagging,


especially when the base model is relatively complex. The smaller subsets lead to less
overfitting, which can make the ensemble perform better on unseen data.

Disadvantages of Subagging
1. Potential Underfitting: Since each model in subagging is trained on a smaller subset of the
data, there is a risk of underfitting, especially if the base models are too simple or if the subsets
are too small. The model might not capture enough information to make accurate predictions on
its own.

2. Less Data Utilization: Since data points are not repeated in the subsets, there is less data
available for each individual model. This could potentially lead to weaker models if the base
model requires a lot of data to perform well.

UNIT 4 40
3. Computational Overhead: Although subagging can be more efficient than bagging, the
computational cost can still be high because multiple models are being trained. In cases where
the models are particularly large or complex, the cost of training many models could still be
substantial.

Example Use Case: Random Forest vs. Subagging


To illustrate the difference between bagging and subagging, let’s compare Random Forest (which
uses bagging) with a model trained using subagging.

In Random Forest (bagging), multiple decision trees are trained on different bootstrapped
samples of the data. Each tree is trained on a subset of the data where some data points are
repeated, which can lead to models that overfit specific aspects of the data.

In subagging, smaller random subsets of data (without replacement) are used for training. This
reduces the likelihood of overfitting and can lead to greater diversity in the models, as each tree
has seen different data. The ensemble of these diverse models can perform better in terms of
generalization.

Boosting
Boosting is a technique that aims to reduce both bias and variance of the model by combining
many weak learners (models that perform slightly better than random guessing) to create a
strong learner (a model with high performance).

The central idea of boosting is to train models sequentially, where each new model corrects the
mistakes made by the previous ones.

Unlike bagging, where models are trained independently and aggregated, boosting focuses on
improving the performance of the model iteratively by trying correct the mistakes made by the
previous ones.

UNIT 4 41
This allows boosting to enhance the overall accuracy of the ensemble and produce highly
accurate models even when the individual base models are weak.

The term "weak learner" refers to a model that performs slightly better than random guessing.
In boosting, these weak models are usually simple algorithms like decision trees with a single
split, also known as decision stumps.

Boosting takes these simple models and builds a strong learner by adjusting the weights of the
data points to focus on the most difficult-to-predict instances.

How Boosting Works


1. Initial Model Training: Boosting starts by training the first model on the entire training dataset.
This model is often weak, meaning it doesn’t perform very well initially.

2. Error Calculation: After the first model makes predictions, the algorithm identifies which data
points were misclassified or wrongly predicted. These misclassified data points are the focus of
the next model, as the goal is to correct these errors in the subsequent model.

3. Weighting Misclassified Points: In boosting, the training instances that were misclassified are
given more weight, meaning they will have more influence on the training of the next model.
Conversely, correctly classified data points have less influence on the next model. By focusing
on the difficult cases, boosting ensures that the next model is better at handling the errors of
the previous one.

4. Training New Model: The next model is trained on the adjusted dataset, which now includes the
re-weighted data points that were misclassified by the previous model.

5. Iterative Process: This process is repeated for several iterations, with each new model being
trained to fix the errors of the previous models. As more models are added to the ensemble, the
system becomes increasingly accurate because each new model corrects the weaknesses of
the combined ensemble.

6. Final Prediction: Once all models are trained, their predictions are combined. In classification
tasks, this is typically done through a weighted voting scheme, where models that performed
better have more influence in the final prediction. In regression tasks, predictions are usually
averaged.

Types of Boosting Algorithms


1. AdaBoost (Adaptive Boosting):

AdaBoost was one of the earliest and most widely used boosting algorithms.

It works by assigning equal weights to all training data points at the start. After each model
is trained, the weights of misclassified points are increased, making those points more
influential in the next round of learning.

AdaBoost then combines the predictions of all models, with each model’s influence
determined by its accuracy (i.e., better models have a higher weight in the final prediction).

Strength: AdaBoost is relatively simple and can significantly improve the performance of
weak models.

UNIT 4 42
2. Gradient Boosting:

Gradient Boosting takes a different approach by fitting a new model to the residuals
(errors) of the combined ensemble of previous models, rather than directly focusing on
misclassified points.

It uses gradient descent, an optimization technique, to minimize the errors (residuals) at


each step, adjusting the learning process to reduce the loss function over time.

Gradient boosting algorithms like XGBoost, LightGBM, and CatBoost are highly optimized
variants of gradient boosting that are widely used in machine learning competitions because
of their efficiency and performance.

Strength: Gradient boosting is flexible, powerful, and can handle various types of predictive
tasks, including classification and regression.

3. Other Variants:

LogitBoost: A variant of boosting that is specifically designed for binary classification


problems using logistic regression models as the base learners.

Stochastic Gradient Boosting: A modification of gradient boosting that introduces


randomness by selecting a subset of the data for each model, helping to reduce overfitting.

Why Boosting Works


Improves Performance by Focusing on Errors: Boosting helps to improve performance by
correcting the errors made by previous models. Each subsequent model focuses on difficult-to-
classify instances, and by doing so, the ensemble learns to generalize better to new data.

Reduction of Bias and Variance: Boosting can reduce both bias and variance. By combining
several weak models, boosting reduces bias by improving the model’s predictions. It also
reduces variance by averaging out errors, ensuring that the final model is less likely to overfit
the data.

Flexibility: Boosting works well with both weak classifiers and complex models. It can be used
for a wide range of machine learning tasks, from regression to classification, and is effective for
both simple and highly complex datasets.

Advantages of Boosting
1. High Accuracy: Boosting typically results in highly accurate models. By focusing on the
hardest-to-predict examples, boosting can improve the performance of weak learners and
achieve better accuracy than many other models.

2. Works Well with Weak Learners: Boosting is particularly useful when the base learners (weak
models) are simple, like decision stumps (small decision trees). Despite their simplicity, boosting
can create an ensemble that performs at a high level.

3. Reduces Overfitting: Although boosting can overfit if not carefully tuned, it generally reduces
overfitting by focusing on the most difficult instances and smoothing out model errors.

UNIT 4 43
4. Versatility: Boosting algorithms, especially Gradient Boosting, can be adapted to both
regression and classification tasks and can handle both linear and non-linear relationships in
the data.

Disadvantages of Boosting
1. Overfitting Risk: If too many models are added or if the base learner is too complex, boosting
can start to overfit the training data. Therefore, it's crucial to tune hyperparameters like the
number of models and the learning rate to avoid overfitting.

2. Computationally Expensive: Boosting algorithms can be computationally intensive, especially


when training many models in the sequence. Gradient boosting, in particular, can be slow for
large datasets.

3. Sensitive to Noisy Data: Boosting can be sensitive to noisy data and outliers. Since it focuses
on correcting errors from previous models, noisy data points or outliers may be overemphasized
in the model-building process, leading to overfitting.

4. Interpretability: Boosting models, especially those like Gradient Boosting or XGBoost, can be
hard to interpret compared to simpler models like decision trees. The final model is an
aggregate of many weak learners, making it challenging to understand the reasoning behind
specific predictions.
Stumping (Decision Stump)
A decision stump is a very simple machine learning model that is often used as a weak learner
in ensemble methods like boosting.

A decision stump is essentially a decision tree with a very limited structure—often consisting of
just one level (a root and two leaves).

Despite its simplicity, decision stumps can be highly effective when used in boosting
algorithms, where multiple stumps are combined to form a strong learner.

A decision stump is a decision tree with only one decision node, meaning it splits the data
based on a single feature, and then assigns a prediction to each of the resulting branches
(leaves).

This simple structure allows the decision stump to make basic, but potentially useful,
predictions.

However, because of its simplicity, a single decision stump is usually a weak learner, meaning it
has a high error rate when used independently.

Since it’s a single decision, it is a very weak learner, meaning it’s not very accurate on its own,
but it can still be useful when combined with other models in an ensemble.

This is done through a process of iterative correction, where each stump in the sequence
focuses on the mistakes made by the previous stumps.

Structure of a Decision Stump


A decision stump operates by making a decision based on the value of a single feature.

UNIT 4 44
For a classification problem, this decision is typically a binary split: if the value of the feature is
above or below a certain threshold, the stump assigns one of two possible labels.

For a regression problem, the decision stump predicts a constant value based on the feature.

The structure can be illustrated as follows:

For Classification: The stump examines one feature, compares it to a threshold, and
classifies the data into one of two classes. For example, if the feature value is less than 5,
assign class "A", otherwise, assign class "B".

For Regression: The stump predicts a constant value, typically the mean value of the target
variable, based on the value of the feature. For example, if the feature is less than 5, predict
the mean target value for all data points where the feature is less than 5.

Why Use Decision Stumps in Boosting?


1. Weak Learners: A decision stump, on its own, is a weak learner because it only uses one
feature to make predictions. This means it’s likely to have a high error rate, making it unsuitable
for standalone predictive tasks. However, in boosting, weak learners are combined to form a
strong learner, so the goal is not to rely on a single stump but to aggregate multiple stumps to
improve predictive power.

2. Simplicity and Speed: One of the advantages of decision stumps is their simplicity. Since they
are essentially one-level decision trees, they are computationally inexpensive and can be
trained quickly. This is important in boosting, as the algorithm requires many iterations of model
training. Using decision stumps allows boosting algorithms to train quickly, even with large
datasets.

3. Focusing on Mistakes: In boosting algorithms like AdaBoost or Gradient Boosting, each new
model is trained to correct the errors made by the previous model. Since decision stumps are
weak learners, they are particularly useful in this process because they can focus on small,
specific parts of the data (the mistakes made by previous models). This iterative correction
process allows boosting to significantly improve the overall performance of the ensemble
model.

How Decision Stumps Work in Boosting


When decision stumps are used in boosting, the general process follows these steps:

1. Initial Training: The first decision stump is trained on the entire dataset, where each data point
is treated equally in the training process. This stump will likely make many errors, but it starts
the boosting process.

2. Error Identification: After the first stump is trained, the boosting algorithm calculates the errors
it made, typically focusing on the misclassified data points. In AdaBoost, for instance, the
misclassified data points are given higher weights, so that subsequent stumps will focus on
these harder-to-predict instances.

3. Sequential Stump Training: A new decision stump is trained on the updated dataset, which now
includes the re-weighted data points. This second stump is more likely to correct the errors
made by the first stump.

UNIT 4 45
4. Combining Stumps: Once all stumps are trained, their predictions are combined in a way that
emphasizes the most accurate models. In AdaBoost, this involves assigning more weight to the
predictions of stumps that performed well and less weight to those that performed poorly. The
final prediction is made by combining the outputs of all stumps, often through weighted voting
(classification) or averaging (regression).

Example of Decision Stumps


Imagine you have a dataset where you want to classify whether a customer will buy a product
(Yes or No) based on their age.

A decision stump might look like this:

Feature: Age

Threshold: 30

If the customer's age is greater than 30, the model might predict "Yes" (the customer will buy
the product), and if the customer's age is less than or equal to 30, it might predict "No" (the
customer will not buy the product). This is a simple, one-level decision tree that only splits the
data on one feature—age.

Advantages of Using Decision Stumps in Boosting


1. Simplicity: Decision stumps are extremely simple models, which makes them fast to train and
easy to understand. This simplicity allows boosting algorithms to focus on correcting errors
rather than building complex individual models.

2. Flexibility: Despite their simplicity, decision stumps can be effective when used in boosting
because they allow the boosting algorithm to create a more powerful ensemble model by
focusing on difficult data points.

3. Efficiency: Since decision stumps are simple, they can be trained quickly. This efficiency is
important in boosting, as many stumps are trained in sequence.

4. Reduction of Bias: When combined in boosting algorithms, decision stumps help reduce bias by
iteratively correcting errors in previous stumps, leading to a more accurate overall model.

Disadvantages of Using Decision Stumps


1. Limited Capacity: Since decision stumps only use a single feature to make predictions, their
capacity to learn complex relationships in the data is very limited. This is why they are
considered weak learners on their own.

2. Risk of Overfitting: Although decision stumps themselves are simple, boosting algorithms can
still overfit if too many stumps are used, especially in noisy datasets. The model may begin to
focus too heavily on the noise in the data, leading to overfitting.
AdaBoost
AdaBoost (Adaptive Boosting) is one of the most popular and influential boosting algorithms in
machine learning.

UNIT 4 46
It was introduced by Yoav Freund and Robert Schapire in 1995 and is known for its simplicity,
effectiveness, and ability to improve the performance of weak learners.

AdaBoost is designed to create a strong predictive model by combining the outputs of multiple
weak learners.

The key idea behind AdaBoost is to iteratively focus on the data points that are difficult to
classify, adjusting the weight of each data point based on the mistakes made by the previous
models.

The goal of AdaBoost is to improve the performance of weak learners by combining them into a
strong learner.

A weak learner is a model that performs slightly better than random guessing.

In AdaBoost, the weak learners are typically decision stumps (one-level decision trees),
although other models can be used as well.

Each decision stump is trained sequentially, and AdaBoost focuses more on the misclassified
examples from previous iterations, ensuring that each new model corrects the errors made by
its predecessors.

How AdaBoost Works: Step-by-Step Process

Step 1: Initialize Weights


1
Initially, each training instance xi is assigned an equal weight wi
​ ​ = N
​, where N is the total
number of training samples.

These weights determine how much influence each training instance has on the learning
process.

Step 2: Train the First Weak Learner


The first weak learner (typically a decision stump) is trained using the training data with the
current weights.

The learner makes predictions for all data points, and an error rate ϵis computed as the
weighted sum of the misclassified instances:

ϵ = ∑i wi ⋅ I(yi 
​ = y^i )
​ ​ ​ ​

where I(yi ​
= y^i )is 1 if the prediction is wrong, and 0 if the prediction is correct.
 ​ ​

Step 3: Update Weights Based on Errors


After the first weak learner is trained, AdaBoost increases the weights of the misclassified
instances to give them more importance in the next round of learning. The new weight for
misclassified instances is adjusted using the formula:

wi ← wi ⋅ exp(α) for misclassified samples


​ ​

wi ← wi ⋅ exp(−α) for correctly classified samples


​ ​

Here, α\alpha is a coefficient that measures the accuracy of the current learner, given by:

UNIT 4 47
1
α= 2
​ ln ( 1−ϵ
ϵ
​)
The coefficient αbecomes larger when the learner performs well, and smaller when the learner
performs poorly. This ensures that the weak learners that perform well are given higher weight,
while those that perform poorly have less influence.

Step 4: Train the Next Learner


A new weak learner is then trained, but this time, it is trained with the updated weights, so it
focuses more on the previously misclassified instances. The process of training the learner,
calculating the error, updating the weights, and adjusting αis repeated for a predefined number
of iterations or until no further improvement can be made.

Step 5: Final Model


After multiple iterations, the final strong model is formed by taking a weighted vote of the
predictions of all weak learners. The weak learners that performed well are given more
influence in the final prediction, while those that performed poorly have less influence.

For classification, the final prediction is made by majority voting, where each weak learner casts
a weighted vote. For regression, the predictions of all weak learners are averaged, weighted by
their respective αvalues.

Example of AdaBoost in Action


Suppose you have a dataset with a mix of easy and difficult-to-classify points. Here's how
AdaBoost would process this:

In the first round, the weak learner might correctly classify most of the easy points but make
mistakes on the difficult ones. AdaBoost increases the weights of the difficult points.

In the second round, the next weak learner is trained, and because the difficult points now have
higher weights, the new model focuses more on getting these points right.

This process continues, with each new model focusing on correcting the mistakes of previous
ones, gradually improving the overall model’s performance.

By the end of the process, AdaBoost has created an ensemble of weak learners, where each
learner focuses on different parts of the data. The final strong model is built by combining these
learners into a single, highly accurate model.

Advantages of AdaBoost
1. High Accuracy: AdaBoost can achieve high accuracy by combining weak learners in a way that
improves their performance. It is particularly effective for classification tasks with complex
decision boundaries.

2. Simplicity: The AdaBoost algorithm is relatively simple to implement, making it an attractive


choice for machine learning practitioners.

3. Focus on Hard-to-Classify Examples: AdaBoost is excellent at identifying and correctly


classifying difficult examples that might otherwise be misclassified by a simple model.

UNIT 4 48
4. Versatility: While decision stumps are commonly used as base learners, AdaBoost can work
with other types of models as well, such as decision trees, linear models, or even neural
networks, though decision stumps are preferred due to their simplicity.

Disadvantages of AdaBoost
1. Sensitive to Noisy Data and Outliers: Since AdaBoost increases the weight of misclassified
points, it can give too much attention to noisy data or outliers. If these points are incorrectly
labeled, the algorithm may overfit to them, reducing overall performance.

2. Overfitting: If AdaBoost is run for too many iterations, it may overfit the training data, especially
if the weak learners become too specialized in correcting noise or outliers.

3. Computational Cost: Although AdaBoost is efficient in terms of the individual learners, the
iterative process can still be computationally expensive, particularly when dealing with large
datasets or many iterations.

Differentiate between bagging and boosting. [OCT


23]
Aspect Bagging Boosting

Full Form Bootstrap Aggregating Boosting

Purpose Reduces variance by averaging predictions Reduces bias by focusing on difficult cases

Combines predictions of multiple models Combines weak learners sequentially, with


Technique (often the same type) by averaging each new model focusing on mistakes made
(regression) or voting (classification) by the previous one

Usually uses the same type of model (e.g., Uses weak learners (often decision trees)
Model Types
decision trees) which can be of any type

No resampling; each new model is trained


Sampling Data is sampled with replacement
on the entire dataset with adjusted weights
Method (bootstrapping) to create multiple datasets
for misclassified samples

Models are trained sequentially, with each


Model Training Models are trained independently in parallel
new model building on the previous one

Primarily reduces variance; little effect on Reduces both bias and variance (focuses on
Impact on Bias
bias correcting errors)

Less prone to overfitting, as variance is More prone to overfitting, particularly with


Overfitting
reduced many models

Example AdaBoost, Gradient Boosting Machines


Random Forest, Bagged Decision Trees
Algorithms (GBM), XGBoost, LightGBM

Models are weighted based on their


Weights of All models are given equal weight in final
accuracy; more accurate models have
Models prediction
higher weights

Equal treatment for all data points; does not Focuses on difficult-to-classify points by
Focus on Errors
focus on previous errors adjusting the weights of misclassified data

Faster in parallel since models are


Computation Slower due to sequential training of models
independent

UNIT 4 49
Aspect Bagging Boosting

Works well with high variance models (e.g., Generally leads to better performance when
Performance
decision trees) bias is high, especially with weak learners

Write a short note on Ensemble learning methods:


[OCT 22]
Ensemble learning refers to a technique in machine learning where multiple models (often called
"learners") are combined to solve a problem and make predictions.

The idea is that by combining the outputs of several models, we can achieve better
performance than any single model could on its own.

The core assumption behind ensemble methods is that combining different models can reduce
errors and improve the overall prediction accuracy.

Ensemble learning methods can be broadly categorized into two types: Simple and Advanced.

i) Simple
1. Simple Ensemble Learning Methods
Simple ensemble methods typically involve combining several similar models in a
straightforward manner. They are often easier to implement and understand. The most common
simple ensemble techniques are:

1. Bagging (Bootstrap Aggregating)

Bagging is a technique that aims to reduce variance and avoid overfitting. In this method,
several models are trained independently on different random subsets of the training
data.

These subsets are created through a process called bootstrapping, where random
samples of the data are drawn with replacement.

This means that some data points might appear more than once in each subset, while
others may be left out.

Once the models are trained, their predictions are combined.

For regression tasks, the final prediction is usually the average of the predictions made
by each model.

For classification tasks, the final prediction is determined by majority voting, where the
class that gets the most votes from all models is chosen.

Example: A popular algorithm that uses bagging is Random Forests. In a random forest,
multiple decision trees are trained on different bootstrap samples of the dataset, and the
final classification or regression prediction is made by aggregating the results of all the
trees.

2. Boosting

UNIT 4 50
Boosting is another ensemble method that aims to improve the accuracy of weak
models.

In boosting, models are trained sequentially, with each new model focusing on the
mistakes made by the previous ones.

The idea is that each model in the sequence tries to correct the errors of the previous
model by giving more weight to the misclassified data points.

The models are combined in a weighted manner, where more weight is given to the
predictions of models that perform better.

The final prediction is typically made by summing up the predictions from all the models.

Example: AdaBoost (Adaptive Boosting) is a popular boosting algorithm. It starts by


training a weak model (like a decision tree stump) on the data, then adjusts the weights
of the misclassified examples so that subsequent models focus more on these
challenging instances. The final prediction is a weighted sum of the predictions of all the
models.

3. Voting

Voting is a simple ensemble method that combines multiple models to make a final
prediction by majority rule.

In classification problems, each model in the ensemble casts a "vote" for a particular
class, and the class that receives the most votes is selected as the final prediction.

In regression tasks, the average of the individual model predictions is taken as the final
prediction.

Voting can be used with different types of models, such as decision trees, support vector
machines, or neural networks, as long as they are independent of each other. There are
two common types of voting:

Hard Voting: Each model predicts a class, and the class with the majority of votes is
chosen.

Soft Voting: Each model predicts class probabilities, and the class with the highest
average probability across all models is selected.

Example: A common approach is using different classifiers like logistic regression,


decision trees, and k-nearest neighbors in a voting ensemble to classify data points. The
diversity of models helps to reduce the risk of overfitting and bias.
ii) Advanced
Advanced ensemble learning methods build on the basic ideas of simple methods but often
involve more sophisticated techniques, complex strategies, or deeper integrations of models.
These methods generally aim to reduce bias and variance more effectively or to improve the
model's interpretability.

1. Stacking

UNIT 4 51
Stacking (or stacked generalization) is an advanced ensemble method where the
predictions of several base models are used as input to a higher-level model, called a
meta-learner.

Unlike bagging and boosting, which combine predictions directly, stacking uses a
second-level model to learn how best to combine the outputs of the base models.

In stacking, multiple models are trained on the training data, and their predictions are
then used as features for a meta-model.

The meta-model is trained to make the final prediction based on the predictions of the
base models.

This approach allows for the possibility of combining the strengths of different types of
models.

Example: Imagine you use a decision tree, a support vector machine (SVM), and a neural
network as base models. The meta-learner could be a logistic regression model that
takes the predictions of these models as input to predict the final output.

2. Gradient Boosting Machines (GBM)

Gradient Boosting is a more advanced and powerful variant of boosting.

Instead of just giving more weight to the misclassified instances, gradient boosting fits
new models to the residual errors (the difference between the observed and predicted
values) of the previous models.

This technique reduces both bias and variance and can produce highly accurate models.

In gradient boosting, each new model in the sequence tries to predict the residuals of the
previous models by focusing on where the errors are largest.

The final prediction is the sum of all the predictions from the models in the sequence.

Example: XGBoost and LightGBM are implementations of gradient boosting that are
highly efficient and have become very popular due to their accuracy and speed in
handling large datasets.

3. Stochastic Gradient Boosting

Stochastic Gradient Boosting is a variation of gradient boosting where random subsets of


the data are used to build each model in the sequence, reducing the correlation between
the models.

This randomness helps prevent overfitting and increases the model’s generalization
ability.

Example: In XGBoost, stochastic gradient boosting is achieved by subsampling the


training data for each boosting iteration.

4. Blending

Blending is similar to stacking but typically uses a simple holdout validation set to train
the meta-learner.

UNIT 4 52
In stacking, the meta-model is trained using cross-validation, which can be
computationally expensive.

Blending, on the other hand, usually splits the training data into two parts: one to train the
base models and the other to train the meta-model.

Example: If you're combining decision trees, support vector machines, and neural
networks, you would use a holdout set to train a logistic regression model that takes the
outputs of these models as input.

Random Forests

Random Forests is an ensemble learning technique primarily used for classification and
regression tasks.

It is an extension of decision trees that aims to increase accuracy, reduce overfitting, and
improve robustness.

The key idea behind Random Forests is to combine the outputs of many decision trees, each
trained on different parts of the data, to make better overall predictions.

A decision tree is a model that makes predictions by asking a series of questions about the
features of the input data.

Each question divides the data into different subsets, and the process continues until the data is
divided into homogenous groups (where the output is as pure as possible).

While decision trees are simple and interpretable, they tend to overfit the training data,
especially when the tree is very deep and complex.

Instead of relying on a single model, ensemble methods like Random Forest use the idea that "a
group of weak learners can make a strong learner."

By combining many trees, Random Forests help overcome the limitations of individual decision
trees.

UNIT 4 53
This technique involves training each tree on a random sample of the data (with replacement).

This means that each tree is trained on a slightly different subset of the dataset, which
introduces diversity among the trees.

Random Forests also introduce randomness in the features used to split the data at each node
of a tree.

Instead of considering all features when making a split, a random subset of features is chosen
at each node, further increasing the diversity between the trees.

How Random Forests Work: Step-by-Step


1. Step 1: Create Multiple Decision Trees:

A Random Forest begins by training multiple decision trees.

Each tree is built using a different random subset of the training data (via bootstrap
sampling), and at each split within a tree, only a random subset of features is considered.

2. Step 2: Train Each Tree Independently:

Each decision tree is trained independently.

Since the trees are built on slightly different data and use different features, each tree may
learn different patterns in the data.

This variability helps to make the ensemble more robust.

3. Step 3: Make Predictions Using All Trees:

Once all the trees are trained, they are used to make predictions. For a given input, each
tree in the forest will produce its own prediction.

For Classification: The final prediction is made using majority voting. The class that is
predicted by the majority of trees becomes the final prediction.

For Regression: The final prediction is typically the average of the individual tree
predictions.

4. Step 4: Combine the Predictions:

The predictions from all the trees are combined to make the final prediction, which usually
leads to better performance than any single decision tree.

Advantages of Random Forests


1. Reduces Overfitting:

One of the biggest strengths of Random Forests is their ability to reduce overfitting. Since
the trees are trained on different data subsets and use different features, they are less likely
to memorize (overfit) the training data. The ensemble nature helps to smooth out individual
errors made by individual trees.

2. Improved Accuracy:

UNIT 4 54
Random Forests usually perform better than a single decision tree because they aggregate
the predictions of multiple trees, which helps to improve generalization and reduce bias.

3. Handles Missing Data Well:

Random Forests can handle missing data in a robust way. During training, when some data
points are missing, Random Forests can still create trees based on the available data and
make predictions even for missing values.

4. Handles Both Classification and Regression:

Random Forests can be applied to both classification and regression tasks. For
classification, it predicts the majority class, and for regression, it predicts the average of the
individual tree predictions.

5. Feature Importance:

Random Forests provide a built-in feature importance metric, which tells you which features
are most influential in making predictions. This can help with feature selection and
understanding the data.

6. Robust to Noise:

Because of the randomness introduced during training, Random Forests are relatively robust
to noise and can handle high-dimensional datasets (datasets with many features).

Disadvantages of Random Forests


1. Model Complexity:

While Random Forests are powerful, they can become computationally expensive,
especially with large datasets and many trees. Training many trees and making predictions
with them can require more memory and processing power than simpler models.

2. Less Interpretability:

One of the trade-offs of using an ensemble method like Random Forest is that it is much
harder to interpret than a single decision tree. While individual decision trees are easy to
understand and visualize, a forest of hundreds or thousands of trees is not easy to interpret.

3. Slower for Real-Time Predictions:

Due to the large number of trees, Random Forests can be slower for making predictions
compared to simpler models. For tasks requiring real-time predictions, the prediction time
might be a concern.

Example of Random Forests in Action


Let’s consider a simple example where we want to predict whether a customer will buy a product
based on their age and income. The dataset might look something like this:

Customer Age Income Buy Product (Y/N)

1 25 50k N

2 30 60k Y

UNIT 4 55
Customer Age Income Buy Product (Y/N)

3 35 70k Y

4 40 80k N

To build a Random Forest model:

1. Step 1: Use bootstrap sampling to create multiple different subsets of the dataset. For example,
one tree might be trained on customers 1, 2, and 3, while another tree might be trained on
customers 2, 3, and 4.

2. Step 2: Each tree in the Random Forest will build a decision tree using different features at each
split. For example, one tree might use "Age" to split the data, while another might use "Income."

3. Step 3: When a new customer with a certain age and income wants to be predicted, each tree in
the forest will predict whether they will buy the product or not. If the majority of trees predict
"Yes," then the final prediction will be "Yes."

4. Step 4: Combine the predictions from all the trees to give the final prediction.

💡 1. Explain the Random Forest Algorithm with an example. [OCT 22]

2. Write a short note on: [APR 2023]

a. Random Forest

b. Adaboost

Diagnostics of Classifiers
In machine learning, evaluating the performance of a classifier is crucial for understanding how
well it can make predictions on unseen data.

A classifier is an algorithm that assigns labels to data points based on learned patterns,
typically used for classification tasks where the goal is to categorize input data into predefined
classes.

To determine the effectiveness of a classifier, various evaluation metrics are employed.

These metrics help assess different aspects of classifier performance, including accuracy,
precision, recall, and the trade-offs between them.

1. Accuracy

Accuracy is the most commonly used metric to evaluate classifiers.

It is a simple measure of how often the classifier makes correct predictions.

Accuracy is calculated as the ratio of correct predictions to the total number of predictions
made.

The formula for accuracy is:

UNIT 4 56
Number of Correct Predictions
Accuracy = Total Number of Predictions


While accuracy gives a good overview of model performance, it may not be suitable in
cases where the data is imbalanced (i.e., one class is much more frequent than others).

In such situations, even if the model predicts the majority class most of the time, it could still
achieve high accuracy while performing poorly on the minority class.

2. Precision

Precision is a metric that focuses on the accuracy of positive predictions.

It measures how many of the instances predicted as positive by the classifier were actually
positive.

Precision is important when the cost of false positives (predicting an incorrect positive
outcome) is high, such as in spam detection or medical diagnosis.

The formula for precision is:


True Positives (TP)
Precision = True Positives (TP) + False Positives (FP) 

Where:

True Positives (TP) are the correctly predicted positive instances.

False Positives (FP) are the instances that were incorrectly classified as positive.

A high precision means that the classifier is reliable when it predicts a positive outcome, but
it does not tell us about how well the model performs on negative instances.

3. Recall (Sensitivity)

Recall, also known as sensitivity or true positive rate, measures the classifier's ability to
identify all the relevant instances within the data.

It focuses on the true positives (TP) and examines how many of the actual positive
instances the model successfully identifies.

Recall is critical when the cost of false negatives (failing to identify a positive outcome) is
high, such as in detecting diseases or fraud detection.

The formula for recall is:


True Positives (TP)
Recall = True Positives (TP) + False Negatives (FN) 

Where:

False Negatives (FN) are the instances that were incorrectly classified as negative.

Recall is useful when the goal is to minimize missed positives, but it does not consider how
many false positives the model makes.

4. F1-Score

The F1-Score is the harmonic mean of precision and recall.

It balances the trade-off between precision and recall by combining them into a single
number.

UNIT 4 57
The F1-Score is particularly useful when you need a balance between precision and recall
and when you are dealing with an imbalanced dataset.

The formula for F1-Score is:


Precision×Recall
F1-Score = 2 × Precision + Recall 

The F1-Score ranges from 0 to 1, where 1 indicates perfect precision and recall, and 0
indicates the worst performance.

It provides a more nuanced evaluation when both false positives and false negatives are
important.

5. Confusion Matrix

The confusion matrix is a tool for summarizing the performance of a classifier.

It is a table that compares the predicted labels with the true labels of a dataset.

The confusion matrix is especially useful in understanding the types of errors the classifier
is making.

It contains four key components:

True Positives (TP): Instances that were correctly classified as positive.

False Positives (FP): Instances that were incorrectly classified as positive.

True Negatives (TN): Instances that were correctly classified as negative.

False Negatives (FN): Instances that were incorrectly classified as negative.

From the confusion matrix, a variety of evaluation metrics can be derived, such as
precision, recall, and accuracy.

It also provides insight into the specific types of mistakes the model is making (e.g., false
positives vs. false negatives).

6. ROC Curve and AUC

UNIT 4 58
The Receiver Operating Characteristic (ROC) curve is a graphical representation of the
classifier’s performance across all possible classification thresholds.

It plots the true positive rate (recall) on the y-axis against the false positive rate (1 -
specificity) on the x-axis.

Each point on the curve represents a different threshold for classifying a positive instance,
and as you move along the curve, you observe the trade-off between the number of true
positives and false positives.

The Area Under the Curve (AUC) quantifies the overall ability of the classifier to
discriminate between positive and negative classes.

The AUC score ranges from 0 to 1, with higher values indicating better model performance.

A model with an AUC of 0.5 indicates no discrimination (i.e., random guessing), while a
value of 1 indicates perfect classification.

7. Specificity

Specificity, also known as the true negative rate, is the measure of how well the classifier
can identify negative instances.

It is the proportion of actual negatives that are correctly identified as such.

Specificity is especially important when the cost of false positives is high.

The formula for specificity is:


True Negatives (TN)
Specificity = True Negatives (TN) + False Positives (FP) 

A high specificity means that the model does a good job of identifying negative instances
and avoiding false positives.

8. Matthews Correlation Coefficient (MCC)

The Matthews Correlation Coefficient (MCC) is a measure that takes into account all four
quadrants of the confusion matrix.

It is considered a more balanced measure than accuracy, especially when dealing with
imbalanced datasets.

MCC returns a value between -1 and 1, where:

1 indicates perfect predictions.

0 indicates random predictions.

1 indicates completely wrong predictions.

The formula for MCC is:


TP ×TN −FP ×FN
MCC = 
(TP +FP )(TP +FN )(TN +FP )(TN +FN )

MCC is particularly useful for evaluating binary classification tasks with imbalanced classes.

UNIT 4 59
💡 1. Explain: [OCT 23]
i) Accuracy

ii) Precision
iii) Recall
v) F-Score

Explain in brief methods used for evaluating classification models. [OCT 22]

What is the relation between precision and recall?


Explain with an example. [OCT 23]
Precision and recall are two important performance metrics used to evaluate classification
models, particularly when the classes are imbalanced (i.e., one class is much more frequent
than the other).

These metrics are related to how the model handles positive predictions, but they focus on
different aspects of performance.

1. Precision measures the accuracy of positive predictions. It answers the question: Of all the
instances the model classified as positive, how many were actually positive?

Formula:

Precision = TP
TP +FP
​

where:

TP = True Positives (correctly predicted positive instances)

FP = False Positives (incorrectly predicted as positive)

2. Recall (also known as Sensitivity or True Positive Rate) measures the ability of the model
to identify all positive instances. It answers the question: Of all the actual positive instances,
how many did the model correctly identify?

Formula:
TP
Recall = TP +FN


where:

FN = False Negatives (instances that were actually positive but predicted as negative)

Example to Understand Precision and Recall


Let's consider a binary classification problem where the task is to predict whether an email is spam
or not spam. Suppose the model makes the following predictions:

True Positives (TP): 80 emails correctly classified as spam

UNIT 4 60
False Positives (FP): 10 emails incorrectly classified as spam (but are not spam)

False Negatives (FN): 5 emails incorrectly classified as not spam (but are actually spam)

True Negatives (TN): 200 emails correctly classified as not spam

Now, we can calculate the precision and recall:

Precision:
80 80
Precision = TP
TP +FP
​ = 80+10
​ = 90
​ ≈ 0.89
This means that 89% of the emails classified as spam by the model are actually spam.

Recall:
TP 80 80
Recall = TP +FN ​ = 80+5 ​ = 85 ​ ≈ 0.94
This means that 94% of all actual spam emails were correctly identified by the model.

Trade-off Between Precision and Recall


Precision and recall often exhibit a trade-off. That is, improving one can sometimes result in the
deterioration of the other:

If the model is biased towards precision (i.e., trying to minimize false positives), it may classify
fewer emails as spam, reducing the chance of false positives but also missing some actual
spam emails, leading to lower recall.

If the model is biased towards recall (i.e., trying to minimize false negatives), it may classify
more emails as spam, which increases the number of correct spam classifications but also
increases the risk of incorrectly labeling non-spam emails as spam (false positives), lowering
precision.

Write a short note on the importance of the


confusion matrix. [OCT 22]
The confusion matrix is a key tool in evaluating the performance of classification models,
particularly in binary and multi-class classification tasks.

It provides a detailed breakdown of the model's predictions by comparing them with the actual
outcomes in the dataset.

This matrix helps assess not just the overall accuracy of the model but also its ability to classify
each class correctly and the types of errors it is making.

A confusion matrix typically includes four components for binary classification:

True Positives (TP): The number of instances that are correctly predicted as positive.

True Negatives (TN): The number of instances that are correctly predicted as negative.

False Positives (FP): The number of instances that are incorrectly predicted as positive
(type I error).

False Negatives (FN): The number of instances that are incorrectly predicted as negative
(type II error).

UNIT 4 61
These components allow us to calculate important metrics such as:

Accuracy: The proportion of total correct predictions.

Precision: The proportion of positive predictions that are actually correct (TP / (TP + FP)).

Recall (Sensitivity): The proportion of actual positives that are correctly identified (TP / (TP
+ FN)).

F1-Score: The harmonic mean of precision and recall, providing a balanced measure when
classes are imbalanced.

Why is the Confusion Matrix Important?


1. Detailed Performance Analysis: Unlike simple accuracy, the confusion matrix allows for a more
nuanced understanding of model performance. It helps identify whether the model is biased
towards one class (e.g., predicting the majority class more frequently) or if it’s making specific
types of errors.

2. Handling Imbalanced Classes: In cases of imbalanced datasets, where one class is much more
frequent than the other, accuracy alone can be misleading. The confusion matrix gives insights
into the types of misclassifications (false positives and false negatives), allowing for more
effective strategies to handle class imbalance.

3. Improving Model Decision-Making: By analyzing the confusion matrix, you can identify where
your model is going wrong. For example, if the model has a high number of false positives, you
might want to adjust the decision threshold or explore techniques like class weighting or
resampling.

4. Evaluation of Specific Metrics: Precision, recall, and F1-score are particularly useful when the
costs of different types of errors are unequal, or when the classes are not equally important.
The confusion matrix enables the calculation of these metrics, providing a more comprehensive
view of model performance.

Macro-Average and Micro-Average Precision,


Recall and F1-Score
Macro-Average Precision, Recall, and F1-Score
Macro-averaging involves calculating the performance metrics for each class independently,
and then taking the average of these metrics.

This method treats all classes equally, regardless of how many instances belong to each class.

1. Macro-Average Precision:

To compute Macro-Average Precision, we first calculate the precision for each class
individually.

Precision for a class is the ratio of correctly predicted positive instances (True Positives)
to the total predicted positives (True Positives + False Positives).

UNIT 4 62
Once we have the precision for each class, we take the average of these values across
all classes.
Formula:

Where N is the number of classes, T Pi is the number of True Positives for class i, and

F Pi is the number of False Positives for class i.


2. Macro-Average Recall:

Similarly, we calculate recall for each class independently.

Recall for a class is the ratio of correctly predicted positive instances (True Positives) to
the total actual positives (True Positives + False Negatives).

After computing the recall for each class, we take the average of these recall values
across all classes.

Formula:

Where F Ni is the number of False Negatives for class i.


3. Macro-Average F1-Score:

The F1-Score is the harmonic mean of precision and recall, providing a balance between
the two.

For Macro-Average F1-Score, we calculate the F1-Score for each class individually and
then average these scores.

Formula:

N
1 2 ⋅ Precisioni ⋅ Recalli
Macro F1 = ∑
​ ​

Precisioni + Recalli
​ ​ ​

N ​ ​

i=1

Macro-Average is useful when you want to treat all classes equally, regardless of their size or
distribution in the dataset.

However, it can be heavily influenced by classes with fewer instances, which may lead to a
lower score if the model performs poorly on rare classes.

Micro-Average Precision, Recall, and F1-Score


Micro-averaging, on the other hand, aggregates the contributions of all classes before
computing the performance metrics.

Instead of calculating Precision, Recall, and F1-Score for each class individually and then
averaging, micro-averaging sums up the True Positives, False Positives, and False Negatives
across all classes and then computes the metric.

UNIT 4 63
1. Micro-Average Precision:

a. To compute Micro-Average Precision, we sum up all True Positives (TP) and False
Positives (FP) across all classes, and then calculate the precision based on these totals.
Formula:

Where T Pi and F Pi are the True Positives and False Positives for class i.

2. Micro-Average Recall:

For Micro-Average Recall, we sum up all True Positives and False Negatives across all
classes, and then compute the recall.

Formula:

Where F Ni is the number of False Negatives for class i.


3. Micro-Average F1-Score:

Micro-Average F1-Score is the harmonic mean of Micro Precision and Micro Recall,
which are calculated using the aggregated sums of True Positives, False Positives, and
False Negatives.

Formula:


Micro-Averaging is useful when the dataset is imbalanced, or when you want to give more
weight to the overall performance of the model rather than the individual class performance.

Since it aggregates the contributions from all classes, it is less sensitive to the performance on
any specific class, particularly small classes.

It focuses more on the classes with a larger number of instances.

When to Use Each


Macro-Average is suitable when you care equally about the performance on each class,
especially when you have imbalanced datasets and want to avoid bias toward larger classes.

Micro-Average is useful when you are more interested in the model’s overall ability to predict
correctly, especially in cases where large classes dominate the data and you want to prioritize
their performance.

ROC Curve
The Receiver Operating Characteristic (ROC) curve is a graphical representation used to
evaluate the performance of a binary classification model.

UNIT 4 64
It provides a way to assess how well a model distinguishes between two classes (e.g., positive
and negative).

The ROC curve is widely used in fields such as medicine, machine learning, and signal
detection to compare different classifiers and to choose the best model based on certain
performance metrics.

How the ROC Curve Works


The ROC curve is created by plotting the TPR (on the vertical axis) against the FPR (on the
horizontal axis) for different thresholds.

In binary classification, a model usually outputs a probability score for each instance, indicating
how likely it is to belong to the positive class.

A threshold is then applied to this score to decide whether an instance is classified as positive
or negative.

For example, if the model predicts a probability above 0.5, it might classify the instance as
positive; otherwise, it classifies it as negative.

By varying this threshold from 0 to 1, we can observe how the TPR and FPR change, which
produces different points on the ROC curve.

The more the curve leans toward the top-left corner (where TPR is high and FPR is low), the
better the model’s performance.

UNIT 4 65
Interpreting the ROC Curve
Top-left corner (ideal point): This represents a point where the model has a high TPR (almost 1)
and a low FPR (close to 0). This means the model is correctly identifying most positive cases
and minimizing false positives.

Diagonal line (random classifier): If the ROC curve lies along the diagonal line from the bottom-
left to the top-right (also known as the line of no-discrimination), the model is no better than
random guessing. In this case, the TPR is approximately equal to the FPR, meaning the model
fails to distinguish between the positive and negative classes.

Area under the ROC Curve (AUC): The area under the ROC curve is often used as a summary
statistic for the model's performance. The AUC ranges from 0 to 1:

An AUC of 1 means perfect classification.

An AUC of 0.5 means the model has no discriminative ability, equivalent to random
guessing.

An AUC less than 0.5 indicates a model that is worse than random guessing (which
suggests that the model might need to be re-evaluated or retrained).

Advantages of the ROC Curve


1. Threshold-Invariant: The ROC curve evaluates the performance of the model across all
possible thresholds, providing a comprehensive view of its behavior.

2. Class Imbalance: The ROC curve is particularly useful in situations where the dataset is
imbalanced (i.e., one class is much more prevalent than the other). It focuses on the relationship
between TPR and FPR, so it does not get overly affected by class imbalance as accuracy might.

3. Comparison of Models: Since the ROC curve plots TPR against FPR, it allows easy comparison
between different models. A model that consistently stays above the curve of another model is
considered better.

Limitations of the ROC Curve


1. Insensitive to Class Imbalance (in some cases): Although the ROC curve is less sensitive to
imbalanced datasets compared to accuracy, it can still be misleading if the dataset is highly
imbalanced. In cases where the negative class dominates, the FPR can become very small,
leading to a seemingly high ROC curve even if the model fails to detect the positive class
effectively. In such cases, other metrics like Precision-Recall curves may be more informative.

2. No Information about the Cost of Misclassification: The ROC curve does not take into account
the different costs of misclassifications (i.e., False Positives vs. False Negatives). In some
applications, False Negatives might be more costly than False Positives (e.g., in medical
diagnoses), and this is not captured by the ROC curve alone.

Area Under the Curve (AUC)

UNIT 4 66
AUC, or "Area Under the ROC Curve," is a metric used to evaluate the performance of a
classifier.

It is closely linked to the Receiver Operating Characteristic (ROC) curve, which visually shows
how well a model distinguishes between positive and negative classes at various decision
thresholds.

AUC represents the area under the ROC curve, and it provides a measure of how well the model
can distinguish between two classes.

Instead of assessing a model through the ROC curve visually, AUC summarizes it into a single
numerical value, with a higher AUC indicating superior model performance.

A single AUC value allows for easy comparison between different models. The model with the
higher AUC typically performs better in classification when tested on the same dataset.

Interpreting AUC Values


The AUC score ranges from 0 to 1, with the following interpretations:

AUC = 1: This indicates perfect performance. A model with an AUC of 1 has no errors and is able
to perfectly distinguish between positive and negative instances. It means that for any pair of a
positive and a negative instance, the model will always rank the positive instance higher than
the negative one.

AUC = 0.5: This indicates no discrimination ability. A model with an AUC of 0.5 performs no
better than random guessing. For example, it might rank positive and negative instances
randomly, with no preference for correctly identifying one class over the other. In this case, the
ROC curve lies along the diagonal line from the bottom-left to the top-right, also known as the
line of no discrimination.

UNIT 4 67
AUC < 0.5: This indicates poor performance, worse than random guessing. A model with an
AUC less than 0.5 means that, for some reason, the model is consistently predicting the
opposite of what it should—ranking negative instances higher than positive ones. This could be
due to a model that is fundamentally flawed or misconfigured.

AUC > 0.5: Any AUC value above 0.5 indicates that the model has some discriminative power
and is better than random guessing. A higher AUC value corresponds to a model that is better at
distinguishing between the classes.

How AUC Is Calculated


To calculate AUC, the True Positive Rate (TPR) and False Positive Rate (FPR) are computed for
various classification thresholds.

These rates are then plotted on the ROC curve, and the area under the curve is determined.

AUC can be computed using various methods, such as the trapezoidal rule, which approximates
the area under the curve by dividing it into smaller trapezoids and summing their areas.

Advantages of AUC
1. Threshold Independence: A key advantage of AUC is that it evaluates the model's performance
across all possible decision thresholds, meaning it doesn't depend on a specific threshold (like
0.5). This is useful because in some situations, you may not want to set a fixed threshold for
classification, and AUC allows you to assess the model's overall capability across a range of
thresholds.

2. Comprehensive Evaluation: AUC provides a single value that summarizes the model’s ability to
distinguish between positive and negative classes, making it easier to compare different
models. It gives a holistic view of a model's performance, regardless of the threshold used to
classify the instances.

3. Better for Imbalanced Datasets: In cases where one class is much more frequent than the other
(a common issue in real-world data), AUC can still provide useful insights into the model's
performance. Traditional metrics like accuracy may be misleading in imbalanced datasets, as a
model that predicts the majority class for all instances could still have high accuracy. AUC,
however, focuses on how well the model differentiates between classes, making it more
informative in such situations.

Limitations of AUC
1. Does Not Consider Costs of Misclassification: AUC treats all false positives and false
negatives equally, without considering the potential costs of these errors. In some applications,
false positives might be much more costly than false negatives (or vice versa). AUC does not
take this into account, so it may not always reflect the true importance of the model’s errors for
a particular use case.

2. Can Be Misleading with Highly Imbalanced Data: While AUC is generally a good metric, it can
sometimes be misleading when there is a very large class imbalance. In extreme cases, a model

UNIT 4 68
might show a high AUC even if it fails to identify the minority class adequately, since the
majority class can dominate the evaluation.

3. No Direct Interpretation of Classification Performance: Although AUC provides a good


overview of how well a model discriminates between classes, it does not directly tell you how
well the model will perform in practice for a specific threshold. For example, a model might have
a high AUC but perform poorly for a threshold that is important for a particular application.

Cross Validation
Cross-validation is a technique used in machine learning and statistics to check how well a
model performs on new, unseen data.

It helps in evaluating the performance of a predictive model by splitting the available dataset
into multiple subsets, training the model on some of these subsets, and testing it on the
remaining ones.

This process helps us estimate how the model will perform when faced with new data, which
reduces the risk of overfitting (when a model performs well on training data but poorly on new
data).

In machine learning, we typically split our data into two parts: one for training the model and
one for testing it.

The model is trained on the training data and then evaluated on the test data.

However, if we only split the data once, there's a chance that the model's performance could
depend too much on how the data was split.

For example, if the test set doesn't represent the entire dataset well, the performance might not
show how the model will do in real-life situations.

Cross-validation addresses this by making multiple splits, which helps provide a more accurate
estimate of the model's performance.

This gives a clearer picture of how the model might behave when applied to new, unseen data.

One common challenge in training machine learning models is overfitting, where the model gets
too good at fitting the training data but struggles with new data.

This can happen when the model is too complex or trained for too long on a small dataset.

Cross-validation helps reduce the risk of overfitting by ensuring the model is tested on different
subsets of the data, not just the training set it was first trained on.

Types of Cross-Validation
1. Holdout Method:

The Holdout Method is one of the simplest types of cross-validation. In this method, the
entire dataset is randomly divided into two separate sets: one for training the model and the
other for testing it.

UNIT 4 69
Typically, a fixed percentage of the data, like 70% or 80%, is used for training, and the
remaining 20% or 30% is used for testing.

The model is trained on the training set and then evaluated on the test set.

This method is easy to implement and computationally inexpensive, making it useful for
quick evaluations.

2. k-Fold Validation:

k-Fold Cross-Validation is an extension of the Holdout Method designed to provide a more


robust estimate of a model's performance.

In this technique, the dataset is randomly divided into k equal-sized subsets or "folds."

For each fold, the model is trained using the remaining k-1 folds and tested on the fold that
was left out.

This process is repeated k times, each time using a different fold as the test set while the
remaining folds are used for training.

The final performance score is averaged across all k iterations.

3. Leave-P-Out Cross-Validation:

Leave-P-Out Cross-Validation (LPOCV) is an even more exhaustive approach compared to


k-Fold Cross-Validation.

In this method, p data points are left out as the test set, and the model is trained on the
remaining data points.

This process is repeated such that every possible combination of p data points is used as
the test set at least once.

For example, if you have a dataset of 100 samples and choose to use Leave-1-Out (which is
a special case of LPOCV with p = 1), the model would be trained 100 times, each time
leaving out one data point for testing. In the case of Leave-2-Out, the model would be
trained 50 times, each time leaving out 2 data points, and so on.

Leave-P-Out Cross-Validation provides a very thorough estimate of the model's


performance because it ensures that every data point in the dataset gets tested.

It is particularly useful when the dataset is small, as it maximizes the use of available data
for training and testing.

However, it comes with a significant computational cost because it requires training the
model multiple times, which can be impractical for larger datasets or more complex models.
Holdout Method
The Holdout Method is one of the simplest and most commonly used techniques for evaluating
the performance of machine learning models.

The main idea behind the holdout method is to split the available dataset into two distinct
subsets: one used for training the model and the other used for testing it.

UNIT 4 70
This method is widely applied because it is easy to understand, simple to implement, and
computationally efficient.

In the Holdout Method, the dataset is randomly divided into two parts: a training set and a test
set.

The training set is used to build the model, while the test set is kept aside and used solely to
evaluate the model's performance after training.

Typically, the data is split in a 70-30 or 80-20 ratio, where 70% or 80% of the data is allocated
for training and the remaining 30% or 20% is used for testing.

For example, in a dataset with 100 data points, if the holdout ratio is 80-20, then 80 data points
would be randomly selected for training, and the remaining 20 data points would form the test
set.

The model is trained on the 80 data points, and after the training is complete, it is tested on the
remaining 20 data points.

The performance of the model on the test set (such as accuracy, precision, recall, etc.) is then
reported as an estimate of how well the model will perform on new, unseen data.

Advantages of the Holdout Method


1. Simplicity: It is one of the easiest methods to implement. All that is required is to randomly split
the dataset into two parts and then train and test the model.

2. Speed: Since the data is split into two sets (training and testing), this method can be very fast
compared to more complex cross-validation techniques, particularly when dealing with large
datasets.

3. Computational Efficiency: The holdout method is computationally light because the model is
trained only once on the training set and tested once on the test set. This makes it suitable for
situations where computational resources are limited.

Disadvantages of the Holdout Method


1. Variance in Performance Estimates: The main limitation of the holdout method is that the
model's performance can vary depending on how the data is split. If the dataset is not large
enough or if the random split is not representative of the entire dataset, the performance
estimate may not be reliable. For example, if the model is tested on a particularly difficult or
easy subset of the data, the results may not reflect how the model would perform on unseen
data.

2. Risk of Underfitting or Overfitting: If the split between the training and test set is not well-
balanced or if the model is too complex or too simple, it can lead to underfitting (where the
model fails to capture the underlying patterns) or overfitting (where the model performs very
well on the training set but poorly on the test set). This issue is particularly problematic when
working with small datasets.

3. Not Fully Utilizing the Data: Since only a subset of the data is used for training and the rest is
used for testing, the model is not trained on the entire dataset. In scenarios where the dataset is

UNIT 4 71
small, this can be inefficient, as it does not fully leverage all available data for model training.

When to Use the Holdout Method


The Holdout Method is most useful when:

Data is abundant: If there is a large enough dataset, a single train-test split can give a
reasonably good estimate of model performance, and the limitations of the method become less
significant.

Speed is a priority: In situations where quick results are needed, the holdout method can be an
effective way to get an initial sense of how well the model might perform on unseen data.

Computational resources are limited: When training multiple models or performing more
complex cross-validation methods is computationally expensive, the holdout method offers a
quick alternative.
k-Fold Validation
K-Fold Cross Validation is a widely used technique in machine learning for evaluating the
performance and generalizability of a model.

It divides the dataset into K equal or nearly equal subsets (called "folds") and uses each subset
as a testing set while using the remaining K-1 subsets for training the model.

This process is repeated K times, with each fold being used as the test set exactly once.

The results of these K evaluations are then averaged to provide a more reliable estimate of the
model's performance.

How K-Fold Cross Validation Works


1. Divide the Data into K Folds: The dataset is randomly split into K equally sized subsets, or
"folds." The number K is a parameter that you define (commonly K=5 or K=10). Each fold will
contain a subset of the data, and all the folds together will cover the entire dataset.

2. Train and Test the Model: The model is trained K times. For each of the K iterations:

One fold is used as the test set (the data used to evaluate the model’s performance).

The remaining K-1 folds are combined to form the training set (the data used to train the
model).

3. Evaluate the Model: After each training iteration, the model is evaluated on the test fold.
Performance metrics like accuracy, precision, recall, or F1-score are computed for that fold.

4. Average the Results: Once all K iterations are complete, the performance results from each fold
are averaged to provide a final performance metric. This averaged score represents the model's
general performance on the dataset.

Example of K-Fold Cross Validation


Suppose we have a dataset of 1000 instances, and we choose K=5 for our cross-validation:

1. Step 1 (Split the Data): Divide the data into 5 folds (each fold contains 200 instances).

UNIT 4 72
2. Step 2 (Train and Test):

In the first iteration, the model is trained on folds 2, 3, 4, and 5, and tested on fold 1.

In the second iteration, the model is trained on folds 1, 3, 4, and 5, and tested on fold 2.

This process continues until each fold has been used as a test set once.

3. Step 3 (Evaluate): After each iteration, the performance metric (e.g., accuracy) is calculated.

4. Step 4 (Average the Results): Once all iterations are complete, the average performance across
all 5 folds is computed. This final average provides a more reliable estimate of the model's
performance compared to using a single train-test split.

Advantages of K-Fold Cross Validation


1. Better Performance Estimation: Unlike a single train-test split, k-fold cross-validation reduces
the variance of the performance estimate. Since the model is tested on different subsets of
data, it is less likely to be biased toward a particular train-test split, leading to a more reliable
performance measure.

2. Efficient Use of Data: Every data point is used for both training and testing. This helps
especially when you have limited data, ensuring that all data points contribute to the evaluation
process.

3. Prevents Overfitting: By training and testing the model on different subsets, k-fold cross-
validation helps in reducing the risk of overfitting. Overfitting occurs when a model performs
well on a specific training set but poorly on unseen data, and k-fold cross-validation can detect
this issue by testing the model on different portions of the data.

4. Flexibility: K-fold cross-validation can be adapted to different datasets by varying the number
of folds. Common values for kkk are 5 or 10, but you can choose a different number based on
the size and characteristics of your data.

Disadvantages of K-Fold Cross Validation


1. Computationally Expensive: The main downside of k-fold cross-validation is that it can be
computationally expensive, especially with large datasets or complex models. Since the model
is trained kkk times (once for each fold), it can take significantly longer to run than a single
train-test split.

2. Bias in Small Datasets: In cases where the dataset is very small, even small changes in how the
data is split into folds can lead to biased performance estimates. In such cases, the
performance scores might not be as reliable as those obtained from larger datasets.

3. Not Ideal for Time Series: K-fold cross-validation assumes that the data is independent and
identically distributed (i.i.d.). However, for time series data, where the order of data points
matters (e.g., stock prices or weather data), this technique may not work well because the data
points are not independent. In such cases, other techniques like Time Series Cross Validation
should be considered.

Choosing the Right Value of K

UNIT 4 73
The value of K plays an important role in the performance of K-Fold Cross Validation. The choice of
K depends on the size of the dataset and the trade-off between computational efficiency and model
evaluation reliability:

Small K (e.g., K=2 or K=3): This leads to fewer training iterations and may provide faster
results, but it increases the variance in the performance estimate because the training and test
sets are not as varied.

Large K (e.g., K=10): This provides more reliable and stable performance estimates, but it is
more computationally expensive. K=10 is a common choice in many applications because it
strikes a balance between reliability and efficiency.
Leave-P-Out Cross-Validation (LPOCV)
Leave-P-Out Cross-Validation (LPOCV) is a more generalized version of cross-validation
techniques that is used to evaluate the performance of a machine learning model.

In this approach, instead of leaving out just one data point (as in Leave-One-Out Cross-
Validation or LOOCV), a subset of pdata points is left out from the training set during each
iteration.

The model is trained on the remaining data and tested on the held-out pdata points.

This process is repeated for all possible combinations of pdata points being left out, providing a
robust performance measure of the model.

The Basic Process of Leave-P-Out Cross-Validation


The steps for performing Leave-P-Out Cross-Validation are as follows:

1. Divide the Dataset: Given a dataset of size n, you select a value for p(the number of data
points to be left out in each iteration). The dataset is then used to form combinations of pp
points that will be held out for testing, while the remaining n − pdata points are used for
training the model.

2. Train and Test the Model: For each combination of pp data points held out, the model is trained
using the remaining n − pdata points. After training, the model is evaluated on the pp held-out
points (i.e., the test set for that iteration).

3. Repeat the Process: This process is repeated for all possible combinations of pp data points. If
nis the total number of data points in the dataset, the number of iterations will be the number of
n
ways to choose pp points out of n, which is given by the binomial coefficient ( p ).

4. Performance Evaluation: After performing all iterations, the performance scores (such as
accuracy, precision, recall, etc.) from each iteration are averaged to obtain an overall
performance metric for the model.

Example of Leave-P-Out Cross-Validation


Let’s consider an example where n = 5 and p = 2 (Leave-2-Out Cross-Validation). You have a
dataset with 5 data points:
{x1 , x2 , x3 , x4 , x5 }
​ ​ ​ ​ ​

UNIT 4 74
In Leave-2-Out Cross-Validation, the model will be trained and tested on combinations of 2 data
points left out during each iteration. The different combinations of held-out data points would be:

Iteration 1: Test on {x1 , x2 }, train on {x3 , x4 , x5 }


​ ​ ​ ​ ​

Iteration 2: Test on {x1 , x3 }, train on {x2 , x4 , x5 }


​ ​ ​ ​ ​

Iteration 3: Test on {x1 , x4 }, train on {x2 , x3 , x5 }


​ ​ ​ ​ ​

Iteration 4: Test on {x1 , x5 }, train on {x2 , x3 , x4 }


​ ​ ​ ​ ​

Iteration 5: Test on {x2 , x3 }, train on {x1 , x4 , x5 }


​ ​ ​ ​ ​

And so on for all combinations.

In each iteration, the model is trained on 3 data points and tested on 2 data points. After all
iterations, the average performance score across all tests is used to evaluate the model.

Advantages of Leave-P-Out Cross-Validation


1. Comprehensive Evaluation: By testing the model on every possible combination of pp data
points, LPOCV provides a very thorough evaluation of how well the model generalizes to unseen
data. It uses all possible test sets, which helps ensure the evaluation is robust.

2. Reduces Bias: Like other cross-validation techniques, Leave-P-Out Cross-Validation minimizes


bias by ensuring that every subset of data gets a chance to be used as a test set. This reduces
the likelihood that the model's performance is overestimated due to an unrepresentative train-
test split.

3. Flexibility: Leave-P-Out Cross-Validation allows for different values of pp, making it a flexible
tool. This can be especially useful when you want to test the model's performance on different
sizes of test sets (e.g., testing with 2 points out versus 3 points out).

Disadvantages of Leave-P-Out Cross-Validation


1. Computationally Expensive: The most significant drawback of Leave-P-Out Cross-Validation is
that it can be computationally expensive, particularly for large datasets. For each iteration, the
model must be trained on n−pn - p data points and tested on pp data points. As the number of
combinations increases exponentially with nn, the computational cost can become prohibitive,
especially for large datasets or complex models.

2. Infeasibility for Large Datasets: For large datasets, the number of possible combinations of
data points increases quickly. For example, if n=100n = 100 and p=2p = 2, the number of
iterations is (1002)=4950\binom{100}{2} = 4950. This is often too large to be computationally
feasible, making LPOCV impractical for datasets with many data points or when pp is large.

3. Overfitting Risk in Small Datasets: While Leave-P-Out Cross-Validation is excellent for avoiding
overfitting in large datasets by ensuring thorough testing, it can sometimes lead to overfitting in
smaller datasets. This is because with each test set consisting of just pp data points, the model
might overly specialize on very small subsets of data.

UNIT 4 75
💡 1. What is K-fold cross-validation? In K-fold cross-validation, comment on the following
situations: [OCT 23] [APR 23]

a. When the value of K is too large

b. When the value of K is too small

How do you decide the value of K in K-fold cross-validation?

Consider the following data to predict student pass


or fail using the K-Nearest Neighbor Algorithm (KNN)
for the values physics = 6 marks, Chemistry = 8
marks with a number of Neighbors K=3. [OCT 22]
Physics (marks) Chemistry (marks) Results

4 3 Fail

6 7 Pass

7 8 Pass

5 5 Fail

8 8 Pass

UNIT 4 76

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy