ML Unit 4
ML Unit 4
In other words, during the learning process, the algorithm is given input data along with the
correct output, or label.
The goal of supervised learning is to learn a mapping from inputs to outputs, so the model can
predict the output for new, unseen data.
The model learns from examples where both the input data and the correct class label are
provided.
Once trained, the model can classify new, unseen data into one of the categories it learned from
the training data.
For example, in a medical diagnosis task, the goal of classification could be to predict whether a
patient has a certain disease based on their symptoms.
The input features (data) could include things like age, blood pressure, temperature, and heart
rate, while the output (label) would be either "disease present" or "disease not present."
2. Multiclass Classification:
This involves more than two classes. The goal is to classify an input into one of multiple classes.
Examples include:
3. Multilabel Classification:
In multilabel classification, each instance can belong to more than one class at the same time.
UNIT 4 1
For example:
Image tagging, where an image can have multiple tags like "cat", "beach", "sunset".
It models the probability that a given input belongs to a particular class using a logistic
function (sigmoid).
2. Decision Trees:
A flowchart-like structure where each internal node represents a feature, each branch
represents a decision rule, and each leaf node represents a class label.
SVMs find a hyperplane that best separates data points of different classes.
They can handle both linear and non-linear classification problems using kernel tricks.
A non-parametric method where a data point is classified based on the majority class of its
K nearest neighbors.
Simple and intuitive, but can be computationally expensive for large datasets.
5. Random Forests:
An ensemble method that combines multiple decision trees to improve accuracy and reduce
overfitting.
6. Neural Networks:
Neural networks consist of layers of interconnected nodes (neurons) and are capable of
learning complex patterns.
They are particularly effective for large datasets and tasks like image classification, natural
language processing, and speech recognition.
Precision: The ratio of correctly predicted positive observations to the total predicted positives.
It answers the question: Of all the instances the model predicted as positive, how many were
actually positive?
UNIT 4 2
Recall (Sensitivity): The ratio of correctly predicted positive observations to all actual positives.
It answers: Of all the actual positive instances, how many did the model identify?
F1-Score: The harmonic mean of precision and recall, useful when there is an imbalance
between classes (i.e., one class is more prevalent than the other).
ROC (Receiver Operating Characteristic) curve plots the true positive rate (recall) against
the false positive rate.
AUC (Area Under the Curve) represents the overall performance of the model.
Challenges in Classification
1. Class Imbalance:
If one class is much more prevalent than others, the model may be biased toward the majority
class, resulting in poor performance on the minority class. Techniques like oversampling,
undersampling, or using specialized algorithms can help address this.
2. Overfitting:
If the model is too complex, it may perform well on the training data but poorly on unseen data.
This is called overfitting. Regularization techniques, such as pruning decision trees or adding
penalties to the loss function, can help prevent overfitting.
3. Feature Selection:
Choosing the right features is crucial for model performance. Irrelevant or redundant features
can reduce accuracy and increase computational complexity. Feature selection methods like
forward selection, backward elimination, and L1 regularization can be used.
It falls under the category of instance-based learning, meaning that it makes predictions based
on the instances (or examples) in the training dataset rather than constructing an explicit model.
KNN is often used for classification problems, where the goal is to categorize a new data point
into one of several classes based on the classes of its neighboring points in the training set.
The algorithm assumes that similar data points exist close to each other in space.
UNIT 4 3
The idea is to predict the class of a data point by looking at the classes of its nearest neighbors
in the feature space.
Basic Idea:
KNN works by finding the "K" closest data points to a given data point and then making a
prediction based on the majority class of those closest neighbors.
The "neighbors" refer to the data points in the dataset that are closest to the point being
classified.
K is a parameter that defines how many neighbors should be considered when making the
classification decision.
The prediction for a new data point is made by finding the K closest data points from the
training dataset and assigning the most common class among these neighbors to the new point.
A small K (like 1 or 3) makes the model sensitive to noise in the data, while a large K makes it
more general but less sensitive to local variations.
2. Calculate the distance between the new data point and all points in the training set.
3. Identify the K closest data points to the new data point based on the calculated distances.
5. Assign the class label of the majority to the new data point.
Suppose you're trying to classify a new animal with a height of 15 cm and weight of 5 kg.
You choose K = 3.
The algorithm calculates the distance between this new animal and all the animals in the training
set.
It then selects the 3 closest animals (the nearest neighbors) and looks at their classes.
If two out of the three closest neighbors are labeled as "cat" and one is labeled as "dog," the
new animal will be classified as a "cat" because it has the majority class label.
A small K can make the model very sensitive to noise, potentially leading to overfitting.
A larger K makes the model more general, but it can smooth over the local patterns in the data,
potentially leading to underfitting.
UNIT 4 4
Cross-validation is often used to select the optimal K value by testing different values and
checking which one performs best on unseen data.
Distance Metric: A way to measure how far apart data points are in the feature space. Common
distance metrics are:
Advantages of KNN:
Simplicity: It’s easy to understand and implement.
Non-Parametric: KNN doesn't make assumptions about the underlying data distribution (e.g.,
normal distribution), which is useful when the data doesn't follow a specific pattern.
Disadvantages of KNN:
Computationally Expensive: As the size of the training dataset increases, the algorithm
requires more time to compute the distances for each new prediction. This is because the
algorithm has to compare the new data point with every point in the training set.
Storage: KNN requires storing all the training data, which can be memory-intensive, especially
for large datasets.
Sensitive to Irrelevant Features: If the data has many irrelevant or redundant features, the
algorithm’s performance may decrease because irrelevant features can distort the distance
calculations.
Dimensionality Reduction: Since KNN can suffer from the "curse of dimensionality" (where the
distance between points becomes less meaningful as the number of features increases),
techniques like PCA (Principal Component Analysis) can be used to reduce the number of
features.
Choose an appropriate K value: This can be done through cross-validation or using techniques
like the "elbow method" to find the optimal K value.
UNIT 4 5
Normalize or scale features: Since KNN is distance-based, the algorithm is sensitive to the
scale of the features. For example, features with larger ranges (e.g., income in thousands vs.
age in years) can dominate the distance calculations. Normalizing or standardizing the data can
improve the performance of KNN.
Use distance weighting: Instead of giving equal importance to all neighbors, you can give
closer neighbors more weight when making the classification decision. This can be achieved by
using a weighted distance function, where the vote of each neighbor is weighted by its
proximity to the query point.
Support Vector Machine (SVM) is a powerful and widely used algorithm for classification tasks
in machine learning.
It is particularly effective for binary classification, where the goal is to separate data into two
distinct classes.
SVM works by finding an optimal hyperplane that best divides the data into these classes.
A hyperplane is a decision boundary that separates the feature space into two regions, each
corresponding to a class.
The margin is defined as the distance between the hyperplane and the closest data points from
each class.
These closest points are called support vectors, and they play a critical role in defining the
optimal hyperplane.
SVM aims to maximize this margin because a larger margin is associated with better
generalization to unseen data.
In simple terms, by creating a wide gap between the classes, SVM reduces the chances of
misclassification on new, unseen data.
This is why SVM is often preferred for problems where the classes are well-separated, or close
to it.
UNIT 4 6
Linear SVM
In its simplest form, SVM is used for linearly separable data, meaning the two classes can be
separated by a straight line (in two dimensions) or a hyperplane (in higher dimensions).
For linearly separable data, SVM works by finding the hyperplane that maximizes the margin
between the two classes.
Mathematically, the goal of SVM in the linear case is to find a hyperplane described by the
equation:
w ⋅ x + b = 0
Where:
The kernel trick involves transforming the data into a higher-dimensional space where it
becomes linearly separable.
In this new space, SVM can find a hyperplane that effectively separates the classes.
The kernel function takes the original data and maps it into a higher-dimensional space.
Common types of kernels include:
2. Polynomial Kernel: Maps the data into a higher-dimensional space using polynomial
functions.
3. Radial Basis Function (RBF) Kernel: One of the most popular kernels, it maps data into an
infinite-dimensional space, making it effective for complex datasets.
The use of kernels allows SVM to perform well on complex datasets with non-linear decision
boundaries by implicitly mapping the data into a higher-dimensional space without the need to
compute this mapping explicitly.
Advantages of SVM
SVM has several advantages, making it a popular choice for classification tasks:
UNIT 4 7
1. Effective in High-Dimensional Spaces: SVM works well when the number of features
(dimensions) is large, making it suitable for text classification and image recognition tasks.
2. Memory Efficiency: SVM is memory efficient because it only relies on the support vectors to
define the decision boundary, rather than using all the training data.
3. Robust to Overfitting: SVM is less prone to overfitting compared to some other algorithms,
especially in high-dimensional spaces, due to the use of the margin for classification.
4. Works Well with Non-linear Data: The kernel trick allows SVM to handle complex, non-linear
decision boundaries effectively.
Disadvantages of SVM
Despite its advantages, SVM has some limitations:
1. Computational Complexity: The training time for SVM can be high, especially for large
datasets. The need to compute pairwise distances between all data points and optimize the
margin can be computationally expensive.
2. Sensitivity to Noise: SVM can be sensitive to noisy data and outliers, particularly if the value of
the regularization parameter C is set too high.
3. Choice of Kernel and Hyperparameters: The performance of SVM heavily depends on the
choice of kernel and the appropriate setting of hyperparameters like C and kernel parameters.
Tuning these parameters can be time-consuming and require cross-validation.
Applications of SVM
SVM is used in a wide range of applications due to its ability to handle both linear and non-linear
data. Some common applications include:
1. Image Classification: SVM is widely used in computer vision tasks, such as classifying images
of objects or handwriting recognition.
2. Text Classification: SVM is often used in natural language processing (NLP) for tasks like spam
email detection or sentiment analysis.
3. Bioinformatics: SVM is used in areas like gene classification and protein structure prediction.
4. Face Detection: SVM has been used in detecting faces in images, a crucial task in computer
vision.
Maximum Margin Linear Separators (Optimal Decision Boundary)
UNIT 4 8
Maximum Margin Linear Separators (Optimal Decision Boundary) in Support
Vector Machines (SVM)
In machine learning, the goal of classification is to divide data into different classes or
categories.
In cases where the data is linearly separable (i.e., the classes can be separated by a straight
line or hyperplane in a multi-dimensional space), a linear classifier can be used.
Support Vector Machines (SVM) are a popular method for classification that focuses on finding
the optimal decision boundary to separate the classes.
A linear separator is a straight line (in two dimensions), a plane (in three dimensions), or a
hyperplane (in higher dimensions) that divides the data into different classes.
In binary classification, the objective is to find a hyperplane that separates two classes of
data.
A linear separator is called "linear" because it is a straight line (or a flat plane) in the feature
space.
For example, in 2D, a linear separator is simply a straight line that divides the data points of one
class from the other class.
In higher dimensions, it's a hyperplane, but the underlying concept is the same.
UNIT 4 9
The margin is the distance between the linear separator (or hyperplane) and the closest data
points from either of the two classes.
A larger margin between the classes means that the classifier is more confident in its
predictions for new, unseen data.
The goal of SVM is to find the separator that maximizes this margin, as a wider margin
generally results in better generalization to new data.
The intuition is that a larger margin reduces the chances of misclassification in future data.
By focusing on the data points closest to the decision boundary (the support vectors), the
classifier can make more accurate predictions on unseen data.
4. Support Vectors:
Support vectors are the data points that are closest to the decision boundary (hyperplane).
These points are crucial because they directly influence the position and orientation of the
hyperplane.
The SVM algorithm uses only these support vectors to determine the optimal boundary.
The other data points (those far from the boundary) do not influence the position of the
hyperplane.
The support vectors are the most difficult to classify, so they define the boundary.
If you remove or change a support vector, the position of the decision boundary might
change significantly, while removing non-support vectors will not.
w ⋅ x + b = 0
where:
UNIT 4 10
w is the weight vector, which is perpendicular to the hyperplane.
b is the bias term, which helps shift the hyperplane away from the origin.
The goal is to find values of w and b that maximize the margin between the two classes.
where:
The reason we minimize ||w|| is that the margin is inversely proportional to the magnitude of the
weight vector. By minimizing the weight vector’s magnitude, we maximize the margin.
Constraints:
The margin is maximized subject to the constraint that the data points must be correctly
classified. For each data point, we need the following conditions:
w ⋅ xi + b ≥ +1
w ⋅ xi + b ≤ −1
Here, xi represents the data points, and the constraints ensure that all data points are on the
correct side of the margin, with a margin of at least 1 from the hyperplane.
minw 12 ∥w∥2
1
The factor 2 simplifies the calculations later on.
UNIT 4 11
2. Subject to Margin Constraints
For a given training dataset with features xi and corresponding labels yi , we need to enforce the
margin constraints. These constraints ensure that the data points are correctly classified with a
gap (margin) between them. The constraints can be written as:
yi (w ⋅ xi + b) ≥ 1,
for all i
Here, w is the weight vector, bis the bias term, and yiy_i is the label (either +1 or -1).
3. Lagrangian Function
To solve this constrained optimization problem, we use the Lagrange multiplier method. The
Lagrangian L(w, b, α)for this problem combines the objective function with the margin
constraints, where αi are the Lagrange multipliers associated with each constraint:
n
The term ∑i=1 αi (yi (w ⋅ xi + b) − 1)represents the penalties for the violations of the margin
constraints. The αi are non-negative values that adjust how strongly the constraint is enforced.
This shows that the weight vector w is a linear combination of the support vectors xi , weighted
This implies that the sum of the Lagrange multipliers weighted by the labels must be zero:
n
∑i=1 αi yi = 0
5. Optimization Problem
Once we have the derivatives, we substitute these results into the Lagrangian to obtain the
optimization problem in terms of the Lagrange multipliers αi . The dual form of the optimization
problem is:
UNIT 4 12
6. Solving the Quadratic Programming Problem
This is a quadratic optimization problem in terms of the Lagrange multipliers αi . The solution of
this problem gives the optimal values of αi , which can then be used to compute the optimal
In Support Vector Machines (SVM), kernels are used to enable the algorithm to learn non-
linear decision boundaries, which is essential when the data is not linearly separable in its
original feature space.
Kernels allow SVMs to work in higher-dimensional spaces, where the data might become
linearly separable, without explicitly performing the transformation.
When the data is not linearly separable, SVMs map the data to a higher-dimensional feature
space where it is more likely that a separating hyperplane can be found.
The kernel trick allows SVMs to compute the inner product (dot product) of data points in
this higher-dimensional space without explicitly performing the transformation.
This is done by defining a kernel function that calculates the dot product in the new feature
space, making the computation more efficient.
UNIT 4 13
dimensional feature space:
In practice, the function ϕ(x)is not computed directly; instead, we use K(x, y)to compute
the required quantities, which simplifies the process.
Popular Kernels
Several types of kernel functions are commonly used in SVMs.
These kernels are designed to handle different types of data and transform the feature
space in a way that makes the data separable.
1. Linear Kernel
The linear kernel is the simplest kernel and is used when the data is already linearly
separable or nearly so.
K(x, y) = x ⋅ y
This kernel does not map the data into a higher-dimensional space—it just computes the
standard dot product in the original input space.
2. Polynomial Kernel
The polynomial kernel is used for non-linear data where the decision boundary is a
polynomial function.
K(x, y) = (x ⋅ y + 1)d
Where:
dis the degree of the polynomial (e.g., d=2 for quadratic boundaries).
This kernel allows the decision boundary to be more flexible, handling cases where the
relationship between data points requires a higher-order polynomial function.
The RBF kernel is one of the most popular and widely used kernels for non-linear
classification problems.
K(x, y) = exp (− 2σ 2 )
∥x−y∥2
Where:
UNIT 4 14
∥xi − xj ∥2 is the squared Euclidean distance between the points xand y
σ is a parameter that controls the width of the Gaussian function (also called the
bandwidth).
The RBF kernel maps data into an infinite-dimensional space and is very effective for
many types of non-linear decision boundaries, especially when the relationship between
the classes is complex or highly non-linear.
4. Sigmoid Kernel
The sigmoid kernel is derived from the activation function used in neural networks.
This kernel can model complex decision boundaries and has a behavior similar to a
neural network.
However, it is less commonly used than other kernels, like the RBF kernel.
If the data appears to have a linear relationship, a linear kernel is the simplest and most
efficient choice.
If the data has a complex, non-linear structure, the RBF kernel is often a good choice
because it can create flexible decision boundaries.
The polynomial kernel is useful when the decision boundary resembles a polynomial
function (e.g., quadratic or cubic).
In practice, the RBF kernel is often preferred for most non-linear classification problems due to
its flexibility and ability to handle complex decision boundaries.
UNIT 4 15
class is in the surrounding region. In such cases, a straight line (or a hyperplane in higher
dimensions) cannot separate the two classes effectively.
This is where non-linear classification comes in. To solve this, SVM uses a kernel trick to map the
data into a higher-dimensional space where it becomes easier to find a linear separator, even
though the data might not be linearly separable in its original space.
2σ 2
Where:
exp (− ∥x−y∥ )
2
2σ 2
The RBF kernel is known for its ability to map data into an infinite-dimensional space, which allows
it to handle highly complex decision boundaries.
1. Objective function:
The objective is to minimize the following function, which corresponds to the margin
maximization:
n
minw,b 12 ∥w∥2 + C ∑i=1 ξi
where:
UNIT 4 16
ξi are slack variables, which allow for some misclassification in case the data is not
perfectly separable.
C is a regularization parameter that controls the trade-off between maximizing the margin
and minimizing classification errors (i.e., penalizing misclassification).
2. Constraints:
The constraints ensure that each data point is correctly classified (up to the allowed
misclassification):
(w ⋅ xi + b) ≥ 1 − ξi
for each data point i
where yi is the class label of the data point xi .
In the case of the RBF kernel, the kernel computes the similarity between data points in the
transformed feature space by simply using the formula:
2σ 2
Thus, we don't need to know the exact transformation ϕ(x), just the kernel function that can
compute the dot product in the higher-dimensional space.
1. Non-linear transformation:
By using the RBF kernel, SVM transforms the data points into a higher-dimensional space where
the classes may become separable.
For example, in a two-dimensional space, the decision boundary might be a curve (such as a circle
or an ellipse), which is not possible with a linear kernel. The RBF kernel enables the SVM to
automatically find such curved boundaries.
UNIT 4 17
The performance of an SVM with the RBF kernel is influenced by two main hyperparameters:
1. C:
The regularization parameter controls the trade-off between maximizing the margin and
minimizing the misclassification errors. A large
C gives higher importance to minimizing errors, potentially leading to overfitting, while a small
C allows for more margin violations, leading to underfitting.
2. σ :
The σ\sigma parameter (also known as the bandwidth of the kernel) controls the influence of
individual training samples. A small σ\sigma makes the kernel more sensitive to the distance
between points, meaning only very close points have a significant influence on the decision
boundary. A large
σ increases the influence of distant points, potentially making the boundary too smooth and
unable to capture intricate patterns in the data.
Advantages:
Flexibility: The RBF kernel is highly flexible and can handle a wide range of non-linear patterns
in the data. It is particularly effective when the decision boundary is highly non-linear.
Generalization: The SVM with the RBF kernel is often effective in high-dimensional spaces and
has good generalization properties, especially when tuned correctly.
Disadvantages:
Parameter sensitivity: The performance of the SVM with the RBF kernel is highly sensitive to
the choice of the hyperparameters and . Incorrect choices can lead to overfitting or
underfitting.
Computational cost: The RBF kernel requires calculating pairwise distances between all points,
which can be computationally expensive for large datasets.
RBF
Support Vector Machines (SVM) is a powerful machine learning algorithm, particularly for
classification tasks. While SVM is well-known for its ability to create linear decision boundaries, it
can also be used for non-linear classification by transforming the input data into a higher-
dimensional space where a linear separator can be found. This transformation is achieved using
kernels, and one of the most commonly used kernels for non-linear classification is the Radial
Basis Function (RBF) kernel.
UNIT 4 18
This is where non-linear classification comes in. To solve this, SVM uses a kernel trick to map the
data into a higher-dimensional space where it becomes easier to find a linear separator, even
though the data might not be linearly separable in its original space.
K(xi , xj ) = exp (− )
∥xi − xj ∥2
2σ 2
Where:
K(xi , xj )is the kernel function that computes the similarity between two data points xi and xj
,
∥xi − xj ∥2 is the squared Euclidean distance between the two points,
σ is a parameter that controls the width of the kernel (also called bandwidth).
The RBF kernel measures how similar two points are. When the points are closer together, the
kernel value is high (indicating similarity). As the points move further apart, the kernel value
decreases. The parameter σ determines how quickly this similarity decays as the points get farther
apart.
The idea is that by using the RBF kernel, we implicitly map the input data into a higher-dimensional
space (often an infinite-dimensional space) where a linear decision boundary can separate the
data. This is important because:
In the higher-dimensional space, the data might become more linearly separable, even though
it wasn't in the original space.
The kernel trick makes this transformation computationally feasible, as we don't need to
compute the transformed data explicitly.
UNIT 4 19
minw,b 12 ∥w∥2
Where w\mathbf{w} is the weight vector and bb is the bias term. The optimization is subject to the
constraints that all data points are correctly classified.
Where:
This optimization problem finds the optimal values of αi , which correspond to the support vectors
c. Decision Function:
Once we solve for the Lagrange multipliers αi , we can compute the final decision function, which
predicts the class label for a new data point x. The decision function is:
n
K(xi, x) + bf(x) = ∑i=1 αi yi K(xi , x) + b
Here:
K(xi , x)is the RBF kernel function, which computes the similarity between the support vectors
The sign of f(x) determines the predicted class of the data point x.
1. C: This is the regularization parameter that controls the trade-off between maximizing the
margin and minimizing the classification error. A larger value means the model will focus more
on classifying all data points correctly (with less tolerance for misclassifications), while a
smaller value will allow some misclassifications to maximize the margin.
2. σ : This controls the width of the RBF kernel. A small makes the kernel more sensitive to the
distance between points, leading to a more complex model with a tighter decision boundary. A
UNIT 4 20
large makes the kernel less sensitive, leading to a smoother decision boundary.
No Need for Explicit Transformation: The kernel trick allows us to work in a higher-
dimensional space without explicitly transforming the data, making it computationally efficient.
Effective for Many Data Types: The RBF kernel works well for many types of data and is often
the default choice in practice.
Computational Complexity: While the kernel trick avoids explicit mapping, training an SVM
with the RBF kernel can still be computationally expensive, especially for large datasets.
In SVR, the focus is on finding a function that deviates as little as possible from the actual data
while also allowing for some flexibility (called a margin) to handle outliers.
SVR works by finding a function that best fits the data points while keeping the error (difference
between predicted and actual values) within a certain margin.
This is done by mapping the data points to a higher-dimensional space and then constructing a
hyperplane that approximates the underlying relationship between the input variables and the
target variable.
UNIT 4 21
considered acceptable. This helps prevent the model from fitting noise in the data, focusing
only on the significant deviations.
3. Support Vectors:
In SVR, the model is defined not by all the data points but by a small subset of points called
support vectors. These are the points that lie on the boundaries of the ϵ − tubeor that are
outside it. The model focuses on minimizing the error around these support vectors.
Step 2: Defining a Hyperplane: Once the data is in a higher-dimensional space, SVR finds a
hyperplane (a generalized version of a line in higher dimensions) that best fits the data. In the
case of regression, instead of finding a hyperplane that separates the data into classes (like in
SVM classification), the goal is to find a hyperplane that predicts the continuous target variable.
Step 3: Defining the Margin: The ε-tube (epsilon tube) is the region around the hyperplane
where no penalty is applied to errors. If the predicted value lies within this tube, no penalty is
applied. If the predicted value lies outside the tube, the error is penalized, and the model is
adjusted to minimize this error.
Step 4: Support Vectors: The support vectors are the data points that lie closest to the
boundary of the ε-tube. These support vectors are the critical elements of the model, as they
help determine the position of the hyperplane.
Y = wx + b
where:
Advantages of SVR
Robust to Outliers: By allowing deviations within the margin ϵ, SVR can handle noise and
outliers more effectively than other regression techniques.
UNIT 4 22
Flexibility: The use of kernels allows SVR to model nonlinear relationships between features
and target variables.
High-dimensional Data: SVR can work well in high-dimensional spaces, especially with the
kernel trick.
Disadvantages of SVR
Computational Complexity: SVR can be computationally expensive, particularly with large
datasets and complex kernel functions.
Choice of Parameters: The performance of SVR is highly sensitive to the choice of C , ϵ, and
kernel parameters, requiring careful tuning.
ii) Margin
Unlike binary classification, where there are only two possible outcomes (e.g., yes/no,
true/false), multiclass classification involves more than two possible classes.
For example, in a fruit classification problem, the goal might be to categorize an image of a fruit
as either an apple, banana, or orange, which are three distinct classes.
In multiclass classification, each input is associated with only one label, but there are more than
two options.
Classifying types of flowers, such as setosa, versicolor, and virginica in the Iris dataset.
Categorizing text into different topics, like sports, politics, and entertainment.
These methods depend on how they manage multiple classes within the model.
UNIT 4 23
The One-vs-Rest (OvR) technique, also known as One-vs-All (OvA), is one of the most
popular and straightforward approaches to extend binary classification algorithms to
multiclass classification problems.
The general idea is to convert the multiclass problem into multiple binary classification
tasks, where each binary classifier attempts to distinguish one class from all the others.
OvR is particularly useful for machine learning models that inherently support binary
classification (such as Support Vector Machines, Logistic Regression, and Decision
Trees).
This technique works by creating one classifier per class, with each classifier being
responsible for predicting whether an instance belongs to the target class or not.
For each class, a separate classifier is trained to distinguish that class from all the other
classes combined.
In other words, for each class, the model learns to predict whether an input belongs to
that class or not.
1. Training Phase:
For a multiclass problem with N classes (say Class A, B, C, and D), OvR trains N
classifiers.
For each classifier, the data for the given class is treated as the positive class, and all
other classes are treated as the negative class.
Each classifier only learns to separate one class from the rest of the classes.
2. Prediction Phase:
Once the classifiers are trained, when a new input is provided, each classifier makes
a prediction:
UNIT 4 24
Each classifier will predict whether the input belongs to the class it is trained for
(positive) or not (negative).
The final prediction for the input is the class whose classifier provides the highest
confidence score or probability.
For example, suppose we have a new input, and the predictions from the
classifiers are:
The final prediction would be Class A, because the classifier for Class A has the
highest probability (0.7).
Advantages:
Simplicity: OvR is easy to implement and works with any binary classifier, like logistic
regression or support vector machines.
Scalability: It handles problems with many classes efficiently, as each classifier deals
with just one class vs. the others.
Disadvantages:
Imbalanced Data: OvR can struggle with imbalanced datasets, where some classes have
significantly fewer examples.
Computational Cost: With many classes, the number of classifiers increases, which can
be computationally expensive.
One-vs-One (OvO)
The One-vs-One (OvO) approach is a strategy used to extend binary classification
algorithms to multiclass classification problems.
It is one of the most commonly used techniques for tackling multiclass classification,
especially when dealing with models that are inherently binary classifiers, such as
Support Vector Machines (SVM), logistic regression, or decision trees.
In OvO, instead of creating one classifier that tries to separate all classes, we create a
binary classifier for each pair of classes in the dataset.
Each classifier distinguishes between just two classes, and the final prediction is made
based on the results from all these pairwise classifiers.
UNIT 4 25
The class that wins the most pairwise comparisons is selected as the final predicted
class.
This means that for a problem with C classes, the number of classifiers needed will be
the number of ways you can select two classes from C, which is mathematically given by
the combination formula:
C(C−1)
Number of classifiers = C = 2
For example, if there are 3 classes: A, B, and C, OvO will train the following binary
classifiers:
1. Classifier 1: A vs B
2. Classifier 2: A vs C
3. Classifier 3: B vs C
During training, the classifier will be given data that belongs to only two classes at a time,
and it will learn how to separate those classes.
For each pair of classes, a binary classifier is trained. Each classifier gets a dataset
with only the two classes involved in that classifier.
2. Prediction Phase:
For a new, unseen input, each of the binary classifiers makes a prediction.
Each classifier votes on which class the data point belongs to.
The class with the most votes across all classifiers is selected as the final prediction
for that input.
UNIT 4 26
2. Step 2: Training the Classifiers
Suppose you have a new test instance. Each classifier will vote for one of the two
classes it was trained on:
After gathering the votes from all classifiers, the class with the most votes is the final
prediction.
In this case, the class C receives 2 votes (from Classifier 2 and Classifier 3), while A and B
each receive 1 vote. Therefore, the final prediction would be C.
Advantages of One-vs-One
1. Simpler Binary Classification Problems: Each binary classifier only needs to separate
two classes, which can be simpler to model, especially for complex problems.
2. Better Performance with Some Algorithms: In some cases, binary classifiers like
Support Vector Machines (SVMs) perform better when they only need to distinguish
between two classes at a time, rather than having to consider all classes together in a
multiclass problem.
3. Handling Class Imbalance: OvO can be beneficial in cases where the classes are
imbalanced (i.e., some classes have many more examples than others). Since each
classifier only deals with two classes at a time, it may be easier to address imbalance
within each pair.
Disadvantages of One-vs-One
1. Computational Complexity: The main drawback of OvO is that it requires training a large
C(C−1)
number of classifiers. If there are C classes, 2
classifiers are needed. As the
number of classes grows, the computational cost increases exponentially. For example,
UNIT 4 27
with 10 classes, you would need to train 45 classifiers, and with 100 classes, 4,950
classifiers.
2. Memory and Storage Requirements: Since each pair of classes has its own classifier,
the storage and memory required to keep track of all these models can be quite large,
especially when dealing with a dataset that has many classes.
3. Prediction Time: After training the classifiers, predicting the class of a new sample
involves running it through all the classifiers, which can be time-consuming, especially
when the number of classes is large.
Applications of One-vs-One
The One-vs-One technique is commonly used in a variety of machine learning tasks, such
as:
2. Text Classification: When classifying documents into multiple categories (e.g., news
topics like sports, politics, or entertainment), OvO allows each classifier to specialize in
distinguishing between two categories, which can improve overall performance.
3. Medical Diagnosis: In healthcare, OvO can be used for classifying medical conditions or
diseases based on symptoms, where each classifier distinguishes between pairs of
diseases.
2. Scalability
As the number of classes increases, training multiple classifiers (as in OvR or OvO) can become
computationally expensive. Additionally, the model might struggle to learn effectively from a
large number of classes, especially if the dataset is not sufficiently large or well-represented for
each class.
UNIT 4 28
boundaries and understanding the decision process for each class may require more advanced
techniques.
Image recognition: Identifying objects in an image, where each object corresponds to a class.
Text classification: Categorizing documents or messages into topics such as sports, politics, or
entertainment.
A 100 80 10 10
B 0 9 0 1
C 0 1 8 1
D 0 1 0 9
When the classes in the dataset are not evenly distributed, meaning that some classes have
significantly more examples than others, the problem is referred to as an imbalanced multi-
class classification problem.
This type of problem can create significant challenges for machine learning algorithms because
the model may become biased toward the majority class, leading to poor performance on the
underrepresented classes.
UNIT 4 29
In a typical classification task, you would expect each class to have a similar number of
samples, but in many real-world scenarios, this is not the case.
For example, in a medical dataset for diagnosing diseases, you might have far more healthy
patients than patients with a rare disease.
In such cases, the minority classes (such as those representing the rare disease) are
underrepresented, and the majority classes (such as the healthy patients) dominate the dataset.
This imbalance can significantly affect the ability of the machine learning model to correctly
classify the minority classes, because it might get "distracted" by the large number of samples
from the majority class.
Why Is It a Problem?
Imbalanced datasets present several challenges for machine learning models:
1. Bias Toward the Majority Class: Many machine learning algorithms are designed to maximize
overall accuracy, meaning that if the majority class is overwhelmingly represented, the
algorithm may simply predict the majority class for all data points, leading to high accuracy but
poor performance on the minority classes.
2. Poor Model Generalization: The model might not learn to recognize the distinguishing features
of the minority classes effectively. Since there are fewer examples of the minority classes, the
model does not get enough exposure to these classes during training, which leads to poor
generalization to new, unseen examples from the minority classes.
3. Ineffective Performance Metrics: Accuracy, the most commonly used performance metric, is
not always a reliable indicator in imbalanced problems. A model that predicts the majority class
for every data point can achieve a high accuracy, but it would fail to identify minority class
examples correctly. This makes metrics like precision, recall, F1-score, and area under the ROC
curve more important in evaluating model performance.
One or more classes have significantly more samples than others, leading to a situation where
the machine learning model may develop biases toward the majority class.
Understanding the causes of class imbalance is important, as it helps in identifying why the
problem occurs and how to address it.
UNIT 4 30
1. Natural Occurrence in Data
One of the most common causes of class imbalance arises from the natural distribution of
events in the real world. Many phenomena naturally occur more frequently in one class than
another. For instance:
In medical diagnostics, most people may be healthy, and only a small percentage may
suffer from a particular disease. As a result, datasets will naturally have more "healthy"
samples compared to "diseased" samples.
In fraud detection, fraudulent transactions occur much less frequently than legitimate
transactions, which leads to an imbalanced dataset with far more non-fraudulent
transactions.
In object detection, some objects (such as cars or people) are much more common in
everyday images, while others (like rare animals or specific products) are much less
frequent.
When these natural disparities exist, the data generated for machine learning will reflect these
uneven distributions, resulting in imbalanced classes.
Sampling Bias: Sometimes, the method used to collect data favors certain groups or
outcomes over others. For example, if a survey is conducted to detect a specific disease but
is only done in an area where the disease is rare, the dataset will have a lower proportion of
disease cases.
Data Availability: In many cases, it is simply easier or cheaper to collect data from the
majority class. For instance, if you are working with an e-commerce platform, it might be
easy to collect data on regular customer purchases but much harder to obtain data on
fraudulent transactions due to their rarity.
Event Monitoring: Some rare events, like equipment failures in industrial settings or natural
disasters, might not happen frequently enough to be adequately represented in the dataset.
Hence, datasets based on these types of events will tend to have fewer instances from the
rare class.
UNIT 4 31
periods, where the occurrence of rare events is low relative to more common events.
It involves modifying the dataset by either oversampling the minority class or undersampling
the majority class to make the class distribution more balanced.
This can be done by duplicating existing samples or generating new ones by random
sampling from the minority class.
The goal is to increase the number of minority class instances so that the model is exposed
to them more often during training, thus improving its ability to predict these rare classes.
This technique helps prevent the model from becoming biased towards the majority class,
but it can also lead to a loss of valuable data and underfitting if too much data is removed.
Both methods have their pros and cons. Oversampling helps the model learn more from the
minority class but can lead to overfitting since duplicated samples might cause the model to
memorize rather than generalize.
On the other hand, undersampling can lead to a loss of important information from the majority
class, potentially reducing the model’s performance on the majority class.
2. Tomek Links
UNIT 4 32
Tomek Links are pairs of instances that are very close to each other but belong to opposite
classes.
When two instances form a Tomek Link, they are the nearest neighbors to each other, but they
belong to different classes.
By removing one of the instances from the majority class in each Tomek Link pair, the decision
boundary between the classes becomes clearer, and the classifier can distinguish between the
classes more effectively.
The main idea is that the instances in a Tomek Link are too close to each other and thus
represent noise or ambiguity in the classification process.
By removing the majority class instance of each pair, you create more space between the
classes.
This results in a more distinct separation, making the task of classification easier and improving
the model's generalization capabilities.
Tomek Links are a form of under-sampling because they remove datapoints from the majority
class.
However, this technique only removes those majority class instances that are very close to
instances of the minority class, focusing on cleaning up the boundary areas rather than
randomly reducing the majority class size.
This helps prevent overfitting and ensures that the classifier is not overwhelmed by the majority
class.
As a result, Tomek Links can be an effective tool for dealing with class imbalance in multiclass
classification problems, improving both model accuracy and decision boundary clarity.
UNIT 4 33
SMOTE (Synthetic Minority Over-sampling Technique) is a more advanced technique for
handling class imbalance by generating synthetic examples rather than duplicating existing
ones.
It focuses on creating new, synthetic samples for the minority class to make the class
distribution more balanced.
For each minority class instance, SMOTE selects a random sample of its k-nearest neighbors.
It randomly selects one or more neighbors and creates synthetic instances along the line
segment between the original instance and the selected neighbor.
These synthetic instances are placed in the feature space, essentially creating new, realistic
examples based on the existing data.
Example:
In a two-dimensional feature space, if an instance from the minority class is located at (3, 4),
and its nearest neighbor is at (4, 5), SMOTE would generate new points between these two
positions, such as (3.5, 4.5), which is a synthetic point that combines information from both the
original instance and its neighbor.
4. Class Weights
Class weights are a method to adjust the learning process in order to give more importance to
the minority class during training, without needing to modify the dataset itself.
In this approach, the algorithm is modified to penalize mistakes on the minority class more
than mistakes on the majority class.
The idea is to make the classifier more sensitive to the minority class by assigning higher
penalties (or weights) to misclassifications of minority class instances.
These weights help adjust the algorithm’s behavior when training the model.
UNIT 4 34
The general approach is to:
For example, in a binary classification problem where class A is the majority class and class B is
the minority class, the classifier might be penalized more for misclassifying class B instances.
This encourages the model to focus on correctly identifying the minority class to improve its
performance.
Flexible: Class weights can be adjusted for any classifier that supports this feature, making it a
very flexible and widely applicable technique.
Overfitting Risk: If the weights are too high, the model may become too focused on the minority
class, leading to overfitting on the minority class and poor generalization.
Ensemble Learning
Ensemble learning is a powerful technique in machine learning that combines multiple models
to make better predictions than any single model could on its own.
The core idea is that by combining the strengths of different models, an ensemble can often
achieve higher accuracy and more robust predictions, especially when individual models may
have their own weaknesses.
In ensemble learning, several models, often referred to as "learners" or "base models," are
trained to solve the same problem.
Once trained, these models' predictions are combined in some way to produce a final output.
The reasoning behind ensemble learning is that different models may make different errors, and
by combining their predictions, these errors can cancel each other out, leading to improved
overall performance.
There are two main types of ensemble learning: Bagging and Boosting, each with its own
approach to combining models and reducing errors.
Bagging involves training multiple models (typically of the same type) on different
subsets of the data. These subsets are created by randomly sampling the original
dataset with replacement (bootstrap sampling). Each model is trained independently.
UNIT 4 35
The most common example of bagging is the Random Forest algorithm, where many
decision trees are trained on different subsets of the data, and their predictions are
combined through voting or averaging.
2. Boosting:
Boosting is a sequential ensemble method where each new model is trained to correct
the mistakes made by previous models. The idea is to focus more on the data points that
were misclassified or poorly predicted by previous models.
In boosting, models are added one after another, and each model gives more weight to
the instances that previous models got wrong. This way, the ensemble focuses on
improving weak points.
AdaBoost (Adaptive Boosting): Each new model gives more weight to the
misclassified data from previous models.
Gradient Boosting: Models are built in a way that corrects the errors from the
previous model using gradient descent.
Goal: To reduce both bias and variance, improving accuracy on complex problems.
2. Robustness: Ensembles tend to be more robust than individual models, especially in the
presence of noisy data or outliers. The diversity of models in an ensemble makes it less likely
that the entire system will be affected by issues that might impact one model.
3. Reduction of Overfitting: While individual models, particularly complex ones, can overfit the
training data, ensemble methods like bagging (e.g., random forests) help to reduce overfitting
by averaging predictions across multiple models, which can lead to better generalization.
3. Risk of Diminishing Returns: As more models are added to an ensemble, there may come a
point where the benefit of adding additional models becomes minimal. This means that after a
UNIT 4 36
certain point, the improvement in performance might not justify the increased computational
effort.
Bagging
Bagging stands for Bootstrap Aggregating, a powerful ensemble learning technique that aims
to improve the performance and stability of machine learning models.
The concept of bagging is based on the idea of combining multiple models (often of the same
type) trained on different subsets of the data to achieve a more accurate and robust final
prediction.
The goal is to reduce the variance of the model, making it less prone to overfitting and more
generalized to new, unseen data.
Bagging works by creating multiple models that are trained on different parts of the training
data.
These parts are generated using a method called bootstrapping, which involves randomly
sampling the training data with replacement.
This means that some data points may appear multiple times in the same subset, while others
may not appear at all.
After training, each of these models makes predictions, and the final prediction is made by
combining the outputs of all the models, typically through voting (for classification tasks) or
averaging (for regression tasks).
The core principle behind bagging is to reduce variance. Variance in a model refers to the
sensitivity of the model to fluctuations in the training data.
High variance models tend to overfit, meaning they perform well on the training data but fail to
generalize to new data.
By averaging the predictions from several models trained on different data subsets, bagging
helps to smooth out errors and reduce the model’s overall variance.
UNIT 4 37
2. Model Training: Each of these subsets is then used to train a separate model. These models are
typically of the same type, such as decision trees or support vector machines. However, each
model is exposed to different data, which introduces diversity in the predictions made by each
model.
3. Prediction: Once all the models are trained, they are used to make predictions on new data. For
classification problems, the predictions from all the models are combined using a majority vote,
meaning the class that most models predict is chosen as the final prediction. For regression
problems, the predictions are averaged to get the final result.
4. Final Aggregation: The final step is to aggregate the results from all the individual models. This
aggregation typically reduces the impact of any one model’s errors. In classification, it’s based
on voting (majority rule), while in regression, it’s the average of the predictions.
Since each tree is trained on a different bootstrap sample, the trees are likely to make different
mistakes.
When the results of many trees are combined, the overall error tends to decrease, as errors
made by some trees are canceled out by others.
Benefits of Bagging
1. Reduction of Overfitting: Bagging helps reduce overfitting, particularly in complex models like
decision trees, which can be highly sensitive to small variations in the training data. By
averaging multiple models, bagging smooths out the predictions and leads to a model that
generalizes better to unseen data.
2. Improved Accuracy: Since bagging aggregates the predictions of multiple models, it can lead to
a more accurate prediction compared to a single model. Even if individual models perform
poorly, the ensemble can still achieve good overall performance.
3. Stability: Bagging stabilizes the model by making it less sensitive to fluctuations in the training
data. If one model is overfitted or underfitted, the effect of this is lessened by the other models
in the ensemble.
4. Parallelization: Bagging is a technique that can be easily parallelized because each model is
trained independently on a different subset of the data. This makes it efficient to implement,
especially on large datasets.
Drawbacks of Bagging
1. Increased Computational Cost: One of the main drawbacks of bagging is that it requires
training multiple models, which increases the computational cost. For large datasets or complex
models, this can be a significant concern in terms of both time and resources.
2. Diminishing Returns: After a certain point, adding more models to the ensemble may lead to
diminishing returns in terms of performance improvement. The additional models may not
UNIT 4 38
contribute significantly to the accuracy, and their inclusion may only increase computational
complexity.
3. Model Diversity: Bagging typically uses the same type of model for each subset of the data.
While this is useful for reducing variance, it may not be as effective in situations where different
model types or approaches could offer additional diversity and improve the ensemble's
performance.
Sub bagging
Subagging (Subset Aggregating) is a variant of the bagging technique in ensemble learning
that aims to improve the accuracy and robustness of machine learning models.
While it shares some similarities with bagging, subagging has a distinct approach in how the
data is sampled for training individual models, offering potential advantages in certain
scenarios.
The main idea behind subagging is to use smaller, randomly selected subsets of the training
data instead of bootstrapping, which is used in traditional bagging.
Subagging works by training multiple models, similar to bagging, but with one key difference:
instead of sampling with replacement (as in bagging), subagging uses random subsets of the
data without replacement.
This means that each model in the ensemble is trained on a different, smaller subset of the data,
and no data point is repeated in any subset.
By using subsets that are smaller than the full training set, subagging can provide a more
diverse set of models, while still maintaining the overall goal of improving model performance
through aggregation.
In subagging, just like in bagging, the predictions of all individual models are combined to make
the final prediction.
For classification tasks, this typically involves majority voting, while for regression tasks, the
predictions are averaged.
2. Model Training: Each of these randomly selected subsets is used to train a separate model.
Like bagging, these models are typically the same type (e.g., decision trees, neural networks)
but are trained on different portions of the data. Since each model is trained on a different
subset, the ensemble members are likely to make different errors, which enhances the diversity
of the ensemble.
3. Prediction: Once all models are trained, they are used to make predictions on new, unseen data.
For classification tasks, each model casts a vote, and the class with the majority of votes is
UNIT 4 39
selected as the final prediction. For regression tasks, the final prediction is the average of all the
model predictions.
4. Final Aggregation: As with bagging, the final result is derived from aggregating the predictions
of all individual models. This aggregation helps reduce the overall error, particularly by
averaging out individual model biases and errors.
Subset Size: In bagging, the training sets are typically the same size as the original dataset. In
subagging, the training subsets are smaller and are typically a fraction of the original dataset,
often around 50-80% of the total data size.
Goal: Bagging focuses on reducing variance by using multiple models trained on different data
samples, whereas subagging aims to achieve a balance between reducing variance and
introducing more diversity by using smaller, non-repetitive subsets of data.
Advantages of Subagging
1. Reduced Variance: Similar to bagging, subagging helps reduce the variance of high-variance
models (e.g., decision trees), making the ensemble more stable and less prone to overfitting.
The different subsets of data prevent overfitting to any particular portion of the data.
2. Diversity of Models: Since subagging uses smaller random subsets of data without
replacement, the models in the ensemble are likely to make different types of errors. This
diversity can lead to better generalization and improved performance on new data.
3. Efficiency: Subagging can be more computationally efficient than bagging, especially when
working with large datasets. By using smaller subsets of data for training the models, the
training process requires less computational power, while still benefiting from the aggregation
of multiple models.
Disadvantages of Subagging
1. Potential Underfitting: Since each model in subagging is trained on a smaller subset of the
data, there is a risk of underfitting, especially if the base models are too simple or if the subsets
are too small. The model might not capture enough information to make accurate predictions on
its own.
2. Less Data Utilization: Since data points are not repeated in the subsets, there is less data
available for each individual model. This could potentially lead to weaker models if the base
model requires a lot of data to perform well.
UNIT 4 40
3. Computational Overhead: Although subagging can be more efficient than bagging, the
computational cost can still be high because multiple models are being trained. In cases where
the models are particularly large or complex, the cost of training many models could still be
substantial.
In Random Forest (bagging), multiple decision trees are trained on different bootstrapped
samples of the data. Each tree is trained on a subset of the data where some data points are
repeated, which can lead to models that overfit specific aspects of the data.
In subagging, smaller random subsets of data (without replacement) are used for training. This
reduces the likelihood of overfitting and can lead to greater diversity in the models, as each tree
has seen different data. The ensemble of these diverse models can perform better in terms of
generalization.
Boosting
Boosting is a technique that aims to reduce both bias and variance of the model by combining
many weak learners (models that perform slightly better than random guessing) to create a
strong learner (a model with high performance).
The central idea of boosting is to train models sequentially, where each new model corrects the
mistakes made by the previous ones.
Unlike bagging, where models are trained independently and aggregated, boosting focuses on
improving the performance of the model iteratively by trying correct the mistakes made by the
previous ones.
UNIT 4 41
This allows boosting to enhance the overall accuracy of the ensemble and produce highly
accurate models even when the individual base models are weak.
The term "weak learner" refers to a model that performs slightly better than random guessing.
In boosting, these weak models are usually simple algorithms like decision trees with a single
split, also known as decision stumps.
Boosting takes these simple models and builds a strong learner by adjusting the weights of the
data points to focus on the most difficult-to-predict instances.
2. Error Calculation: After the first model makes predictions, the algorithm identifies which data
points were misclassified or wrongly predicted. These misclassified data points are the focus of
the next model, as the goal is to correct these errors in the subsequent model.
3. Weighting Misclassified Points: In boosting, the training instances that were misclassified are
given more weight, meaning they will have more influence on the training of the next model.
Conversely, correctly classified data points have less influence on the next model. By focusing
on the difficult cases, boosting ensures that the next model is better at handling the errors of
the previous one.
4. Training New Model: The next model is trained on the adjusted dataset, which now includes the
re-weighted data points that were misclassified by the previous model.
5. Iterative Process: This process is repeated for several iterations, with each new model being
trained to fix the errors of the previous models. As more models are added to the ensemble, the
system becomes increasingly accurate because each new model corrects the weaknesses of
the combined ensemble.
6. Final Prediction: Once all models are trained, their predictions are combined. In classification
tasks, this is typically done through a weighted voting scheme, where models that performed
better have more influence in the final prediction. In regression tasks, predictions are usually
averaged.
AdaBoost was one of the earliest and most widely used boosting algorithms.
It works by assigning equal weights to all training data points at the start. After each model
is trained, the weights of misclassified points are increased, making those points more
influential in the next round of learning.
AdaBoost then combines the predictions of all models, with each model’s influence
determined by its accuracy (i.e., better models have a higher weight in the final prediction).
Strength: AdaBoost is relatively simple and can significantly improve the performance of
weak models.
UNIT 4 42
2. Gradient Boosting:
Gradient Boosting takes a different approach by fitting a new model to the residuals
(errors) of the combined ensemble of previous models, rather than directly focusing on
misclassified points.
Gradient boosting algorithms like XGBoost, LightGBM, and CatBoost are highly optimized
variants of gradient boosting that are widely used in machine learning competitions because
of their efficiency and performance.
Strength: Gradient boosting is flexible, powerful, and can handle various types of predictive
tasks, including classification and regression.
3. Other Variants:
Reduction of Bias and Variance: Boosting can reduce both bias and variance. By combining
several weak models, boosting reduces bias by improving the model’s predictions. It also
reduces variance by averaging out errors, ensuring that the final model is less likely to overfit
the data.
Flexibility: Boosting works well with both weak classifiers and complex models. It can be used
for a wide range of machine learning tasks, from regression to classification, and is effective for
both simple and highly complex datasets.
Advantages of Boosting
1. High Accuracy: Boosting typically results in highly accurate models. By focusing on the
hardest-to-predict examples, boosting can improve the performance of weak learners and
achieve better accuracy than many other models.
2. Works Well with Weak Learners: Boosting is particularly useful when the base learners (weak
models) are simple, like decision stumps (small decision trees). Despite their simplicity, boosting
can create an ensemble that performs at a high level.
3. Reduces Overfitting: Although boosting can overfit if not carefully tuned, it generally reduces
overfitting by focusing on the most difficult instances and smoothing out model errors.
UNIT 4 43
4. Versatility: Boosting algorithms, especially Gradient Boosting, can be adapted to both
regression and classification tasks and can handle both linear and non-linear relationships in
the data.
Disadvantages of Boosting
1. Overfitting Risk: If too many models are added or if the base learner is too complex, boosting
can start to overfit the training data. Therefore, it's crucial to tune hyperparameters like the
number of models and the learning rate to avoid overfitting.
3. Sensitive to Noisy Data: Boosting can be sensitive to noisy data and outliers. Since it focuses
on correcting errors from previous models, noisy data points or outliers may be overemphasized
in the model-building process, leading to overfitting.
4. Interpretability: Boosting models, especially those like Gradient Boosting or XGBoost, can be
hard to interpret compared to simpler models like decision trees. The final model is an
aggregate of many weak learners, making it challenging to understand the reasoning behind
specific predictions.
Stumping (Decision Stump)
A decision stump is a very simple machine learning model that is often used as a weak learner
in ensemble methods like boosting.
A decision stump is essentially a decision tree with a very limited structure—often consisting of
just one level (a root and two leaves).
Despite its simplicity, decision stumps can be highly effective when used in boosting
algorithms, where multiple stumps are combined to form a strong learner.
A decision stump is a decision tree with only one decision node, meaning it splits the data
based on a single feature, and then assigns a prediction to each of the resulting branches
(leaves).
This simple structure allows the decision stump to make basic, but potentially useful,
predictions.
However, because of its simplicity, a single decision stump is usually a weak learner, meaning it
has a high error rate when used independently.
Since it’s a single decision, it is a very weak learner, meaning it’s not very accurate on its own,
but it can still be useful when combined with other models in an ensemble.
This is done through a process of iterative correction, where each stump in the sequence
focuses on the mistakes made by the previous stumps.
UNIT 4 44
For a classification problem, this decision is typically a binary split: if the value of the feature is
above or below a certain threshold, the stump assigns one of two possible labels.
For a regression problem, the decision stump predicts a constant value based on the feature.
For Classification: The stump examines one feature, compares it to a threshold, and
classifies the data into one of two classes. For example, if the feature value is less than 5,
assign class "A", otherwise, assign class "B".
For Regression: The stump predicts a constant value, typically the mean value of the target
variable, based on the value of the feature. For example, if the feature is less than 5, predict
the mean target value for all data points where the feature is less than 5.
2. Simplicity and Speed: One of the advantages of decision stumps is their simplicity. Since they
are essentially one-level decision trees, they are computationally inexpensive and can be
trained quickly. This is important in boosting, as the algorithm requires many iterations of model
training. Using decision stumps allows boosting algorithms to train quickly, even with large
datasets.
3. Focusing on Mistakes: In boosting algorithms like AdaBoost or Gradient Boosting, each new
model is trained to correct the errors made by the previous model. Since decision stumps are
weak learners, they are particularly useful in this process because they can focus on small,
specific parts of the data (the mistakes made by previous models). This iterative correction
process allows boosting to significantly improve the overall performance of the ensemble
model.
1. Initial Training: The first decision stump is trained on the entire dataset, where each data point
is treated equally in the training process. This stump will likely make many errors, but it starts
the boosting process.
2. Error Identification: After the first stump is trained, the boosting algorithm calculates the errors
it made, typically focusing on the misclassified data points. In AdaBoost, for instance, the
misclassified data points are given higher weights, so that subsequent stumps will focus on
these harder-to-predict instances.
3. Sequential Stump Training: A new decision stump is trained on the updated dataset, which now
includes the re-weighted data points. This second stump is more likely to correct the errors
made by the first stump.
UNIT 4 45
4. Combining Stumps: Once all stumps are trained, their predictions are combined in a way that
emphasizes the most accurate models. In AdaBoost, this involves assigning more weight to the
predictions of stumps that performed well and less weight to those that performed poorly. The
final prediction is made by combining the outputs of all stumps, often through weighted voting
(classification) or averaging (regression).
Feature: Age
Threshold: 30
If the customer's age is greater than 30, the model might predict "Yes" (the customer will buy
the product), and if the customer's age is less than or equal to 30, it might predict "No" (the
customer will not buy the product). This is a simple, one-level decision tree that only splits the
data on one feature—age.
2. Flexibility: Despite their simplicity, decision stumps can be effective when used in boosting
because they allow the boosting algorithm to create a more powerful ensemble model by
focusing on difficult data points.
3. Efficiency: Since decision stumps are simple, they can be trained quickly. This efficiency is
important in boosting, as many stumps are trained in sequence.
4. Reduction of Bias: When combined in boosting algorithms, decision stumps help reduce bias by
iteratively correcting errors in previous stumps, leading to a more accurate overall model.
2. Risk of Overfitting: Although decision stumps themselves are simple, boosting algorithms can
still overfit if too many stumps are used, especially in noisy datasets. The model may begin to
focus too heavily on the noise in the data, leading to overfitting.
AdaBoost
AdaBoost (Adaptive Boosting) is one of the most popular and influential boosting algorithms in
machine learning.
UNIT 4 46
It was introduced by Yoav Freund and Robert Schapire in 1995 and is known for its simplicity,
effectiveness, and ability to improve the performance of weak learners.
AdaBoost is designed to create a strong predictive model by combining the outputs of multiple
weak learners.
The key idea behind AdaBoost is to iteratively focus on the data points that are difficult to
classify, adjusting the weight of each data point based on the mistakes made by the previous
models.
The goal of AdaBoost is to improve the performance of weak learners by combining them into a
strong learner.
A weak learner is a model that performs slightly better than random guessing.
In AdaBoost, the weak learners are typically decision stumps (one-level decision trees),
although other models can be used as well.
Each decision stump is trained sequentially, and AdaBoost focuses more on the misclassified
examples from previous iterations, ensuring that each new model corrects the errors made by
its predecessors.
These weights determine how much influence each training instance has on the learning
process.
The learner makes predictions for all data points, and an error rate ϵis computed as the
weighted sum of the misclassified instances:
ϵ = ∑i wi ⋅ I(yi
= y^i )
where I(yi
= y^i )is 1 if the prediction is wrong, and 0 if the prediction is correct.
Here, α\alpha is a coefficient that measures the accuracy of the current learner, given by:
UNIT 4 47
1
α= 2
ln ( 1−ϵ
ϵ
)
The coefficient αbecomes larger when the learner performs well, and smaller when the learner
performs poorly. This ensures that the weak learners that perform well are given higher weight,
while those that perform poorly have less influence.
For classification, the final prediction is made by majority voting, where each weak learner casts
a weighted vote. For regression, the predictions of all weak learners are averaged, weighted by
their respective αvalues.
In the first round, the weak learner might correctly classify most of the easy points but make
mistakes on the difficult ones. AdaBoost increases the weights of the difficult points.
In the second round, the next weak learner is trained, and because the difficult points now have
higher weights, the new model focuses more on getting these points right.
This process continues, with each new model focusing on correcting the mistakes of previous
ones, gradually improving the overall model’s performance.
By the end of the process, AdaBoost has created an ensemble of weak learners, where each
learner focuses on different parts of the data. The final strong model is built by combining these
learners into a single, highly accurate model.
Advantages of AdaBoost
1. High Accuracy: AdaBoost can achieve high accuracy by combining weak learners in a way that
improves their performance. It is particularly effective for classification tasks with complex
decision boundaries.
UNIT 4 48
4. Versatility: While decision stumps are commonly used as base learners, AdaBoost can work
with other types of models as well, such as decision trees, linear models, or even neural
networks, though decision stumps are preferred due to their simplicity.
Disadvantages of AdaBoost
1. Sensitive to Noisy Data and Outliers: Since AdaBoost increases the weight of misclassified
points, it can give too much attention to noisy data or outliers. If these points are incorrectly
labeled, the algorithm may overfit to them, reducing overall performance.
2. Overfitting: If AdaBoost is run for too many iterations, it may overfit the training data, especially
if the weak learners become too specialized in correcting noise or outliers.
3. Computational Cost: Although AdaBoost is efficient in terms of the individual learners, the
iterative process can still be computationally expensive, particularly when dealing with large
datasets or many iterations.
Purpose Reduces variance by averaging predictions Reduces bias by focusing on difficult cases
Usually uses the same type of model (e.g., Uses weak learners (often decision trees)
Model Types
decision trees) which can be of any type
Primarily reduces variance; little effect on Reduces both bias and variance (focuses on
Impact on Bias
bias correcting errors)
Equal treatment for all data points; does not Focuses on difficult-to-classify points by
Focus on Errors
focus on previous errors adjusting the weights of misclassified data
UNIT 4 49
Aspect Bagging Boosting
Works well with high variance models (e.g., Generally leads to better performance when
Performance
decision trees) bias is high, especially with weak learners
The idea is that by combining the outputs of several models, we can achieve better
performance than any single model could on its own.
The core assumption behind ensemble methods is that combining different models can reduce
errors and improve the overall prediction accuracy.
Ensemble learning methods can be broadly categorized into two types: Simple and Advanced.
i) Simple
1. Simple Ensemble Learning Methods
Simple ensemble methods typically involve combining several similar models in a
straightforward manner. They are often easier to implement and understand. The most common
simple ensemble techniques are:
Bagging is a technique that aims to reduce variance and avoid overfitting. In this method,
several models are trained independently on different random subsets of the training
data.
These subsets are created through a process called bootstrapping, where random
samples of the data are drawn with replacement.
This means that some data points might appear more than once in each subset, while
others may be left out.
For regression tasks, the final prediction is usually the average of the predictions made
by each model.
For classification tasks, the final prediction is determined by majority voting, where the
class that gets the most votes from all models is chosen.
Example: A popular algorithm that uses bagging is Random Forests. In a random forest,
multiple decision trees are trained on different bootstrap samples of the dataset, and the
final classification or regression prediction is made by aggregating the results of all the
trees.
2. Boosting
UNIT 4 50
Boosting is another ensemble method that aims to improve the accuracy of weak
models.
In boosting, models are trained sequentially, with each new model focusing on the
mistakes made by the previous ones.
The idea is that each model in the sequence tries to correct the errors of the previous
model by giving more weight to the misclassified data points.
The models are combined in a weighted manner, where more weight is given to the
predictions of models that perform better.
The final prediction is typically made by summing up the predictions from all the models.
3. Voting
Voting is a simple ensemble method that combines multiple models to make a final
prediction by majority rule.
In classification problems, each model in the ensemble casts a "vote" for a particular
class, and the class that receives the most votes is selected as the final prediction.
In regression tasks, the average of the individual model predictions is taken as the final
prediction.
Voting can be used with different types of models, such as decision trees, support vector
machines, or neural networks, as long as they are independent of each other. There are
two common types of voting:
Hard Voting: Each model predicts a class, and the class with the majority of votes is
chosen.
Soft Voting: Each model predicts class probabilities, and the class with the highest
average probability across all models is selected.
1. Stacking
UNIT 4 51
Stacking (or stacked generalization) is an advanced ensemble method where the
predictions of several base models are used as input to a higher-level model, called a
meta-learner.
Unlike bagging and boosting, which combine predictions directly, stacking uses a
second-level model to learn how best to combine the outputs of the base models.
In stacking, multiple models are trained on the training data, and their predictions are
then used as features for a meta-model.
The meta-model is trained to make the final prediction based on the predictions of the
base models.
This approach allows for the possibility of combining the strengths of different types of
models.
Example: Imagine you use a decision tree, a support vector machine (SVM), and a neural
network as base models. The meta-learner could be a logistic regression model that
takes the predictions of these models as input to predict the final output.
Instead of just giving more weight to the misclassified instances, gradient boosting fits
new models to the residual errors (the difference between the observed and predicted
values) of the previous models.
This technique reduces both bias and variance and can produce highly accurate models.
In gradient boosting, each new model in the sequence tries to predict the residuals of the
previous models by focusing on where the errors are largest.
The final prediction is the sum of all the predictions from the models in the sequence.
Example: XGBoost and LightGBM are implementations of gradient boosting that are
highly efficient and have become very popular due to their accuracy and speed in
handling large datasets.
This randomness helps prevent overfitting and increases the model’s generalization
ability.
4. Blending
Blending is similar to stacking but typically uses a simple holdout validation set to train
the meta-learner.
UNIT 4 52
In stacking, the meta-model is trained using cross-validation, which can be
computationally expensive.
Blending, on the other hand, usually splits the training data into two parts: one to train the
base models and the other to train the meta-model.
Example: If you're combining decision trees, support vector machines, and neural
networks, you would use a holdout set to train a logistic regression model that takes the
outputs of these models as input.
Random Forests
Random Forests is an ensemble learning technique primarily used for classification and
regression tasks.
It is an extension of decision trees that aims to increase accuracy, reduce overfitting, and
improve robustness.
The key idea behind Random Forests is to combine the outputs of many decision trees, each
trained on different parts of the data, to make better overall predictions.
A decision tree is a model that makes predictions by asking a series of questions about the
features of the input data.
Each question divides the data into different subsets, and the process continues until the data is
divided into homogenous groups (where the output is as pure as possible).
While decision trees are simple and interpretable, they tend to overfit the training data,
especially when the tree is very deep and complex.
Instead of relying on a single model, ensemble methods like Random Forest use the idea that "a
group of weak learners can make a strong learner."
By combining many trees, Random Forests help overcome the limitations of individual decision
trees.
UNIT 4 53
This technique involves training each tree on a random sample of the data (with replacement).
This means that each tree is trained on a slightly different subset of the dataset, which
introduces diversity among the trees.
Random Forests also introduce randomness in the features used to split the data at each node
of a tree.
Instead of considering all features when making a split, a random subset of features is chosen
at each node, further increasing the diversity between the trees.
Each tree is built using a different random subset of the training data (via bootstrap
sampling), and at each split within a tree, only a random subset of features is considered.
Since the trees are built on slightly different data and use different features, each tree may
learn different patterns in the data.
Once all the trees are trained, they are used to make predictions. For a given input, each
tree in the forest will produce its own prediction.
For Classification: The final prediction is made using majority voting. The class that is
predicted by the majority of trees becomes the final prediction.
For Regression: The final prediction is typically the average of the individual tree
predictions.
The predictions from all the trees are combined to make the final prediction, which usually
leads to better performance than any single decision tree.
One of the biggest strengths of Random Forests is their ability to reduce overfitting. Since
the trees are trained on different data subsets and use different features, they are less likely
to memorize (overfit) the training data. The ensemble nature helps to smooth out individual
errors made by individual trees.
2. Improved Accuracy:
UNIT 4 54
Random Forests usually perform better than a single decision tree because they aggregate
the predictions of multiple trees, which helps to improve generalization and reduce bias.
Random Forests can handle missing data in a robust way. During training, when some data
points are missing, Random Forests can still create trees based on the available data and
make predictions even for missing values.
Random Forests can be applied to both classification and regression tasks. For
classification, it predicts the majority class, and for regression, it predicts the average of the
individual tree predictions.
5. Feature Importance:
Random Forests provide a built-in feature importance metric, which tells you which features
are most influential in making predictions. This can help with feature selection and
understanding the data.
6. Robust to Noise:
Because of the randomness introduced during training, Random Forests are relatively robust
to noise and can handle high-dimensional datasets (datasets with many features).
While Random Forests are powerful, they can become computationally expensive,
especially with large datasets and many trees. Training many trees and making predictions
with them can require more memory and processing power than simpler models.
2. Less Interpretability:
One of the trade-offs of using an ensemble method like Random Forest is that it is much
harder to interpret than a single decision tree. While individual decision trees are easy to
understand and visualize, a forest of hundreds or thousands of trees is not easy to interpret.
Due to the large number of trees, Random Forests can be slower for making predictions
compared to simpler models. For tasks requiring real-time predictions, the prediction time
might be a concern.
1 25 50k N
2 30 60k Y
UNIT 4 55
Customer Age Income Buy Product (Y/N)
3 35 70k Y
4 40 80k N
1. Step 1: Use bootstrap sampling to create multiple different subsets of the dataset. For example,
one tree might be trained on customers 1, 2, and 3, while another tree might be trained on
customers 2, 3, and 4.
2. Step 2: Each tree in the Random Forest will build a decision tree using different features at each
split. For example, one tree might use "Age" to split the data, while another might use "Income."
3. Step 3: When a new customer with a certain age and income wants to be predicted, each tree in
the forest will predict whether they will buy the product or not. If the majority of trees predict
"Yes," then the final prediction will be "Yes."
4. Step 4: Combine the predictions from all the trees to give the final prediction.
a. Random Forest
b. Adaboost
Diagnostics of Classifiers
In machine learning, evaluating the performance of a classifier is crucial for understanding how
well it can make predictions on unseen data.
A classifier is an algorithm that assigns labels to data points based on learned patterns,
typically used for classification tasks where the goal is to categorize input data into predefined
classes.
These metrics help assess different aspects of classifier performance, including accuracy,
precision, recall, and the trade-offs between them.
1. Accuracy
Accuracy is calculated as the ratio of correct predictions to the total number of predictions
made.
UNIT 4 56
Number of Correct Predictions
Accuracy = Total Number of Predictions
While accuracy gives a good overview of model performance, it may not be suitable in
cases where the data is imbalanced (i.e., one class is much more frequent than others).
In such situations, even if the model predicts the majority class most of the time, it could still
achieve high accuracy while performing poorly on the minority class.
2. Precision
It measures how many of the instances predicted as positive by the classifier were actually
positive.
Precision is important when the cost of false positives (predicting an incorrect positive
outcome) is high, such as in spam detection or medical diagnosis.
Where:
False Positives (FP) are the instances that were incorrectly classified as positive.
A high precision means that the classifier is reliable when it predicts a positive outcome, but
it does not tell us about how well the model performs on negative instances.
3. Recall (Sensitivity)
Recall, also known as sensitivity or true positive rate, measures the classifier's ability to
identify all the relevant instances within the data.
It focuses on the true positives (TP) and examines how many of the actual positive
instances the model successfully identifies.
Recall is critical when the cost of false negatives (failing to identify a positive outcome) is
high, such as in detecting diseases or fraud detection.
Where:
False Negatives (FN) are the instances that were incorrectly classified as negative.
Recall is useful when the goal is to minimize missed positives, but it does not consider how
many false positives the model makes.
4. F1-Score
It balances the trade-off between precision and recall by combining them into a single
number.
UNIT 4 57
The F1-Score is particularly useful when you need a balance between precision and recall
and when you are dealing with an imbalanced dataset.
The F1-Score ranges from 0 to 1, where 1 indicates perfect precision and recall, and 0
indicates the worst performance.
It provides a more nuanced evaluation when both false positives and false negatives are
important.
5. Confusion Matrix
It is a table that compares the predicted labels with the true labels of a dataset.
The confusion matrix is especially useful in understanding the types of errors the classifier
is making.
From the confusion matrix, a variety of evaluation metrics can be derived, such as
precision, recall, and accuracy.
It also provides insight into the specific types of mistakes the model is making (e.g., false
positives vs. false negatives).
UNIT 4 58
The Receiver Operating Characteristic (ROC) curve is a graphical representation of the
classifier’s performance across all possible classification thresholds.
It plots the true positive rate (recall) on the y-axis against the false positive rate (1 -
specificity) on the x-axis.
Each point on the curve represents a different threshold for classifying a positive instance,
and as you move along the curve, you observe the trade-off between the number of true
positives and false positives.
The Area Under the Curve (AUC) quantifies the overall ability of the classifier to
discriminate between positive and negative classes.
The AUC score ranges from 0 to 1, with higher values indicating better model performance.
A model with an AUC of 0.5 indicates no discrimination (i.e., random guessing), while a
value of 1 indicates perfect classification.
7. Specificity
Specificity, also known as the true negative rate, is the measure of how well the classifier
can identify negative instances.
A high specificity means that the model does a good job of identifying negative instances
and avoiding false positives.
The Matthews Correlation Coefficient (MCC) is a measure that takes into account all four
quadrants of the confusion matrix.
It is considered a more balanced measure than accuracy, especially when dealing with
imbalanced datasets.
MCC is particularly useful for evaluating binary classification tasks with imbalanced classes.
UNIT 4 59
💡 1. Explain: [OCT 23]
i) Accuracy
ii) Precision
iii) Recall
v) F-Score
Explain in brief methods used for evaluating classification models. [OCT 22]
These metrics are related to how the model handles positive predictions, but they focus on
different aspects of performance.
1. Precision measures the accuracy of positive predictions. It answers the question: Of all the
instances the model classified as positive, how many were actually positive?
Formula:
Precision = TP
TP +FP
where:
2. Recall (also known as Sensitivity or True Positive Rate) measures the ability of the model
to identify all positive instances. It answers the question: Of all the actual positive instances,
how many did the model correctly identify?
Formula:
TP
Recall = TP +FN
where:
FN = False Negatives (instances that were actually positive but predicted as negative)
UNIT 4 60
False Positives (FP): 10 emails incorrectly classified as spam (but are not spam)
False Negatives (FN): 5 emails incorrectly classified as not spam (but are actually spam)
Precision:
80 80
Precision = TP
TP +FP
= 80+10
= 90
≈ 0.89
This means that 89% of the emails classified as spam by the model are actually spam.
Recall:
TP 80 80
Recall = TP +FN = 80+5 = 85 ≈ 0.94
This means that 94% of all actual spam emails were correctly identified by the model.
If the model is biased towards precision (i.e., trying to minimize false positives), it may classify
fewer emails as spam, reducing the chance of false positives but also missing some actual
spam emails, leading to lower recall.
If the model is biased towards recall (i.e., trying to minimize false negatives), it may classify
more emails as spam, which increases the number of correct spam classifications but also
increases the risk of incorrectly labeling non-spam emails as spam (false positives), lowering
precision.
It provides a detailed breakdown of the model's predictions by comparing them with the actual
outcomes in the dataset.
This matrix helps assess not just the overall accuracy of the model but also its ability to classify
each class correctly and the types of errors it is making.
True Positives (TP): The number of instances that are correctly predicted as positive.
True Negatives (TN): The number of instances that are correctly predicted as negative.
False Positives (FP): The number of instances that are incorrectly predicted as positive
(type I error).
False Negatives (FN): The number of instances that are incorrectly predicted as negative
(type II error).
UNIT 4 61
These components allow us to calculate important metrics such as:
Precision: The proportion of positive predictions that are actually correct (TP / (TP + FP)).
Recall (Sensitivity): The proportion of actual positives that are correctly identified (TP / (TP
+ FN)).
F1-Score: The harmonic mean of precision and recall, providing a balanced measure when
classes are imbalanced.
2. Handling Imbalanced Classes: In cases of imbalanced datasets, where one class is much more
frequent than the other, accuracy alone can be misleading. The confusion matrix gives insights
into the types of misclassifications (false positives and false negatives), allowing for more
effective strategies to handle class imbalance.
3. Improving Model Decision-Making: By analyzing the confusion matrix, you can identify where
your model is going wrong. For example, if the model has a high number of false positives, you
might want to adjust the decision threshold or explore techniques like class weighting or
resampling.
4. Evaluation of Specific Metrics: Precision, recall, and F1-score are particularly useful when the
costs of different types of errors are unequal, or when the classes are not equally important.
The confusion matrix enables the calculation of these metrics, providing a more comprehensive
view of model performance.
This method treats all classes equally, regardless of how many instances belong to each class.
1. Macro-Average Precision:
To compute Macro-Average Precision, we first calculate the precision for each class
individually.
Precision for a class is the ratio of correctly predicted positive instances (True Positives)
to the total predicted positives (True Positives + False Positives).
UNIT 4 62
Once we have the precision for each class, we take the average of these values across
all classes.
Formula:
Where N is the number of classes, T Pi is the number of True Positives for class i, and
2. Macro-Average Recall:
Recall for a class is the ratio of correctly predicted positive instances (True Positives) to
the total actual positives (True Positives + False Negatives).
After computing the recall for each class, we take the average of these recall values
across all classes.
Formula:
3. Macro-Average F1-Score:
The F1-Score is the harmonic mean of precision and recall, providing a balance between
the two.
For Macro-Average F1-Score, we calculate the F1-Score for each class individually and
then average these scores.
Formula:
N
1 2 ⋅ Precisioni ⋅ Recalli
Macro F1 = ∑
Precisioni + Recalli
N
i=1
Macro-Average is useful when you want to treat all classes equally, regardless of their size or
distribution in the dataset.
However, it can be heavily influenced by classes with fewer instances, which may lead to a
lower score if the model performs poorly on rare classes.
Instead of calculating Precision, Recall, and F1-Score for each class individually and then
averaging, micro-averaging sums up the True Positives, False Positives, and False Negatives
across all classes and then computes the metric.
UNIT 4 63
1. Micro-Average Precision:
a. To compute Micro-Average Precision, we sum up all True Positives (TP) and False
Positives (FP) across all classes, and then calculate the precision based on these totals.
Formula:
Where T Pi and F Pi are the True Positives and False Positives for class i.
2. Micro-Average Recall:
For Micro-Average Recall, we sum up all True Positives and False Negatives across all
classes, and then compute the recall.
Formula:
3. Micro-Average F1-Score:
Micro-Average F1-Score is the harmonic mean of Micro Precision and Micro Recall,
which are calculated using the aggregated sums of True Positives, False Positives, and
False Negatives.
Formula:
⁍
Micro-Averaging is useful when the dataset is imbalanced, or when you want to give more
weight to the overall performance of the model rather than the individual class performance.
Since it aggregates the contributions from all classes, it is less sensitive to the performance on
any specific class, particularly small classes.
Micro-Average is useful when you are more interested in the model’s overall ability to predict
correctly, especially in cases where large classes dominate the data and you want to prioritize
their performance.
ROC Curve
The Receiver Operating Characteristic (ROC) curve is a graphical representation used to
evaluate the performance of a binary classification model.
UNIT 4 64
It provides a way to assess how well a model distinguishes between two classes (e.g., positive
and negative).
The ROC curve is widely used in fields such as medicine, machine learning, and signal
detection to compare different classifiers and to choose the best model based on certain
performance metrics.
In binary classification, a model usually outputs a probability score for each instance, indicating
how likely it is to belong to the positive class.
A threshold is then applied to this score to decide whether an instance is classified as positive
or negative.
For example, if the model predicts a probability above 0.5, it might classify the instance as
positive; otherwise, it classifies it as negative.
By varying this threshold from 0 to 1, we can observe how the TPR and FPR change, which
produces different points on the ROC curve.
The more the curve leans toward the top-left corner (where TPR is high and FPR is low), the
better the model’s performance.
UNIT 4 65
Interpreting the ROC Curve
Top-left corner (ideal point): This represents a point where the model has a high TPR (almost 1)
and a low FPR (close to 0). This means the model is correctly identifying most positive cases
and minimizing false positives.
Diagonal line (random classifier): If the ROC curve lies along the diagonal line from the bottom-
left to the top-right (also known as the line of no-discrimination), the model is no better than
random guessing. In this case, the TPR is approximately equal to the FPR, meaning the model
fails to distinguish between the positive and negative classes.
Area under the ROC Curve (AUC): The area under the ROC curve is often used as a summary
statistic for the model's performance. The AUC ranges from 0 to 1:
An AUC of 0.5 means the model has no discriminative ability, equivalent to random
guessing.
An AUC less than 0.5 indicates a model that is worse than random guessing (which
suggests that the model might need to be re-evaluated or retrained).
2. Class Imbalance: The ROC curve is particularly useful in situations where the dataset is
imbalanced (i.e., one class is much more prevalent than the other). It focuses on the relationship
between TPR and FPR, so it does not get overly affected by class imbalance as accuracy might.
3. Comparison of Models: Since the ROC curve plots TPR against FPR, it allows easy comparison
between different models. A model that consistently stays above the curve of another model is
considered better.
2. No Information about the Cost of Misclassification: The ROC curve does not take into account
the different costs of misclassifications (i.e., False Positives vs. False Negatives). In some
applications, False Negatives might be more costly than False Positives (e.g., in medical
diagnoses), and this is not captured by the ROC curve alone.
UNIT 4 66
AUC, or "Area Under the ROC Curve," is a metric used to evaluate the performance of a
classifier.
It is closely linked to the Receiver Operating Characteristic (ROC) curve, which visually shows
how well a model distinguishes between positive and negative classes at various decision
thresholds.
AUC represents the area under the ROC curve, and it provides a measure of how well the model
can distinguish between two classes.
Instead of assessing a model through the ROC curve visually, AUC summarizes it into a single
numerical value, with a higher AUC indicating superior model performance.
A single AUC value allows for easy comparison between different models. The model with the
higher AUC typically performs better in classification when tested on the same dataset.
AUC = 1: This indicates perfect performance. A model with an AUC of 1 has no errors and is able
to perfectly distinguish between positive and negative instances. It means that for any pair of a
positive and a negative instance, the model will always rank the positive instance higher than
the negative one.
AUC = 0.5: This indicates no discrimination ability. A model with an AUC of 0.5 performs no
better than random guessing. For example, it might rank positive and negative instances
randomly, with no preference for correctly identifying one class over the other. In this case, the
ROC curve lies along the diagonal line from the bottom-left to the top-right, also known as the
line of no discrimination.
UNIT 4 67
AUC < 0.5: This indicates poor performance, worse than random guessing. A model with an
AUC less than 0.5 means that, for some reason, the model is consistently predicting the
opposite of what it should—ranking negative instances higher than positive ones. This could be
due to a model that is fundamentally flawed or misconfigured.
AUC > 0.5: Any AUC value above 0.5 indicates that the model has some discriminative power
and is better than random guessing. A higher AUC value corresponds to a model that is better at
distinguishing between the classes.
These rates are then plotted on the ROC curve, and the area under the curve is determined.
AUC can be computed using various methods, such as the trapezoidal rule, which approximates
the area under the curve by dividing it into smaller trapezoids and summing their areas.
Advantages of AUC
1. Threshold Independence: A key advantage of AUC is that it evaluates the model's performance
across all possible decision thresholds, meaning it doesn't depend on a specific threshold (like
0.5). This is useful because in some situations, you may not want to set a fixed threshold for
classification, and AUC allows you to assess the model's overall capability across a range of
thresholds.
2. Comprehensive Evaluation: AUC provides a single value that summarizes the model’s ability to
distinguish between positive and negative classes, making it easier to compare different
models. It gives a holistic view of a model's performance, regardless of the threshold used to
classify the instances.
3. Better for Imbalanced Datasets: In cases where one class is much more frequent than the other
(a common issue in real-world data), AUC can still provide useful insights into the model's
performance. Traditional metrics like accuracy may be misleading in imbalanced datasets, as a
model that predicts the majority class for all instances could still have high accuracy. AUC,
however, focuses on how well the model differentiates between classes, making it more
informative in such situations.
Limitations of AUC
1. Does Not Consider Costs of Misclassification: AUC treats all false positives and false
negatives equally, without considering the potential costs of these errors. In some applications,
false positives might be much more costly than false negatives (or vice versa). AUC does not
take this into account, so it may not always reflect the true importance of the model’s errors for
a particular use case.
2. Can Be Misleading with Highly Imbalanced Data: While AUC is generally a good metric, it can
sometimes be misleading when there is a very large class imbalance. In extreme cases, a model
UNIT 4 68
might show a high AUC even if it fails to identify the minority class adequately, since the
majority class can dominate the evaluation.
Cross Validation
Cross-validation is a technique used in machine learning and statistics to check how well a
model performs on new, unseen data.
It helps in evaluating the performance of a predictive model by splitting the available dataset
into multiple subsets, training the model on some of these subsets, and testing it on the
remaining ones.
This process helps us estimate how the model will perform when faced with new data, which
reduces the risk of overfitting (when a model performs well on training data but poorly on new
data).
In machine learning, we typically split our data into two parts: one for training the model and
one for testing it.
The model is trained on the training data and then evaluated on the test data.
However, if we only split the data once, there's a chance that the model's performance could
depend too much on how the data was split.
For example, if the test set doesn't represent the entire dataset well, the performance might not
show how the model will do in real-life situations.
Cross-validation addresses this by making multiple splits, which helps provide a more accurate
estimate of the model's performance.
This gives a clearer picture of how the model might behave when applied to new, unseen data.
One common challenge in training machine learning models is overfitting, where the model gets
too good at fitting the training data but struggles with new data.
This can happen when the model is too complex or trained for too long on a small dataset.
Cross-validation helps reduce the risk of overfitting by ensuring the model is tested on different
subsets of the data, not just the training set it was first trained on.
Types of Cross-Validation
1. Holdout Method:
The Holdout Method is one of the simplest types of cross-validation. In this method, the
entire dataset is randomly divided into two separate sets: one for training the model and the
other for testing it.
UNIT 4 69
Typically, a fixed percentage of the data, like 70% or 80%, is used for training, and the
remaining 20% or 30% is used for testing.
The model is trained on the training set and then evaluated on the test set.
This method is easy to implement and computationally inexpensive, making it useful for
quick evaluations.
2. k-Fold Validation:
In this technique, the dataset is randomly divided into k equal-sized subsets or "folds."
For each fold, the model is trained using the remaining k-1 folds and tested on the fold that
was left out.
This process is repeated k times, each time using a different fold as the test set while the
remaining folds are used for training.
3. Leave-P-Out Cross-Validation:
In this method, p data points are left out as the test set, and the model is trained on the
remaining data points.
This process is repeated such that every possible combination of p data points is used as
the test set at least once.
For example, if you have a dataset of 100 samples and choose to use Leave-1-Out (which is
a special case of LPOCV with p = 1), the model would be trained 100 times, each time
leaving out one data point for testing. In the case of Leave-2-Out, the model would be
trained 50 times, each time leaving out 2 data points, and so on.
It is particularly useful when the dataset is small, as it maximizes the use of available data
for training and testing.
However, it comes with a significant computational cost because it requires training the
model multiple times, which can be impractical for larger datasets or more complex models.
Holdout Method
The Holdout Method is one of the simplest and most commonly used techniques for evaluating
the performance of machine learning models.
The main idea behind the holdout method is to split the available dataset into two distinct
subsets: one used for training the model and the other used for testing it.
UNIT 4 70
This method is widely applied because it is easy to understand, simple to implement, and
computationally efficient.
In the Holdout Method, the dataset is randomly divided into two parts: a training set and a test
set.
The training set is used to build the model, while the test set is kept aside and used solely to
evaluate the model's performance after training.
Typically, the data is split in a 70-30 or 80-20 ratio, where 70% or 80% of the data is allocated
for training and the remaining 30% or 20% is used for testing.
For example, in a dataset with 100 data points, if the holdout ratio is 80-20, then 80 data points
would be randomly selected for training, and the remaining 20 data points would form the test
set.
The model is trained on the 80 data points, and after the training is complete, it is tested on the
remaining 20 data points.
The performance of the model on the test set (such as accuracy, precision, recall, etc.) is then
reported as an estimate of how well the model will perform on new, unseen data.
2. Speed: Since the data is split into two sets (training and testing), this method can be very fast
compared to more complex cross-validation techniques, particularly when dealing with large
datasets.
3. Computational Efficiency: The holdout method is computationally light because the model is
trained only once on the training set and tested once on the test set. This makes it suitable for
situations where computational resources are limited.
2. Risk of Underfitting or Overfitting: If the split between the training and test set is not well-
balanced or if the model is too complex or too simple, it can lead to underfitting (where the
model fails to capture the underlying patterns) or overfitting (where the model performs very
well on the training set but poorly on the test set). This issue is particularly problematic when
working with small datasets.
3. Not Fully Utilizing the Data: Since only a subset of the data is used for training and the rest is
used for testing, the model is not trained on the entire dataset. In scenarios where the dataset is
UNIT 4 71
small, this can be inefficient, as it does not fully leverage all available data for model training.
Data is abundant: If there is a large enough dataset, a single train-test split can give a
reasonably good estimate of model performance, and the limitations of the method become less
significant.
Speed is a priority: In situations where quick results are needed, the holdout method can be an
effective way to get an initial sense of how well the model might perform on unseen data.
Computational resources are limited: When training multiple models or performing more
complex cross-validation methods is computationally expensive, the holdout method offers a
quick alternative.
k-Fold Validation
K-Fold Cross Validation is a widely used technique in machine learning for evaluating the
performance and generalizability of a model.
It divides the dataset into K equal or nearly equal subsets (called "folds") and uses each subset
as a testing set while using the remaining K-1 subsets for training the model.
This process is repeated K times, with each fold being used as the test set exactly once.
The results of these K evaluations are then averaged to provide a more reliable estimate of the
model's performance.
2. Train and Test the Model: The model is trained K times. For each of the K iterations:
One fold is used as the test set (the data used to evaluate the model’s performance).
The remaining K-1 folds are combined to form the training set (the data used to train the
model).
3. Evaluate the Model: After each training iteration, the model is evaluated on the test fold.
Performance metrics like accuracy, precision, recall, or F1-score are computed for that fold.
4. Average the Results: Once all K iterations are complete, the performance results from each fold
are averaged to provide a final performance metric. This averaged score represents the model's
general performance on the dataset.
1. Step 1 (Split the Data): Divide the data into 5 folds (each fold contains 200 instances).
UNIT 4 72
2. Step 2 (Train and Test):
In the first iteration, the model is trained on folds 2, 3, 4, and 5, and tested on fold 1.
In the second iteration, the model is trained on folds 1, 3, 4, and 5, and tested on fold 2.
This process continues until each fold has been used as a test set once.
3. Step 3 (Evaluate): After each iteration, the performance metric (e.g., accuracy) is calculated.
4. Step 4 (Average the Results): Once all iterations are complete, the average performance across
all 5 folds is computed. This final average provides a more reliable estimate of the model's
performance compared to using a single train-test split.
2. Efficient Use of Data: Every data point is used for both training and testing. This helps
especially when you have limited data, ensuring that all data points contribute to the evaluation
process.
3. Prevents Overfitting: By training and testing the model on different subsets, k-fold cross-
validation helps in reducing the risk of overfitting. Overfitting occurs when a model performs
well on a specific training set but poorly on unseen data, and k-fold cross-validation can detect
this issue by testing the model on different portions of the data.
4. Flexibility: K-fold cross-validation can be adapted to different datasets by varying the number
of folds. Common values for kkk are 5 or 10, but you can choose a different number based on
the size and characteristics of your data.
2. Bias in Small Datasets: In cases where the dataset is very small, even small changes in how the
data is split into folds can lead to biased performance estimates. In such cases, the
performance scores might not be as reliable as those obtained from larger datasets.
3. Not Ideal for Time Series: K-fold cross-validation assumes that the data is independent and
identically distributed (i.i.d.). However, for time series data, where the order of data points
matters (e.g., stock prices or weather data), this technique may not work well because the data
points are not independent. In such cases, other techniques like Time Series Cross Validation
should be considered.
UNIT 4 73
The value of K plays an important role in the performance of K-Fold Cross Validation. The choice of
K depends on the size of the dataset and the trade-off between computational efficiency and model
evaluation reliability:
Small K (e.g., K=2 or K=3): This leads to fewer training iterations and may provide faster
results, but it increases the variance in the performance estimate because the training and test
sets are not as varied.
Large K (e.g., K=10): This provides more reliable and stable performance estimates, but it is
more computationally expensive. K=10 is a common choice in many applications because it
strikes a balance between reliability and efficiency.
Leave-P-Out Cross-Validation (LPOCV)
Leave-P-Out Cross-Validation (LPOCV) is a more generalized version of cross-validation
techniques that is used to evaluate the performance of a machine learning model.
In this approach, instead of leaving out just one data point (as in Leave-One-Out Cross-
Validation or LOOCV), a subset of pdata points is left out from the training set during each
iteration.
The model is trained on the remaining data and tested on the held-out pdata points.
This process is repeated for all possible combinations of pdata points being left out, providing a
robust performance measure of the model.
1. Divide the Dataset: Given a dataset of size n, you select a value for p(the number of data
points to be left out in each iteration). The dataset is then used to form combinations of pp
points that will be held out for testing, while the remaining n − pdata points are used for
training the model.
2. Train and Test the Model: For each combination of pp data points held out, the model is trained
using the remaining n − pdata points. After training, the model is evaluated on the pp held-out
points (i.e., the test set for that iteration).
3. Repeat the Process: This process is repeated for all possible combinations of pp data points. If
nis the total number of data points in the dataset, the number of iterations will be the number of
n
ways to choose pp points out of n, which is given by the binomial coefficient ( p ).
4. Performance Evaluation: After performing all iterations, the performance scores (such as
accuracy, precision, recall, etc.) from each iteration are averaged to obtain an overall
performance metric for the model.
UNIT 4 74
In Leave-2-Out Cross-Validation, the model will be trained and tested on combinations of 2 data
points left out during each iteration. The different combinations of held-out data points would be:
In each iteration, the model is trained on 3 data points and tested on 2 data points. After all
iterations, the average performance score across all tests is used to evaluate the model.
3. Flexibility: Leave-P-Out Cross-Validation allows for different values of pp, making it a flexible
tool. This can be especially useful when you want to test the model's performance on different
sizes of test sets (e.g., testing with 2 points out versus 3 points out).
2. Infeasibility for Large Datasets: For large datasets, the number of possible combinations of
data points increases quickly. For example, if n=100n = 100 and p=2p = 2, the number of
iterations is (1002)=4950\binom{100}{2} = 4950. This is often too large to be computationally
feasible, making LPOCV impractical for datasets with many data points or when pp is large.
3. Overfitting Risk in Small Datasets: While Leave-P-Out Cross-Validation is excellent for avoiding
overfitting in large datasets by ensuring thorough testing, it can sometimes lead to overfitting in
smaller datasets. This is because with each test set consisting of just pp data points, the model
might overly specialize on very small subsets of data.
UNIT 4 75
💡 1. What is K-fold cross-validation? In K-fold cross-validation, comment on the following
situations: [OCT 23] [APR 23]
4 3 Fail
6 7 Pass
7 8 Pass
5 5 Fail
8 8 Pass
UNIT 4 76