0% found this document useful (0 votes)
17 views93 pages

ML 7th Sem Aiml Ite Notes Complete Long (1) - 63-155

The document provides an overview of classification in machine learning, explaining its definition, types, and common algorithms like k-NN and Random Forest. It details the classification process, challenges, and practical considerations for implementing these algorithms. Additionally, it highlights the advantages and disadvantages of k-NN and Random Forest, along with their applications in various fields.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views93 pages

ML 7th Sem Aiml Ite Notes Complete Long (1) - 63-155

The document provides an overview of classification in machine learning, explaining its definition, types, and common algorithms like k-NN and Random Forest. It details the classification process, challenges, and practical considerations for implementing these algorithms. Additionally, it highlights the advantages and disadvantages of k-NN and Random Forest, along with their applications in various fields.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

When choosing between these methods, consider the nature of your data, the problem you're

trying to solve, and the assumptions of each technique.

Unit 3
Introduction to Classification and Classification Algorithms

What is Classification?
Classification is a supervised learning technique in machine learning that involves predicting
the category or class of a given data point based on its features. The goal is to map input
variables (features) to a discrete output variable (class label).

Definition: Classification is the process of identifying the category to which a new data
point belongs, based on a training dataset containing observations with known class labels.

Key Features:

1. Supervised Learning: Classification requires labeled training data.

2. Discrete Output: Output is typically categorical (e.g., spam or not spam).

3. Real-World Applications:

Email filtering (spam vs. non-spam)

Medical diagnosis (disease classification)

Sentiment analysis (positive, negative, or neutral sentiment)

Examples of Classification Problems


1. Binary Classification: Two possible classes.

Example: Classifying emails as spam or not spam.

2. Multi-class Classification: More than two classes.

Example: Classifying animals into categories like cats, dogs, or birds.

3. Multi-label Classification: Each instance may belong to multiple classes simultaneously.

Example: Classifying a movie into genres (action, drama, and comedy).

General Approach to Classification


The classification process typically involves the following steps:

1. Problem Understanding and Data Collection:

Machine Learning 63
Clearly define the problem.

Collect a dataset with input features and corresponding class labels.

2. Data Preprocessing:

Handle missing data and outliers.

Normalize or standardize features.

Convert categorical data to numerical (e.g., one-hot encoding).

3. Feature Selection and Engineering:

Identify and retain the most relevant features.

Create new features if needed.

4. Splitting Data:

Divide the data into training and test sets (and sometimes a validation set).

5. Choose a Classification Algorithm:

Select an algorithm based on the problem and dataset characteristics (e.g., Logistic
Regression, Decision Trees).

6. Train the Model:

Fit the chosen model to the training dataset.

Adjust parameters (hyperparameters) to optimize performance.

7. Evaluate the Model:

Use performance metrics such as:

Accuracy: Proportion of correctly classified instances.

Precision and Recall: For imbalanced datasets.

F1-Score: Harmonic mean of precision and recall.

Confusion Matrix: Provides a detailed breakdown of correct and incorrect


predictions.

8. Hyperparameter Tuning:

Optimize model parameters using methods like Grid Search or Random Search.

9. Test and Deploy:

Test the model on unseen data.

Deploy the model for real-world applications.

Machine Learning 64
Common Classification Algorithms
1. Linear Classifiers:

Logistic Regression

Linear Discriminant Analysis (LDA)

2. Non-linear Classifiers:

K-Nearest Neighbors (KNN)

Support Vector Machines (SVM)

3. Tree-based Methods:

Decision Trees

Random Forest

Gradient Boosting (e.g., XGBoost, LightGBM)

4. Neural Networks:

Multilayer Perceptrons (MLP)

Convolutional Neural Networks (CNNs) for image data.

5. Bayesian Models:

Naive Bayes

Challenges in Classification
Imbalanced Data: Some classes may have significantly more samples than others.

Overfitting: The model performs well on training data but poorly on test data.

High Dimensionality: A large number of features can make training difficult.

k-Nearest Neighbor (k-NN) Algorithm

Introduction
The k-Nearest Neighbors (k-NN) algorithm is a simple, yet powerful, instance-based, and non-
parametric classification technique in machine learning. It is widely used due to its ease of
implementation and interpretability. The key idea is that a data point’s class label can be
determined based on the majority class of its k closest neighbors in the feature space.

How k-NN Works


1. Training Phase:

Machine Learning 65
In k-NN, there is no explicit training phase. The algorithm simply stores the training
dataset.

2. Prediction Phase:

Given a new input (test point), the algorithm calculates the distance between this point
and all points in the training set.

The k closest training examples are selected.

The class label of the test point is assigned based on the majority class among these k
neighbors.

Key Components of k-NN


1. Distance Metric:

To determine which points are the closest, a distance metric must be chosen. Common
metrics include:

Euclidean Distance: Euclidean Distance = sqrt(sum((x_i - y_i)^2))

Where
x_i and y_i are the coordinates of the two points.

Manhattan Distance (or L1 norm): Manhattan Distance = sum(|x_i - y_i|)

Minkowski Distance: A generalization of both Euclidean and Manhattan distances.

2. Choosing k:

The parameter k refers to the number of nearest neighbors to consider.

Small values of k (e.g., k=1) make the model highly sensitive to noise.

Large values of k make the algorithm more robust but can lead to underfitting.

3. Class Assignment:

The class label of the new data point is assigned based on the majority vote of the k
nearest neighbors.

In case of a tie, various tie-breaking strategies can be used (e.g., choosing the class
with the smallest distance sum).

4. Weighted Voting:

Instead of simple majority voting, weighted voting can be used where closer neighbors
have a higher influence on the classification.

Algorithm Steps
1. Input: A dataset of labeled instances, the number of neighbors (k), and the test instance.

Machine Learning 66
2. For each test point:

Calculate the distance between the test point and all training data points using the
chosen distance metric.

Sort the distances in ascending order.

Select the top k training points with the smallest distances.

Perform a majority vote or weighted vote to assign the test point’s class.

3. Output: The predicted class for the test point.

Advantages of k-NN
1. Simplicity: k-NN is very simple to understand and implement.

2. Non-Parametric: There is no need to assume anything about the data distribution.

3. Versatility: k-NN can be used for both classification and regression problems.

4. Adaptability: The model can naturally adapt to changes in the data as no explicit training is
required.

Disadvantages of k-NN
1. Computational Cost:

As k-NN needs to compute distances for every test point to all training samples, it can
be computationally expensive, especially with large datasets.

The prediction phase is slow, especially if the dataset is large.

2. Memory-Intensive:

Since the algorithm stores the entire training dataset, it requires significant memory.

3. Sensitivity to Irrelevant Features:

If the dataset has irrelevant or noisy features, the performance of k-NN can degrade.

4. Curse of Dimensionality:

As the number of features increases, the distance between data points increases,
which can make it harder for the algorithm to identify meaningful neighbors. This issue
can be mitigated by dimensionality reduction techniques such as PCA.

5. Choice of k:

The performance of the algorithm heavily depends on selecting an appropriate value


for k. Too large or too small k can lead to poor generalization.

Machine Learning 67
Selecting the Optimal Value of k
Cross-validation is typically used to choose the best value for k.

A common approach is to try different values of k and select the one that minimizes the
classification error.

Odd values for k are preferred when dealing with binary classification problems to avoid
ties.

Practical Considerations
1. Feature Scaling:

Since k-NN uses distances to determine proximity, it’s essential to normalize or


standardize the features. Otherwise, features with larger ranges will dominate the
distance calculation.

2. Curse of Dimensionality:

High-dimensional data can make distance metrics less meaningful. Dimensionality


reduction techniques (e.g., PCA, t-SNE) may help improve performance.

3. Handling Missing Values:

k-NN can be sensitive to missing data. Simple approaches like imputation or removing
data points with missing values can be applied.

Example
Consider a dataset with two classes: A (label = 0) and B (label = 1). You are given a new data
point (test point) and need to classify it using k-NN.

Suppose you choose k = 3 .

Calculate the distance between the test point and every point in the training set.

Find the 3 closest points to the test point. If two of them belong to class A and one belongs
to class B , the test point is classified as A based on majority voting.

Applications of k-NN
1. Pattern Recognition: Classifying images based on pixel data.

2. Recommendation Systems: Identifying similar users or items based on previous behavior


or attributes.

3. Medical Diagnosis: Classifying patients based on medical data (e.g., disease presence or
absence).

4. Handwriting Recognition: Classifying handwritten characters or digits.

Machine Learning 68
Conclusion
k-NN is a powerful and intuitive algorithm widely used in classification tasks. However, it
comes with challenges such as computational inefficiency and sensitivity to the curse of
dimensionality. By understanding these trade-offs, k-NN can be effectively applied in various
domains with careful pre-processing and parameter tuning.

Random Forests

Introduction
Random Forest is an ensemble learning technique used for both classification and regression
tasks. It combines multiple Decision Trees to improve performance, overcome overfitting, and
provide more robust predictions. The concept behind Random Forests is to create a "forest" of
decision trees where each tree votes for the predicted class, and the majority vote is taken as
the final output.

Type: Ensemble Learning (Bagging technique)

Uses: Classification and Regression

Key Idea: Reduce overfitting and increase accuracy by averaging multiple decision trees.

Machine Learning 69
How Random Forest Works
1. Bootstrap Aggregation (Bagging):

Random Forest uses a technique called bagging, where multiple subsets of the original
training data are created with replacement. This means that some data points may be
selected multiple times, while others may be left out.

Each subset is used to train a separate decision tree.

2. Random Feature Selection:

When building each tree, Random Forest does not consider all features for splitting;
instead, it selects a random subset of features. This randomness helps to reduce
correlation among the trees and makes the model more robust.

3. Building Decision Trees:

Each decision tree is trained independently on its subset of data.

Since each tree uses a random subset of features, it is slightly different from others.

4. Voting (Classification) / Averaging (Regression):

Machine Learning 70
For Classification: Each tree in the Random Forest predicts a class label, and the final
prediction is based on the majority vote across all trees.

For Regression: Each tree predicts a numeric value, and the final output is the average
of all predictions.

Steps Involved in Random Forest Algorithm


1. Input: Training dataset with features and labels, number of trees to build ( N ).

2. For each tree (out of N trees):

Draw a bootstrap sample from the training data.

Select a random subset of features.

Grow a decision tree to the maximum possible depth (without pruning).

3. Output: A collection of N decision trees (the forest).

4. Prediction:

For classification: Take the majority vote from all trees.

For regression: Take the average value predicted by all trees.

Key Features of Random Forest


1. Ensemble of Decision Trees:

By using multiple trees, Random Forest avoids the overfitting that can happen with
individual decision trees.

2. Bagging and Random Subset of Features:

Bagging helps reduce variance, while random feature selection adds diversity to the
individual trees, preventing them from becoming too correlated.

3. Feature Importance:

Random Forest provides a measure of feature importance by evaluating how much


each feature contributes to reducing the impurity in the splits of each tree. This is
useful for feature selection.

Advantages of Random Forest


1. Accuracy:

The ensemble approach of Random Forest often provides higher accuracy than a single
decision tree.

2. Robustness:

Machine Learning 71
Random Forests are less prone to overfitting due to the randomness introduced by
bagging and random feature selection.

3. Handles Large Datasets:

Random Forest is suitable for large datasets with higher dimensionality.

4. Feature Importance:

It can rank features based on their importance to the prediction, which is helpful in
understanding the data.

5. No Pruning Required:

Unlike decision trees, Random Forest does not require explicit pruning, as the ensemble
approach balances complexity and overfitting.

Disadvantages of Random Forest


1. Complexity:

Random Forests are much more complex compared to a single decision tree. This
complexity makes them harder to interpret.

2. Computational Cost:

Training a Random Forest can be more computationally expensive and slower,


especially when there are many trees or high-dimensional data.

3. Black-Box Model:

While it is possible to determine feature importance, the internal workings of Random


Forest are not as easily interpretable as a single decision tree.

Hyperparameters in Random Forest


Number of Trees ( n_estimators ):

The number of decision trees to be built in the forest. Increasing the number of trees
usually leads to higher accuracy but also increases training time.

Number of Features ( max_features ):

The number of features to consider when splitting each node. Options include:

'sqrt' : Use the square root of the total number of features.

'log2' : Use the logarithm base 2 of the number of features.

Tree Depth ( max_depth ):

Limits how deep each tree can grow. A shallow tree will help prevent overfitting.

Machine Learning 72
Minimum Samples for Splitting ( min_samples_split ):

The minimum number of samples required to split a node. This helps control the growth
of the tree and avoid overfitting.

Feature Importance in Random Forest


One of the advantages of Random Forest is that it provides estimates of feature importance,
which is calculated as the average reduction in impurity for each feature across all the trees in
the forest. The higher the importance score, the more relevant the feature is for the
classification task.

Example of Random Forest in Action


Suppose you have a dataset of patient information, and you want to predict whether a patient
has a particular disease. You would use Random Forest as follows:

1. Training Phase:

Collect a bootstrap sample of the training data.

Randomly select a subset of features at each node.

Train multiple decision trees (e.g., n_estimators=100 ) on different samples.

2. Prediction Phase:

Each of the 100 trees makes a prediction (Yes or No).

The Random Forest takes the majority vote as the final prediction.

Applications of Random Forest


1. Finance: Predicting loan defaults or identifying fraudulent transactions.

2. Healthcare: Disease prediction or identifying risk factors for patients.

3. Marketing: Customer segmentation, predicting customer churn.

4. Image Classification: Identifying objects in an image by using an ensemble approach to


categorize the pixels.

Conclusion
Random Forest is a powerful and widely used machine learning algorithm for both classification
and regression tasks. It combines the strengths of multiple decision trees to reduce overfitting
and increase accuracy. Despite being more complex and computationally expensive compared
to individual decision trees, Random Forest's robustness, versatility, and feature importance
evaluation make it a popular choice for many practical applications.

Machine Learning 73
Fuzzy Set Approaches

Introduction to Fuzzy Sets


Fuzzy Set Theory is an extension of classical set theory that allows partial membership of
elements in a set. Unlike classical sets, where an element either belongs or does not belong to
a set, fuzzy sets allow for degrees of membership ranging from 0 to 1. This flexibility is
particularly useful in dealing with real-world data, which often involves uncertainty and
imprecision.

Classical Set: In a classical set, an element can either fully belong (membership value = 1)
or not belong at all (membership value = 0).

Fuzzy Set: In a fuzzy set, each element has a degree of membership ranging between 0
and 1, representing the grade of membership.

Fuzzy sets were introduced by Lotfi A. Zadeh in 1965 to handle uncertainties and to model
problems that have vagueness or imprecision.

Characteristics of Fuzzy Sets


1. Membership Function (MF):

A fuzzy set is characterized by a membership function, denoted as μ_A(x) , which


maps elements of a given domain to the membership values between 0 and 1.

The membership function defines how each element in the domain is mapped to its
corresponding degree of belonging to a fuzzy set.

2. Fuzzy Membership:

The membership value can be interpreted as the degree of truth or the degree to
which an element belongs to a particular set.

For example, a fuzzy set of "tall people" may assign a membership of 0.7 to a person
who is 180 cm tall and 0.2 to a person who is 160 cm tall.

3. Fuzzy vs. Crisp Sets:

Crisp Set: Elements have binary membership (either in or out).

Fuzzy Set: Elements have a graded membership, allowing partial inclusion.

Machine Learning 74
Membership Functions
Membership functions can have different shapes, depending on the characteristics of the
fuzzy set:

1. Triangular Membership Function:

Defined by a triangular shape with parameters a , b , and c .

Formula:

μ_A(x) = max(min((x - a) / (b - a), (c - x) / (c - b)), 0)

Useful when the boundary of a set is not sharply defined.

2. Trapezoidal Membership Function:

Machine Learning 75
Defined by four parameters a , b , c , and d that form a trapezoid.

It has two flat regions, indicating higher certainty over a range of values.

3. Gaussian Membership Function:

Defined by parameters c (mean) and σ (standard deviation).

Provides smooth boundaries, commonly used in situations involving gradual transitions.

Operations on Fuzzy Sets


Fuzzy sets support a variety of operations similar to classical sets but are adapted for degrees
of membership:

1. Union (OR Operation):

The union of two fuzzy sets A and B is given by the maximum of the membership
values.

Formula:

μ_(A ∪ B)(x) = max(μ_A(x), μ_B(x))

2. Intersection (AND Operation):

The intersection of two fuzzy sets A and B is given by the minimum of the
membership values.

Formula:

μ_(A ∩ B)(x) = min(μ_A(x), μ_B(x))

3. Complement (NOT Operation):

The complement of a fuzzy set A is given by subtracting the membership value from 1.

Formula:

μ_(¬A)(x) = 1 - μ_A(x)

Fuzzy Set vs. Probability


Fuzzy Set Theory: Deals with degrees of membership of elements in a set. It is used to
model vagueness and imprecision in data.

Probability Theory: Deals with the likelihood of an event occurring. It is used to model
randomness and uncertainty.

Machine Learning 76
For example, the statement "it is cloudy" can be modeled using fuzzy sets by assigning a
membership value to describe the "degree of cloudiness." On the other hand, probability
theory could be used to predict the chance of rain given the cloudiness.

Applications of Fuzzy Sets


1. Control Systems:

Fuzzy set theory is widely used in fuzzy logic controllers. It is particularly useful in
systems where human-like reasoning is required. For example, air conditioners,
washing machines, and other appliances use fuzzy logic to adjust settings like
temperature, washing time, etc.

2. Decision Making:

Fuzzy sets are used in multi-criteria decision-making where options are evaluated
based on multiple attributes with varying degrees of importance.

3. Pattern Recognition:

Fuzzy sets are applied to classification problems where the boundaries between
classes are not sharply defined (e.g., determining the membership of an image to
different object categories).

4. Medical Diagnosis:

Fuzzy sets help in medical diagnosis by allowing partial membership in diagnostic


categories. For instance, a patient can partially belong to a category of having a
particular disease based on symptoms with varying intensities.

Fuzzy Inference Systems (FIS)


Fuzzy Inference Systems are used to map inputs to outputs using fuzzy logic. It involves three
main steps:

1. Fuzzification:

Convert crisp inputs into fuzzy values using membership functions.

2. Rule Evaluation:

Apply fuzzy rules using a series of if-then conditions. These rules help in describing
the relationship between input and output variables in a human-understandable way.

3. Defuzzification:

Convert the fuzzy output back into a crisp value. Popular defuzzification methods
include centroid and max-membership methods.

Machine Learning 77
Example of Fuzzy Set
Suppose we want to create a fuzzy set to represent the concept of "hot temperature."

1. Fuzzy Set Definition:

The fuzzy set "Hot Temperature" could be represented with a membership function
that assigns a degree of membership to temperatures. For example:

20°C might have a membership value of 0 (not hot).

30°C might have a membership value of 0.6.

40°C might have a membership value of 1 (fully hot).

2. Rule-Based System:

A fuzzy rule might be: "If the temperature is hot, turn on the fan at high speed."

The fuzzy inference system would calculate the degree to which the current
temperature is "hot" and adjust the fan accordingly.

Conclusion
Fuzzy Set Theory provides a powerful framework for modeling uncertainty, vagueness, and
imprecision, making it ideal for applications that involve subjective, ambiguous, or linguistic
information. By allowing partial membership, fuzzy sets represent real-world concepts more
effectively compared to traditional binary sets.

Support Vector Machine (SVM)

Introduction
Support Vector Machine (SVM) is a powerful supervised learning algorithm used for
classification and regression tasks. It is especially effective in high-dimensional spaces and
for problems where the data is not linearly separable. The main idea of SVM is to find a
decision boundary (or hyperplane) that maximizes the margin between two classes of data
points.

Type: Supervised Learning (Classification and Regression)

Goal: Maximize the margin between the decision boundary and the closest data points
from either class.

Key Concepts in SVM


1. Hyperplane:

Machine Learning 78
A hyperplane is a decision boundary that separates the data into different classes. For
a 2D space, it is simply a line, while in a 3D space, it is a plane. For higher dimensions,
it is called a hyperplane.

The goal of SVM is to find the optimal hyperplane that maximizes the separation
between classes.

2. Margin:

The margin is the distance between the hyperplane and the closest data points from
either class. SVM aims to maximize this margin.

The data points that are closest to the hyperplane are called support vectors, and they
are critical for defining the position of the hyperplane.

3. Support Vectors:

Support vectors are the data points that lie closest to the decision boundary. These
points influence the position and orientation of the hyperplane.

The hyperplane is uniquely defined by these support vectors, hence the name Support
Vector Machine.

How SVM Works


1. Linearly Separable Data:

Machine Learning 79
In the simplest case, SVM finds a straight line (or hyperplane) that can completely
separate data points of two different classes.

If the data is linearly separable, SVM constructs the optimal hyperplane that maximizes
the margin.

2. Non-linearly Separable Data:

Often, real-world data is not linearly separable. To handle this, SVM uses a kernel trick
to transform the data into a higher-dimensional space, where it becomes linearly
separable.

Kernel functions help create the separation by projecting the data to a new dimension.

Kernel Trick
The kernel trick is used to transform non-linearly separable data into a higher dimension
where a hyperplane can separate the classes. This allows SVM to create complex decision
boundaries without explicitly computing the transformation. Common kernel functions include:

1. Linear Kernel:

Suitable when data is linearly separable. In this case, the SVM simply finds a straight
line to separate the data.

2. Polynomial Kernel:

Suitable when the relationship between the features is more complex and requires
curved decision boundaries.

Polynomial kernel of degree d can be used to transform data to a higher degree.

3. Radial Basis Function (RBF) / Gaussian Kernel:

The RBF kernel is the most widely used kernel in SVM.

It projects the data into an infinite-dimensional space, allowing it to handle complex


relationships and curved boundaries.

4. Sigmoid Kernel:

This kernel behaves like a neural network activation function and can be used in
specific applications.

Mathematical Representation
The goal of SVM is to find the hyperplane that maximizes the margin between the classes.

Suppose you have data points (x_1, y_1), (x_2, y_2), ..., (x_n, y_n) where y_i is the class
label ( +1 or 1 ).

Machine Learning 80
The hyperplane can be represented as:

w * x + b = 0

Where w is the weight vector and b is the bias.

The objective of SVM is to find w and b that maximize the margin, subject to the constraint
that:

y_i * (w * x_i + b) >= 1 for all i

This ensures that each data point is correctly classified and lies on the correct side of the
margin.

Soft Margin and Regularization


In many real-world scenarios, perfect separation of data may not be possible due to noise or
overlapping classes. SVM introduces a soft margin to allow some misclassification, which can
be controlled by a regularization parameter ( C ).

Regularization Parameter (C):

C is a hyperparameter that controls the trade-off between maximizing the margin and

minimizing classification errors.

High C: The model attempts to classify all training data points correctly, which may lead
to overfitting.

Low C: The model allows some misclassification to achieve a larger margin, which
helps in generalization.

Advantages of SVM
1. Effective in High Dimensions:

SVM is effective when the number of features is large compared to the number of
samples.

2. Works Well with Non-linear Data:

With the kernel trick, SVM can efficiently handle complex and non-linear data.

3. Memory Efficiency:

SVM only uses the support vectors for classification, making it memory-efficient as it
does not need to store the entire dataset.

4. Versatility:

Machine Learning 81
Can be used for both classification and regression (referred to as Support Vector
Regression or SVR).

Disadvantages of SVM
1. Computational Complexity:

Training SVMs can be computationally intensive, especially for large datasets with
many features. Complexity grows quadratically with the number of samples.

2. Difficult to Choose Appropriate Kernel:

Choosing the right kernel and hyperparameters requires experimentation and


experience. Incorrect kernel selection can lead to poor model performance.

3. Not Suitable for Large Datasets:

SVM can be inefficient when the number of samples is very large due to high training
times.

4. Interpretability:

The results of SVM are often harder to interpret compared to other algorithms like
Decision Trees.

Applications of SVM
1. Text Classification:

SVM is used for spam detection and sentiment analysis due to its ability to handle high-
dimensional data (like word counts in text).

2. Face Detection:

SVM is used to distinguish between faces and non-faces in images.

3. Bioinformatics:

SVMs are used for classifying genes, identifying proteins, and analyzing medical data.

4. Handwriting Recognition:

SVM is used to classify handwritten digits, often applied in digit recognition systems
like postal code reading.

Example of SVM
Consider a dataset with two classes: +1 and -1 , with features such as height and weight of
individuals.

Machine Learning 82
If the data is linearly separable, SVM will find the line (hyperplane) that divides the two
classes with the maximum margin.

If the data is not linearly separable, an RBF kernel can be used to transform the data into a
higher-dimensional space, allowing SVM to find a non-linear decision boundary.

For example, in a binary classification task of recognizing cancerous vs non-cancerous cells,


SVM would determine the best hyperplane that divides the feature space into two regions,
each corresponding to one of the two classes.

Types of Support Vector Machine (SVM) Kernels


In Support Vector Machines, kernels are used to transform data into a higher-dimensional
space to enable finding an optimal decision boundary even when the original data is not
linearly separable. Kernels allow SVM to be flexible and effective at handling both linear and
non-linear problems. Here are the most commonly used types of SVM kernels:

1. Linear Kernel
Definition: The linear kernel is the simplest kernel type used in SVM. It is primarily used
when the data is linearly separable, meaning it can be separated by a straight line (or
hyperplane).

Mathematical Representation: The linear kernel function can be represented as:


where
x and y are vectors of features and the dot ( · ) represents the dot product between these
vectors.

K(x, y) = x · y

Use Cases:

It is effective when the number of features is significantly greater than the number of
training samples.

Commonly used for text classification problems like sentiment analysis or document
classification, where the features are word counts or term frequency vectors.

Advantages:

Speed: Linear SVM is computationally less expensive compared to non-linear kernels.

Interpretability: Linear models are easy to understand, making the decision-making


process more transparent.

Limitation:

It cannot handle non-linear relationships in the data.

Machine Learning 83
2. Polynomial Kernel
Definition: The polynomial kernel is used to handle non-linear relationships in the data by
allowing more complex decision boundaries. It introduces polynomial terms, effectively
enabling SVM to create curved separation boundaries.

Mathematical Representation: The polynomial kernel can be represented as:


where:

K(x, y) = (x · y + c)^d

x and y are vectors of features.

c is a constant to control the flexibility of the decision boundary.

dis the degree of the polynomial, which determines the complexity of the decision
surface.

Use Cases:

Suitable when the data has complex non-linear relationships between features.

Works well for applications where the relationships between input features and output
are better captured by polynomials, such as image recognition tasks.

Advantages:

Flexible Decision Boundaries: With an appropriate choice of the polynomial degree,


SVM can model very intricate patterns.

Limitations:

Computational Complexity: It can be computationally expensive for large datasets or


high-degree polynomials, making the training process slower.

Overfitting: High-degree polynomials can lead to overfitting, especially for small


datasets, as the model can become too complex.

3. Gaussian Kernel (Radial Basis Function - RBF Kernel)


Definition: The Gaussian Kernel, also known as the Radial Basis Function (RBF) kernel, is
one of the most popular non-linear kernels. It is used to map input data points into an
infinite-dimensional feature space, which allows the SVM to find a linear separation in this
transformed space.

Mathematical Representation: The Gaussian kernel can be represented as:


where:

K(x, y) = exp(-γ ||x - y||^2)

Machine Learning 84
||x - y||^2 represents the squared Euclidean distance between x and y .

γ (gamma) is a parameter that defines the influence of a single training example.

High Gamma: The decision boundary is highly influenced by individual data points,
which can lead to overfitting.

Low Gamma: The model captures broader trends, which might lead to underfitting.

Use Cases:

It is suitable when the relationship between classes is highly non-linear, such as in


image classification, biometric identification, and medical diagnostics.

Advantages:

Flexibility: The RBF kernel is very flexible and can model a wide range of decision
surfaces by controlling the γ parameter.

Effective for Complex Boundaries: It is effective when the decision boundary is curved
or when the data is not linearly separable.

Limitations:

Parameter Tuning: The value of γ needs careful tuning, typically done using cross-
validation.

Computationally Intensive: The Gaussian kernel can be more computationally


expensive than the linear kernel, especially for large datasets.

Choosing the Right Kernel


The choice of kernel depends on the nature of the problem and the data:

Linear Kernel: Choose this if the data is linearly separable or if interpretability and
speed are priorities.

Polynomial Kernel: Use when you suspect non-linear relationships but want to model
these relationships with polynomial decision boundaries.

Gaussian/RBF Kernel: Choose this if the data is complex and not linearly separable,
and you want the SVM to create a sophisticated decision boundary.

Summary of Kernels
1. Linear Kernel:

Equation: K(x, y) = x · y

Characteristics: Suitable for linearly separable data.

Advantages: Simple, easy to interpret, computationally efficient.

Machine Learning 85
2. Polynomial Kernel:

Equation: K(x, y) = (x · y + c)^d

Characteristics: Creates polynomial decision boundaries.

Advantages: Effective for moderately complex relationships.

Parameters: Degree d , and constant c .

3. Gaussian Kernel (RBF):

Equation: K(x, y) = exp(-γ ||x - y||^2)

Characteristics: Highly flexible and widely used.

Advantages: Suitable for non-linear and complex data.

Parameter: Gamma ( γ ), which controls the influence of points.

Examples of Kernel Applications


Linear Kernel:

Used for text categorization (e.g., spam detection) where the input features are often
sparse and high-dimensional.

Polynomial Kernel:

Applied in image classification when relationships between features are quadratic or


higher-order.

Gaussian Kernel:

Used in medical diagnosis to distinguish between different classes of diseases where


the relationship between input features is not straightforward.

Suitable for handwriting recognition where characters have high variability.

Conclusion
The different SVM kernels make it possible to adapt SVM for a wide range of problems, from
simple linear separations to very complex non-linear relationships. By transforming data into
different feature spaces using these kernels, SVM is able to effectively handle many practical
machine learning tasks. The selection of an appropriate kernel, along with its corresponding
parameters, plays a crucial role in the performance and success of the SVM model.

Hyperplane, Properties of SVM, and Issues in SVM

1. Hyperplane (Decision Surface)

Machine Learning 86
A hyperplane is a decision surface that separates the data into different classes in a Support
Vector Machine (SVM). The hyperplane is essentially the boundary that best divides the
dataset into classes.

Definition:

A hyperplane is a linear decision boundary that divides the input space into two halves,
representing the different classes.

In two-dimensional space, it is simply a line, whereas, in three-dimensional space, it


is a plane. In higher dimensions, it is referred to as a hyperplane.

Equation of a Hyperplane:

w · x + b = 0

Where:

w is the weight vector that determines the orientation of the hyperplane.

x represents the input features.

b is the bias term that helps adjust the position of the hyperplane.

Goal of SVM:

The main goal of SVM is to find the optimal hyperplane that maximizes the margin
between the two classes. The margin is defined as the distance between the
hyperplane and the closest data points from each class, which are called support
vectors.

The larger the margin, the better the generalization ability of the classifier.

Support Vectors:

Support vectors are the data points that lie closest to the hyperplane. These points are
critical as they determine the exact position and orientation of the hyperplane.

Only support vectors influence the decision boundary, making SVM computationally
efficient in terms of memory, as it does not need to store all data points.

Linear vs. Non-Linear Hyperplanes:

For linearly separable data, the hyperplane is a straight line.

For non-linearly separable data, a hyperplane might not be able to divide the classes
in the original feature space. In such cases, the kernel trick is used to transform the
data into a higher-dimensional space, where a hyperplane can effectively separate the
classes.

Machine Learning 87
2. Properties of SVM
Support Vector Machines have several properties that make them effective for both
classification and regression tasks:

1. Margin Maximization:

SVM aims to maximize the margin between the hyperplane and the support vectors.
Maximizing the margin enhances the generalization of the classifier, making it robust to
noise and overfitting.

2. Support Vector Influence:

The decision boundary is determined entirely by the support vectors. This allows SVM
to be memory-efficient because it only needs to keep track of these critical data points
rather than all training samples.

3. Effective in High Dimensions:

SVM is particularly effective in high-dimensional spaces where the number of features


is much greater than the number of training samples. This is because the complexity of
SVM does not directly depend on the dimensionality of the feature space.

4. Kernel Trick:

The kernel trick allows SVM to create non-linear decision boundaries by mapping input
features into a higher-dimensional space without explicitly computing the
transformation. This makes SVM versatile in handling non-linearly separable data.

5. Regularization:

SVM uses a regularization parameter (C) to control the trade-off between achieving a
larger margin and minimizing classification error. A larger value of C places more
emphasis on classifying all training points correctly, potentially leading to overfitting. A
smaller value allows for more misclassification but ensures a wider margin.

6. Robustness to Outliers:

Although SVM can handle outliers with the soft margin approach, it is still sensitive to
noisy data. However, the impact of noisy points is minimized if they are not selected as
support vectors.

3. Issues in SVM
Despite its advantages, SVM has some issues and challenges that users should be aware of:

1. Computational Complexity:

SVM is computationally intensive, especially for large datasets, because it involves


solving a quadratic optimization problem. The training time increases significantly as
the number of training samples grows.

Machine Learning 88
For very large datasets, SVM may become infeasible in terms of memory and
computation.

2. Choice of Kernel and Parameters:

Choosing the right kernel function and tuning the associated hyperparameters (such
as C and γ for RBF kernel) is crucial for good performance.

The selection of an appropriate kernel is often not straightforward, and incorrect


choice can lead to poor model performance. Extensive cross-validation is often needed
to select the optimal combination.

3. Interpretability:

Unlike simple linear models, SVM models are less interpretable, especially when using
non-linear kernels. The decision boundary formed by the hyperplane in a higher-
dimensional space is difficult to visualize and explain.

When feature interpretability is needed, SVM may not be the best choice compared to
models like Decision Trees.

4. Sensitivity to Noisy Data:

SVM can be sensitive to outliers, especially when using a small value for the
regularization parameter C . The presence of noisy data points can significantly affect
the decision boundary if they lie close to the hyperplane.

5. Memory Usage:

In scenarios where a large number of support vectors are required, SVM can consume
a significant amount of memory. This can be problematic for very large datasets where
the number of support vectors is a large percentage of the total training set.

6. Class Imbalance:

SVM may perform poorly with imbalanced datasets where one class has significantly
more samples than the other. This is because the decision boundary is influenced by
the number of samples, and a small class may not have enough representation to shape
an appropriate boundary.

Techniques like SMOTE (Synthetic Minority Over-sampling Technique) or using


different weights for different classes can help mitigate this issue.

7. Not Well-Suited for Probability Estimation:

SVM does not inherently provide probabilistic outputs. Extensions such as Platt
Scaling are used to estimate probabilities, but these tend to add complexity and are not
as reliable as models specifically designed for probabilistic interpretation.

Summary

Machine Learning 89
Hyperplane: The decision surface that separates data into different classes. The goal of
SVM is to find the optimal hyperplane that maximizes the margin.

Properties of SVM: SVM is effective in high dimensions, uses support vectors to define the
decision boundary, applies kernel tricks for non-linear data, and maximizes the margin to
achieve good generalization.

Issues in SVM:

Computational Complexity: Training can be computationally intensive for large


datasets.

Kernel Selection: Choosing the right kernel and tuning hyperparameters can be
challenging.

Interpretability: The decision boundary, especially in non-linear cases, is difficult to


interpret.

Sensitivity to Noise and Outliers: SVM may be affected by outliers or noisy data points.

Memory Usage: SVM can be memory-intensive, especially if many support vectors are
needed.

Class Imbalance: SVM might struggle with imbalanced datasets, requiring additional
techniques to adjust.

SVM is a powerful algorithm that works well on a wide range of classification and regression
tasks, but it requires careful tuning and can be computationally expensive. Understanding
these properties and issues helps in deciding when SVM is the right choice for a given
problem.

Decision Trees: Decision Tree Learning Algorithm

Introduction
A Decision Tree is a popular supervised learning algorithm used for classification and
regression tasks. It is a tree-like structure where each internal node represents a decision
based on a feature, each branch represents the outcome of the decision, and each leaf node
represents a final output or class label. Decision trees are easy to understand and interpret,
making them very useful for a wide range of applications.

Type: Supervised Learning (Classification and Regression)

Goal: Create a model that predicts the value of a target variable based on input features.

Structure of a Decision Tree

Machine Learning 90
1. Root Node: The topmost node in a decision tree, representing the entire dataset. The root
node is split based on the feature that provides the highest information gain.

2. Internal Nodes: Nodes that represent decisions on features. Each internal node splits into
two or more branches based on feature values.

3. Leaf Nodes (Terminal Nodes): Nodes that represent the final output value or class label.
Each leaf node contains a prediction that applies to the data reaching that point.

4. Branches: These are the connections between nodes, representing the outcome of
decisions.

Decision Tree Learning Algorithm


The goal of decision tree learning is to split the data recursively until a suitable decision rule is
formed for prediction. Here’s the basic process of constructing a decision tree:

1. Select the Best Feature to Split:

The algorithm starts by selecting the best feature that splits the data into subsets. The
quality of a split is evaluated using metrics such as:

Information Gain (IG)

Gini Impurity

Variance Reduction (for regression)

2. Split the Dataset:

The data is split into subsets based on the selected feature. This process continues
recursively for each subset.

3. Stopping Criteria:

The process stops when one of the following conditions is met:

All data points in a node belong to the same class.

No more features are left for splitting.

A pre-defined stopping condition, such as maximum tree depth or minimum


samples per leaf.

Common Splitting Criteria


1. Information Gain (Entropy-based Splitting)

Entropy measures the impurity or uncertainty in the dataset.

The formula for entropy is:


where

Machine Learning 91
p_i is the proportion of instances belonging to class i in dataset S .

Entropy(S) = -∑ (p_i * log2(p_i))

Information Gain (IG) measures the reduction in entropy by splitting the dataset on a
particular feature.

A feature with the highest Information Gain is selected to split the data.

2. Gini Impurity (Gini Index)

Gini Impurity is another metric to evaluate the quality of a split. It measures how often a
randomly chosen element would be incorrectly labeled if it was randomly classified
based on the distribution of class labels.

The formula for Gini Impurity is:


where
p_i is the proportion of instances belonging to class i in dataset S .

Gini(S) = 1 - ∑ (p_i^2)

A feature that provides the lowest Gini Impurity is selected for the split.

3. Variance Reduction (for Regression Trees)

When using decision trees for regression, Variance Reduction is used to measure the
effectiveness of a split.

The feature that results in the greatest reduction in variance is selected for splitting the
data.

Algorithm Steps of Decision Tree Learning


1. Input: Training dataset with features and labels.

2. Start at the Root Node:

Calculate the splitting criteria (e.g., Information Gain, Gini Impurity) for all features.

Choose the feature with the highest information gain (or lowest Gini Impurity).

3. Split the Data:

Split the data into subsets based on the chosen feature.

Each subset forms a branch from the node.

4. Repeat:

Machine Learning 92
Apply the same process to each subset (node) until one of the stopping criteria is met
(e.g., pure nodes, max depth reached).

5. Output: A decision tree where each leaf node provides the final classification or prediction.

Hyperparameters for Decision Trees


1. Maximum Depth ( max_depth ):

Controls the maximum number of levels in the tree. Limiting tree depth helps prevent
overfitting by reducing model complexity.

2. Minimum Samples Split ( min_samples_split ):

The minimum number of samples required to split an internal node. Higher values
prevent overfitting by reducing splits.

3. Minimum Samples Leaf ( min_samples_leaf ):

The minimum number of samples that must be present in a leaf node. This helps ensure
that leaves do not end up with only a few samples.

4. Maximum Features ( max_features ):

Controls the number of features to consider when looking for the best split. Helps in
controlling overfitting.

5. Criterion ( criterion ):

The function to measure the quality of a split. Common criteria are gini for the Gini
Impurity and entropy for Information Gain.

Advantages of Decision Trees


1. Easy to Understand and Interpret:

Decision trees are intuitive and easily visualized, making it easier for non-experts to
understand the model.

2. No Requirement for Data Normalization:

Decision trees do not require feature scaling or normalization.

3. Handle Non-linear Relationships:

Decision trees can naturally handle complex, non-linear relationships between


features.

4. Feature Importance:

Decision trees provide a measure of feature importance which can be used to


understand which features are most influential.

Machine Learning 93
Disadvantages of Decision Trees
1. Overfitting:

Decision trees are prone to overfitting, especially when the tree depth is not limited,
and the model becomes too complex. Pruning methods and limiting tree depth help
mitigate this issue.

2. Instability:

Decision trees are sensitive to small changes in the data. Small variations in the data
can lead to a completely different tree structure.

3. Bias Towards Dominant Classes:

Decision trees can be biased towards classes with more samples, especially when the
data is imbalanced.

4. Greedy Approach:

Decision trees use a greedy algorithm for splitting, which may not always lead to the
global optimal solution.

Regularization Techniques in Decision Trees


1. Pruning:

Pruning reduces the size of the tree by removing branches that have little importance.
It helps in simplifying the model and improving generalization.

Pre-pruning: Limit tree growth by setting constraints (e.g., max depth, minimum
samples per leaf).

Post-pruning: The tree is first grown fully, then pruned back based on error rates on
validation data.

2. Setting Minimum Split/Leaf Nodes:

By setting min_samples_split and min_samples_leaf , you control the size of nodes, reducing
the chances of forming too specific (overfitted) splits.

Applications of Decision Trees


1. Medical Diagnosis:

Decision trees are used in medical diagnostics to classify patients based on symptoms
and determine possible diseases.

2. Credit Scoring:

Machine Learning 94
Banks use decision trees to assess whether a loan applicant is likely to default, based
on various financial parameters.

3. Customer Churn:

Decision trees help in predicting if a customer is likely to leave a service, enabling


companies to take preventative actions.

4. Fraud Detection:

Identify potentially fraudulent transactions by classifying transactions based on


transaction amount, frequency, and other factors.

Conclusion
Decision Trees are intuitive, easy to interpret, and capable of handling both numerical and
categorical data, making them useful for many practical machine learning tasks. However, they
tend to overfit on training data if not properly regularized, which is why pruning techniques and
limiting hyperparameters are crucial to improve their generalization capability. Despite their
disadvantages, decision trees are often used as the base learners for more advanced
ensemble models like Random Forest and Gradient Boosting Machines.

https://www.xoriant.com/blog/decision-trees-for-classification-a-machine-learning-algorithm
Introduction Decision Trees are a type of Supervised Machine Learning (that is you explain
what the input is and what the corresponding output is in the training data) where the data is
continuously split according to a certain parameter. The tree can be explained by two entities,
namely decision nodes and leaves. The leaves are the decisions or the final outcomes. And the
decision nodes are where the data is split.

An example of a decision tree can be explained using above binary tree. Let’s say you want to
predict whether a person is fit given their information like age, eating habit, and physical
activity, etc. The decision nodes here are questions like ‘What’s the age?’, ‘Does he exercise?’,
‘Does he eat a lot of pizzas’? And the leaves, which are outcomes like either ‘fit’, or ‘unfit’. In

Machine Learning 95
this case this was a binary classification problem (a yes no type problem). There are two main
types of Decision Trees:

1. Classification trees (Yes/No types)

What we’ve seen above is an example of classification tree, where the outcome was a variable
like ‘fit’ or ‘unfit’. Here the decision variable is Categorical.

2. Regression trees (Continuous data types)

Here the decision or the outcome variable is Continuous, e.g. a number like 123. Working Now
that we know what a Decision Tree is, we’ll see how it works internally. There are many
algorithms out there which construct Decision Trees, but one of the best is called as ID3
Algorithm. ID3 Stands for Iterative Dichotomiser 3. Before discussing the ID3 algorithm, we’ll
go through few definitions.

Entropy:

Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set S, is the measure of
the amount of uncertainty or randomness in data. Intuitively, it tells us about the predictability
of a certain event. Example, consider a coin toss whose probability of heads is 0.5 and
probability of tails is 0.5. Here the entropy is the highest possible, since there’s no way of
determining what the outcome might be. Alternatively, consider a coin which has heads on
both the sides, the entropy of such an event can be predicted perfectly since we know
beforehand that it’ll always be heads. In other words, this event has no randomness hence it’s
entropy is zero. In particular, lower values imply less uncertainty while higher values imply high
uncertainty.

Information Gain:

nformation gain is also called as Kullback-Leibler divergence denoted by IG(S,A) for a set S is
the effective change in entropy after deciding on a particular attribute A. It measures the
relative change in entropy with respect to the independent variables. Alternatively, where
IG(S, A) is the information gain by applying feature A. H(S) is the Entropy of the entire set, while
the second term calculates the Entropy after applying the feature A, where P(x) is the
probability of event x.

Machine Learning 96
Let’s understand this with the help of an example. Consider a piece of data collected over the
course of 14 days where the features are Outlook, Temperature, Humidity, Wind and the
outcome variable is whether Golf was played on the day. Now, our job is to build a predictive
model which takes in above 4 parameters and predicts whether Golf will be played on the day.
We’ll build a decision tree to do that using ID3 algorithm.

Day Outlook Temperature Humidity Wind Play Golf

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

ID3 Algorithm will perform following tasks recursively

1. Create root node for the tree

2. If all examples are positive, return leaf node ‘positive’

3. Else if all examples are negative, return leaf node ‘negative’

4. Calculate the entropy of current state H(S)

5. For each attribute, calculate the entropy with respect to the attribute ‘x’ denoted by H(S, x)

6. Select the attribute which has maximum value of IG(S, x)

7. Remove the attribute that offers highest IG from the set of attributes

8. Repeat until we run out of all attributes, or the decision tree has all leaf nodes.

Now, let's go ahead and grow the decision tree. The initial step is to calculate H(S), the Entropy
of the current state. In the above example, we can see in total there are 5 No’s and 9 Yes’s.

Yes No Total

Machine Learning 97
9 5 14

Remember that the Entropy is 0 if all members belong to the same class, and 1 when half of
them belong to one class and other half belong to other class that is perfect randomness. Here
it’s 0.94 which means the distribution is fairly random. Now, the next step is to choose the
attribute that gives us highest possible Information Gain which we’ll choose as the root node.
Let’s start with ‘Wind’

where ‘x’ are the possible values for an attribute. Here, attribute ‘Wind’ takes two possible
values in the sample data, hence x = {Weak, Strong} We’ll have to calculate: Amongst all the 14
examples we have 8 places where the wind is weak and 6 where the wind is Strong.

Wind = Weak Wind = Strong Total

8 6 14

Now, out of the 8 Weak examples, 6 of them were ‘Yes’ for Play Golf and 2 of them were ‘No’
for ‘Play Golf’. So, we have, Similarly, out of 6 Strong examples, we have 3 examples where

Machine Learning 98
the outcome was ‘Yes’ for Play Golf and 3 where we had ‘No’ for Play Golf. Remember, here
half items belong to one class while other half belong to other. Hence we have perfect
randomness. Now we have all the pieces required to calculate the Information Gain, Which
tells us the Information Gain by considering ‘Wind’ as the feature and give us information gain
of 0.048. Now we must similarly calculate the Information Gain for all the features. We can
clearly see that IG(S, Outlook) has the highest information gain of 0.246, hence we chose
Outlook attribute as the root node. At this point, the decision tree looks like.

Here we observe that whenever the outlook is Overcast, Play Golf is always ‘Yes’, it’s no
coincidence by any chance, the simple tree resulted because of the highest information gain
is given by the attribute Outlook. Now how do we proceed from this point? We can simply
apply recursion, you might want to look at the algorithm steps described earlier. Now that

Machine Learning 99
we’ve used Outlook, we’ve got three of them remaining Humidity, Temperature, and Wind. And,
we had three possible values of Outlook: Sunny, Overcast, Rain. Where the Overcast node
already ended up having leaf node ‘Yes’, so we’re left with two subtrees to compute: Sunny and
Rain. Table where the value of Outlook is Sunny looks like:

Temperature Humidity Wind Play Golf

Hot High Weak No

Hot High Strong No

Mild High Weak No

Cool Normal Weak Yes

Mild Normal Strong Yes

In the similar fashion, we compute the following values As we can see the highest
Information Gain is given by Humidity. Proceeding in the same way with will give us Wind as
the one with highest information gain. The final Decision Tree looks something like this.

Decision Tree Learning: ID3 Algorithm, Inductive Bias, Entropy,


Information Theory, Information Gain, and Issues in Decision

Machine Learning 100


Tree Learning

1. ID3 Algorithm (Iterative Dichotomiser 3)


The ID3 (Iterative Dichotomiser 3) algorithm is one of the earliest and most well-known
decision tree learning algorithms developed by Ross Quinlan. ID3 is used to create a decision
tree based on a given dataset.

Working Principle of ID3 Algorithm


Goal: The goal of ID3 is to construct a decision tree that can classify a set of training
examples into given classes based on the features.

Criterion: The ID3 algorithm uses Information Gain based on Entropy as the splitting
criterion to determine the best feature at each node of the tree.

Steps in ID3 Algorithm


1. Start at the Root Node:

Begin with the full training dataset at the root.

2. Calculate Entropy and Information Gain for all features.

3. Choose the Best Feature:

The feature with the highest Information Gain is selected to split the data.

4. Split Data:

Create branches for each possible value of the feature, and assign subsets of the
training data to these branches.

5. Repeat Recursively:

Continue splitting each subset based on the next feature with the highest information
gain until all data points are classified or other stopping criteria are met (e.g., all
samples belong to the same class).

6. Create Leaf Nodes:

When all data points are classified, the nodes are turned into leaf nodes, which
represent the final decision.

Features of ID3
Attribute Selection: ID3 uses Information Gain to determine the most informative attribute
at each level.

Works Well for Categorical Data: ID3 is suited for classification problems, particularly with
categorical data.

Machine Learning 101


Disadvantages of ID3
Prone to Overfitting: ID3 tends to overfit the data, especially if the tree grows too deep.

Only for Categorical Attributes: The ID3 algorithm works primarily with categorical
features and requires discretization for numerical data.

Greedy Approach: The selection of attributes is done using a greedy approach, which
does not guarantee the global optimum solution.

2. Inductive Bias
Inductive Bias refers to the set of assumptions a machine learning algorithm makes to
generalize from the training data to unseen data.

Inductive Bias in Decision Trees:


The Inductive Bias of a decision tree is:

Shorter Trees Are Preferred: Decision trees aim to create the shortest possible tree
that fits the data.

Preference for Features with High Information Gain: The ID3 algorithm selects
features based on their information gain, assuming that features with higher
information gain lead to better classification results.

Importance of Inductive Bias:

Inductive bias helps decision trees generalize well to unseen data by preventing them from
creating unnecessarily complex models that overfit the training data.

3. Entropy and Information Theory


Entropy is a fundamental concept in Information Theory used to measure the uncertainty or
impurity in a dataset. It quantifies the amount of randomness or disorder in the data and helps
determine how well a dataset can be split.

Entropy in Decision Trees:


Definition: Entropy is used to measure the impurity or uncertainty of a dataset. The higher
the entropy, the more mixed the data in terms of its class labels.

Mathematical Formula:
Where:

Entropy(S) = - ∑ (p_i * log2(p_i))

S is the dataset.

Machine Learning 102


p_i is the proportion of instances in class i .

The sum runs over all the classes.

Interpretation:
Low Entropy: When entropy is low (close to 0), the dataset is pure, meaning that most of
the instances belong to the same class.

High Entropy: When entropy is high (close to 1), the dataset is more mixed, with an almost
equal distribution of classes.

4. Information Gain
Information Gain is a measure used to evaluate the effectiveness of an attribute in classifying
the dataset. It is calculated as the reduction in entropy after splitting the dataset based on a
particular feature.

How Information Gain is Calculated:


1. Calculate Entropy of the Entire Dataset ( Entropy(S) ).

2. Split Dataset Based on Feature:

Partition the dataset using a particular feature.

3. Calculate Entropy for Each Subset:

Calculate the entropy of each subset after the split.

4. Calculate Information Gain:

The information gain is given by:

Information Gain = Entropy(S) - ∑ (Weighted Entropy of Subsets)

The feature with the highest information gain is selected as the split attribute for the
current node.

Significance:
Higher Information Gain indicates that the feature is more informative and results in a
better split of the dataset.

The goal is to maximize information gain at every node, which helps in making the decision
tree more efficient in classification.

5. Issues in Decision Tree Learning


Although decision trees are powerful, they come with several challenges:

Machine Learning 103


1. Overfitting:

Decision trees tend to overfit the training data, especially when they grow too deep
and create too many branches. This results in poor generalization to unseen data.

Solution: Use pruning techniques, limit tree depth, or set constraints on the number of
samples per leaf node.

2. Bias Towards Features with Many Values:

When selecting attributes, decision trees may favor features with many distinct values
(e.g., ID number). This could lead to splits that do not actually provide meaningful
information.

Solution: Use different metrics like Gain Ratio that penalize features with a large
number of values.

3. Imbalanced Datasets:

Decision trees can be biased towards the majority class in imbalanced datasets.

Solution: Use sampling techniques (over-sampling or under-sampling) or adjust the


class weights.

4. High Variance:

Decision trees are susceptible to high variance; small changes in data can lead to a
significantly different tree.

Solution: Use ensemble methods like Random Forests to reduce variance by


averaging multiple trees.

5. Greedy Nature:

Decision trees use a greedy algorithm for attribute selection, which may not lead to the
optimal solution.

The local optimum choice may lead to sub-optimal overall performance.

6. Difficulty in Handling Numeric Features:

Decision trees can handle numeric features, but they require special handling to
determine optimal split points.

Solution: Pre-process numerical data to create appropriate thresholds for splitting.

7. Data Fragmentation:

As the tree grows deeper, the dataset gets split into smaller fragments, leading to
insufficient data at some nodes. This is known as the fragmentation problem and can
result in unreliable splits.

Solution: Limit the depth of the tree or prune nodes with insufficient data.

Machine Learning 104


Summary
ID3 Algorithm: Builds decision trees by selecting attributes based on Information Gain. It is
used for classification tasks, but it can overfit and is limited to categorical features.

Inductive Bias: Represents the set of assumptions made by the decision tree to generalize
from the training data to unseen data. This includes a preference for smaller trees and
attributes with high information gain.

Entropy and Information Theory: Entropy measures the uncertainty or impurity in the data,
while Information Gain measures the reduction in entropy after splitting the dataset based
on a particular attribute.

Information Gain: Used as the criterion in ID3 to determine the best feature for splitting,
with higher information gain representing a better split.

Issues in Decision Tree Learning: Challenges include overfitting, high variance, bias
towards features with many values, difficulty in handling imbalanced data, and greedy
nature of the algorithm.

Understanding these concepts provides the foundation for effectively utilizing decision trees,
addressing their limitations, and applying more advanced algorithms such as Random Forest
and Gradient Boosting Machines for improved performance.

Bayesian Learning
Bayesian learning is a probabilistic framework for machine learning that leverages Bayes'
theorem to update the probability of a hypothesis based on observed evidence or data.
Bayesian learning methods are valuable for dealing with uncertainty and making predictions
that incorporate prior knowledge.

1. Bayes' Theorem
Bayes' theorem forms the backbone of Bayesian learning by providing a principled way to
revise probabilities in light of new data. It connects the prior probability of a hypothesis, the
likelihood of observing data given that hypothesis, and the posterior probability—which is the
updated belief after seeing the data.

Bayes' Theorem Formula

P (E∣H) ⋅ P (H)
P (H∣E) =
P (E)

Where:

P(H) : Prior Probability of hypothesis H before considering the evidence. It is our initial
belief about H .

Machine Learning 105


P(E) : Marginal Probability of evidence E . It represents the overall likelihood of the data,
regardless of the hypothesis.

P(E|H): Likelihood of the evidence E given hypothesis H . It measures how probable the
observed data is under the hypothesis.

P(H|E): Posterior Probability of hypothesis H after observing evidence E . This represents


the revised belief about H after seeing the data.

Illustrative Example Using Bayes' Theorem


Suppose we want to determine the probability that a person has a specific disease given a
positive test result.

H: The event that the person has the disease.

E: The event that the test result is positive.

P(H): The prior probability of the disease.

P(E|H): The probability of getting a positive result given that the person has the disease
(i.e., the sensitivity of the test).

P(E): The probability of a positive result occurring overall (which depends on the
prevalence of the disease and the accuracy of the test).

P(H|E): The updated probability that the person has the disease after observing the
positive test result.

By combining these values using Bayes’ theorem, we can get a more accurate estimate of the
likelihood that the person actually has the disease.

2. Concept Learning in Bayesian Framework


Concept Learning involves inferring a general rule or concept from observed examples. In
Bayesian learning, this is achieved by calculating the probability of each hypothesis given the
training data.

Machine Learning 106


The hypothesis that has the highest posterior probability after evaluating all the data is chosen
as the most probable hypothesis.

3. Bayes Optimal Classifier


The Bayes Optimal Classifier is a probabilistic approach that aims to make predictions by
considering all possible hypotheses from the hypothesis space (\( H \)). Instead of choosing
just one hypothesis, it makes predictions by calculating a weighted average over all
hypotheses, which makes it a theoretically optimal decision maker.

Bayes Optimal Prediction


The Bayes Optimal Prediction is made by combining all the possible hypotheses, weighted by
their posterior probabilities. It predicts the class that has the highest expected probability over
all hypotheses.

Machine Learning 107


Benefits of Bayes Optimal Classifier
Optimal Decision Making: The Bayes Optimal Classifier provides the best possible
prediction with the least error by taking into account all hypotheses.

Handles Uncertainty: It naturally handles uncertainty by incorporating the posterior


probability for each hypothesis.

Limitations
Computational Complexity: Evaluating every hypothesis in the hypothesis space can be
computationally challenging, especially for large hypothesis spaces.

Prior Knowledge Requirement: Requires prior knowledge and an initial probability


distribution over all hypotheses, which can be difficult to estimate.

Example of Bayesian Learning and Bayes Optimal Classifier


Imagine you are trying to classify whether an email is spam or not spam:

Machine Learning 108


Hypothesis Space (\( H \)): Different possible models that explain the classification of
emails as spam or not spam.

Prior Probability (\( P(H) \)): Prior belief about how likely each model is to be correct.

Likelihood (\( P(D|H) \)): Probability of observing certain features in the email (e.g., certain
keywords) given the hypothesis.

Posterior Probability: Updated belief about the hypothesis after seeing the email.

Bayes Optimal Classifier: Instead of selecting a single model, the classifier considers all
possible models and weighs their predictions by their posterior probabilities.

Summary
Bayes' Theorem: Provides a method for updating the probability of a hypothesis based on
observed evidence.

Concept Learning in Bayesian Framework: Involves using Bayes' theorem to update


beliefs about hypotheses as new data becomes available.

Bayes Optimal Classifier: Provides an optimal prediction by combining all possible


hypotheses weighted by their posterior probabilities.

Bayesian learning is a powerful framework that naturally integrates uncertainty and prior
knowledge into the learning process, making it useful for a wide range of applications such as
medical diagnosis, spam detection, and risk assessment.

https://mlarchive.com/machine-learning/the-ultimate-guide-to-naive-bayes/

What is Bayes’ theorem?


In statistics and probability theory, the Bayes’ theorem (also known as the Bayes’ rule) is a
mathematical formula used to determine the conditional probability of events.

Essentially, the Bayes’ theorem describes the probability of an event based on prior
knowledge of the conditions that might be relevant to the event.

The theorem is named after English statistician, Thomas Bayes, who discovered the
formula in 1763. It is considered the foundation of the special statistical inference approach
called the Bayes’ inference.

Bayes’ theorem formula is: P(A|B) = P(B|A)*P(A)/P(B)

Bayes’ theorem formula

Machine Learning 109


What is Naive Bayes algorithm?
The Naive Bayes consists of two words: 1- Naive: As it assumes the independency between
traits or features. 2- Bayes: Based on Bayes’ theorem.

To use the algorithm: 1-We must convert the presented data set into frequency tables. 2-
Then create a probability table by finding the probabilities of certain features. 3- Then use
Bayes’ theorem in order to calculate the posterior probability.

For example, let’s solve the following problem: If the weather is sunny, then the Player
should play or not?

Given the following dataset:

The given Dataset

Machine Learning 110


The first step is to convert our data into frequency tables as follows:

Convert The Dataset to Frequency table.

Then to create the likelihood/probability table as follows:

Likelihood/probability table

Machine Learning 111


Now let’s apply the algorithm to our case:

Naive Bayes Algorithm Calculations

𝑃(𝑌𝑒𝑠│𝑆𝑢𝑛𝑛𝑦) > 𝑃(𝑁𝑜│𝑆𝑢𝑛𝑛𝑦) ⇒ So on a sunny day, the player can play the game.

Advantages of using Naive Bayes


It is one of the fastest and easiest ML algorithms for predicting a class of datasets. It works
quickly and can save a lot of time.

It is suitable for solving multicategory forecasting problems.

It is used for both binary and multi-class classifications.

This algorithm performs well in multicategory predictions as compared to some other


algorithms.

It is one of the most common choices for text classification problems.

When the assumption of feature independence is correct, the algorithm can perform better
than other models, it also requires much less training data.

It is suitable for classification with discrete features that are categorically distributed.

Disadvantages of using Naive Bayes

Machine Learning 112


It assumes that all predictors (or features) are independent, where this limits the
applicability of it in real-world use cases.

This algorithm has a “zero-frequency problem” that assigns null probabilities to the
categorical variables whose classes in the test dataset are not available in the training
dataset, a smoothing method can be used in order to overcome this problem.

Its estimates can be wrong in some cases, so you shouldn’t take its potential outcomes
very seriously.

Naive Bayes Types


Categorical: Used with categorical features, fails when it faces an unknown category.

Gaussian: Used for Gaussian distributed features.

Complement: Used for imbalanced data, as it measures the probability of each sample
belonging to all other classes not its class.

Bernoulli: Used when features fellow Bernoulli distribution, it is suitable for discrete data,
where it is designed for binary/boolean features.

Multimodal: Unlike Bernoulli it works with occurrence counts, not only binary features.

Bayesian Belief Networks and EM Algorithm

1. Bayesian Belief Networks (BBNs)


Bayesian Belief Networks (BBNs), also known as Bayesian Networks or Bayes Nets, are a
type of probabilistic graphical model that represent the probabilistic relationships among a set
of variables. They are powerful tools for modeling uncertainty and reasoning about complex
domains where direct computations may be infeasible.

Structure of a Bayesian Belief Network


1. Directed Acyclic Graph (DAG):

BBNs are represented as a Directed Acyclic Graph (DAG).

Nodes: Each node represents a random variable (e.g., symptoms, test results, or
conditions).

Edges: Directed edges represent dependencies between variables. An edge from node
A to node B means that A has a direct influence on B.

2. Conditional Probability Table (CPT):

Machine Learning 113


Each node has a Conditional Probability Table (CPT) that quantifies the effects of the
parent nodes.

For a variable X with parents P1, P2, ..., Pn , the CPT defines:

P (X∣P 1, P 2, ..., P n)

If a node has no parents, its CPT defines its prior probability.

Properties of Bayesian Belief Networks


1. Conditional Independence:

BBNs represent dependencies explicitly, which allows for conditional independence


relationships to be defined easily. This significantly reduces the complexity of
calculations.

2. Local and Global Representations:

Local Representation: The relationship of each node with its parents.

Global Representation: The joint probability distribution of all nodes can be determined
using local CPTs.

Joint Probability Distribution


The joint probability distribution of all the variables in a BBN can be calculated as the product
of all the conditional probabilities of nodes given their parents. If X1, X2, ..., Xn are the nodes
in the network:

n
P (X1, X2, ..., Xn) = ∏ P (Xi ∣Parents(Xi ))
​ ​ ​

i=1

Inference in Bayesian Networks


Inference in Bayesian networks involves calculating the posterior probability of certain nodes
given the observed evidence about other nodes.

1. Exact Inference:

Methods like Variable Elimination or Belief Propagation can be used to derive exact
probabilities in small networks.

2. Approximate Inference:

For larger networks, exact inference becomes computationally expensive. Sampling


methods such as Monte Carlo methods or Markov Chain Monte Carlo (MCMC) are
used for approximation.

Machine Learning 114


Applications of Bayesian Belief Networks
1. Medical Diagnosis:

BBNs are used in medical diagnosis to model the relationships between symptoms,
test results, and diseases. They allow for reasoning about the likelihood of different
diseases given observed symptoms.

2. Risk Assessment:

Used in risk management to assess the probability of events based on various


influencing factors.

3. Decision Support Systems:

Employed in decision-making systems to provide recommendations by analyzing the


relationships between different variables.

Example of a Bayesian Network


Consider a network for diagnosing a disease based on symptoms and a test:

Nodes:

D : Disease (Yes/No)

S : Symptom (Present/Absent)

T : Test Result (Positive/Negative)

Edges:

D affects S (whether the disease causes the symptom).

D also affects T (whether the disease leads to a positive test result).

The network structure and conditional probability tables can help calculate the probability of
having the disease given that the symptom is present and the test is positive.

2. Expectation-Maximization (EM) Algorithm


The Expectation-Maximization (EM) Algorithm is an iterative optimization technique used for
finding maximum likelihood estimates of parameters in models that involve latent (hidden)
variables. The EM algorithm is especially useful when dealing with incomplete data or models
with missing variables.

Objective of the EM Algorithm


The EM algorithm is used to find the best estimate of the parameters of a statistical model
when some of the data is missing or hidden.

Machine Learning 115


It aims to maximize the likelihood function (or log-likelihood) for the observed data, which
is difficult to do directly due to hidden or incomplete variables.

Steps of the EM Algorithm


The EM algorithm alternates between two steps:

1. Expectation Step (E-Step):

In the E-step, the algorithm calculates the expected value of the log-likelihood
function, considering the current estimate of the parameters.

In other words, it calculates the probability distribution over the possible values of the
hidden variables, using the current parameters to fill in the missing values.

2. Maximization Step (M-Step):

In the M-step, the algorithm maximizes the expected log-likelihood computed in the E-
step with respect to the model parameters.

The M-step updates the parameters to values that maximize the likelihood function
based on the distribution calculated in the E-step.

These two steps are repeated until convergence, meaning that the parameters no longer
change significantly.

Mathematical Representation of EM Algorithm


Given data X and a model with parameters theta, we aim to maximize the log-likelihood log
P(X | theta) .

The EM algorithm iteratively updates the parameters by alternating between the two steps:

1. E-Step: Compute the expected log-likelihood:

Q(θ∣θ(t) ) = E[log P (X, Z∣θ)∣X, θ(t) ]

Where \( Z \) represents the latent variables and \( \theta^{(t)} \) represents the current
parameter estimates.

1. M-Step: Maximize the expected log-likelihood:

θ(t+1) = arg max Q(θ∣θ(t) )


This iterative process continues until convergence.

Applications of EM Algorithm
1. Gaussian Mixture Models (GMMs):

Machine Learning 116


EM is used to estimate the parameters of Gaussian Mixture Models for clustering. In
this context, EM assigns data points probabilistically to different Gaussian components
and estimates the parameters (mean, variance) of each component.

2. Hidden Markov Models (HMMs):

The EM algorithm is used to train Hidden Markov Models by estimating the transition
and emission probabilities in the presence of hidden states.

3. Image Reconstruction:

In medical imaging or computer vision, the EM algorithm is used to reconstruct


missing parts of images or improve resolution based on incomplete data.

4. Missing Data Problems:

EM is applied to datasets with missing entries by iteratively imputing values and


optimizing the model parameters.

Example of EM Algorithm for Gaussian Mixture Model (GMM)


Consider a dataset of points that we believe were generated by a mixture of two Gaussian
distributions.

1. E-Step:

Calculate the responsibilities for each data point, which represents the probability that
each point belongs to each Gaussian component, using the current parameter
estimates (mean, variance, and mixing coefficients).

2. M-Step:

Update the parameters of each Gaussian component (mean, variance, and mixing
coefficients) by maximizing the expected complete log-likelihood using the
responsibilities from the E-step.

This process is repeated until the parameters converge, leading to the best-fit Gaussian
components for the data.

Summary
Bayesian Belief Networks are graphical models that represent probabilistic relationships
between variables using a Directed Acyclic Graph and Conditional Probability Tables.
They are highly useful for reasoning under uncertainty in domains like medical diagnosis
and risk assessment.

Expectation-Maximization (EM) Algorithm is an iterative method used to find maximum


likelihood estimates in the presence of missing or hidden data. It involves alternating
between the E-step (Expectation) and the M-step (Maximization) until convergence,

Machine Learning 117


making it valuable for training models like Gaussian Mixture Models and Hidden Markov
Models.

Both Bayesian Belief Networks and the EM Algorithm are important tools in machine learning
and statistical modeling, enabling robust reasoning and parameter estimation even when
dealing with complex dependencies and incomplete information.

https://encord.com/blog/what-is-ensemble-learning/

Ensemble Learning
Imagine you are watching a football match. The sports analysts provide you with detailed
statistics and expert opinions. At the same time, you also take into account the opinions of
fellow enthusiasts who may have witnessed previous matches. This approach helps overcome
the limitations of relying solely on one model and increases overall accuracy. Similarly, in
ensemble learning, combining multiple models or algorithms can improve prediction accuracy.
In both cases, the power of collective knowledge and multiple viewpoints is harnessed to make
more informed and reliable predictions, overcoming the limitations of relying solely on one
model. Let us take a deeper dive into what Ensemble Learning actually is.
Ensemble learning is a machine learning technique that improves the performance of machine
learning models by combining predictions from multiple models. By leveraging the strengths of
diverse algorithms, ensemble methods aim to reduce both bias and variance, resulting in more
reliable predictions. It also increases the model’s robustness to errors and uncertainties,
especially in critical applications like healthcare or finance.
Ensemble learning techniques like bagging, boosting, and stacking enhance performance and
reliability, making them valuable for teams that want to build reliable ML systems.

Machine Learning 118


Ensemble Learning
This article highlights the benefits of ensemble learning for reducing bias and improving
predictive model accuracy. It highlights techniques to identify and manage uncertainties,
leading to more reliable risk assessments, and provides guidance on applying ensemble
learning to predictive modeling tasks.
Here, we will address the following topics:

Brief overview

Ensemble learning techniques

Benefits of ensemble learning

Challenges and considerations

Applications of ensemble learning

Types of Ensemble Learning


Ensemble learning differs from deep learning; the latter focuses on complex pattern
recognition tasks through hierarchical feature learning. Ensemble techniques, such as bagging,
boosting, stacking, and voting, address different aspects of model training to enhance
prediction accuracy and robustness.

These techniques aim to reduce bias and variance in individual models, and improve prediction
accuracy by learning previous errors, ultimately leading to a consensus prediction that is often
more reliable than any single model.

Machine Learning 119


The main challenge is not to obtain highly accurate base models but to obtain base models
that make different kinds of errors. If ensembles are used for classification, high accuracies
can be achieved if different base models misclassify different training examples, even if the
base classifier accuracy is low.

Bagging: Bootstrap Aggregating


Bootstrap aggregation, or bagging, is a technique that improves prediction accuracy by
combining predictions from multiple models. It involves creating random subsets of data,
training individual models on each subset, and combining their predictions. However, this only
happens in regression tasks. For classification tasks, the majority vote is typically used.
Bagging applies bootstrap sampling to obtain the data subsets for training the base learners.

Random forest
The Random Forest algorithm is a prime example of bagging. It creates an ensemble of
decision trees trained on samples of datasets. Ensemble learning effectively handles complex
features and captures nuanced patterns, resulting in more reliable predictions. However, it is
also true that the interpretability of ensemble models may be compromised due to the
combination of multiple decision trees. Ensemble models can provide more accurate
predictions than individual decision trees, but understanding the reasoning behind each
prediction becomes challenging. Bagging helps reduce overfitting by generating multiple
subsets of the training data and training individual decision trees on each subset. It also helps
reduce the impact of outliers or noisy data points by averaging the predictions of multiple
decision trees.

Machine Learning 120


Ensemble Learning: Bagging & Boosting | Towards Data Science

Boosting: Iterative Learning


Boosting is a technique in ensemble learning that converts a collection of weak learners into a
strong one by focusing on the errors of previous iterations. The process involves incrementally
increasing the weight of misclassified data points, so subsequent models focus more on
difficult cases. The final model is created by combining these weak learners and prioritizing
those that perform better.

Gradient boosting
Gradient Boosting (GB) trains each model to minimize the errors of previous models by training
each new model on the remaining errors. This iterative process effectively handles numerical
and categorical data and can outperform other machine learning algorithms, making it
versatile for various applications.

For example, you can apply Gradient Boosting in healthcare to predict disease likelihood
accurately. Iteratively combining weak learners to build a strong learner can improve prediction
accuracy, which could be valuable in providing insights for early intervention and personalized
treatment plans based on demographic and medical factors such as age, gender, family
history, and biomarkers.
One potential challenge of gradient boosting in healthcare is its lack of interpretability. While it
excels at accurately predicting disease likelihood, the complex nature of the algorithm makes it
difficult to understand and interpret the underlying factors driving those predictions.

This can pose challenges for healthcare professionals who must explain the reasoning behind
a particular prediction or treatment recommendation to patients. However, efforts are being
made to develop techniques that enhance the interpretability of GB models in healthcare,
ensuring transparency and trust in their use for decision-making.
Boosting is an ensemble method that seeks to change the training data to focus attention on
examples that previous fit models on the training dataset have gotten wrong.

Machine Learning 121


Boosting in Machine Learning | Boosting and AdaBoost
In the clinical literature, gradient boosting has been successfully used to predict, among other
things, cardiovascular events, the development of sepsis, delirium, and hospital readmissions
following lumbar laminectomy.

Stacking: Meta-learning
Stacking, or stacked generalization, is a model-ensembling technique that improves predictive
performance by combining predictions from multiple models. It involves training a meta-model
that uses the output of base-level models to make a final prediction. The meta-model, a linear
regression, a neural network, or any other algorithm makes the final prediction.
This technique leverages the collective knowledge of different models to generate more
accurate and robust predictions. The meta-model can be trained using ensemble algorithms
like linear regression, neural networks, or support vector machines. The final prediction is
based on the meta-model's output. Overfitting occurs when a model becomes too closely fitted
to the training data and performs poorly on new, unseen data. Stacking helps mitigate
overfitting by combining multiple models with different strengths and weaknesses, thereby
reducing the risk of relying too heavily on a single model’s biases or idiosyncrasies.
For example, in financial forecasting, stacking combines models like regression, random forest,
and gradient boosting to improve stock market predictions. This ensemble approach mitigates
the individual biases in the model and allows easy incorporation of new models or the removal
of underperforming ones, enhancing prediction performance over time.

Voting

Machine Learning 122


Voting is a popular technique used in ensemble learning, where multiple models are combined
to make predictions. Majority voting, or max voting, involves selecting the class label that
receives the majority of votes from the individual models. On the other hand, weighted voting
assigns different weights to each model's prediction and combines them to make a final
decision. Both majority and weighted voting are methods of aggregating predictions from
multiple models through a voting mechanism and strongly influence the final decision.
Examples of algorithms that use voting in ensemble learning include random
forests and gradient boosting (although it’s an additive model “weighted” addition). Random
forest uses decision tree models trained on different data subsets. A majority vote determines
the final forecast based on individual forecasts.
For instance, in a random forest applied to credit scoring, each decision tree might decide
whether an individual is a credit risk. The final credit risk classification is based on the majority
vote of all trees in the forest. This process typically improves predictive performance by
harnessing the collective decision-making power of multiple models.

The application of either bagging or boosting requires the selection of a base learner
algorithm first. For example, if one chooses a classification tree, then boosting and bagging
would be a pool of trees with a size equal to the user’s preference.

Benefits of Ensemble Learning


Improved Accuracy and Stability
Ensemble methods combine the strengths of individual models by leveraging their diverse
perspectives on the data. Each model may excel in different aspects, such as capturing
different patterns or handling specific types of noise. By combining their predictions through
voting or weighted averaging, ensemble methods can improve overall accuracy by capturing a
more comprehensive understanding of the data. This helps to mitigate the weaknesses and
biases that may be present in any single model. Ensemble learning, which improves model
accuracy in the classification model while lowering mean absolute error in the regression
model, can make a stable model less prone to overfitting. Ensemble methods also have the
advantage of handling large datasets efficiently, making them suitable for big data applications.
Additionally, ensemble methods provide a way to incorporate diverse perspectives and
expertise from multiple models, leading to more robust and reliable predictions.

Robustness
Ensemble learning enhances robustness by considering multiple models' opinions and making
consensus-based predictions. This mitigates the impact of outliers or errors in a single model,
ensuring more accurate results. Combining diverse models reduces the risk of biases or
inaccuracies from individual models, enhancing the overall reliability and performance of the
ensemble learning approach. However, combining multiple models can increase the
computational complexity compared to using a single model. Furthermore, as ensemble models

Machine Learning 123


incorporate different algorithms or variations of the same algorithm, their interpretability may
be somewhat compromised.

Reducing Overfitting
Ensemble learning reduces overfitting by using random data subsets for training each model.
Bagging introduces randomness and diversity, improving generalization performance. Boosting
assigns higher weights to difficult-to-classify instances, focusing on challenging cases and
improving accuracy. Iteratively adjusting weights allows boosting to learn from mistakes and
build models sequentially, resulting in a strong ensemble capable of handling complex data
patterns. Both approaches help improve generalization performance and accuracy in ensemble
learning.

Benefits of using Ensemble Learning on Land Use Data

Challenges and Considerations in Ensemble Learning


Model Selection and Weighting
Selecting the right combination of models to include in the ensemble, determining the optimal
weighting of each model's predictions, and managing the computational resources required to
train and evaluate multiple models simultaneously. Additionally, ensemble learning may not
always improve performance if the individual models are too similar or if the training data has a
high degree of noise. The diversity of the models—in terms of algorithms, feature processing,
and data perspectives—is vital to covering a broader spectrum of data patterns. Optimal
weighting of each model's contribution, often based on performance metrics, is crucial to

Machine Learning 124


harnessing their collective predictive power. Therefore, careful consideration and
experimentation are necessary to achieve the desired results with ensemble learning.

Computational Complexity
Ensemble learning, involving multiple algorithms and feature sets, requires more computational
resources than individual models. While parallel processing offers a solution, orchestrating an
ensemble of models across multiple processors can introduce complexity in both
implementation and maintenance. Also, more computation might not always lead to better
performance, especially if the ensemble is not set up correctly or if the models amplify each
other's errors in noisy datasets.

Diversity and Overfitting


Ensemble learning requires diverse models to avoid bias and enhance accuracy. By
incorporating different algorithms, feature sets, and training data, ensemble learning captures a
wider range of patterns, reducing the risk of overfitting and ensuring the ensemble can handle
various scenarios and make accurate predictions in different contexts. Strategies such as
cross-validation help in evaluating the ensemble's consistency and reliability, ensuring the
ensemble is robust against different data scenarios.

Interpretability
Ensemble learning models prioritize accuracy over interpretability, resulting in highly accurate
predictions. However, this trade-off makes the ensemble model more challenging to interpret.
Techniques like feature importance analysis and model introspection can help provide insights
but may not fully demystify the predictions of complex ensembles. the factors contributing to
ensemble models' decision-making, reducing the interpretability challenge.

AdaBoost and XGBoost


AdaBoost and XGBoost are two popular boosting algorithms used in machine learning for
improving the performance of weak classifiers to build a strong ensemble model. Both are
widely used due to their ability to handle various types of datasets effectively and enhance
prediction accuracy.

1. AdaBoost (Adaptive Boosting)

Introduction to AdaBoost
AdaBoost stands for Adaptive Boosting. It is an ensemble method that combines multiple
"weak" learners (often simple decision trees or stumps) into a single "strong" learner in an

Machine Learning 125


iterative way. The basic idea behind boosting is to train multiple classifiers sequentially, each
one focusing more on the errors made by the previous classifiers.

Type: Ensemble Learning - Boosting

Key Idea: Create a strong classifier by combining multiple weak classifiers in sequence.

Weak Learners: Typically uses decision stumps (a decision tree with only one split).

How AdaBoost Works


1. Initialize Weights:

Each training sample is assigned an equal weight initially.

Weights help determine the importance of each sample.

2. Training Weak Classifiers:

Train a weak classifier (e.g., a decision stump) on the training dataset.

Calculate the error rate of the classifier:

Samples that are misclassified by the weak classifier are given higher weights.

Samples that are correctly classified are given lower weights.

3. Update Weights:

Update the weights for the next iteration, such that:

Misclassified samples get higher weights, which makes them more important in the
next round.

Correctly classified samples get reduced weights.

4. Combine Weak Classifiers:

After each iteration, assign a weight to the weak classifier based on its accuracy. This
weight indicates the importance of the classifier in the final ensemble.

Repeat the process to train multiple classifiers and combine their predictions to form
the final, strong classifier.

The final output is a weighted vote from all weak classifiers.

Machine Learning 126


Advantages of AdaBoost
1. Simple and Fast:

AdaBoost is easy to implement and relatively fast compared to other boosting


algorithms.

2. Versatile:

Works well with various types of weak learners (commonly decision stumps).

3. Emphasizes Difficult Cases:

Machine Learning 127


Focuses on misclassified instances, helping the ensemble to perform well even when
dealing with challenging cases.

Disadvantages of AdaBoost
1. Sensitive to Noisy Data:

If there are outliers or noisy data, AdaBoost can focus too much on these instances,
leading to overfitting.

2. Dependent on Weak Learner Quality:

Performance depends on the quality of weak learners. If weak learners are very poor,
the ensemble won't perform well.

Applications of AdaBoost
Face Recognition: Used in computer vision tasks, such as face detection (e.g., Haar
cascades).

Text Classification: AdaBoost is also employed in classifying text or spam filtering.

2. XGBoost (Extreme Gradient Boosting)

Introduction to XGBoost
XGBoost stands for Extreme Gradient Boosting. It is an optimized implementation of the
Gradient Boosting Machine (GBM), specifically designed for speed and performance. XGBoost
is widely recognized for its high efficiency, scalability, and accuracy.

Type: Ensemble Learning - Boosting

Key Idea: Uses advanced regularization and gradient-based optimization to build strong
models from weak learners.

Weak Learners: Typically uses decision trees with maximum depth.

How XGBoost Works


1. Gradient Boosting Concept:

The idea of Gradient Boosting is to build new learners to correct the errors made by
previous learners.

Each learner in XGBoost is trained to minimize the residuals (errors) of the previous
learners, by fitting a decision tree to the gradient of the loss function with respect to the
predictions.

2. Gradient Descent Optimization:

Uses gradient descent to minimize a loss function.

Machine Learning 128


The loss function can be chosen depending on the problem (e.g., squared error for
regression, log loss for classification).

3. Tree Pruning and Regularization:

Pruning: XGBoost uses depth-first pruning to grow trees only until no further
improvement can be made.

Regularization: Adds \( L_1 \) and \( L_2 \) regularization to the objective function,


helping to prevent overfitting and improve generalization.

4. Additive Model Building:

XGBoost constructs an ensemble of decision trees in an additive fashion, adding one


tree at a time to correct the residuals of the previous trees.

The trees are constructed iteratively, and the final prediction is obtained by summing
the output from all the trees.

5. Shrinkage and Learning Rate:

A learning rate parameter helps to control the contribution of each new tree added to
the model, allowing the model to make smaller, more controlled updates.

Advantages of XGBoost
1. High Performance:

XGBoost is highly efficient and fast, making it suitable for large datasets and real-time
applications.

2. Regularization:

Machine Learning 129


The inclusion of \( L_1 \) and \( L_2 \) regularization helps in controlling model
complexity and prevents overfitting.

3. Handling Missing Data:

XGBoost has a built-in mechanism to handle missing values effectively, making it a


robust option for real-world datasets.

4. Flexibility:

XGBoost allows various customized objective functions and evaluation metrics.

Disadvantages of XGBoost
1. Complexity:

XGBoost has many hyperparameters that need to be fine-tuned for optimal


performance, which can make it more complex to use.

2. Computationally Intensive:

Although it is highly optimized, training XGBoost can be computationally intensive


compared to simpler algorithms, particularly when dealing with very large datasets.

Applications of XGBoost
1. Kaggle Competitions:

XGBoost is a favorite among data scientists in Kaggle competitions due to its high
accuracy and performance.

2. Credit Scoring:

Widely used in the finance industry for predicting credit risk and assessing loan
eligibility.

3. Customer Behavior Analysis:

Used for predicting customer churn, recommendation systems, and marketing.

Summary: AdaBoost vs. XGBoost


Feature AdaBoost XGBoost

Type Sequential Boosting Gradient Boosting

Weak Learners (usually decision


Base Learner Decision Trees (with depth control)
stumps)

Regularization No explicit regularization \( L_1 \) and \( L_2 \) Regularization

Highly optimized, but computationally


Speed Fast and simple
intensive

Machine Learning 130


Handling
Sensitive to outliers Regularization helps mitigate overfitting
Outliers

Applications Spam detection, Face detection Kaggle, Credit scoring, Customer churn

AdaBoost is simpler and often used with decision stumps to create sequential weak
learners, emphasizing the misclassified data.

XGBoost uses gradient boosting to iteratively improve the model using decision trees, with
an emphasis on regularization and gradient optimization to reduce overfitting and increase
generalization performance.

Both algorithms are highly effective for classification and regression tasks, and their choice
depends on the problem, computational resources, and data characteristics.

Classification Metrics in Machine Learning


Classification Metrics is about predicting the class labels given input data. In binary
classification, there are only two possible output classes(i.e., Dichotomy). In multiclass
classification, more than two possible classes can be present. I’ll focus only on binary
classification.
A very common example of binary classification is spam detection, where the input data could
include the email text and metadata (sender, sending time), and the output label is
either “spam” or “not spam.” (See Figure) Sometimes, people use some other names also for
the two classes: “positive” and “negative,” or “class 1” and “class 0.”
Email spam detection is a binary classification problem (source: From Book — Evaluating
Machine Learning Model — O’Reilly)

There are many ways for measuring classification performance. Accuracy, confusion matrix,
log-loss, and AUC-ROC are some of the most popular metrics. Precision-recall is a widely
used metrics for classification problems.

The Limitations of Accuracy

Machine Learning 131


Accuracy simply measures how often the classifier correctly predicts. We can define accuracy
as the ratio of the number of correct predictions and the total number of predictions.

When any model gives an accuracy rate of 99%, you might think that model is performing very
good but this is not always true and can be misleading in some situations. I am going to explain
this with the help of an example.

Example of Limitation of Accuracy


Consider a binary classification problem, where a model can achieve only two results, either
model gives a correct or incorrect prediction. Now imagine we have a classification task to
predict if an image is a dog or cat as shown in the image. In a supervised learning algorithm,
we first fit/train a model on training data, then test the model on testing data. Once we have
the model’s predictions from the X_test data, we compare them to the true y_values (the
correct labels).

Machine Learning 132


We feed the image of the dog into the training model. Suppose the model predicts that this is a
dog, and then we compare the prediction to the correct label. If the model predicts that this
image is a cat and then we again compare it to the correct label and it would be incorrect.

We repeat this process for all images in X_test data. Eventually, we’ll have a count of correct
and incorrect matches. But in reality, it is very rare that all incorrect or correct matches
hold equal value. Therefore one metric won’t tell the entire story.
Accuracy is useful when the target class is well balanced but is not a good choice for the
unbalanced classes. Imagine the scenario where we had 99 images of the dog and only 1
image of a cat present in our training data. Then our model would always predict the dog, and
therefore we got 99% accuracy. In reality, Data is always imbalanced for example Spam email,
credit card fraud, and medical diagnosis. Hence, if we want to do a better model evaluation and
have a full picture of the model evaluation, other metrics such as recall and precision should
also be considered.

What is Confusion Matrix?


Confusion Matrix is a performance measurement for the machine learning classification
problems where the output can be two or more classes. It is a table with combinations of
predicted and actual values.

A confusion matrix is defined as thetable that is often used to describe


the performance of a classification model on a set of the test data for
which the true values are known.

Machine Learning 133


It is extremely useful for measuring the Recall, Precision, Accuracy, and AUC-ROC curves.
Let’s try to understand TP, FP, FN, TN with an example of pregnancy analogy.

Machine Learning 134


True Positive: We predicted positive and it’s true. In the image, we predicted that a woman
is pregnant and she actually is.

True Negative: We predicted negative and it’s true. In the image, we predicted that a man is
not pregnant and he actually is not.

False Positive (Type 1 Error): We predicted positive and it’s false. In the image, we
predicted that a man is pregnant but he actually is not.

False Negative (Type 2 Error): We predicted negative and it’s false. In the image, we
predicted that a woman is not pregnant but she actually is.

We discussed Accuracy, now let’s discuss some other metrics of the confusion matrix!

Precision
It explains how many of the correctly predicted cases actually turned out to be positive.
Precision is useful in the cases where False Positive is a higher concern than False Negatives.
The importance of Precision is in music or video recommendation systems, e-commerce
websites, etc. where wrong results could lead to customer churn and this could be harmful to
the business.

Precision for a label is defined as the number of true positives divided by


the number of predicted positives.

Recall (Sensitivity)
It explains how many of the actual positive cases we were able to predict correctly with our
model. Recall is a useful metric in cases where False Negative is of higher concern than False
Positive. It is important in medical cases where it doesn’t matter whether we raise a false
alarm but the actual positive cases should not go undetected!

Recall for a label is defined as the number of true positives divided by the
total number of actual positives.

Machine Learning 135


Also Read: Precision and Recall in Machine Learning

F1 Score
It gives a combined idea about Precision and Recall metrics. It is maximum when Precision is
equal to Recall.

F1 Score is the harmonic mean of precision and recall.

The F1 score punishes extreme values more. F1 Score could be an effective evaluation metric in
the following cases:

When FP and FN are equally costly.

Adding more data doesn’t effectively change the outcome

True Negative is high

AUC-ROC
The Receiver Operator Characteristic (ROC) is a probability curve that plots the TPR(True
Positive Rate) against the FPR(False Positive Rate) at various threshold values and separates
the ‘signal’ from the ‘noise’.
The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish
between classes. From the graph, we simply say the area of the curve ABDE and the X and Y-
axis.
From the graph shown below, the greater the AUC, the better is the performance of the model
at different threshold points between positive and negative classes. This simply means that
When AUC is equal to 1, the classifier is able to perfectly distinguish between all Positive and
Negative class points. When AUC is equal to 0, the classifier would be predicting all Negatives
as Positives and vice versa. When AUC is 0.5, the classifier is not able to distinguish between
the Positive and Negative classes.

Machine Learning 136


Working of AUC
In a ROC curve, the X-axis value shows False Positive Rate (FPR), and Y-axis shows True
Positive Rate (TPR). Higher the value of X means higher the number of False Positives(FP) than
True Negatives(TN), while a higher Y-axis value indicates a higher number of TP than FN. So,
the choice of the threshold depends on the ability to balance between FP and FN.

To know more, read our Guide to AUC ROC Curve in Machine Learning.

Log Loss
Log loss (Logistic loss) or Cross-Entropy Loss is one of the major metrics to assess the
performance of a classification problem.

For a single sample with true label y ∈{0,1} and a probability estimate p=Pr(y=1), the log loss is:

Machine Learning 137


Conclusion
Understanding how well a machine learning model will perform on unseen data is the main
purpose behind working with these evaluation metrics. Classification Metrics like accuracy,
precision, recall are good ways to evaluate classification models for balanced datasets, but if
the data is imbalanced then other methods like ROC/AUC perform better in evaluating the
model performance.
ROC curve isn’t just a single number but it’s a whole curve that provides nuanced details about
the behavior of the classifier. It is also hard to quickly compare many ROC curves to each
other.

Classification Model Evaluation and Selection: Real Example


Let's use a simple example of evaluating and selecting a classification model to predict
whether a customer will subscribe to a term deposit after a marketing campaign by a bank. The
dataset features include customer age, job type, marital status, balance, and the outcome of
previous marketing campaigns.
We'll consider three models: Logistic Regression, Decision Tree, and Support Vector Machine
(SVM).

Step 1: Data Preparation


First, we split the data into a training set (80%) and a testing set (20%). This allows us to train
the models on one part of the data and test them on a completely independent set, ensuring
that our evaluations are unbiased.

Step 2: Model Training


Each model is trained on the training set:

Logistic Regression: Good for binary classification with a linear decision boundary.

Decision Tree: Offers a more flexible approach by segmenting the space into smaller
regions.

SVM: Works well for higher-dimensional spaces or when there is a clear margin of
separation.

Step 3: Model Evaluation


We test each model on the testing set and calculate key metrics from the confusion matrix for
each model.

Example Confusion Matrix Output


Assume after testing on the testing set, the results are as follows for each model:

Machine Learning 138


Logistic Regression:

True Positives (TP): 78

True Negatives (TN): 85

False Positives (FP): 15

False Negatives (FN): 22

Decision Tree:

TP: 81

TN: 80

FP: 20

FN: 19

SVM:

TP: 75

TN: 90

FP: 10

FN: 25

Machine Learning 139


Machine Learning 140
Step 5: Model Selection
From the metrics:

SVM offers the highest Accuracy and Precision.

Decision Tree offers good Recall but has lower precision.

Logistic Regression offers a balance with a good F1-Score and decent performance across
all metrics.

Conclusion
For this task:

If prioritizing correctly identifying as many potential subscribers as possible (maximizing


TP), Decision Tree might be preferred due to its higher recall.

If prioritizing the correctness of the positive classifications (ensuring that those predicted
as likely to subscribe are very likely to do so), SVM might be best due to its high precision.

For a balance between precision and recall, Logistic Regression presents a viable option
with the highest F1-Score.

Choosing the right model depends on the business goals and the cost of false positives vs.
false negatives. For instance, if missing out on potential subscribers is more costly than
mistakenly identifying non-subscribers as potential subscribers, a model with higher recall may
be preferable.

1. Sensitivity (Recall or True Positive Rate)


Definition: Measures the proportion of actual positives that are correctly identified by the
model.

True Positives (TP)


Sensitivity =
True Positives (TP) + False Negatives (FN)

Use: Indicates how well the model can detect positive cases (e.g., detecting actual
churners in a dataset).

2. Specificity (True Negative Rate)


Definition: Measures the proportion of actual negatives that are correctly identified by the
model.

True Negatives (TN)


F ormula : Specificity =
True Negatives (TN) + False Positives (FP)

Machine Learning 141


Use: Indicates how well the model can detect negative cases (e.g., identifying non-
churners correctly).

3. Positive Predictive Value (PPV or Precision)


Definition: Measures the proportion of predicted positives that are actually positive.

True Positives (TP)


F ormula : PPV (Precision) =
True Positives (TP) + False Positives (FP)

Use: Indicates how reliable a positive prediction is (e.g., if the model says someone will
churn, what is the chance they actually will?).

4. Negative Predictive Value (NPV)


Definition: Measures the proportion of predicted negatives that are actually negative.

True Negatives (TN)


F ormula : NPV =
True Negatives (TN) + False Negatives (FN)

Use: Indicates how reliable a negative prediction is (e.g., if the model says someone will not
churn, what is the chance they really won't?).

Summary
Sensitivity: How good is the model at finding actual positives?

Specificity: How good is the model at finding actual negatives?

PPV (Precision): When the model predicts positive, how often is it correct?

NPV: When the model predicts negative, how often is it correct?

These metrics help you understand the trade-offs in your model's performance and can be
chosen based on what is most important in your specific problem (e.g., minimizing false
positives vs. maximizing detection of positives).

Gains chart
https://community.spotfire.com/articles/spotfire-statistica/gains-vs-roc-curves-do-you-
understand-the-difference/
Typically called a Cumulative Gains Chart it can be simply explained by the following example:

For simplicity let's assume we have 1000 customers. If we run an advertising campaign for all
our customers, we might find that 30% (300 out of 1000) will respond and buy our new
product.

Machine Learning 142


Marketing to all our customers could be one strategy for running a campaign. But this is not the
optimum use of our marketing dollars, especially for large customer bases. Therefore we would
like to have a better way of running this advertising campaign so that instead of targeting all
our customer base, we target only to those customers with a high probability of responding
positively to the campaign. This will, firstly lower the cost of the campaign and, secondly (and
maybe more importantly) we will not disturb those customers with advertising who have no
interest in our new product.
This is where predictive classification models come in. There are lots of different models, but
no matter which one we use, we can still evaluate the results of our model by using Cumulative
Gains Charts. If we have historical data on the reactions of customers to past campaigns, we
can use the data to build a model that predicts, if a particular customer will respond by buying
the product or not. The results of such a model are typical, for each customer, the probability
of a positive and negative reaction from the customer. We can sort customers according to the
probability of a positive reaction to the campaign and run the campaign only for a percentage
of customers with the highest probability.

The Gains chart is the visualization of that principle. On the X-axis we have the percentage of
the customer base we want to target with the campaign. The Y-axis gives us the answer to the
percentage of all positive responses customers have found in the targeted sample. In the
picture below you can see an example of the Gains chart. (The gains chart associated with the
model is the red curve).

Machine Learning 143


What can we read from the graph? What happens if we only target 10% of our customer base?
According to the results of our model, if we will take the 10% of customers with the highest
probability of a positive response, we will get 28% of all the possible positive responses. This
means we will find 84 customers with positive responses from the 100 customers reached by
the campaign (84 is 28% of 300 positive response customers in our customer base).
With an increase of targeted customers to 50%, we already have more than 80% of those who
will, in a real situation, give a positive response. If this is our selected strategy for the real
campaign (reaching 50% of our customers by the model), then we will have reached 80% of all
the positive responses and saved 50% of our costs of running the campaign (we do not want
to run the campaign to customers that are not likely to respond positively).
The choice of the percentage to be targeted in the campaign depends on the concrete costs
for the campaign and the profit from the expected positive responses. The Gains chart is a
display of the expected results based on the choice of the percentage targeted. Our final
strategy, therefore, consists of the model and the targeted percentage (instead of the
percentage we can define the cut-off value for probabilities - if the probability is above this
value/threshold we will include the customer in the campaign).

Machine Learning 144


It was already said that the red curve represents the proposed model. The Blue
curve represents the gains chart of a random model. In this case, we are displaying the
observed results of picking customers randomly without any selection criteria, which assumes
that we would get the same proportion of positive responses if we target the whole customer
base. In other words, If we target 10% of all customers, we will have 10% of all the positive
responses within our 10% sample. The curves are meeting at (0, 0) and (100, 100), the second
point means we run the campaign to all customers, therefore the output (all those who
responded positively) is the same as the observed results. When we are using a predictive
model, in this case picking customers according to sorted probabilities, it does not make sense
when we include all customers.
The Green curve is the optimal model, the best possible order for picking customers ? we will
first target all customers with a positive response and then those with a negative response. The
slope of the first part of the green curve is 100/(percentage of all positive responses).

Confusion matrix
To test our strategy (defined by the model and the targeted percentage or equivalently the cut-
off value) we need to compare the output of the model to the actual results in the real world.
This is done by comparing the results and creating a contingency table of misclassification
errors (terminology as used in hypothesis testing - TP means true positive, FN false negative,
FP false positive, and TN true negative):

Prediction YES Prediction NO

Observed YES Count TP (right decision) Count FN (error of the second kind)

Observed NO Count FP (error of the first kind) Count TN (right decision)

Ideally, we want to have the right decisions made with high frequency. Such a table (usually
called a confusion matrix) is a very important decision tool when we evaluate the quality of the
model.

For better orientation, it is common practice to display the confusion matrix in the form of the
following graph. From this graph, we see, how many times the model predicts correctly (true
negatives and true positives) and how many times we have an incorrect prediction (false
positives and false negatives). The better the model, the larger the bars TP and TN in
comparison to FN, and FP.

Machine Learning 145


A point on the gains chart is equivalent to:

The second term is on the X-axis and it is a fraction of targeted customers.

Discussed curves (ROC, Gains, and Lift) are computed based on information from confusion
matrices. It is important to realize that curves are created according to a larger number of
these confusion matrices for various targeted percentages/cut-off values.

ROC curve
Other terms connected with a confusion matrix are Sensitivity and Specificity. They are
computed in the following way:

Machine Learning 146


The ROC curve (Receiver Operating Characteristics curve) is the display of sensitivity and
specificity for different cut-off values for probability (If the probability of a positive response is
above the cut-off, we predict a positive outcome, if not we are predicting a negative one). Each
cut-off value defines one point on the ROC curve, plotting the cut-off for the range of 0 to 1 will
draw the whole ROC curve. The Red curve on the ROC curve diagram below is the same
model as the example for the Gains chart:

The Y-axis measures the rate (as a percentage) of correctly predicted customers with a
positive response. The X-axis measures the rate of incorrectly predicted customers with a
negative response.
The optimal model could be the following: Sensitivity will rise to a maximum and specificity will
stay the whole time at 1 (the optimal model is in green color). The task is to have the ROC
curve of the developed model as close as possible to the optimal model.

Machine Learning 147


Usage
The Gains and the ROC curve are visualizations showing the overall performance of the
models. The shape of the curves will tell us a lot about the behavior of the model. It clearly
shows how much our model is better than a model assigning categories randomly and how far
we are from the optimal model which is in practice unachievable. These curves can help in
setting the final cut-off point for deciding which probabilities will mean positive and negative
response prediction. The model together with the cut-off point will define our strategy of who
should be targeted by the campaign and who should not be (Typically a chosen default value
of 0.5 might not meet the requirements of the use case nor would it be the best cut-off). During
the building of the predictive model, we can have many interim models - candidates for the
final best model. Displaying more Gains (ROC) charts for more models in one graph gives the
possibility to compare models.
It is very important to mention that ROC, Gains, or Lift charts are connected only by one
predicted category! In our example, we were interested in finding customers with positive
responses because that was the main task of our use case. There are also analogical Gains
and ROC charts that represent the negative customer response as well. If the main goal for
prediction was finding the customers with a negative response, the criterion for the quality of
the model would be rather Gains or ROC curve for the negative response category.

So, what is the difference?


Both curves are displaying the dependence of the correctly predicted category in question
(positive response in our example) by changing the cut-off of assignment to that category. The
difference is the scale on the X-axis of the graph, whereas the Y-axis is the same for Gains as
well as the ROC chart. If you love formulas then have a look at the following table:

The Graphical representation of the results as a confusion matrix is below - colors on the graph
represent the same as the color markings in the table above:

Machine Learning 148


The whole principle of connection of Gains and ROC charts together with Confusion matrices
(tables of good and bad classifications) is below. The main goal of the graphs below is to
highlight the fact that a single confusion matrix (as well as other measures like misclassification
rate) are connected only with one point on the Gains, ROC, or Lift chart!

Machine Learning 149


Lift chart
We have mentioned the Lift chart several times but have not explained it. A Lift chart comes
directly from a Gains chart, where the X-axis is the same, but the Y-axis is the ratio of the Gains
value of the model and the Gains value of a model choosing customers randomly (red and blue
curve in the above Gains chart). In other words, it shows how many times the model is better
than the random choice of cases. We can see that the value of the lift chart at X=100 is 1
because if we choose all customers there would be no lift. The same customers will be picked
by both models.

Machine Learning 150


We hope you enjoyed this article and we wish you a lot of good predictive models.

Misclassification Cost Adjustment and Decision Cost/Benefit


Analysis
When building a classification model, it's essential to account for the costs associated with
misclassifications to reflect real-world implications. Not all types of errors (false positives vs.
false negatives) have the same impact or cost in practical scenarios. Misclassification Cost

Machine Learning 151


Adjustment and Decision Cost/Benefit Analysis help make more informed decisions by
explicitly considering these factors.

1. Misclassification Cost Adjustment


Misclassification Cost Adjustment involves adjusting the model to account for the different
costs associated with different types of errors:

False Positive (FP): When the model incorrectly predicts a positive outcome.

Example: Predicting a customer will churn when they won't. This might lead to
unnecessary retention efforts and increased costs.

False Negative (FN): When the model incorrectly predicts a negative outcome.

Example: Failing to predict a high-risk medical condition, leading to health


consequences or higher costs later on.

The aim is to adjust the model’s focus to minimize high-cost errors based on the specific use
case.

Techniques for Misclassification Cost Adjustment


1. Weighted Cost Function:

Assign different weights to the types of errors during model training.

For example, if a false negative is more costly than a false positive, assign a higher
penalty for false negatives in the loss function.

2. Class Imbalance Techniques:

For imbalanced datasets, you can use methods like:

Resampling: Oversample the minority class (e.g., fraud cases) or undersample the
majority class.

Synthetic Data Generation: Methods like SMOTE (Synthetic Minority Over-


sampling Technique) can help generate synthetic examples of the minority class.

Cost-Sensitive Learning: Modify the learning algorithm to be aware of different


misclassification costs.

3. Custom Thresholds:

Adjust the classification threshold to change the balance between precision and
recall.

For example, in a fraud detection system, you may want to reduce false negatives
(missing fraud), even if it means increasing false positives, by setting a lower threshold
for classifying something as "fraud".

Machine Learning 152


Example:
Imagine we are building a model to detect fraud in bank transactions:

False Positive Cost (FP): Annoying a customer by incorrectly flagging their legitimate
transaction as fraud.

False Negative Cost (FN): Failing to identify an actual fraud, leading to financial loss.

In this case, False Negative Cost is much higher, and hence, the model should be adjusted to
prioritize reducing false negatives, even at the cost of more false positives.

2. Decision Cost/Benefit Analysis


Decision Cost/Benefit Analysis is the process of evaluating the economic or real-world
impact of the model's decisions by comparing the costs of errors and the benefits of correct
predictions.
This type of analysis is critical when there are asymmetric costs and benefits—meaning the
impact of each type of decision varies significantly.

Steps for Decision Cost/Benefit Analysis


1. Identify Costs and Benefits:

Costs: Define the financial or operational cost of a False Positive and False Negative.

Benefits: Define the value associated with True Positive and True Negative
predictions.

2. Create a Cost-Benefit Matrix:

This is similar to a confusion matrix but includes the estimated costs and benefits for
each type of outcome.

Predicted Positive Predicted Negative

Benefit of correctly identifying (e.g., saved


Actual Positive Cost of FN (e.g., lost customer)
revenue)

Actual Benefit of correctly avoiding false


Cost of FP (e.g., false alarms)
Negative alarm

3. Calculate Net Gain/Loss:

Use the number of True Positives (TP), True Negatives (TN), False Positives (FP), and
False Negatives (FN) to calculate the total cost or benefit.

Net Gain = (Benefit from TP * Number of TP) + (Benefit from TN * Number of TN) -
(Cost of FP * Number of FP) - (Cost of FN * Number of FN)

Example: Bank Loan Approval

Machine Learning 153


Imagine building a model to predict whether a customer will default on a loan:

True Positive (TP): Correctly identifying someone who would default—Benefit is avoiding a
loan loss.

True Negative (TN): Correctly approving a customer who will not default—Benefit is
gaining a profitable customer.

False Positive (FP): Incorrectly rejecting someone who wouldn’t default—Cost is the
opportunity loss of rejecting a good customer.

False Negative (FN): Incorrectly approving someone who ends up defaulting—Cost is the
financial loss from loan default.

Optimizing Based on Cost/Benefit:


If the net gain is negative, the model might need adjustments—such as changing the
classification threshold or using a different cost-sensitive learning algorithm—to improve
the net value.

Machine Learning 154


You may choose a different classification threshold that minimizes FN (due to high cost)
and balances the benefits.

Summary
1. Misclassification Cost Adjustment is about adjusting your model to minimize errors with
high real-world costs. Techniques include custom loss functions, resampling to address
imbalances, and threshold tuning to minimize costly errors.

2. Decision Cost/Benefit Analysis involves evaluating the financial or operational impacts of


different types of model predictions. You create a cost-benefit matrix and calculate the
net gain or loss to determine the true value of deploying the model.

Both techniques are crucial for deploying machine learning models that have real-world
implications, ensuring that the model not only performs well on accuracy metrics but also
aligns with the financial and operational priorities of the business or use case.

Unit 4
Cluster Analysis and Clustering Methods

Introduction to Cluster Analysis


Cluster analysis is a technique used to group a set of objects into clusters such that objects
within a cluster are more similar to each other than to objects in other clusters. It is widely used
in fields such as data mining, machine learning, pattern recognition, and statistics.

Objective: To partition a dataset into distinct groups based on similarity.

Key Idea: Minimize intra-cluster distance (distance between objects in the same cluster)
and maximize inter-cluster distance (distance between objects in different clusters).

Applications of Cluster Analysis


1. Marketing: Customer segmentation to target specific groups for campaigns.

2. Biology: Classifying species or genes with similar characteristics.

3. Image Processing: Identifying regions in images for segmentation.

4. Social Networks: Community detection.

5. Anomaly Detection: Identifying outliers or unusual patterns.

Steps in Cluster Analysis


1. Data Preparation:

Machine Learning 155

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy