ML 7th Sem Aiml Ite Notes Complete Long (1) - 63-155
ML 7th Sem Aiml Ite Notes Complete Long (1) - 63-155
Unit 3
Introduction to Classification and Classification Algorithms
What is Classification?
Classification is a supervised learning technique in machine learning that involves predicting
the category or class of a given data point based on its features. The goal is to map input
variables (features) to a discrete output variable (class label).
Definition: Classification is the process of identifying the category to which a new data
point belongs, based on a training dataset containing observations with known class labels.
Key Features:
3. Real-World Applications:
Machine Learning 63
Clearly define the problem.
2. Data Preprocessing:
4. Splitting Data:
Divide the data into training and test sets (and sometimes a validation set).
Select an algorithm based on the problem and dataset characteristics (e.g., Logistic
Regression, Decision Trees).
8. Hyperparameter Tuning:
Optimize model parameters using methods like Grid Search or Random Search.
Machine Learning 64
Common Classification Algorithms
1. Linear Classifiers:
Logistic Regression
2. Non-linear Classifiers:
3. Tree-based Methods:
Decision Trees
Random Forest
4. Neural Networks:
5. Bayesian Models:
Naive Bayes
Challenges in Classification
Imbalanced Data: Some classes may have significantly more samples than others.
Overfitting: The model performs well on training data but poorly on test data.
Introduction
The k-Nearest Neighbors (k-NN) algorithm is a simple, yet powerful, instance-based, and non-
parametric classification technique in machine learning. It is widely used due to its ease of
implementation and interpretability. The key idea is that a data point’s class label can be
determined based on the majority class of its k closest neighbors in the feature space.
Machine Learning 65
In k-NN, there is no explicit training phase. The algorithm simply stores the training
dataset.
2. Prediction Phase:
Given a new input (test point), the algorithm calculates the distance between this point
and all points in the training set.
The class label of the test point is assigned based on the majority class among these k
neighbors.
To determine which points are the closest, a distance metric must be chosen. Common
metrics include:
Where
x_i and y_i are the coordinates of the two points.
2. Choosing k:
Small values of k (e.g., k=1) make the model highly sensitive to noise.
Large values of k make the algorithm more robust but can lead to underfitting.
3. Class Assignment:
The class label of the new data point is assigned based on the majority vote of the k
nearest neighbors.
In case of a tie, various tie-breaking strategies can be used (e.g., choosing the class
with the smallest distance sum).
4. Weighted Voting:
Instead of simple majority voting, weighted voting can be used where closer neighbors
have a higher influence on the classification.
Algorithm Steps
1. Input: A dataset of labeled instances, the number of neighbors (k), and the test instance.
Machine Learning 66
2. For each test point:
Calculate the distance between the test point and all training data points using the
chosen distance metric.
Perform a majority vote or weighted vote to assign the test point’s class.
Advantages of k-NN
1. Simplicity: k-NN is very simple to understand and implement.
3. Versatility: k-NN can be used for both classification and regression problems.
4. Adaptability: The model can naturally adapt to changes in the data as no explicit training is
required.
Disadvantages of k-NN
1. Computational Cost:
As k-NN needs to compute distances for every test point to all training samples, it can
be computationally expensive, especially with large datasets.
2. Memory-Intensive:
Since the algorithm stores the entire training dataset, it requires significant memory.
If the dataset has irrelevant or noisy features, the performance of k-NN can degrade.
4. Curse of Dimensionality:
As the number of features increases, the distance between data points increases,
which can make it harder for the algorithm to identify meaningful neighbors. This issue
can be mitigated by dimensionality reduction techniques such as PCA.
5. Choice of k:
Machine Learning 67
Selecting the Optimal Value of k
Cross-validation is typically used to choose the best value for k.
A common approach is to try different values of k and select the one that minimizes the
classification error.
Odd values for k are preferred when dealing with binary classification problems to avoid
ties.
Practical Considerations
1. Feature Scaling:
2. Curse of Dimensionality:
k-NN can be sensitive to missing data. Simple approaches like imputation or removing
data points with missing values can be applied.
Example
Consider a dataset with two classes: A (label = 0) and B (label = 1). You are given a new data
point (test point) and need to classify it using k-NN.
Calculate the distance between the test point and every point in the training set.
Find the 3 closest points to the test point. If two of them belong to class A and one belongs
to class B , the test point is classified as A based on majority voting.
Applications of k-NN
1. Pattern Recognition: Classifying images based on pixel data.
3. Medical Diagnosis: Classifying patients based on medical data (e.g., disease presence or
absence).
Machine Learning 68
Conclusion
k-NN is a powerful and intuitive algorithm widely used in classification tasks. However, it
comes with challenges such as computational inefficiency and sensitivity to the curse of
dimensionality. By understanding these trade-offs, k-NN can be effectively applied in various
domains with careful pre-processing and parameter tuning.
Random Forests
Introduction
Random Forest is an ensemble learning technique used for both classification and regression
tasks. It combines multiple Decision Trees to improve performance, overcome overfitting, and
provide more robust predictions. The concept behind Random Forests is to create a "forest" of
decision trees where each tree votes for the predicted class, and the majority vote is taken as
the final output.
Key Idea: Reduce overfitting and increase accuracy by averaging multiple decision trees.
Machine Learning 69
How Random Forest Works
1. Bootstrap Aggregation (Bagging):
Random Forest uses a technique called bagging, where multiple subsets of the original
training data are created with replacement. This means that some data points may be
selected multiple times, while others may be left out.
When building each tree, Random Forest does not consider all features for splitting;
instead, it selects a random subset of features. This randomness helps to reduce
correlation among the trees and makes the model more robust.
Since each tree uses a random subset of features, it is slightly different from others.
Machine Learning 70
For Classification: Each tree in the Random Forest predicts a class label, and the final
prediction is based on the majority vote across all trees.
For Regression: Each tree predicts a numeric value, and the final output is the average
of all predictions.
4. Prediction:
By using multiple trees, Random Forest avoids the overfitting that can happen with
individual decision trees.
Bagging helps reduce variance, while random feature selection adds diversity to the
individual trees, preventing them from becoming too correlated.
3. Feature Importance:
The ensemble approach of Random Forest often provides higher accuracy than a single
decision tree.
2. Robustness:
Machine Learning 71
Random Forests are less prone to overfitting due to the randomness introduced by
bagging and random feature selection.
4. Feature Importance:
It can rank features based on their importance to the prediction, which is helpful in
understanding the data.
5. No Pruning Required:
Unlike decision trees, Random Forest does not require explicit pruning, as the ensemble
approach balances complexity and overfitting.
Random Forests are much more complex compared to a single decision tree. This
complexity makes them harder to interpret.
2. Computational Cost:
3. Black-Box Model:
The number of decision trees to be built in the forest. Increasing the number of trees
usually leads to higher accuracy but also increases training time.
The number of features to consider when splitting each node. Options include:
Limits how deep each tree can grow. A shallow tree will help prevent overfitting.
Machine Learning 72
Minimum Samples for Splitting ( min_samples_split ):
The minimum number of samples required to split a node. This helps control the growth
of the tree and avoid overfitting.
1. Training Phase:
2. Prediction Phase:
The Random Forest takes the majority vote as the final prediction.
Conclusion
Random Forest is a powerful and widely used machine learning algorithm for both classification
and regression tasks. It combines the strengths of multiple decision trees to reduce overfitting
and increase accuracy. Despite being more complex and computationally expensive compared
to individual decision trees, Random Forest's robustness, versatility, and feature importance
evaluation make it a popular choice for many practical applications.
Machine Learning 73
Fuzzy Set Approaches
Classical Set: In a classical set, an element can either fully belong (membership value = 1)
or not belong at all (membership value = 0).
Fuzzy Set: In a fuzzy set, each element has a degree of membership ranging between 0
and 1, representing the grade of membership.
Fuzzy sets were introduced by Lotfi A. Zadeh in 1965 to handle uncertainties and to model
problems that have vagueness or imprecision.
The membership function defines how each element in the domain is mapped to its
corresponding degree of belonging to a fuzzy set.
2. Fuzzy Membership:
The membership value can be interpreted as the degree of truth or the degree to
which an element belongs to a particular set.
For example, a fuzzy set of "tall people" may assign a membership of 0.7 to a person
who is 180 cm tall and 0.2 to a person who is 160 cm tall.
Machine Learning 74
Membership Functions
Membership functions can have different shapes, depending on the characteristics of the
fuzzy set:
Formula:
Machine Learning 75
Defined by four parameters a , b , c , and d that form a trapezoid.
It has two flat regions, indicating higher certainty over a range of values.
The union of two fuzzy sets A and B is given by the maximum of the membership
values.
Formula:
The intersection of two fuzzy sets A and B is given by the minimum of the
membership values.
Formula:
The complement of a fuzzy set A is given by subtracting the membership value from 1.
Formula:
μ_(¬A)(x) = 1 - μ_A(x)
Probability Theory: Deals with the likelihood of an event occurring. It is used to model
randomness and uncertainty.
Machine Learning 76
For example, the statement "it is cloudy" can be modeled using fuzzy sets by assigning a
membership value to describe the "degree of cloudiness." On the other hand, probability
theory could be used to predict the chance of rain given the cloudiness.
Fuzzy set theory is widely used in fuzzy logic controllers. It is particularly useful in
systems where human-like reasoning is required. For example, air conditioners,
washing machines, and other appliances use fuzzy logic to adjust settings like
temperature, washing time, etc.
2. Decision Making:
Fuzzy sets are used in multi-criteria decision-making where options are evaluated
based on multiple attributes with varying degrees of importance.
3. Pattern Recognition:
Fuzzy sets are applied to classification problems where the boundaries between
classes are not sharply defined (e.g., determining the membership of an image to
different object categories).
4. Medical Diagnosis:
1. Fuzzification:
2. Rule Evaluation:
Apply fuzzy rules using a series of if-then conditions. These rules help in describing
the relationship between input and output variables in a human-understandable way.
3. Defuzzification:
Convert the fuzzy output back into a crisp value. Popular defuzzification methods
include centroid and max-membership methods.
Machine Learning 77
Example of Fuzzy Set
Suppose we want to create a fuzzy set to represent the concept of "hot temperature."
The fuzzy set "Hot Temperature" could be represented with a membership function
that assigns a degree of membership to temperatures. For example:
2. Rule-Based System:
A fuzzy rule might be: "If the temperature is hot, turn on the fan at high speed."
The fuzzy inference system would calculate the degree to which the current
temperature is "hot" and adjust the fan accordingly.
Conclusion
Fuzzy Set Theory provides a powerful framework for modeling uncertainty, vagueness, and
imprecision, making it ideal for applications that involve subjective, ambiguous, or linguistic
information. By allowing partial membership, fuzzy sets represent real-world concepts more
effectively compared to traditional binary sets.
Introduction
Support Vector Machine (SVM) is a powerful supervised learning algorithm used for
classification and regression tasks. It is especially effective in high-dimensional spaces and
for problems where the data is not linearly separable. The main idea of SVM is to find a
decision boundary (or hyperplane) that maximizes the margin between two classes of data
points.
Goal: Maximize the margin between the decision boundary and the closest data points
from either class.
Machine Learning 78
A hyperplane is a decision boundary that separates the data into different classes. For
a 2D space, it is simply a line, while in a 3D space, it is a plane. For higher dimensions,
it is called a hyperplane.
The goal of SVM is to find the optimal hyperplane that maximizes the separation
between classes.
2. Margin:
The margin is the distance between the hyperplane and the closest data points from
either class. SVM aims to maximize this margin.
The data points that are closest to the hyperplane are called support vectors, and they
are critical for defining the position of the hyperplane.
3. Support Vectors:
Support vectors are the data points that lie closest to the decision boundary. These
points influence the position and orientation of the hyperplane.
The hyperplane is uniquely defined by these support vectors, hence the name Support
Vector Machine.
Machine Learning 79
In the simplest case, SVM finds a straight line (or hyperplane) that can completely
separate data points of two different classes.
If the data is linearly separable, SVM constructs the optimal hyperplane that maximizes
the margin.
Often, real-world data is not linearly separable. To handle this, SVM uses a kernel trick
to transform the data into a higher-dimensional space, where it becomes linearly
separable.
Kernel functions help create the separation by projecting the data to a new dimension.
Kernel Trick
The kernel trick is used to transform non-linearly separable data into a higher dimension
where a hyperplane can separate the classes. This allows SVM to create complex decision
boundaries without explicitly computing the transformation. Common kernel functions include:
1. Linear Kernel:
Suitable when data is linearly separable. In this case, the SVM simply finds a straight
line to separate the data.
2. Polynomial Kernel:
Suitable when the relationship between the features is more complex and requires
curved decision boundaries.
4. Sigmoid Kernel:
This kernel behaves like a neural network activation function and can be used in
specific applications.
Mathematical Representation
The goal of SVM is to find the hyperplane that maximizes the margin between the classes.
Suppose you have data points (x_1, y_1), (x_2, y_2), ..., (x_n, y_n) where y_i is the class
label ( +1 or 1 ).
Machine Learning 80
The hyperplane can be represented as:
w * x + b = 0
The objective of SVM is to find w and b that maximize the margin, subject to the constraint
that:
This ensures that each data point is correctly classified and lies on the correct side of the
margin.
C is a hyperparameter that controls the trade-off between maximizing the margin and
High C: The model attempts to classify all training data points correctly, which may lead
to overfitting.
Low C: The model allows some misclassification to achieve a larger margin, which
helps in generalization.
Advantages of SVM
1. Effective in High Dimensions:
SVM is effective when the number of features is large compared to the number of
samples.
With the kernel trick, SVM can efficiently handle complex and non-linear data.
3. Memory Efficiency:
SVM only uses the support vectors for classification, making it memory-efficient as it
does not need to store the entire dataset.
4. Versatility:
Machine Learning 81
Can be used for both classification and regression (referred to as Support Vector
Regression or SVR).
Disadvantages of SVM
1. Computational Complexity:
Training SVMs can be computationally intensive, especially for large datasets with
many features. Complexity grows quadratically with the number of samples.
SVM can be inefficient when the number of samples is very large due to high training
times.
4. Interpretability:
The results of SVM are often harder to interpret compared to other algorithms like
Decision Trees.
Applications of SVM
1. Text Classification:
SVM is used for spam detection and sentiment analysis due to its ability to handle high-
dimensional data (like word counts in text).
2. Face Detection:
3. Bioinformatics:
SVMs are used for classifying genes, identifying proteins, and analyzing medical data.
4. Handwriting Recognition:
SVM is used to classify handwritten digits, often applied in digit recognition systems
like postal code reading.
Example of SVM
Consider a dataset with two classes: +1 and -1 , with features such as height and weight of
individuals.
Machine Learning 82
If the data is linearly separable, SVM will find the line (hyperplane) that divides the two
classes with the maximum margin.
If the data is not linearly separable, an RBF kernel can be used to transform the data into a
higher-dimensional space, allowing SVM to find a non-linear decision boundary.
1. Linear Kernel
Definition: The linear kernel is the simplest kernel type used in SVM. It is primarily used
when the data is linearly separable, meaning it can be separated by a straight line (or
hyperplane).
K(x, y) = x · y
Use Cases:
It is effective when the number of features is significantly greater than the number of
training samples.
Commonly used for text classification problems like sentiment analysis or document
classification, where the features are word counts or term frequency vectors.
Advantages:
Limitation:
Machine Learning 83
2. Polynomial Kernel
Definition: The polynomial kernel is used to handle non-linear relationships in the data by
allowing more complex decision boundaries. It introduces polynomial terms, effectively
enabling SVM to create curved separation boundaries.
K(x, y) = (x · y + c)^d
dis the degree of the polynomial, which determines the complexity of the decision
surface.
Use Cases:
Suitable when the data has complex non-linear relationships between features.
Works well for applications where the relationships between input features and output
are better captured by polynomials, such as image recognition tasks.
Advantages:
Limitations:
Machine Learning 84
||x - y||^2 represents the squared Euclidean distance between x and y .
High Gamma: The decision boundary is highly influenced by individual data points,
which can lead to overfitting.
Low Gamma: The model captures broader trends, which might lead to underfitting.
Use Cases:
Advantages:
Flexibility: The RBF kernel is very flexible and can model a wide range of decision
surfaces by controlling the γ parameter.
Effective for Complex Boundaries: It is effective when the decision boundary is curved
or when the data is not linearly separable.
Limitations:
Parameter Tuning: The value of γ needs careful tuning, typically done using cross-
validation.
Linear Kernel: Choose this if the data is linearly separable or if interpretability and
speed are priorities.
Polynomial Kernel: Use when you suspect non-linear relationships but want to model
these relationships with polynomial decision boundaries.
Gaussian/RBF Kernel: Choose this if the data is complex and not linearly separable,
and you want the SVM to create a sophisticated decision boundary.
Summary of Kernels
1. Linear Kernel:
Equation: K(x, y) = x · y
Machine Learning 85
2. Polynomial Kernel:
Used for text categorization (e.g., spam detection) where the input features are often
sparse and high-dimensional.
Polynomial Kernel:
Gaussian Kernel:
Conclusion
The different SVM kernels make it possible to adapt SVM for a wide range of problems, from
simple linear separations to very complex non-linear relationships. By transforming data into
different feature spaces using these kernels, SVM is able to effectively handle many practical
machine learning tasks. The selection of an appropriate kernel, along with its corresponding
parameters, plays a crucial role in the performance and success of the SVM model.
Machine Learning 86
A hyperplane is a decision surface that separates the data into different classes in a Support
Vector Machine (SVM). The hyperplane is essentially the boundary that best divides the
dataset into classes.
Definition:
A hyperplane is a linear decision boundary that divides the input space into two halves,
representing the different classes.
Equation of a Hyperplane:
w · x + b = 0
Where:
b is the bias term that helps adjust the position of the hyperplane.
Goal of SVM:
The main goal of SVM is to find the optimal hyperplane that maximizes the margin
between the two classes. The margin is defined as the distance between the
hyperplane and the closest data points from each class, which are called support
vectors.
The larger the margin, the better the generalization ability of the classifier.
Support Vectors:
Support vectors are the data points that lie closest to the hyperplane. These points are
critical as they determine the exact position and orientation of the hyperplane.
Only support vectors influence the decision boundary, making SVM computationally
efficient in terms of memory, as it does not need to store all data points.
For non-linearly separable data, a hyperplane might not be able to divide the classes
in the original feature space. In such cases, the kernel trick is used to transform the
data into a higher-dimensional space, where a hyperplane can effectively separate the
classes.
Machine Learning 87
2. Properties of SVM
Support Vector Machines have several properties that make them effective for both
classification and regression tasks:
1. Margin Maximization:
SVM aims to maximize the margin between the hyperplane and the support vectors.
Maximizing the margin enhances the generalization of the classifier, making it robust to
noise and overfitting.
The decision boundary is determined entirely by the support vectors. This allows SVM
to be memory-efficient because it only needs to keep track of these critical data points
rather than all training samples.
4. Kernel Trick:
The kernel trick allows SVM to create non-linear decision boundaries by mapping input
features into a higher-dimensional space without explicitly computing the
transformation. This makes SVM versatile in handling non-linearly separable data.
5. Regularization:
SVM uses a regularization parameter (C) to control the trade-off between achieving a
larger margin and minimizing classification error. A larger value of C places more
emphasis on classifying all training points correctly, potentially leading to overfitting. A
smaller value allows for more misclassification but ensures a wider margin.
6. Robustness to Outliers:
Although SVM can handle outliers with the soft margin approach, it is still sensitive to
noisy data. However, the impact of noisy points is minimized if they are not selected as
support vectors.
3. Issues in SVM
Despite its advantages, SVM has some issues and challenges that users should be aware of:
1. Computational Complexity:
Machine Learning 88
For very large datasets, SVM may become infeasible in terms of memory and
computation.
Choosing the right kernel function and tuning the associated hyperparameters (such
as C and γ for RBF kernel) is crucial for good performance.
3. Interpretability:
Unlike simple linear models, SVM models are less interpretable, especially when using
non-linear kernels. The decision boundary formed by the hyperplane in a higher-
dimensional space is difficult to visualize and explain.
When feature interpretability is needed, SVM may not be the best choice compared to
models like Decision Trees.
SVM can be sensitive to outliers, especially when using a small value for the
regularization parameter C . The presence of noisy data points can significantly affect
the decision boundary if they lie close to the hyperplane.
5. Memory Usage:
In scenarios where a large number of support vectors are required, SVM can consume
a significant amount of memory. This can be problematic for very large datasets where
the number of support vectors is a large percentage of the total training set.
6. Class Imbalance:
SVM may perform poorly with imbalanced datasets where one class has significantly
more samples than the other. This is because the decision boundary is influenced by
the number of samples, and a small class may not have enough representation to shape
an appropriate boundary.
SVM does not inherently provide probabilistic outputs. Extensions such as Platt
Scaling are used to estimate probabilities, but these tend to add complexity and are not
as reliable as models specifically designed for probabilistic interpretation.
Summary
Machine Learning 89
Hyperplane: The decision surface that separates data into different classes. The goal of
SVM is to find the optimal hyperplane that maximizes the margin.
Properties of SVM: SVM is effective in high dimensions, uses support vectors to define the
decision boundary, applies kernel tricks for non-linear data, and maximizes the margin to
achieve good generalization.
Issues in SVM:
Kernel Selection: Choosing the right kernel and tuning hyperparameters can be
challenging.
Sensitivity to Noise and Outliers: SVM may be affected by outliers or noisy data points.
Memory Usage: SVM can be memory-intensive, especially if many support vectors are
needed.
Class Imbalance: SVM might struggle with imbalanced datasets, requiring additional
techniques to adjust.
SVM is a powerful algorithm that works well on a wide range of classification and regression
tasks, but it requires careful tuning and can be computationally expensive. Understanding
these properties and issues helps in deciding when SVM is the right choice for a given
problem.
Introduction
A Decision Tree is a popular supervised learning algorithm used for classification and
regression tasks. It is a tree-like structure where each internal node represents a decision
based on a feature, each branch represents the outcome of the decision, and each leaf node
represents a final output or class label. Decision trees are easy to understand and interpret,
making them very useful for a wide range of applications.
Goal: Create a model that predicts the value of a target variable based on input features.
Machine Learning 90
1. Root Node: The topmost node in a decision tree, representing the entire dataset. The root
node is split based on the feature that provides the highest information gain.
2. Internal Nodes: Nodes that represent decisions on features. Each internal node splits into
two or more branches based on feature values.
3. Leaf Nodes (Terminal Nodes): Nodes that represent the final output value or class label.
Each leaf node contains a prediction that applies to the data reaching that point.
4. Branches: These are the connections between nodes, representing the outcome of
decisions.
The algorithm starts by selecting the best feature that splits the data into subsets. The
quality of a split is evaluated using metrics such as:
Gini Impurity
The data is split into subsets based on the selected feature. This process continues
recursively for each subset.
3. Stopping Criteria:
Machine Learning 91
p_i is the proportion of instances belonging to class i in dataset S .
Information Gain (IG) measures the reduction in entropy by splitting the dataset on a
particular feature.
A feature with the highest Information Gain is selected to split the data.
Gini Impurity is another metric to evaluate the quality of a split. It measures how often a
randomly chosen element would be incorrectly labeled if it was randomly classified
based on the distribution of class labels.
Gini(S) = 1 - ∑ (p_i^2)
A feature that provides the lowest Gini Impurity is selected for the split.
When using decision trees for regression, Variance Reduction is used to measure the
effectiveness of a split.
The feature that results in the greatest reduction in variance is selected for splitting the
data.
Calculate the splitting criteria (e.g., Information Gain, Gini Impurity) for all features.
Choose the feature with the highest information gain (or lowest Gini Impurity).
4. Repeat:
Machine Learning 92
Apply the same process to each subset (node) until one of the stopping criteria is met
(e.g., pure nodes, max depth reached).
5. Output: A decision tree where each leaf node provides the final classification or prediction.
Controls the maximum number of levels in the tree. Limiting tree depth helps prevent
overfitting by reducing model complexity.
The minimum number of samples required to split an internal node. Higher values
prevent overfitting by reducing splits.
The minimum number of samples that must be present in a leaf node. This helps ensure
that leaves do not end up with only a few samples.
Controls the number of features to consider when looking for the best split. Helps in
controlling overfitting.
5. Criterion ( criterion ):
The function to measure the quality of a split. Common criteria are gini for the Gini
Impurity and entropy for Information Gain.
Decision trees are intuitive and easily visualized, making it easier for non-experts to
understand the model.
4. Feature Importance:
Machine Learning 93
Disadvantages of Decision Trees
1. Overfitting:
Decision trees are prone to overfitting, especially when the tree depth is not limited,
and the model becomes too complex. Pruning methods and limiting tree depth help
mitigate this issue.
2. Instability:
Decision trees are sensitive to small changes in the data. Small variations in the data
can lead to a completely different tree structure.
Decision trees can be biased towards classes with more samples, especially when the
data is imbalanced.
4. Greedy Approach:
Decision trees use a greedy algorithm for splitting, which may not always lead to the
global optimal solution.
Pruning reduces the size of the tree by removing branches that have little importance.
It helps in simplifying the model and improving generalization.
Pre-pruning: Limit tree growth by setting constraints (e.g., max depth, minimum
samples per leaf).
Post-pruning: The tree is first grown fully, then pruned back based on error rates on
validation data.
By setting min_samples_split and min_samples_leaf , you control the size of nodes, reducing
the chances of forming too specific (overfitted) splits.
Decision trees are used in medical diagnostics to classify patients based on symptoms
and determine possible diseases.
2. Credit Scoring:
Machine Learning 94
Banks use decision trees to assess whether a loan applicant is likely to default, based
on various financial parameters.
3. Customer Churn:
4. Fraud Detection:
Conclusion
Decision Trees are intuitive, easy to interpret, and capable of handling both numerical and
categorical data, making them useful for many practical machine learning tasks. However, they
tend to overfit on training data if not properly regularized, which is why pruning techniques and
limiting hyperparameters are crucial to improve their generalization capability. Despite their
disadvantages, decision trees are often used as the base learners for more advanced
ensemble models like Random Forest and Gradient Boosting Machines.
https://www.xoriant.com/blog/decision-trees-for-classification-a-machine-learning-algorithm
Introduction Decision Trees are a type of Supervised Machine Learning (that is you explain
what the input is and what the corresponding output is in the training data) where the data is
continuously split according to a certain parameter. The tree can be explained by two entities,
namely decision nodes and leaves. The leaves are the decisions or the final outcomes. And the
decision nodes are where the data is split.
An example of a decision tree can be explained using above binary tree. Let’s say you want to
predict whether a person is fit given their information like age, eating habit, and physical
activity, etc. The decision nodes here are questions like ‘What’s the age?’, ‘Does he exercise?’,
‘Does he eat a lot of pizzas’? And the leaves, which are outcomes like either ‘fit’, or ‘unfit’. In
Machine Learning 95
this case this was a binary classification problem (a yes no type problem). There are two main
types of Decision Trees:
What we’ve seen above is an example of classification tree, where the outcome was a variable
like ‘fit’ or ‘unfit’. Here the decision variable is Categorical.
Here the decision or the outcome variable is Continuous, e.g. a number like 123. Working Now
that we know what a Decision Tree is, we’ll see how it works internally. There are many
algorithms out there which construct Decision Trees, but one of the best is called as ID3
Algorithm. ID3 Stands for Iterative Dichotomiser 3. Before discussing the ID3 algorithm, we’ll
go through few definitions.
Entropy:
Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set S, is the measure of
the amount of uncertainty or randomness in data. Intuitively, it tells us about the predictability
of a certain event. Example, consider a coin toss whose probability of heads is 0.5 and
probability of tails is 0.5. Here the entropy is the highest possible, since there’s no way of
determining what the outcome might be. Alternatively, consider a coin which has heads on
both the sides, the entropy of such an event can be predicted perfectly since we know
beforehand that it’ll always be heads. In other words, this event has no randomness hence it’s
entropy is zero. In particular, lower values imply less uncertainty while higher values imply high
uncertainty.
Information Gain:
nformation gain is also called as Kullback-Leibler divergence denoted by IG(S,A) for a set S is
the effective change in entropy after deciding on a particular attribute A. It measures the
relative change in entropy with respect to the independent variables. Alternatively, where
IG(S, A) is the information gain by applying feature A. H(S) is the Entropy of the entire set, while
the second term calculates the Entropy after applying the feature A, where P(x) is the
probability of event x.
Machine Learning 96
Let’s understand this with the help of an example. Consider a piece of data collected over the
course of 14 days where the features are Outlook, Temperature, Humidity, Wind and the
outcome variable is whether Golf was played on the day. Now, our job is to build a predictive
model which takes in above 4 parameters and predicts whether Golf will be played on the day.
We’ll build a decision tree to do that using ID3 algorithm.
5. For each attribute, calculate the entropy with respect to the attribute ‘x’ denoted by H(S, x)
7. Remove the attribute that offers highest IG from the set of attributes
8. Repeat until we run out of all attributes, or the decision tree has all leaf nodes.
Now, let's go ahead and grow the decision tree. The initial step is to calculate H(S), the Entropy
of the current state. In the above example, we can see in total there are 5 No’s and 9 Yes’s.
Yes No Total
Machine Learning 97
9 5 14
Remember that the Entropy is 0 if all members belong to the same class, and 1 when half of
them belong to one class and other half belong to other class that is perfect randomness. Here
it’s 0.94 which means the distribution is fairly random. Now, the next step is to choose the
attribute that gives us highest possible Information Gain which we’ll choose as the root node.
Let’s start with ‘Wind’
where ‘x’ are the possible values for an attribute. Here, attribute ‘Wind’ takes two possible
values in the sample data, hence x = {Weak, Strong} We’ll have to calculate: Amongst all the 14
examples we have 8 places where the wind is weak and 6 where the wind is Strong.
8 6 14
Now, out of the 8 Weak examples, 6 of them were ‘Yes’ for Play Golf and 2 of them were ‘No’
for ‘Play Golf’. So, we have, Similarly, out of 6 Strong examples, we have 3 examples where
Machine Learning 98
the outcome was ‘Yes’ for Play Golf and 3 where we had ‘No’ for Play Golf. Remember, here
half items belong to one class while other half belong to other. Hence we have perfect
randomness. Now we have all the pieces required to calculate the Information Gain, Which
tells us the Information Gain by considering ‘Wind’ as the feature and give us information gain
of 0.048. Now we must similarly calculate the Information Gain for all the features. We can
clearly see that IG(S, Outlook) has the highest information gain of 0.246, hence we chose
Outlook attribute as the root node. At this point, the decision tree looks like.
Here we observe that whenever the outlook is Overcast, Play Golf is always ‘Yes’, it’s no
coincidence by any chance, the simple tree resulted because of the highest information gain
is given by the attribute Outlook. Now how do we proceed from this point? We can simply
apply recursion, you might want to look at the algorithm steps described earlier. Now that
Machine Learning 99
we’ve used Outlook, we’ve got three of them remaining Humidity, Temperature, and Wind. And,
we had three possible values of Outlook: Sunny, Overcast, Rain. Where the Overcast node
already ended up having leaf node ‘Yes’, so we’re left with two subtrees to compute: Sunny and
Rain. Table where the value of Outlook is Sunny looks like:
In the similar fashion, we compute the following values As we can see the highest
Information Gain is given by Humidity. Proceeding in the same way with will give us Wind as
the one with highest information gain. The final Decision Tree looks something like this.
Criterion: The ID3 algorithm uses Information Gain based on Entropy as the splitting
criterion to determine the best feature at each node of the tree.
The feature with the highest Information Gain is selected to split the data.
4. Split Data:
Create branches for each possible value of the feature, and assign subsets of the
training data to these branches.
5. Repeat Recursively:
Continue splitting each subset based on the next feature with the highest information
gain until all data points are classified or other stopping criteria are met (e.g., all
samples belong to the same class).
When all data points are classified, the nodes are turned into leaf nodes, which
represent the final decision.
Features of ID3
Attribute Selection: ID3 uses Information Gain to determine the most informative attribute
at each level.
Works Well for Categorical Data: ID3 is suited for classification problems, particularly with
categorical data.
Only for Categorical Attributes: The ID3 algorithm works primarily with categorical
features and requires discretization for numerical data.
Greedy Approach: The selection of attributes is done using a greedy approach, which
does not guarantee the global optimum solution.
2. Inductive Bias
Inductive Bias refers to the set of assumptions a machine learning algorithm makes to
generalize from the training data to unseen data.
Shorter Trees Are Preferred: Decision trees aim to create the shortest possible tree
that fits the data.
Preference for Features with High Information Gain: The ID3 algorithm selects
features based on their information gain, assuming that features with higher
information gain lead to better classification results.
Inductive bias helps decision trees generalize well to unseen data by preventing them from
creating unnecessarily complex models that overfit the training data.
Mathematical Formula:
Where:
S is the dataset.
Interpretation:
Low Entropy: When entropy is low (close to 0), the dataset is pure, meaning that most of
the instances belong to the same class.
High Entropy: When entropy is high (close to 1), the dataset is more mixed, with an almost
equal distribution of classes.
4. Information Gain
Information Gain is a measure used to evaluate the effectiveness of an attribute in classifying
the dataset. It is calculated as the reduction in entropy after splitting the dataset based on a
particular feature.
The feature with the highest information gain is selected as the split attribute for the
current node.
Significance:
Higher Information Gain indicates that the feature is more informative and results in a
better split of the dataset.
The goal is to maximize information gain at every node, which helps in making the decision
tree more efficient in classification.
Decision trees tend to overfit the training data, especially when they grow too deep
and create too many branches. This results in poor generalization to unseen data.
Solution: Use pruning techniques, limit tree depth, or set constraints on the number of
samples per leaf node.
When selecting attributes, decision trees may favor features with many distinct values
(e.g., ID number). This could lead to splits that do not actually provide meaningful
information.
Solution: Use different metrics like Gain Ratio that penalize features with a large
number of values.
3. Imbalanced Datasets:
Decision trees can be biased towards the majority class in imbalanced datasets.
4. High Variance:
Decision trees are susceptible to high variance; small changes in data can lead to a
significantly different tree.
5. Greedy Nature:
Decision trees use a greedy algorithm for attribute selection, which may not lead to the
optimal solution.
Decision trees can handle numeric features, but they require special handling to
determine optimal split points.
7. Data Fragmentation:
As the tree grows deeper, the dataset gets split into smaller fragments, leading to
insufficient data at some nodes. This is known as the fragmentation problem and can
result in unreliable splits.
Solution: Limit the depth of the tree or prune nodes with insufficient data.
Inductive Bias: Represents the set of assumptions made by the decision tree to generalize
from the training data to unseen data. This includes a preference for smaller trees and
attributes with high information gain.
Entropy and Information Theory: Entropy measures the uncertainty or impurity in the data,
while Information Gain measures the reduction in entropy after splitting the dataset based
on a particular attribute.
Information Gain: Used as the criterion in ID3 to determine the best feature for splitting,
with higher information gain representing a better split.
Issues in Decision Tree Learning: Challenges include overfitting, high variance, bias
towards features with many values, difficulty in handling imbalanced data, and greedy
nature of the algorithm.
Understanding these concepts provides the foundation for effectively utilizing decision trees,
addressing their limitations, and applying more advanced algorithms such as Random Forest
and Gradient Boosting Machines for improved performance.
Bayesian Learning
Bayesian learning is a probabilistic framework for machine learning that leverages Bayes'
theorem to update the probability of a hypothesis based on observed evidence or data.
Bayesian learning methods are valuable for dealing with uncertainty and making predictions
that incorporate prior knowledge.
1. Bayes' Theorem
Bayes' theorem forms the backbone of Bayesian learning by providing a principled way to
revise probabilities in light of new data. It connects the prior probability of a hypothesis, the
likelihood of observing data given that hypothesis, and the posterior probability—which is the
updated belief after seeing the data.
P (E∣H) ⋅ P (H)
P (H∣E) =
P (E)
Where:
P(H) : Prior Probability of hypothesis H before considering the evidence. It is our initial
belief about H .
P(E|H): Likelihood of the evidence E given hypothesis H . It measures how probable the
observed data is under the hypothesis.
P(E|H): The probability of getting a positive result given that the person has the disease
(i.e., the sensitivity of the test).
P(E): The probability of a positive result occurring overall (which depends on the
prevalence of the disease and the accuracy of the test).
P(H|E): The updated probability that the person has the disease after observing the
positive test result.
By combining these values using Bayes’ theorem, we can get a more accurate estimate of the
likelihood that the person actually has the disease.
Limitations
Computational Complexity: Evaluating every hypothesis in the hypothesis space can be
computationally challenging, especially for large hypothesis spaces.
Prior Probability (\( P(H) \)): Prior belief about how likely each model is to be correct.
Likelihood (\( P(D|H) \)): Probability of observing certain features in the email (e.g., certain
keywords) given the hypothesis.
Posterior Probability: Updated belief about the hypothesis after seeing the email.
Bayes Optimal Classifier: Instead of selecting a single model, the classifier considers all
possible models and weighs their predictions by their posterior probabilities.
Summary
Bayes' Theorem: Provides a method for updating the probability of a hypothesis based on
observed evidence.
Bayesian learning is a powerful framework that naturally integrates uncertainty and prior
knowledge into the learning process, making it useful for a wide range of applications such as
medical diagnosis, spam detection, and risk assessment.
https://mlarchive.com/machine-learning/the-ultimate-guide-to-naive-bayes/
Essentially, the Bayes’ theorem describes the probability of an event based on prior
knowledge of the conditions that might be relevant to the event.
The theorem is named after English statistician, Thomas Bayes, who discovered the
formula in 1763. It is considered the foundation of the special statistical inference approach
called the Bayes’ inference.
To use the algorithm: 1-We must convert the presented data set into frequency tables. 2-
Then create a probability table by finding the probabilities of certain features. 3- Then use
Bayes’ theorem in order to calculate the posterior probability.
For example, let’s solve the following problem: If the weather is sunny, then the Player
should play or not?
Likelihood/probability table
𝑃(𝑌𝑒𝑠│𝑆𝑢𝑛𝑛𝑦) > 𝑃(𝑁𝑜│𝑆𝑢𝑛𝑛𝑦) ⇒ So on a sunny day, the player can play the game.
When the assumption of feature independence is correct, the algorithm can perform better
than other models, it also requires much less training data.
It is suitable for classification with discrete features that are categorically distributed.
This algorithm has a “zero-frequency problem” that assigns null probabilities to the
categorical variables whose classes in the test dataset are not available in the training
dataset, a smoothing method can be used in order to overcome this problem.
Its estimates can be wrong in some cases, so you shouldn’t take its potential outcomes
very seriously.
Complement: Used for imbalanced data, as it measures the probability of each sample
belonging to all other classes not its class.
Bernoulli: Used when features fellow Bernoulli distribution, it is suitable for discrete data,
where it is designed for binary/boolean features.
Multimodal: Unlike Bernoulli it works with occurrence counts, not only binary features.
Nodes: Each node represents a random variable (e.g., symptoms, test results, or
conditions).
Edges: Directed edges represent dependencies between variables. An edge from node
A to node B means that A has a direct influence on B.
For a variable X with parents P1, P2, ..., Pn , the CPT defines:
P (X∣P 1, P 2, ..., P n)
Global Representation: The joint probability distribution of all nodes can be determined
using local CPTs.
n
P (X1, X2, ..., Xn) = ∏ P (Xi ∣Parents(Xi ))
i=1
1. Exact Inference:
Methods like Variable Elimination or Belief Propagation can be used to derive exact
probabilities in small networks.
2. Approximate Inference:
BBNs are used in medical diagnosis to model the relationships between symptoms,
test results, and diseases. They allow for reasoning about the likelihood of different
diseases given observed symptoms.
2. Risk Assessment:
Nodes:
D : Disease (Yes/No)
S : Symptom (Present/Absent)
Edges:
The network structure and conditional probability tables can help calculate the probability of
having the disease given that the symptom is present and the test is positive.
In the E-step, the algorithm calculates the expected value of the log-likelihood
function, considering the current estimate of the parameters.
In other words, it calculates the probability distribution over the possible values of the
hidden variables, using the current parameters to fill in the missing values.
In the M-step, the algorithm maximizes the expected log-likelihood computed in the E-
step with respect to the model parameters.
The M-step updates the parameters to values that maximize the likelihood function
based on the distribution calculated in the E-step.
These two steps are repeated until convergence, meaning that the parameters no longer
change significantly.
The EM algorithm iteratively updates the parameters by alternating between the two steps:
Where \( Z \) represents the latent variables and \( \theta^{(t)} \) represents the current
parameter estimates.
Applications of EM Algorithm
1. Gaussian Mixture Models (GMMs):
The EM algorithm is used to train Hidden Markov Models by estimating the transition
and emission probabilities in the presence of hidden states.
3. Image Reconstruction:
1. E-Step:
Calculate the responsibilities for each data point, which represents the probability that
each point belongs to each Gaussian component, using the current parameter
estimates (mean, variance, and mixing coefficients).
2. M-Step:
Update the parameters of each Gaussian component (mean, variance, and mixing
coefficients) by maximizing the expected complete log-likelihood using the
responsibilities from the E-step.
This process is repeated until the parameters converge, leading to the best-fit Gaussian
components for the data.
Summary
Bayesian Belief Networks are graphical models that represent probabilistic relationships
between variables using a Directed Acyclic Graph and Conditional Probability Tables.
They are highly useful for reasoning under uncertainty in domains like medical diagnosis
and risk assessment.
Both Bayesian Belief Networks and the EM Algorithm are important tools in machine learning
and statistical modeling, enabling robust reasoning and parameter estimation even when
dealing with complex dependencies and incomplete information.
https://encord.com/blog/what-is-ensemble-learning/
Ensemble Learning
Imagine you are watching a football match. The sports analysts provide you with detailed
statistics and expert opinions. At the same time, you also take into account the opinions of
fellow enthusiasts who may have witnessed previous matches. This approach helps overcome
the limitations of relying solely on one model and increases overall accuracy. Similarly, in
ensemble learning, combining multiple models or algorithms can improve prediction accuracy.
In both cases, the power of collective knowledge and multiple viewpoints is harnessed to make
more informed and reliable predictions, overcoming the limitations of relying solely on one
model. Let us take a deeper dive into what Ensemble Learning actually is.
Ensemble learning is a machine learning technique that improves the performance of machine
learning models by combining predictions from multiple models. By leveraging the strengths of
diverse algorithms, ensemble methods aim to reduce both bias and variance, resulting in more
reliable predictions. It also increases the model’s robustness to errors and uncertainties,
especially in critical applications like healthcare or finance.
Ensemble learning techniques like bagging, boosting, and stacking enhance performance and
reliability, making them valuable for teams that want to build reliable ML systems.
Brief overview
These techniques aim to reduce bias and variance in individual models, and improve prediction
accuracy by learning previous errors, ultimately leading to a consensus prediction that is often
more reliable than any single model.
Random forest
The Random Forest algorithm is a prime example of bagging. It creates an ensemble of
decision trees trained on samples of datasets. Ensemble learning effectively handles complex
features and captures nuanced patterns, resulting in more reliable predictions. However, it is
also true that the interpretability of ensemble models may be compromised due to the
combination of multiple decision trees. Ensemble models can provide more accurate
predictions than individual decision trees, but understanding the reasoning behind each
prediction becomes challenging. Bagging helps reduce overfitting by generating multiple
subsets of the training data and training individual decision trees on each subset. It also helps
reduce the impact of outliers or noisy data points by averaging the predictions of multiple
decision trees.
Gradient boosting
Gradient Boosting (GB) trains each model to minimize the errors of previous models by training
each new model on the remaining errors. This iterative process effectively handles numerical
and categorical data and can outperform other machine learning algorithms, making it
versatile for various applications.
For example, you can apply Gradient Boosting in healthcare to predict disease likelihood
accurately. Iteratively combining weak learners to build a strong learner can improve prediction
accuracy, which could be valuable in providing insights for early intervention and personalized
treatment plans based on demographic and medical factors such as age, gender, family
history, and biomarkers.
One potential challenge of gradient boosting in healthcare is its lack of interpretability. While it
excels at accurately predicting disease likelihood, the complex nature of the algorithm makes it
difficult to understand and interpret the underlying factors driving those predictions.
This can pose challenges for healthcare professionals who must explain the reasoning behind
a particular prediction or treatment recommendation to patients. However, efforts are being
made to develop techniques that enhance the interpretability of GB models in healthcare,
ensuring transparency and trust in their use for decision-making.
Boosting is an ensemble method that seeks to change the training data to focus attention on
examples that previous fit models on the training dataset have gotten wrong.
Stacking: Meta-learning
Stacking, or stacked generalization, is a model-ensembling technique that improves predictive
performance by combining predictions from multiple models. It involves training a meta-model
that uses the output of base-level models to make a final prediction. The meta-model, a linear
regression, a neural network, or any other algorithm makes the final prediction.
This technique leverages the collective knowledge of different models to generate more
accurate and robust predictions. The meta-model can be trained using ensemble algorithms
like linear regression, neural networks, or support vector machines. The final prediction is
based on the meta-model's output. Overfitting occurs when a model becomes too closely fitted
to the training data and performs poorly on new, unseen data. Stacking helps mitigate
overfitting by combining multiple models with different strengths and weaknesses, thereby
reducing the risk of relying too heavily on a single model’s biases or idiosyncrasies.
For example, in financial forecasting, stacking combines models like regression, random forest,
and gradient boosting to improve stock market predictions. This ensemble approach mitigates
the individual biases in the model and allows easy incorporation of new models or the removal
of underperforming ones, enhancing prediction performance over time.
Voting
The application of either bagging or boosting requires the selection of a base learner
algorithm first. For example, if one chooses a classification tree, then boosting and bagging
would be a pool of trees with a size equal to the user’s preference.
Robustness
Ensemble learning enhances robustness by considering multiple models' opinions and making
consensus-based predictions. This mitigates the impact of outliers or errors in a single model,
ensuring more accurate results. Combining diverse models reduces the risk of biases or
inaccuracies from individual models, enhancing the overall reliability and performance of the
ensemble learning approach. However, combining multiple models can increase the
computational complexity compared to using a single model. Furthermore, as ensemble models
Reducing Overfitting
Ensemble learning reduces overfitting by using random data subsets for training each model.
Bagging introduces randomness and diversity, improving generalization performance. Boosting
assigns higher weights to difficult-to-classify instances, focusing on challenging cases and
improving accuracy. Iteratively adjusting weights allows boosting to learn from mistakes and
build models sequentially, resulting in a strong ensemble capable of handling complex data
patterns. Both approaches help improve generalization performance and accuracy in ensemble
learning.
Computational Complexity
Ensemble learning, involving multiple algorithms and feature sets, requires more computational
resources than individual models. While parallel processing offers a solution, orchestrating an
ensemble of models across multiple processors can introduce complexity in both
implementation and maintenance. Also, more computation might not always lead to better
performance, especially if the ensemble is not set up correctly or if the models amplify each
other's errors in noisy datasets.
Interpretability
Ensemble learning models prioritize accuracy over interpretability, resulting in highly accurate
predictions. However, this trade-off makes the ensemble model more challenging to interpret.
Techniques like feature importance analysis and model introspection can help provide insights
but may not fully demystify the predictions of complex ensembles. the factors contributing to
ensemble models' decision-making, reducing the interpretability challenge.
Introduction to AdaBoost
AdaBoost stands for Adaptive Boosting. It is an ensemble method that combines multiple
"weak" learners (often simple decision trees or stumps) into a single "strong" learner in an
Key Idea: Create a strong classifier by combining multiple weak classifiers in sequence.
Weak Learners: Typically uses decision stumps (a decision tree with only one split).
Samples that are misclassified by the weak classifier are given higher weights.
3. Update Weights:
Misclassified samples get higher weights, which makes them more important in the
next round.
After each iteration, assign a weight to the weak classifier based on its accuracy. This
weight indicates the importance of the classifier in the final ensemble.
Repeat the process to train multiple classifiers and combine their predictions to form
the final, strong classifier.
2. Versatile:
Works well with various types of weak learners (commonly decision stumps).
Disadvantages of AdaBoost
1. Sensitive to Noisy Data:
If there are outliers or noisy data, AdaBoost can focus too much on these instances,
leading to overfitting.
Performance depends on the quality of weak learners. If weak learners are very poor,
the ensemble won't perform well.
Applications of AdaBoost
Face Recognition: Used in computer vision tasks, such as face detection (e.g., Haar
cascades).
Introduction to XGBoost
XGBoost stands for Extreme Gradient Boosting. It is an optimized implementation of the
Gradient Boosting Machine (GBM), specifically designed for speed and performance. XGBoost
is widely recognized for its high efficiency, scalability, and accuracy.
Key Idea: Uses advanced regularization and gradient-based optimization to build strong
models from weak learners.
The idea of Gradient Boosting is to build new learners to correct the errors made by
previous learners.
Each learner in XGBoost is trained to minimize the residuals (errors) of the previous
learners, by fitting a decision tree to the gradient of the loss function with respect to the
predictions.
Pruning: XGBoost uses depth-first pruning to grow trees only until no further
improvement can be made.
The trees are constructed iteratively, and the final prediction is obtained by summing
the output from all the trees.
A learning rate parameter helps to control the contribution of each new tree added to
the model, allowing the model to make smaller, more controlled updates.
Advantages of XGBoost
1. High Performance:
XGBoost is highly efficient and fast, making it suitable for large datasets and real-time
applications.
2. Regularization:
4. Flexibility:
Disadvantages of XGBoost
1. Complexity:
2. Computationally Intensive:
Applications of XGBoost
1. Kaggle Competitions:
XGBoost is a favorite among data scientists in Kaggle competitions due to its high
accuracy and performance.
2. Credit Scoring:
Widely used in the finance industry for predicting credit risk and assessing loan
eligibility.
Applications Spam detection, Face detection Kaggle, Credit scoring, Customer churn
AdaBoost is simpler and often used with decision stumps to create sequential weak
learners, emphasizing the misclassified data.
XGBoost uses gradient boosting to iteratively improve the model using decision trees, with
an emphasis on regularization and gradient optimization to reduce overfitting and increase
generalization performance.
Both algorithms are highly effective for classification and regression tasks, and their choice
depends on the problem, computational resources, and data characteristics.
There are many ways for measuring classification performance. Accuracy, confusion matrix,
log-loss, and AUC-ROC are some of the most popular metrics. Precision-recall is a widely
used metrics for classification problems.
When any model gives an accuracy rate of 99%, you might think that model is performing very
good but this is not always true and can be misleading in some situations. I am going to explain
this with the help of an example.
We repeat this process for all images in X_test data. Eventually, we’ll have a count of correct
and incorrect matches. But in reality, it is very rare that all incorrect or correct matches
hold equal value. Therefore one metric won’t tell the entire story.
Accuracy is useful when the target class is well balanced but is not a good choice for the
unbalanced classes. Imagine the scenario where we had 99 images of the dog and only 1
image of a cat present in our training data. Then our model would always predict the dog, and
therefore we got 99% accuracy. In reality, Data is always imbalanced for example Spam email,
credit card fraud, and medical diagnosis. Hence, if we want to do a better model evaluation and
have a full picture of the model evaluation, other metrics such as recall and precision should
also be considered.
True Negative: We predicted negative and it’s true. In the image, we predicted that a man is
not pregnant and he actually is not.
False Positive (Type 1 Error): We predicted positive and it’s false. In the image, we
predicted that a man is pregnant but he actually is not.
False Negative (Type 2 Error): We predicted negative and it’s false. In the image, we
predicted that a woman is not pregnant but she actually is.
We discussed Accuracy, now let’s discuss some other metrics of the confusion matrix!
Precision
It explains how many of the correctly predicted cases actually turned out to be positive.
Precision is useful in the cases where False Positive is a higher concern than False Negatives.
The importance of Precision is in music or video recommendation systems, e-commerce
websites, etc. where wrong results could lead to customer churn and this could be harmful to
the business.
Recall (Sensitivity)
It explains how many of the actual positive cases we were able to predict correctly with our
model. Recall is a useful metric in cases where False Negative is of higher concern than False
Positive. It is important in medical cases where it doesn’t matter whether we raise a false
alarm but the actual positive cases should not go undetected!
Recall for a label is defined as the number of true positives divided by the
total number of actual positives.
F1 Score
It gives a combined idea about Precision and Recall metrics. It is maximum when Precision is
equal to Recall.
The F1 score punishes extreme values more. F1 Score could be an effective evaluation metric in
the following cases:
AUC-ROC
The Receiver Operator Characteristic (ROC) is a probability curve that plots the TPR(True
Positive Rate) against the FPR(False Positive Rate) at various threshold values and separates
the ‘signal’ from the ‘noise’.
The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish
between classes. From the graph, we simply say the area of the curve ABDE and the X and Y-
axis.
From the graph shown below, the greater the AUC, the better is the performance of the model
at different threshold points between positive and negative classes. This simply means that
When AUC is equal to 1, the classifier is able to perfectly distinguish between all Positive and
Negative class points. When AUC is equal to 0, the classifier would be predicting all Negatives
as Positives and vice versa. When AUC is 0.5, the classifier is not able to distinguish between
the Positive and Negative classes.
To know more, read our Guide to AUC ROC Curve in Machine Learning.
Log Loss
Log loss (Logistic loss) or Cross-Entropy Loss is one of the major metrics to assess the
performance of a classification problem.
For a single sample with true label y ∈{0,1} and a probability estimate p=Pr(y=1), the log loss is:
Logistic Regression: Good for binary classification with a linear decision boundary.
Decision Tree: Offers a more flexible approach by segmenting the space into smaller
regions.
SVM: Works well for higher-dimensional spaces or when there is a clear margin of
separation.
Decision Tree:
TP: 81
TN: 80
FP: 20
FN: 19
SVM:
TP: 75
TN: 90
FP: 10
FN: 25
Logistic Regression offers a balance with a good F1-Score and decent performance across
all metrics.
Conclusion
For this task:
If prioritizing the correctness of the positive classifications (ensuring that those predicted
as likely to subscribe are very likely to do so), SVM might be best due to its high precision.
For a balance between precision and recall, Logistic Regression presents a viable option
with the highest F1-Score.
Choosing the right model depends on the business goals and the cost of false positives vs.
false negatives. For instance, if missing out on potential subscribers is more costly than
mistakenly identifying non-subscribers as potential subscribers, a model with higher recall may
be preferable.
Use: Indicates how well the model can detect positive cases (e.g., detecting actual
churners in a dataset).
Use: Indicates how reliable a positive prediction is (e.g., if the model says someone will
churn, what is the chance they actually will?).
Use: Indicates how reliable a negative prediction is (e.g., if the model says someone will not
churn, what is the chance they really won't?).
Summary
Sensitivity: How good is the model at finding actual positives?
PPV (Precision): When the model predicts positive, how often is it correct?
These metrics help you understand the trade-offs in your model's performance and can be
chosen based on what is most important in your specific problem (e.g., minimizing false
positives vs. maximizing detection of positives).
Gains chart
https://community.spotfire.com/articles/spotfire-statistica/gains-vs-roc-curves-do-you-
understand-the-difference/
Typically called a Cumulative Gains Chart it can be simply explained by the following example:
For simplicity let's assume we have 1000 customers. If we run an advertising campaign for all
our customers, we might find that 30% (300 out of 1000) will respond and buy our new
product.
The Gains chart is the visualization of that principle. On the X-axis we have the percentage of
the customer base we want to target with the campaign. The Y-axis gives us the answer to the
percentage of all positive responses customers have found in the targeted sample. In the
picture below you can see an example of the Gains chart. (The gains chart associated with the
model is the red curve).
Confusion matrix
To test our strategy (defined by the model and the targeted percentage or equivalently the cut-
off value) we need to compare the output of the model to the actual results in the real world.
This is done by comparing the results and creating a contingency table of misclassification
errors (terminology as used in hypothesis testing - TP means true positive, FN false negative,
FP false positive, and TN true negative):
Observed YES Count TP (right decision) Count FN (error of the second kind)
Ideally, we want to have the right decisions made with high frequency. Such a table (usually
called a confusion matrix) is a very important decision tool when we evaluate the quality of the
model.
For better orientation, it is common practice to display the confusion matrix in the form of the
following graph. From this graph, we see, how many times the model predicts correctly (true
negatives and true positives) and how many times we have an incorrect prediction (false
positives and false negatives). The better the model, the larger the bars TP and TN in
comparison to FN, and FP.
Discussed curves (ROC, Gains, and Lift) are computed based on information from confusion
matrices. It is important to realize that curves are created according to a larger number of
these confusion matrices for various targeted percentages/cut-off values.
ROC curve
Other terms connected with a confusion matrix are Sensitivity and Specificity. They are
computed in the following way:
The Y-axis measures the rate (as a percentage) of correctly predicted customers with a
positive response. The X-axis measures the rate of incorrectly predicted customers with a
negative response.
The optimal model could be the following: Sensitivity will rise to a maximum and specificity will
stay the whole time at 1 (the optimal model is in green color). The task is to have the ROC
curve of the developed model as close as possible to the optimal model.
The Graphical representation of the results as a confusion matrix is below - colors on the graph
represent the same as the color markings in the table above:
False Positive (FP): When the model incorrectly predicts a positive outcome.
Example: Predicting a customer will churn when they won't. This might lead to
unnecessary retention efforts and increased costs.
False Negative (FN): When the model incorrectly predicts a negative outcome.
The aim is to adjust the model’s focus to minimize high-cost errors based on the specific use
case.
For example, if a false negative is more costly than a false positive, assign a higher
penalty for false negatives in the loss function.
Resampling: Oversample the minority class (e.g., fraud cases) or undersample the
majority class.
3. Custom Thresholds:
Adjust the classification threshold to change the balance between precision and
recall.
For example, in a fraud detection system, you may want to reduce false negatives
(missing fraud), even if it means increasing false positives, by setting a lower threshold
for classifying something as "fraud".
False Positive Cost (FP): Annoying a customer by incorrectly flagging their legitimate
transaction as fraud.
False Negative Cost (FN): Failing to identify an actual fraud, leading to financial loss.
In this case, False Negative Cost is much higher, and hence, the model should be adjusted to
prioritize reducing false negatives, even at the cost of more false positives.
Costs: Define the financial or operational cost of a False Positive and False Negative.
Benefits: Define the value associated with True Positive and True Negative
predictions.
This is similar to a confusion matrix but includes the estimated costs and benefits for
each type of outcome.
Use the number of True Positives (TP), True Negatives (TN), False Positives (FP), and
False Negatives (FN) to calculate the total cost or benefit.
Net Gain = (Benefit from TP * Number of TP) + (Benefit from TN * Number of TN) -
(Cost of FP * Number of FP) - (Cost of FN * Number of FN)
True Positive (TP): Correctly identifying someone who would default—Benefit is avoiding a
loan loss.
True Negative (TN): Correctly approving a customer who will not default—Benefit is
gaining a profitable customer.
False Positive (FP): Incorrectly rejecting someone who wouldn’t default—Cost is the
opportunity loss of rejecting a good customer.
False Negative (FN): Incorrectly approving someone who ends up defaulting—Cost is the
financial loss from loan default.
Summary
1. Misclassification Cost Adjustment is about adjusting your model to minimize errors with
high real-world costs. Techniques include custom loss functions, resampling to address
imbalances, and threshold tuning to minimize costly errors.
Both techniques are crucial for deploying machine learning models that have real-world
implications, ensuring that the model not only performs well on accuracy metrics but also
aligns with the financial and operational priorities of the business or use case.
Unit 4
Cluster Analysis and Clustering Methods
Key Idea: Minimize intra-cluster distance (distance between objects in the same cluster)
and maximize inter-cluster distance (distance between objects in different clusters).