0% found this document useful (0 votes)
76 views42 pages

Unit-4 (1) .Docx ML

A decision tree is a supervised learning algorithm used in machine learning for modeling and predicting outcomes based on input data, structured in a tree-like format with nodes representing decisions and leaf nodes representing outcomes. Key terminologies include root nodes, internal nodes, leaf nodes, and splitting criteria such as information gain and Gini index, which guide the decision-making process. While decision trees are interpretable and versatile, they can suffer from overfitting and sensitivity to data changes, necessitating careful application and tuning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views42 pages

Unit-4 (1) .Docx ML

A decision tree is a supervised learning algorithm used in machine learning for modeling and predicting outcomes based on input data, structured in a tree-like format with nodes representing decisions and leaf nodes representing outcomes. Key terminologies include root nodes, internal nodes, leaf nodes, and splitting criteria such as information gain and Gini index, which guide the decision-making process. While decision trees are interpretable and versatile, they can suffer from overfitting and sensitivity to data changes, necessitating careful application and tuning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Decision Tree in Machine Learning

A decision tree is a type of supervised learning algorithm that is


commonly used in machine learning to model and predict outcomes
based on input data. It is a tree-like structure where each internal
node tests on attribute, each branch corresponds to attribute value
and each leaf node represents the final decision or prediction. The
decision tree algorithm falls under the category of supervised
learning. They can be used to solve
both regression and classification problems.
Decision Tree Terminologies
There are specialized terms associated with decision trees that
denote various components and facets of the tree structure and
decision-making procedure. :
 Root Node: A decision tree’s root node, which represents
the original choice or feature from which the tree branches,
is the highest node.
 Internal Nodes (Decision Nodes): Nodes in the tree
whose choices are determined by the values of particular
attributes. There are branches on these nodes that go to
other nodes.
 Leaf Nodes (Terminal Nodes): The branches’ termini,
when choices or forecasts are decided upon. There are no
more branches on leaf nodes.
 Branches (Edges): Links between nodes that show how
decisions are made in response to particular
circumstances.
 Splitting: The process of dividing a node into two or more
sub-nodes based on a decision criterion. It involves
selecting a feature and a threshold to create subsets of
data.
 Parent Node: A node that is split into child nodes. The
original node from which a split originates.
 Child Node: Nodes created as a result of a split from a
parent node.
 Decision Criterion: The rule or condition used to
determine how the data should be split at a decision node.
It involves comparing feature values against a threshold.
 Pruning: The process of removing branches or nodes from
a decision tree to improve its generalization and prevent
overfitting.
Understanding these terminologies is crucial for interpreting and
working with decision trees in machine learning applications.
How Decision Tree is formed?
The process of forming a decision tree involves recursively
partitioning the data based on the values of different attributes.
The algorithm selects the best attribute to split the data at each
internal node, based on certain criteria such as information gain or
Gini impurity. This splitting process continues until a stopping
criterion is met, such as reaching a maximum depth or having a
minimum number of instances in a leaf node.
Why Decision Tree?
Decision trees are widely used in machine learning for a number of
reasons:
 Decision trees are so versatile in simulating intricate
decision-making processes, because of their interpretability
and versatility.
 Their portrayal of complex choice scenarios that take into
account a variety of causes and outcomes is made possible
by their hierarchical structure.
 They provide comprehensible insights into the decision
logic, decision trees are especially helpful for tasks
involving categorization and regression.
 They are proficient with both numerical and categorical
data, and they can easily adapt to a variety of datasets
thanks to their autonomous feature selection capability.
 Decision trees also provide simple visualization, which
helps to comprehend and elucidate the underlying decision
processes in a model.
Decision Tree Approach
Decision tree uses the tree representation to solve the problem in
which each leaf node corresponds to a class label and attributes
are represented on the internal node of the tree. We can represent
any boolean function on discrete attributes using the decision tree.

Below are some assumptions that we made while using the


decision tree:
At the beginning, we consider the whole training set as the root.
 Feature values are preferred to be categorical. If the values
are continuous then they are discretized prior to building
the model.
 On the basis of attribute values, records are distributed
recursively.
 We use statistical methods for ordering attributes as root
or the internal node.

As you can see from the above image the Decision Tree works on
the Sum of Product form which is also known as Disjunctive Normal
Form. In the above image, we are predicting the use of computer in
the daily life of people. In the Decision Tree, the major challenge is
the identification of the attribute for the root node at each level.
This process is known as attribute selection. We have two popular
attribute selection measures:
1. Information Gain
2. Gini Index
1. Information Gain:
When we use a node in a decision tree to partition the training
instances into smaller subsets the entropy changes. Information
gain is a measure of this change in entropy.
 Suppose S is a set of instances,
 A is an attribute
 Sv is the subset of S
 v represents an individual value that the attribute A can
take and Values (A) is the set of all possible values of A,
then

Entropy: is the measure of uncertainty of a random variable, it


characterizes the impurity of an arbitrary collection of examples.
The higher the entropy more the information content.
Suppose S is a set of instances, A is an attribute, S v is the subset of
S with A = v, and Values (A) is the set of all possible values of A,
then

Example:
For the set X = {a,a,a,b,b,b,b,b}
Total instances: 8
Instances of b: 5
Instances of a: 3

Building Decision Tree using Information Gain The


essentials:
 Start with all training instances associated with the root
node
 Use info gain to choose which attribute to label each node
with
 Note: No root-to-leaf path should contain the same discrete
attribute twice
 Recursively construct each subtree on the subset of
training instances that would be classified down that path
in the tree.
 If all positive or all negative training instances remain, the
label that node “yes” or “no” accordingly
 If no attributes remain, label with a majority vote of
training instances left at that node
 If no instances remain, label with a majority vote of the
parent’s training instances.
Example: Now, let us draw a Decision Tree for the following data
using Information gain. Training set: 3 features and 2 classes
X Y Z C

1 1 1 I
X Y Z C

1 1 0 I

0 0 1 II

1 0 0 II

Here, we have 3 features and 2 output classes. To build a decision


tree using Information gain. We will take each of the features and
calculate the information for each feature.

Split on feature X

Split on feature Y
Split on feature Z
From the above images, we can see that the information gain is
maximum when we make a split on feature Y. So, for the root node
best-suited feature is feature Y. Now we can see that while splitting
the dataset by feature Y, the child contains a pure subset of the
target variable. So we don’t need to further split the dataset. The
final tree for the above dataset would look like this:

2. Gini Index
 Gini Index is a metric to measure how often a randomly
chosen element would be incorrectly identified.
 It means an attribute with a lower Gini index should be
preferred.
 Sklearn supports “Gini” criteria for Gini Index and by
default, it takes “gini” value.
 The Formula for the calculation of the Gini Index is given
below.
The Gini Index is a measure of the inequality or impurity of a
distribution, commonly used in decision trees and other machine
learning algorithms. It ranges from 0 to 1, where 0 represents
perfect equality (all values are the same) and 1 represents perfect
inequality (all values are different).
Some additional features and characteristics of the Gini Index
are:
 It is calculated by summing the squared probabilities of
each outcome in a distribution and subtracting the result
from 1.
 A lower Gini Index indicates a more homogeneous or pure
distribution, while a higher Gini Index indicates a more
heterogeneous or impure distribution.
 In decision trees, the Gini Index is used to evaluate the
quality of a split by measuring the difference between the
impurity of the parent node and the weighted impurity of
the child nodes.
 Compared to other impurity measures like entropy, the Gini
Index is faster to compute and more sensitive to changes
in class probabilities.
 One disadvantage of the Gini Index is that it tends to favor
splits that create equally sized child nodes, even if they are
not optimal for classification accuracy.
 In practice, the choice between using the Gini Index or
other impurity measures depends on the specific problem
and dataset, and often requires experimentation and
tuning.
Example of a Decision Tree Algorithm
Forecasting Activities Using Weather Information
 Root node: Whole dataset
 Attribute : “Outlook” (sunny, cloudy, rainy).
 Subsets: Overcast, Rainy, and Sunny.
 Recursive Splitting: Divide the sunny subset even more
according to humidity, for example.
 Leaf Nodes: Activities include “swimming,” “hiking,” and
“staying inside.”
Beginning with the entire dataset as the root node of the
decision tree:
 Determine the best attribute to split the dataset based on
information gain, which is calculated by the formula:
Information gain = Entropy(parent) – [Weighted average] *
Entropy(children), where entropy is a measure of impurity
or disorder of a set of examples, and the weighted average
is based on the number of examples in each child node.
 Create a new internal node that corresponds to the best
attribute and connects it to the root node. For example, if
the best attribute is “outlook” (which can have values
“sunny”, “overcast”, or “rainy”), we create a new node
labeled “outlook” and connect it to the root node.
 Partition the dataset into subsets based on the values of
the best attribute. For example, we create three subsets:
one for instances where the outlook is “sunny”, one for
instances where the outlook is “overcast”, and one for
instances where the outlook is “rainy”.
 Recursively repeat steps 1-4 for each subset until all
instances in a given subset belong to the same class or no
further splitting is possible. For example, if the subset of
instances where the outlook is “overcast” contains only
instances where the activity is “hiking”, we assign a leaf
node labeled “hiking” to this subset. If the subset of
instances where the outlook is “sunny” is further split
based on the humidity attribute, we repeat steps 2-4 for
this subset.
 Assign a leaf node to each subset that contains instances
that belong to the same class. For example, if the subset of
instances where the outlook is “rainy” contains only
instances where the activity is “stay inside”, we assign a
leaf node labeled “stay inside” to this subset.
 Make predictions based on the decision tree by traversing
it from the root node to a leaf node that corresponds to the
instance being classified. For example, if the outlook is
“sunny” and the humidity is “high”, we traverse the
decision tree by following the “sunny” branch and then the
“high humidity” branch, and we end up at a leaf node
labeled “swimming”, which is our predicted activity.
Advantages of Decision Tree
 Easy to understand and interpret, making them accessible
to non-experts.
 Handle both numerical and categorical data without
requiring extensive preprocessing.
 Provides insights into feature importance for decision-
making.
 Handle missing values and outliers without significant
impact.
 Applicable to both classification and regression tasks.
Disadvantages of Decision Tree
 Disadvantages include the potential for overfitting
 Sensitivity to small changes in data, limited generalization
if training data is not representative
 Potential bias in the presence of imbalanced data.
Conclusion
Decision trees, a key tool in machine learning, model and predict
outcomes based on input data through a tree-like structure. They
offer interpretability, versatility, and simple visualization, making
them valuable for both categorization and regression tasks. While
decision trees have advantages like ease of understanding, they
may face challenges such as overfitting. Understanding their
terminologies and formation process is essential for effective
application in diverse scenarios.
Frequently Asked Questions (FAQs)
1. What are the major issues in decision tree learning?
Major issues in decision tree learning include overfitting, sensitivity
to small data changes, and limited generalization. Ensuring proper
pruning, tuning, and handling imbalanced data can help mitigate
these challenges for more robust decision tree models.
2. How does decision tree help in decision making?
Decision trees aid decision-making by representing complex
choices in a hierarchical structure. Each node tests specific
attributes, guiding decisions based on data values. Leaf nodes
provide final outcomes, offering a clear and interpretable path for
decision analysis in machine learning.
3. What is the maximum depth of a decision tree?
The maximum depth of a decision tree is a hyperparameter that
determines the maximum number of levels or nodes from the root
to any leaf. It controls the complexity of the tree and helps prevent
overfitting.
4. What is the concept of decision tree?
A decision tree is a supervised learning algorithm that models
decisions based on input features. It forms a tree-like structure
where each internal node represents a decision based on an
attribute, leading to leaf nodes representing outcomes.
5. What is entropy in decision tree?
In decision trees, entropy is a measure of impurity or disorder
within a dataset. It quantifies the uncertainty associated with
classifying instances, guiding the algorithm to make informative
splits for effective decision-making.
6. What are the Hyperparameters of decision tree?
1. Max Depth: Maximum depth of the tree.
2. Min Samples Split: Minimum samples required to split an
internal node.
3. Min Samples Leaf: Minimum samples required in a leaf
node.
4. Criterion: The function used to measure the quality of a
split
Ensemble Learning and
Random forest

Hrishav kumar
·

Follow
3 min read

Jan 8, 2019

Ensemble learning

Ensemble learning is the process by which multiple


models, such as classifiers or experts, are strategically
generated and combined to solve a particular
computational intelligence problem. Ensemble learning is
primarily used to improve the (classification, prediction,
function approximation, etc.) performance of a model, or
reduce the likelihood of an unfortunate selection of a poor
one. Other applications of ensemble learning include
assigning a confidence to the decision made by the model,
selecting optimal (or near optimal) features, data fusion,
incremental learning, non-stationary learning and error-
correcting.

Commonly used Ensemble learning techniques

1. Bagging : Bagging tries to implement similar learners


on small sample populations and then takes a mean of all
the predictions. In generalized bagging, you can use
different learners on different population. As you can
expect this helps us to reduce the variance error.

2. Boosting : Boosting is an iterative technique which


adjust the weight of an observation based on the last
classification. If an observation was classified incorrectly, it
tries to increase the weight of this observation and vice
versa. Boosting in general decreases the bias error and
builds strong predictive models. However, they may
sometimes over fit on the training data.
Random Forest

Random forest

Random Forest is a flexible, easy to use machine learning


algorithm that produces, even without hyper-parameter
tuning, a great result most of the time. It is also one of the
most used algorithms, because it’s simplicity and the fact
that it can be used for both classification and regression
tasks.

Random Forest is an ensemble machine learning algorithm


that follows the bagging technique. It is an extension of the
bagging estimator algorithm. The base estimators in
random forest are decision trees. Unlike bagging meta
estimator, random forest randomly selects a set of features
which are used to decide the best split at each node of the
decision tree.
Random Forest adds randomness to model while forming
decision trees. Instead of searching for the most important
feature, it searches for best feature from a random subset
of features. It results in diversification which makes our
model better. We can make our model more random by
using some threshold for each feature rather than
searching for the best threshold.

Applications

1. The random forest algorithm is used in a lot of


different fields, like Banking, Stock Market, Medicine
and E-Commerce.
2. In Banking it is used for example to detect
customers who will use the bank’s services more
frequently than others and repay their debt in time.
In this domain it is also used to detect fraud
customers who want to scam the bank.
3. In finance, it is used to determine a stock’s
behaviour in the future.

Conclusion

Random Forests are hard to beat in terms of performance.


Of course you can probably always find a model that can
perform better, like a neural network, but these usually
take much more time in the development. And on top of
that, they can handle a lot of different feature types, like
binary, categorical and numerical.

Overall, Random Forest is a (mostly) fast, simple and


flexible tool, although it has its limitations like a large
number of trees can make the algorithm too slow and
ineffective for real-time predictions.

Gradient Boosting
Gradient Boosting is a popular boosting algorithm in machine
learning used for classification and regression tasks. Boosting is
one kind of ensemble Learning method which trains the model
sequentially and each new model tries to correct the previous
model. It combines several weak learners into strong learners.
There is two most popular boosting algorithm i.e
1. AdaBoost
2. Gradient Boosting
Gradient Boosting
Gradient Boosting is a powerful boosting algorithm that combines
several weak learners into strong learners, in which each new
model is trained to minimize the loss function such as mean
squared error or cross-entropy of the previous model using gradient
descent. In each iteration, the algorithm computes the gradient of
the loss function with respect to the predictions of the current
ensemble and then trains a new weak model to minimize this
gradient. The predictions of the new model are then added to the
ensemble, and the process is repeated until a stopping criterion is
met.
In contrast to AdaBoost, the weights of the training instances are
not tweaked, instead, each predictor is trained using the residual
errors of the predecessor as labels. There is a technique called
the Gradient Boosted Trees whose base learner is CART
(Classification and Regression Trees). The below diagram explains
how gradient-boosted trees are trained for regression problems.

Gradient Boosted Trees for Regression

The ensemble consists of M trees. Tree1 is trained using the


feature matrix X and the labels y. The predictions
labeled y1(hat) are used to determine the training set residual
errors r1. Tree2 is then trained using the feature matrix X and the
residual errors r1 of Tree1 as labels. The predicted
results r1(hat) are then used to determine the residual r2. The
process is repeated until all the M trees forming the ensemble are
trained. There is an important parameter used in this technique
known as Shrinkage. Shrinkage refers to the fact that the
prediction of each tree in the ensemble is shrunk after it is
multiplied by the learning rate (eta) which ranges between 0 to 1.
There is a trade-off between eta and the number of estimators,
decreasing learning rate needs to be compensated with increasing
estimators in order to reach certain model performance. Since all
trees are trained now, predictions can be made. Each tree predicts
a label and the final prediction is given by the formula,
Clustering in Machine Learning

Introduction to Clustering: It is basically a type of unsupervised


learning method. An unsupervised learning method is a method in
which we draw references from datasets consisting of input data
without labeled responses. Generally, it is used as a process to find
meaningful structure, explanatory underlying processes, generative
features, and groupings inherent in a set of examples.
Clustering is the task of dividing the population or data points into
a number of groups such that data points in the same groups are
more similar to other data points in the same group and dissimilar to
the data points in other groups. It is basically a collection of objects
on the basis of similarity and dissimilarity between them.
For example The data points in the graph below clustered together
can be classified into one single group. We can distinguish the
clusters, and we can identify that there are 3 clusters in the below
picture.

It is not necessary for clusters to be spherical as depicted below:


DBSCAN: Density-based Spatial Clustering of Applications
with Noise
These data points are clustered by using the basic concept that the
data point lies within the given constraint from the cluster center.
Various distance methods and techniques are used for the
calculation of the outliers.
Why Clustering?
Clustering is very much important as it determines the intrinsic
grouping among the unlabelled data present. There are no criteria
for good clustering. It depends on the user, and what criteria they
may use which satisfy their need. For instance, we could be
interested in finding representatives for homogeneous groups (data
reduction), finding “natural clusters” and describing their unknown
properties (“natural” data types), in finding useful and suitable
groupings (“useful” data classes) or in finding unusual data objects
(outlier detection). This algorithm must make some assumptions
that constitute the similarity of points and each assumption make
different and equally valid clusters.
Clustering Methods:
 Density-Based Methods: These methods consider the
clusters as the dense region having some similarities and
differences from the lower dense region of the space. These
methods have good accuracy and the ability to merge two
clusters. Example DBSCAN (Density-Based Spatial
Clustering of Applications with Noise), OPTICS (Ordering
Points to Identify Clustering Structure), etc.
 Hierarchical Based Methods: The clusters formed in this
method form a tree-type structure based on the hierarchy.
New clusters are formed using the previously formed one. It
is divided into two category
 Agglomerative (bottom-up approach)
 Divisive (top-down approach)
Examples CURE (Clustering Using Representatives), BIRCH
(Balanced Iterative Reducing Clustering and using Hierarchies), etc.
 Partitioning Methods: These methods partition the
objects into k clusters and each partition forms one cluster.
This method is used to optimize an objective criterion
similarity function such as when the distance is a major
parameter example K-means, CLARANS (Clustering Large
Applications based upon Randomized Search), etc.
 Grid-based Methods: In this method, the data space is
formulated into a finite number of cells that form a grid-like
structure. All the clustering operations done on these grids
are fast and independent of the number of data objects
example STING (Statistical Information Grid), wave cluster,
CLIQUE (CLustering In Quest), etc.
Clustering Algorithms: K-means clustering algorithm – It is the
simplest unsupervised learning algorithm that solves clustering
problem.K-means algorithm partitions n observations into k clusters
where each observation belongs to the cluster with the nearest
mean serving as a prototype of the cluster.
Applications of Clustering in different fields:
1. Marketing: It can be used to characterize & discover
customer segments for marketing purposes.
2. Biology: It can be used for classification among different
species of plants and animals.
3. Libraries: It is used in clustering different books on the
basis of topics and information.
4. Insurance: It is used to acknowledge the customers, their
policies and identifying the frauds.
5. City Planning: It is used to make groups of houses and to
study their values based on their geographical locations and
other factors present.
6. Earthquake studies: By learning the earthquake-affected
areas we can determine the dangerous zones.
7. Image Processing: Clustering can be used to group similar
images together, classify images based on content, and
identify patterns in image data.
8. Genetics: Clustering is used to group genes that have
similar expression patterns and identify gene networks that
work together in biological processes.
9. Finance: Clustering is used to identify market segments
based on customer behavior, identify patterns in stock
market data, and analyze risk in investment portfolios.
10. Customer Service: Clustering is used to group customer
inquiries and complaints into categories, identify common
issues, and develop targeted solutions.
11. Manufacturing: Clustering is used to group similar
products together, optimize production processes, and
identify defects in manufacturing processes.
12. Medical diagnosis: Clustering is used to group patients
with similar symptoms or diseases, which helps in making
accurate diagnoses and identifying effective treatments.
13. Fraud detection: Clustering is used to identify
suspicious patterns or anomalies in financial transactions,
which can help in detecting fraud or other financial crimes.
14. Traffic analysis: Clustering is used to group similar
patterns of traffic data, such as peak hours, routes, and
speeds, which can help in improving transportation planning
and infrastructure.
15. Social network analysis: Clustering is used to identify
communities or groups within social networks, which can
help in understanding social behavior, influence, and trends.
16. Cybersecurity: Clustering is used to group similar
patterns of network traffic or system behavior, which can
help in detecting and preventing cyberattacks.
17. Climate analysis: Clustering is used to group similar
patterns of climate data, such as temperature, precipitation,
and wind, which can help in understanding climate change
and its impact on the environment.
18. Sports analysis: Clustering is used to group similar
patterns of player or team performance data, which can
help in analyzing player or team strengths and weaknesses
and making strategic decisions.
19. Crime analysis: Clustering is used to group similar
patterns of crime data, such as location, time, and type,
which can help in identifying crime hotspots, predicting
future crime trends, and improving crime prevention
strategies.

Elbow Method for optimal value of k in


KMeans



Prerequisites: K-Means Clustering
In this article, we will discuss how to select the best k (Number of
clusters) in the k-Means clustering algorithm.

Introduction To Elbow Method


A fundamental step for any unsupervised algorithm is to determine
the optimal number of clusters into which the data may be
clustered. Since we do not have any predefined number of clusters
in unsupervised learning. We tend to use some method that can
help us decide the best number of clusters. In the case of K-Means
clustering, we use Elbow Method for defining the best number of
clustering

What Is the Elbow Method in K-Means


Clustering
As we know in the k-means clustering algorithm we randomly
initialize k clusters and we iteratively adjust these k clusters till
these k-centroids riches in an equilibrium state. However, the main
thing we do before initializing these clusters is that determine how
many clusters we have to use.
For determining K(numbers of clusters) we use Elbow
method. Elbow Method is a technique that we use to determine
the number of centroids(k) to use in a k-means clustering algorithm.
In this method to determine the k-value we continuously iterate for
k=1 to k=n (Here n is the hyperparameter that we choose as per
our requirement). For every value of k, we calculate the within-
cluster sum of squares (WCSS) value.
WCSS - It is defined as the sum of square distances between the centroids
and
each points.
Now For determining the best number of clusters(k) we plot a graph
of k versus their WCSS value. Surprisingly the graph looks like an
elbow (which we will see later). Also, When k=1 the WCSS has the
highest value but with increasing k value WCSS value starts to
decrease. We choose that value of k from where the graph starts to
look like a straight line.
BSCAN Clustering in ML | Density based
clustering
 Read

 Courses

 Practice

 Video

 Jobs



Clustering analysis or simply Clustering is basically an Unsupervised
learning method that divides the data points into a number of
specific batches or groups, such that the data points in the same
groups have similar properties and data points in different groups
have different properties in some sense. It comprises many different
methods based on differential evolution.
E.g. K-Means (distance between points), Affinity propagation (graph
distance), Mean-shift (distance between points), DBSCAN (distance
between nearest points), Gaussian mixtures (Mahalanobis distance
to centers), Spectral clustering (graph distance), etc.
Fundamentally, all clustering methods use the same approach i.e.
first we calculate similarities and then we use it to cluster the data
points into groups or batches. Here we will focus on the Density-
based spatial clustering of applications with noise (DBSCAN)
clustering method.
Density-Based Spatial Clustering Of
Applications With Noise (DBSCAN)
Clusters are dense regions in the data space, separated by regions
of the lower density of points. The DBSCAN algorithm is based on
this intuitive notion of “clusters” and “noise”. The key idea is that for
each point of a cluster, the neighborhood of a given radius has to
contain at least a minimum number of points.
Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical
clustering work for finding spherical-shaped clusters or convex
clusters. In other words, they are suitable only for compact and well-
separated clusters. Moreover, they are also severely affected by the
presence of noise and outliers in the data.
Real-life data may contain irregularities, like:
1. Clusters can be of arbitrary shape such as those shown in
the figure below.
2. Data may contain noise.
The figure above shows a data set containing non-convex shape
clusters and outliers. Given such data, the k-means algorithm has
difficulties in identifying these clusters with arbitrary shapes.

Parameters Required For DBSCAN Algorithm

1. eps: It defines the neighborhood around a data point i.e. if


the distance between two points is lower or equal to ‘eps’
then they are considered neighbors. If the eps value is
chosen too small then a large part of the data will be
considered as an outlier. If it is chosen very large then the
clusters will merge and the majority of the data points will
be in the same clusters. One way to find the eps value is
based on the k-distance graph.
2. MinPts: Minimum number of neighbors (data points) within
eps radius. The larger the dataset, the larger value of MinPts
must be chosen. As a general rule, the minimum MinPts can
be derived from the number of dimensions D in the dataset
as, MinPts >= D+1. The minimum value of MinPts must be
chosen at least 3.

In this algorithm, we have 3 types of data points.


Core Point: A point is a core point if it has more than MinPts points
within eps.
Border Point: A point which has fewer than MinPts within eps but it
is in the neighborhood of a core point.
Noise or outlier: A point which is not a core point or border point.

Steps Used In DBSCAN Algorithm


1. Find all the neighbor points within eps and identify the core
points or visited with more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster,
create a new cluster.
3. Find recursively all its density-connected points and assign
them to the same cluster as the core point.
A point a and b are said to be density connected if there
exists a point c which has a sufficient number of points in its
neighbors and both points a and b are within the eps
distance. This is a chaining process. So, if b is a neighbor
of c, c is a neighbor of d, and d is a neighbor of e, which in
turn is neighbor of a implying that b is a neighbor of a.
4. Iterate through the remaining unvisited points in the
dataset. Those points that do not belong to any cluster are
noise.
Pseudocode For DBSCAN Clustering Algorithm
DBSCAN(dataset, eps, MinPts){
# cluster index
C=1
for each unvisited point p in dataset {
mark p as visited
# find neighbors
Neighbors N = find the neighboring points of p

if |N|>=MinPts:
N = N U N'
if p' is not a member of any cluster:
add p' to cluster C
}

Spectral Clustering
Spectral Clustering is a variant of the clustering algorithm that uses
the connectivity between the data points to form the clustering. It
uses eigenvalues and eigenvectors of the data matrix to forecast
the data into lower dimensions space to cluster the data points. It is
based on the idea of a graph representation of data where the data
point are represented as nodes and the similarity between the data
points are represented by an edge.
Steps performed for spectral Clustering

Building the Similarity Graph Of The Data: This step builds the
Similarity Graph in the form of an adjacency matrix which is
represented by A. The adjacency matrix can be built in the
following manners:
 Epsilon-neighbourhood Graph: A parameter epsilon is
fixed beforehand. Then, each point is connected to all the
points which lie in its epsilon-radius. If all the distances
between any two points are similar in scale then typically
the weights of the edges ie the distance between the two
points are not stored since they do not provide any
additional information. Thus, in this case, the graph built is
an undirected and unweighted graph.
 K-Nearest Neighbours A parameter k is fixed
beforehand. Then, for two vertices u and v, an edge is
directed from u to v only if v is among the k-nearest
neighbours of u. Note that this leads to the formation of a
weighted and directed graph because it is not always the
case that for each u having v as one of the k-nearest
neighbours, it will be the same case for v having u among
its k-nearest neighbours. To make this graph undirected,
one of the following approaches is followed:-
1. Direct an edge from u to v and from v to u if either
v is among the k-nearest neighbours of u OR u is
among the k-nearest neighbours of v.
2. Direct an edge from u to v and from v to u if v is
among the k-nearest neighbours of u AND u is
among the k-nearest neighbours of v.
3. Fully-Connected Graph: To build this graph,
each point is connected with an undirected edge-
weighted by the distance between the two points
to every other point. Since this approach is used to
model the local neighbourhood relationships thus
typically the Gaussian similarity metric is used to
calculate the distance.

ML | V-Measure for Evaluating Clustering


Performance
 Read
 Courses

 Practice

 Jobs



One of the primary disadvantages of any clustering technique is that
it is difficult to evaluate its performance. To tackle this problem, the
metric of V-Measure was developed. The calculation of the V-
Measure first requires the calculation of two terms:-
1. Homogeneity: A perfectly homogeneous clustering is one
where each cluster has data-points belonging to the same
class label. Homogeneity describes the closeness of the
clustering algorithm to this perfection.
2. Completeness: A perfectly complete clustering is one
where all data-points belonging to the same class are
clustered into the same cluster. Completeness describes the
closeness of the clustering algorithm to this perfection.
Trivial Homogeneity: It is the case when the number of clusters is
equal to the number of data points and each point is in exactly one
cluster. It is the extreme case when homogeneity is highest while
completeness is minimum.
Trivial Completeness: It is the case when all the data points are
clustered into one cluster. It is the extreme case when homogeneity
is minimum and completeness is maximum.

Assume that each data point in the above diagrams is of the


different class label for Trivial Homogeneity and Trivial
Completeness. Note: The term homogeneous is different from
completeness in the sense that while talking about homogeneity,
the base concept is of the respective cluster which we check
whether in each cluster does each data point is of the same class
label. While talking about completeness, the base concept is of the
respective class label which we check whether data points of each
class label is in the same cluster.
In the above diagram, the clustering is perfectly homogeneous since
in each cluster the data points of are of the same class label but it is
not complete because not all data points of the same class label
belong to the same class label.
In the above diagram, the clustering is perfectly complete because
all data points of the same class label belong to the same cluster but
it is not homogeneous because the 1st cluster contains data points
of many class labels. Let us assume that there are N data samples,
C different class labels, K clusters and number of data-points
belonging to the class c and cluster k. Then the homogeneity h is

given by the following:-

where an

d The

completeness c is given by the following:-

where an

d Thus the

weighted V-Measure is given by the following:-


The factor can be adjusted to favour either the homogeneity or the
completeness of the clustering algorithm. The primary advantage of
this evaluation metric is that it is independent of the number of class
labels, the number of clusters, the size of the data and the
clustering algorithm used and is a very reliable metric. The following
code will demonstrate how to compute the V-Measure of a clustering
algorithm. The data used is the Detection of Credit Card
Fraud which can be downloaded from Kaggle. The clustering
algorithm used is the Variational Bayesian Inference for
Gaussian Mixture Model. Step 1: Importing the required
libraries
 Python3
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

from sklearn.metrics import v_measure_score

Step 2: Loading and Cleaning the data


 Python3
# Changing the working location to the location of the file

cd C:\Users\Dev\Desktop\Kaggle\Credit Card Fraud

# Loading the data

df = pd.read_csv('creditcard.csv')

# Separating the dependent and independent variables

y = df['Class']

X = df.drop('Class', axis = 1)

X.head()
Step 3: Building different clustering models and comparing
their V-Measure scores In this step, 5 different K-Means
Clustering Models will be built with each model clustering the data
into a different number of clusters.
 Python3
# List of V-Measure Scores for different models

v_scores = []

# List of different types of covariance parameters

N_Clusters = [2, 3, 4, 5, 6]

a) n_clusters = 2
 Python3
# Building the clustering model

kmeans2 = KMeans(n_clusters = 2)

# Training the clustering model

kmeans2.fit(X)

# Storing the predicted Clustering labels

labels2 = kmeans2.predict(X)

# Evaluating the performance

v_scores.append(v_measure_score(y, labels2))

b) n_clusters = 3
 Python3
# Building the clustering model

kmeans3 = KMeans(n_clusters = 3)

# Training the clustering model

kmeans3.fit(X)

# Storing the predicted Clustering labels

labels3 = kmeans3.predict(X)

# Evaluating the performance

v_scores.append(v_measure_score(y, labels3))

c) n_clusters = 4
 Python3
# Building the clustering model

kmeans4 = KMeans(n_clusters = 4)

# Training the clustering model

kmeans4.fit(X)

# Storing the predicted Clustering labels

labels4 = kmeans4.predict(X)

# Evaluating the performance

v_scores.append(v_measure_score(y, labels4))

d) n_clusters = 5
 Python3
# Building the clustering model

kmeans5 = KMeans(n_clusters = 5)
# Training the clustering model

kmeans5.fit(X)

# Storing the predicted Clustering labels

labels5 = kmeans5.predict(X)

# Evaluating the performance

v_scores.append(v_measure_score(y, labels5))

e) n_clusters = 6
 Python3
# Building the clustering model

kmeans6 = KMeans(n_clusters = 6)

# Training the clustering model

kmeans6.fit(X)

# Storing the predicted Clustering labels

labels6 = kmeans6.predict(X)

# Evaluating the performance

v_scores.append(v_measure_score(y, labels6))

Step 4: Visualizing the results and comparing the


performances
 Python3
# Plotting a Bar Graph to compare the models

plt.bar(N_Clusters, v_scores)

plt.xlabel('Number of Clusters')

plt.ylabel('V-Measure Score')
plt.title('Comparison of different Clustering Models')

plt.show()

Adjusted Rand Index (ARI)



Website

The Adjusted Rand Index (ARI) is a measure of the similarity


between two data clusterings. It is a correction of the Rand Index,
which is a basic measure of similarity between two clusterings,
but it has the disadvantage of being sensitive to chance. The
Adjusted Rand Index takes into account the fact that some
agreement between two clusterings can occur by chance, and it
adjusts the Rand Index to account for this possibility. It is
calculated as follows:

1. Let N be the number of samples in the data set.


2. Let C1 and C2 be two different clusterings of the data set.

3. Let a be the number of pairs of samples that are in the same


cluster in both C1 and C2.

4. Let b be the number of pairs of samples that are in different


clusters in C1 and C2.

5. Calculate the Rand Index RI as RI = (a + b) / (N choose 2),


where (N choose 2) is the number of possible pairs of samples.

6. Calculate the expected value E of the Rand Index for random


clusterings, given by: E = (sum(n_i choose 2) * sum(n_j choose 2))
/ (N choose 2), where n_i is the number of samples in cluster i and
n_j is the number of samples in cluster j.

7. Calculate the Adjusted Rand Index ARI as ARI = (RI - E) /


(max(RI) - E), where max(RI) = 1.

The higher the ARI value, the closer the two clusterings are to
each other. It ranges from -1 to 1, where 1 indicates perfect
agreement between the two clusterings, 0 indicates a random
agreement and -1 indicates that the two clusterings are
completely different. The ARI is widely used in machine learning,
data mining, and pattern recognition, especially for the evaluation
of clustering algorithms.

What is Meta Learning?


Meta-learning is learning to learn algorithms, which aim to create
AI systems that can adapt to new tasks and improve their
performance over time, without the need for extensive retraining.
Meta-learning algorithms typically involve training a model on a
variety of different tasks, with the goal of learning generalizable
knowledge that can be transferred to new tasks. This is different
from traditional machine learning, where a model is typically
trained on a single task and then used for that task alone.
 Meta-learning, also called “learning to learn” algorithms, is
a branch of machine learning that focuses on teaching
models to self-adapt and solve new problems with little to
no human intervention.
 It entails using a different machine learning algorithm that
has already been trained to act as a mentor and transfer
knowledge. Through data analysis, meta-learning gains
insights from this mentor algorithm’s output and improves
the developing algorithm’s ability to solve problems
effectively.
 To increase the flexibility of automatic learning, meta-
learning makes use of algorithmic metadata. It
comprehends how algorithms adjust to a variety of
problems, improving the functionality of current algorithms
and possibly even learning the algorithm itself.
 Meta-learning optimizes learning by using algorithmic
metadata, including performance measures and data-
derived patterns, to strategically learn, select, alter, or
combine algorithms for specific problems.
The process of learning to learn or the meta-training process can
be crudely summed up in the following diagram:

Meta-Learning

Working of Meta Learning


Training models to quickly adapt to new tasks with minimal data is
the focus of a machine learning paradigm known as “meta-
learning,” or “learning to learn.” In order to help models quickly
adapt to new, untested tasks using a limited amount of task-
specific data, meta-learning aims to enable models to generalize
learning experiences across different tasks.
Two primary phases are involved in the typical meta-learning
workflow:
 Meta – Learning
 Tasks: Exposure to a range of tasks, each with its
own set of parameters or characteristics, is part of
the meta-training phase.
 Model Training: Many tasks are used to train a
base model, also known as a learner. The purpose
of this model is to represent shared knowledge or
common patterns among various tasks.
 Adaption: With few examples, the model is trained
to quickly adjust its parameters to new tasks.
 Meta – Testing(Adaption)
 New Task: The model is given a brand-new task
during the meta-testing stage that it was not
exposed to during training.
 Few Shots: With only a small amount of data, the
model is modified for the new task (few-shot
learning). In order to make this adaptation, the
model’s parameters are frequently updated using
the examples from the new task.
 Generalization: Meta-learning efficacy is
evaluated by looking at how well the model quickly
generalizes to the new task.
Why we need Meta-Learning
Meta-Learning can enable the machine to learn more efficiently and
effectively from limited data and it can adapt to any changes in the
problem quickly. Here are some examples of meta-learning
processes:
 Few-shot Learning: It is a type of learning algorithm or
technique, which can learn in very few steps of training and
on limited examples.
 Transfer Learning: It is a technique in which knowledge is
transferred from one task to another if there are some
similarities between both tasks. In this case, another model
can be developed with very limited data and few-step
training using the knowledge of another pre-trained model.
Learning the meta-parameters
Throughout the whole training process, backpropagation is used in
meta-learning to back-propagate the meta-loss gradient, all the
way back to the original model weights. It is highly computational,
uses second derivatives, and is made easier by frameworks such as
Tensorflow and PyTorch. By contrasting model predictions with
ground truth labels, the meta-loss—a measure of the meta-
learner’s efficacy—is obtained. Parameters are updated during
training by meta-optimizers such as SGD, RMSProp, and Adam.
Three main steps subsumed in meta-learning are as follows:
1. Inclusion of a learning sub-model.
2. A dynamic inductive bias: Altering the inductive bias of
a learning algorithm to match the given problem. This is
done by altering key aspects of the learning algorithm,
such as the hypothesis representation, heuristic formulae,
or parameters. Many different approaches exist.
3. Extracting useful knowledge and experience from
the metadata of the model: Metadata consists of
knowledge about previous learning episodes and is used to
efficiently develop an effective hypothesis for a new task.
This is also a form of Inductive transfer.
Meta-Learning Approaches
There are several approaches to Meta-Learning, some common
approaches are as follows:
1. Metric-based meta-learning: This approach basically
aims to find a metric space. It is similar to the nearest
neighbor algorithm which measures the similarity or
distance to learn the given examples. The goal is to learn a
function that converts input examples into a metric space
with labels that are similar for nearby points and dissimilar
for far-off points. The success of metric-based meta-
learning models depends on the selection of the kernel
function, which determines the weight of each labeled
example in predicting the label of a new example.
Applications of metric-based meta-learning include few-
shot classification, where the goal is to classify new classes
with very few examples.
2. Optimization-based Meta-Learning: This approach
focuses on optimizing algorithms in such a way that they
can quickly solve the new task in very less examples. In
the neural network to better accomplish a task Usually,
multiple neural networks are used. One neural net is
responsible for the optimization (different techniques can
be used) of hyperparameters of another neural net to
improve its performance.
Few-shot learning in reinforcement learning is an example
of an optimization-based meta-learning application where
the objective is to learn a policy that can handle new issues
with a small number of examples.
3. Model-Agnostic Meta-Learning (MAML): It is an
optimization-based meta-learning framework that enables
a model to quickly adapt to new tasks with only a few
examples by learning generalizable features that can be
used in different tasks. In MAML, the model is trained on a
set of meta-training tasks, which are similar to the target
tasks but have a different distribution of data. The model
learns a set of generalizable parameters that can be
quickly adapted to new tasks with only a few examples by
performing a few gradient descent steps.
4. Model-based Meta-Learning: Model-based Meta-
Learning is a well-known meta-learning algorithm that
learns how to initialize the model parameters correctly so
that it can quickly adapt to new tasks with few examples. It
updates its parameters rapidly with a few training steps
and quickly adapts to new tasks by learning a set of
common parameters. It could be a neural network with a
certain architecture that is designed for fast updates, or it
could be a more general optimization algorithm that can
quickly adapt to new tasks. The parameters of a model are
trained such that even a few iterations of applying gradient
descent with relatively few data samples from a new task
(new domain) can lead to good generalization on that task.
Model-based meta-learning has shown impressive results in
various domains, including few-shot learning, robotics,
and natural language processing.
 Memory-Augmented Neural
Networks: Memory-augmented neural networks,
such as Neural Turing Machines (NTMs) and
Differentiable Neural Computers (DNCs), utilize
external memory for improved meta-learning,
enabling complex reasoning and tasks like machine
translation and image captioning.
 Meta Networks: Meta Networks is a model-based
meta-learning. The key idea behind Meta Networks
is to use a meta-learner to generate the weights of
a task-specific network, which is then used to solve
a new task. The task-specific network is designed
to take input from the meta-learner and produce
output that is specific to the new task. In other
words, the architecture of the task-specific network
is learned on-the-fly by the meta-learner during the
meta-training phase, which enables rapid
adaptation to new tasks with only a few examples.
 Bayesian Meta-Learning: Bayesian Meta-
Learning or Bayesian optimization is a family of
meta-Learning algorithms that uses the bayesian
method for optimizing a black-box function that is
expensive to evaluate, by constructing a
probabilistic model of the function, which is then
iteratively updated as new data is acquired.

Advantages of Meta-learning
1. Meta-Learning offers more speed: Meta-learning
approaches can produce learning architectures that
perform better and faster than hand-crafted models.
2. Better generalization: Meta-learning models can
frequently generalize to new tasks more effectively by
learning to learn, even when the new tasks are very
different from the ones they were trained on.
3. Scaling: Meta-learning can automate the process of
choosing and fine-tuning algorithms, thereby increasing the
potential to scale AI applications.
4. Fewer data required: These approaches assist in the
development of more general systems, which can transfer
knowledge from one context to another. This reduces the
amount of data you need in solving problems in the new
context.
5. Improved performance: Meta-learning can help improve
the performance of machine learning models by allowing
them to adapt to different datasets and learning
environments. By leveraging prior knowledge and
experience, meta-learning models can quickly adapt to new
situations and make better decisions.
6. Fewer hyperparameters: Meta-learning can help reduce
the number of hyperparameters that need to be tuned
manually. By learning to optimize these parameters
automatically, meta-learning models can improve their
performance and reduce the need for manual tuning.
Meta-learning Optimization
During the training process of a machine learning algorithm,
hyperparameters determine which parameters should be used.
These variables have a direct impact on how successfully a model
trains. Optimizing hyperparameters may be done in several ways.
1. Grid Search: The Grid Search technique makes use of
manually set hyperparameters. All suitable combinations of
hyperparameter values (within a given range) are tested
during a grid search. After that, the model selects the best
hyperparameter value. But because the process takes so
long and is so ineffective, this approach is seen as
conventional. Grid Search may be found in the Sklearn
library.
2. Random Search: The optimal solution for the created
model is found using the random search approach, which
uses random combinations of the hyperparameters. Even
though it has characteristics similar to grid search, it has
been shown to produce superior results overall. The
disadvantage of random search is that it produces a high
level of volatility while computing. Random Search may be
found in the Sklearn library. Random Search is superior to
Grid Search.
Applications of Meta-learning
Meta-learning algorithms are already in use in various applications,
some of which are:
1. Online learning tasks in reinforcement learning
2. Sequence modeling in Natural language processing
3. Image classification tasks in Computer vision
4. Few-shot learning: Meta-learning can be used to train
models that can quickly adapt to new tasks with limited
data. This is particularly useful in scenarios where the cost
of collecting large amounts of data is prohibitively high,
such as in medical diagnosis or autonomous driving.
5. Model selection: Meta-learning can help automate the
process of model selection by learning to choose the best
model for a given task based on past experience. This can
save time and resources while also improving the accuracy
and robustness of the resulting model.
6. Hyperparameter optimization: Meta-learning can be
used to automatically tune hyperparameters for machine-
learning models. By learning from past experience, meta-
learning models can quickly find the best hyperparameters
for a given task, leading to better performance and faster
training times.
7. Transfer learning: Meta-learning can be used to facilitate
transfer learning, where knowledge learned in one domain
is transferred to another domain. This can be especially
useful in scenarios where data is scarce or where the target
domain is vastly different from the source domain.
8. Recommender systems: Meta-learning can be used to
build better recommender systems by learning to
recommend the most relevant items based on past user
behavior. This can improve the accuracy and relevance of
recommendations, leading to better user engagement and
satisfaction.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy