0% found this document useful (0 votes)

5 views80 pages

Complete Clustering

The document provides an overview of clustering techniques in data mining, focusing on methods such as K-Means, K-Medoids, Hierarchical Clustering, and Density-Based Clustering (DBSCAN). It explains the purpose of cluster analysis, scenarios for its application, and details on various algorithms, including their advantages and drawbacks. The document also highlights the importance of selecting appropriate parameters for effective clustering results.

Uploaded by

mous7457

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views80 pages

Complete Clustering

Uploaded by

mous7457

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 80

Clustering

KALINGA INSTITUTE OF INDUSTRIAL

TECHNOLOGY

School Of Computer
Engineering

Datamining and Dr.PradeepKUmar Mallick

Associate Professor [II]
Data warehousing School of Computer Engineering,
Kalinga Institute of Industrial Technology (KIIT),
(CS 2004) Deemed to be University,Odisha

3 Credit Lecture Note 11

Clustering
2

• Cluster analysis is an unsupervised learning algorithm, meaning that

you don’t know how many clusters exist in the data before running the
model.
• Cluster analysis, also known as clustering, is a method of data mining
that groups similar data points together.
• The goal of cluster analysis is to divide a dataset into groups (or
clusters) such that the data points within each group are more similar to
each other than to data points in other groups.
• This process is often used for exploratory data analysis and can help
identify patterns or relationships within the data that may not be
immediately obvious.
• There are many different algorithms used for cluster analysis, such as
k-means, hierarchical clustering, and density-based clustering.
• The choice of algorithm will depend on the specific requirements of the
analysis and the nature of the data being analyzed.
Cluster Analysis
3

When should cluster analysis be used?

o Cluster analysis is for when you’re looking to segment or
categorise a dataset into groups based on similarities, but aren’t
sure what those groups should be.
o While it’s tempting to use cluster analysis in many different
research projects, it’s important to know when it’s genuinely the
right fit.
Scenarios where cluster analysis proves its worth.
4

o Here are three of the most common scenarios where cluster analysis
proves its worth.
o Exploratory data analysis
1. When you have a new dataset and are in the early stages of understanding it,
cluster analysis can provide a much-needed guide.
2. By forming clusters, you can get a read on potential patterns or trends that
could warrant deeper investigation.
o Market segmentation
1. This is a golden application for cluster analysis, especially in the business
world. Because when you aim to target your products or services more
effectively, understanding your customer base becomes paramount.
2. Cluster analysis can carve out specific customer segments based on buying
habits, preferences or demographics, allowing for tailored marketing strategies
that resonate more deeply.
o Resource allocation
1. Be it in healthcare, manufacturing, logistics or many other sectors, resource
allocation is often one of the biggest challenges. Cluster analysis can be used to
identify which groups or areas require the most attention or resources, enabling
more efficient and targeted deployment.
Scenarios where cluster analysis proves its worth.
5

• K-Means Clustering is an Unsupervised Learning algorithm, which

groups the unlabeled dataset into different clusters. Here K defines the
number of pre-defined clusters that need to be created in the process, as
if K=2, there will be two clusters, and for K=3, there will be three
clusters, and so on.
• It is an iterative algorithm that divides the unlabeled dataset into k
different clusters in such a way that each dataset belongs only one
group that has similar properties.It allows us to cluster the data into
different groups and a convenient way to discover the categories of
groups in the unlabeled dataset on its own without the need for any
training.
• It is a centroid-based algorithm, where each cluster is associated with
a centroid. The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding clusters.
K-mean Clustering
7

• K-Means Clustering is an Unsupervised Machine Learning algorithm,

which groups the unlabeled dataset into different clusters.
• K means clustering, assigns data points to one of the K clusters
depending on their distance from the center of the clusters.
• It starts by randomly assigning the clusters centroid in the space. Then
each data point assign to one of the cluster based on its distance from
centroid of the cluster.
• After assigning each point to one of the cluster, new cluster centroids
are assigned.
• This process runs iteratively until it finds good cluster. In the analysis
we assume that number of cluster is given in advanced and we have to
put points in one of the group.
• In some cases, K is not clearly defined, and we have to think about the
optimal number of K.
K-mean Clustering
8

The algorithm works as follows:

1. Select the number K to decide the number of clusters.

2. Select random K points or centroids. (It can be other from the
input dataset).
3. Assign each data point to their closest centroid, which will form
the predefined K clusters.
4. Calculate the variance and place a new centroid of each cluster.
5. Repeat the third steps, which means reassign each datapoint to
the new closest centroid of each cluster.
6. If any reassignment occurs, then go to step-4 else go to FINISH.
7. The model is ready.
K-mean Clustering
9
K-mean Clustering
10
K-mean Clustering
11
K-mean Clustering
12
K-mean Clustering
13

https://www.youtube.com/watch?v=Kz
JORp8bgqs
Problems of K-Mean Algorithm
14

• K-Medoids and K-Means are two types of clustering mechanisms in Partition

Clustering.
• First, Clustering is the process of breaking down an abstract group of data points/
objects into classes of similar objects such that all the objects in one cluster have
similar traits. , a group of n objects is broken down into k number of clusters based on
their similarities.
• K-medoids is an unsupervised method with unlabelled data to be clustered.
• It is an improvised version of the K-Means algorithm mainly designed to deal with
outlier data sensitivity.
• Compared to other partitioning algorithms, the algorithm is simple, fast, and easy to
implement.
K-Medoids Algorithm
15

• The problem with the K-Means algorithm is that the algorithm needs to
handle outlier data.
• An outlier is a point different from the rest of the points.
• All the outlier data points show up in a different cluster and will attract other
clusters to merge with it.
• Outlier data increases the mean of a cluster by up to 10 units.
• Hence, K-Means clustering is highly affected by outlier data.
K-Medoids Algorithm
16

Algorithm:
1. Randomly select k points from the data to be the initial medoids
2. Calculate the distance between each medoid and non-medoid point, and assign each
point to the nearest medoid
3. Calculate the cost, which is the sum of the distances of each data point from its
assigned medoid
4. Swap a medoid point with a non-medoid point from the same cluster, and recalculate
the cost
5. If the new cost is higher, undo the swap
6. Otherwise, repeat step 4 until the medoids no longer change

Features
1. The number of clusters, k, must be specified before running the algorithm
2. K-medoids is a variant of the k-means algorithm, but uses actual data points instead of
centroids to represent clusters
3. K-medoids is less sensitive to noise and outliers than k-means
4. K-medoids can produce better solutions than other algorithms in some cases, but it can
be very slow
Example-1
17
K-Medoids Algorithm
18
K-Medoids Algorithm
19
K-Medoids Algorithm
20
K-Medoids Algorithm
21
K-Medoids Algorithm
22
K-Medoids Algorithm
23
K-Medoids Algorithm
24
K-Medoids Algorithm
25
K-Medoids Algorithm
26
K-Medoids Algorithm
27
K-Medoids Algorithm
28
K-Medoids Algorithm
29
K-Medoids Algorithm
30
K-Medoids Algorithm
31
Hierarchical Clustering
32

• Hierarchical clustering is a method of cluster analysis in data mining

that creates a hierarchical representation of the clusters in a dataset.
• The method starts by treating each data point as a separate cluster and
then iteratively combines the closest clusters until a stopping criterion
is reached.
• The result of hierarchical clustering is a tree-like structure, called a
dendrogram.
• Dendrogram:
✔ In Hierarchical Clustering, the aim is to produce a hierarchical
series of nested clusters.
✔ A diagram called Dendrogram (A Dendrogram is a tree-like
diagram that statistics the sequences of merges or splits)
graphically represents this hierarchy and is an inverted tree that
describes the order in which factors are merged (bottom-up view)
or clusters are broken up (top-down view).
Hierarchical Clustering
33

Advantages:
• The ability to handle non-convex clusters and clusters of
different sizes and densities.
• The ability to handle missing data and noisy data.
• The ability to reveal the hierarchical structure of the data,
which can be useful for understanding the relationships
among the clusters.
Drawbacks of Hierarchical Clustering
• The need for a criterion to stop the clustering process and
determine the final number of clusters.
• The computational cost and memory requirements of the
method can be high, especially for large datasets.
• The results can be sensitive to the initial conditions, linkage
criterion, and distance metric used.
Types of Hierarchical Clustering
34

Basically, there are two types of hierarchical Clustering:

1. Agglomerative Clustering
2. Divisive clustering

2.Agglomerative Clustering
• Initially consider every data point as an individual Cluster and at
every step, merge the nearest pairs of the cluster. (It is a bottom-up
method).
• At first, every dataset is considered an individual entity or cluster.
• At every iteration, the clusters merge with different clusters until one
cluster is formed.
Types of Hierarchical Clustering
35

1. Algorithm
1. Calculate the similarity of one cluster with all the other clusters
(calculate proximity matrix)
2. Consider every data point as an individual cluster
3. Merge the clusters which are highly similar or close to each other.
4. Recalculate the proximity matrix for each cluster
5. Repeat Steps 3 and 4 until only a single cluster remains.
Types of Hierarchical Clustering
36
Types of Hierarchical Clustering
37

2. Divisive Hierarchical clustering

✔ We can say that Divisive Hierarchical clustering is precisely
the opposite of Agglomerative Hierarchical clustering.
✔ In Divisive Hierarchical clustering, we take into account all of the
data points as a single cluster and in every iteration, we separate the
data points from the clusters which aren’t comparable.
✔ In the end, we are left with N clusters.
Hierarchical Clustering
38
Agglomerative Algorithm: Single Link
39
Hierarchical Clustering
40
Hierarchical Clustering
41
Hierarchical Clustering
42
Hierarchical Clustering
43
Hierarchical Clustering
44
Hierarchical Clustering
45
Hierarchical Clustering
46
Hierarchical Clustering
47
Hierarchical Clustering
48
Hierarchical Clustering
49
Hierarchical Clustering
50
Complete Hierarchical Clustering
51
Complete Hierarchical Clustering
52
Complete Hierarchical Clustering
53
Complete Hierarchical Clustering
54
Complete Hierarchical Clustering
55
Complete Hierarchical Clustering
56
Complete Hierarchical Clustering
57
Average Linkage
58
Average Linkage
59
AverageLinkage
60
AverageLinkage
61
AverageLinkage
62
AverageLinkage
63
AverageLinkage
64
AverageLinkage
65
AverageLinkage
66
Density-Based Spatial Clustering Of Applications With
Noise (DBSCAN)
67

o Density-Based Clustering refers to one of the most popular unsupervised

learning methodologies used in model building and machine learning
algorithms.
o The data points in the region separated by two clusters of low point
density are considered as noise.
o The surroundings with a radius ε of a given object are known as the ε
neighborhood of the object.
o If the ε neighborhood of the object comprises at least a minimum
number, MinPts of objects, then it is called a core object.
o It is a scan method.
o It requires density parameters as a termination condition.
o It is used to manage noise in data clusters.
o Density-based clustering is used to identify clusters of arbitrary size.
Density-Based Spatial Clustering Of Applications With
Noise (DBSCAN)
68

o Clusters are dense regions in the data space, separated by regions of

the lower density of points.
o The DBSCAN algorithm is based on this intuitive notion of “clusters” and
“noise”.
o The key idea is that for each point of a cluster, the neighborhood of a
given radius has to contain at least a minimum number of points.
Why DBSCAN?
69

o Partitioning methods (K-means, PAM clustering) and hierarchical

clustering work for finding spherical-shaped clusters or convex
clusters.
o In other words, they are suitable only for compact and well-separated
clusters.
o Moreover, they are also severely affected by the presence of noise and
outliers in the data.
o Real-life data may contain irregularities, like:
o Clusters can be of arbitrary shape such as those shown in the figure
below.
o Data may contain noise.

o The figure above shows a data set containing non-convex shape

clusters and outliers. Given such data, the k-means algorithm has
difficulties in identifying these clusters with arbitrary shapes.
Parameters Required For DBSCAN Algorithm
70

o eps: It defines the neighborhood around a data point i.e. if the distance
between two points is lower or equal to ‘eps’ then they are considered
neighbors.
o If the eps value is chosen too small then a large part of the data will be
considered as an outlier.
o If it is chosen very large then the clusters will merge and the majority of
the data points will be in the same clusters.
o One way to find the eps value is based on the k-distance graph.

o MinPts: Minimum number of neighbors (data points) within eps radius.

The larger the dataset, the larger value of MinPts must be chosen.
o As a general rule, the minimum MinPts can be derived from the number
of dimensions D in the dataset as, MinPts >= D+1.
o The minimum value of MinPts must be chosen at least 3.
Steps Used In DBSCAN Algorithm

i. Find all the neighbor points within eps and identify the core points or
visited with more than MinPts neighbors.

ii. For each core point if it is not already assigned to a cluster, create a new
cluster.
iii. Find recursively all its density-connected points and assign them to the
same cluster as the core point.
A point a and b are said to be density connected if there exists a
point c which has a sufficient number of points in its neighbors and both
points a and b are within the eps distance. This is a chaining process.
So, if b is a neighbor of c, c is a neighbor of d, and d is a neighbor of e,
which in turn is neighbor of a implying that b is a neighbor of a.
iv. Iterate through the remaining unvisited points in the dataset. Those
points that do not belong to any cluster are noise.
Steps Used In DBSCAN Algorithm

In this algorithm, we have 3 types of data points.

Core Point: A point is a core point if it has more than MinPts points
within eps.

Border Point: A point which has fewer than MinPts within eps but it is in
the neighborhood of a core point.

Noise or outlier: A point which is not a core point or border point.

DBSCAN-Example-1
73
DBSCAN
74
DBSCAN
75
DBSCAN
76
DBSCAN
77
DBSCAN
78
Examination

79
References
• https://www.youtube.com/watch?v=oNYtYm
0tFso
• https://www.youtube.com/watch?v=oNYtYm
0tFso

DWDM Unit V Note
No ratings yet
DWDM Unit V Note
19 pages
Unit 4
No ratings yet
Unit 4
4 pages
Unit V - Clustering
No ratings yet
Unit V - Clustering
19 pages
Untitled Document
No ratings yet
Untitled Document
32 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
Clustering Part1
No ratings yet
Clustering Part1
79 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
No ratings yet
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
38 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Clustering: An Overview: Key Concepts Objective
No ratings yet
Clustering: An Overview: Key Concepts Objective
12 pages
ML Unit-4
No ratings yet
ML Unit-4
14 pages
BDA Unit 2
No ratings yet
BDA Unit 2
31 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
Clustering Notes
No ratings yet
Clustering Notes
17 pages
Artificial Intelligence Lec 5
No ratings yet
Artificial Intelligence Lec 5
20 pages
10 Clus Basic
No ratings yet
10 Clus Basic
95 pages
Clustering
No ratings yet
Clustering
104 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Unit - 4 (ML)
No ratings yet
Unit - 4 (ML)
13 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
Unit5 Clustering
No ratings yet
Unit5 Clustering
74 pages
Clustering
No ratings yet
Clustering
32 pages
Techniques of Cluster Analysis: A Seminar On
No ratings yet
Techniques of Cluster Analysis: A Seminar On
25 pages
Clustering
No ratings yet
Clustering
6 pages
Clustering
No ratings yet
Clustering
29 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
Clustering Methods
No ratings yet
Clustering Methods
14 pages
Unit 4
No ratings yet
Unit 4
29 pages
07 Clustering
No ratings yet
07 Clustering
34 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
Clustering
No ratings yet
Clustering
10 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Unit 4
No ratings yet
Unit 4
74 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
9 pages
Techniques of Cluster Analysis: A Seminar On
No ratings yet
Techniques of Cluster Analysis: A Seminar On
25 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
Clustering
No ratings yet
Clustering
84 pages
ML CH 4
No ratings yet
ML CH 4
51 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
CLUSTERING
No ratings yet
CLUSTERING
11 pages
M5
No ratings yet
M5
40 pages
Fds Unit03
No ratings yet
Fds Unit03
11 pages
Clustering
No ratings yet
Clustering
25 pages
10 Clus Basic
No ratings yet
10 Clus Basic
66 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
Cluster Analysis
No ratings yet
Cluster Analysis
76 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Unit - 4 DM
No ratings yet
Unit - 4 DM
24 pages
Clustering
No ratings yet
Clustering
34 pages
Lecture - 10 Unsupervised Learning & K-Means Clustering
No ratings yet
Lecture - 10 Unsupervised Learning & K-Means Clustering
31 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Clustering and K-Means Algorithm
No ratings yet
Clustering and K-Means Algorithm
81 pages
Data Mining: I Gede Mahendra Darmawiguna
No ratings yet
Data Mining: I Gede Mahendra Darmawiguna
25 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
2019 Data Science Summer Internship Program - Opt PDF
No ratings yet
2019 Data Science Summer Internship Program - Opt PDF
14 pages
Vision: Suryo Adhi Wibowo, PH.D
No ratings yet
Vision: Suryo Adhi Wibowo, PH.D
51 pages
SDLC Model Explainable Automated Program Repair
No ratings yet
SDLC Model Explainable Automated Program Repair
7 pages
Ransomware Attack Detection Based On Pertinent System Calls Using Machine Learning Techniques
No ratings yet
Ransomware Attack Detection Based On Pertinent System Calls Using Machine Learning Techniques
23 pages
Stock Market Prediction of NIFTY 50 Index Applying Machine Learning Techniques
No ratings yet
Stock Market Prediction of NIFTY 50 Index Applying Machine Learning Techniques
25 pages
An Efficient Algorithm For A LSTM-based Method For Stock Returns Prediction
No ratings yet
An Efficient Algorithm For A LSTM-based Method For Stock Returns Prediction
6 pages
198-Article Text-354-1-10-20250227
No ratings yet
198-Article Text-354-1-10-20250227
14 pages
AI MCQ Answers
No ratings yet
AI MCQ Answers
5 pages
A Review of Deep Learning Models To Detect Malware in Android Applications
No ratings yet
A Review of Deep Learning Models To Detect Malware in Android Applications
9 pages
Machine Learning Systems
No ratings yet
Machine Learning Systems
1,748 pages
Computer Vision 15 Exam Q and A
No ratings yet
Computer Vision 15 Exam Q and A
44 pages
15 AI Applications - Use Cases - Examples in Logistics in 2021
No ratings yet
15 AI Applications - Use Cases - Examples in Logistics in 2021
16 pages
Credit Analysis
No ratings yet
Credit Analysis
12 pages
AI Based Threat Detection System - IEEE Report
No ratings yet
AI Based Threat Detection System - IEEE Report
10 pages
Introduction To ML Linear Regression
No ratings yet
Introduction To ML Linear Regression
33 pages
MLS-C01 AWS Certified Exam Practice Questions
No ratings yet
MLS-C01 AWS Certified Exam Practice Questions
34 pages
2013 Selection of The Best Classifier From Different Datasets Using WEKA PDF
No ratings yet
2013 Selection of The Best Classifier From Different Datasets Using WEKA PDF
8 pages
Paper 14014
No ratings yet
Paper 14014
9 pages
NTCC Final Report
No ratings yet
NTCC Final Report
17 pages
DeepSeek Presentation
No ratings yet
DeepSeek Presentation
76 pages
Ad3461 Machine Learning Laboratory - 1
No ratings yet
Ad3461 Machine Learning Laboratory - 1
1 page
A Hybrid Machine Learning Method For Image Classification
No ratings yet
A Hybrid Machine Learning Method For Image Classification
15 pages
Ai - Foundations of Machine Learning I
No ratings yet
Ai - Foundations of Machine Learning I
39 pages
Cse2009 Soft-Computing Eth 1.1 3 Cse2009
No ratings yet
Cse2009 Soft-Computing Eth 1.1 3 Cse2009
2 pages
Intelligent Wind Turbine Blade Icing Detection Using Supervisory Control and Data Acquisition (SCADA) Data and Ensemble Deep Learning
No ratings yet
Intelligent Wind Turbine Blade Icing Detection Using Supervisory Control and Data Acquisition (SCADA) Data and Ensemble Deep Learning
14 pages
Shub Hang I Sanjay
No ratings yet
Shub Hang I Sanjay
1 page
Deep Learning - Libraries
No ratings yet
Deep Learning - Libraries
5 pages
What Is AI 1610590751
No ratings yet
What Is AI 1610590751
8 pages
5 Roadmap
No ratings yet
5 Roadmap
11 pages
Machine Learning Based Prediction of Seismic Response of Eleva 2024 Structur
No ratings yet
Machine Learning Based Prediction of Seismic Response of Eleva 2024 Structur
43 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Complete Clustering

Uploaded by

Complete Clustering

Uploaded by

Clustering

KALINGA INSTITUTE OF INDUSTRIAL

Datamining and Dr.PradeepKUmar Mallick

3 Credit Lecture Note 11

• Cluster analysis is an unsupervised learning algorithm, meaning that

When should cluster analysis be used?

• K-Means Clustering is an Unsupervised Learning algorithm, which

• K-Means Clustering is an Unsupervised Machine Learning algorithm,

The algorithm works as follows:

1. Select the number K to decide the number of clusters.

• K-Medoids and K-Means are two types of clustering mechanisms in Partition

• Hierarchical clustering is a method of cluster analysis in data mining

Basically, there are two types of hierarchical Clustering:

2. Divisive Hierarchical clustering

o Density-Based Clustering refers to one of the most popular unsupervised

o Clusters are dense regions in the data space, separated by regions of

o Partitioning methods (K-means, PAM clustering) and hierarchical

o The figure above shows a data set containing non-convex shape

o MinPts: Minimum number of neighbors (data points) within eps radius.

In this algorithm, we have 3 types of data points.

Noise or outlier: A point which is not a core point or border point.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.