0% found this document useful (0 votes)
5 views80 pages

Complete Clustering

The document provides an overview of clustering techniques in data mining, focusing on methods such as K-Means, K-Medoids, Hierarchical Clustering, and Density-Based Clustering (DBSCAN). It explains the purpose of cluster analysis, scenarios for its application, and details on various algorithms, including their advantages and drawbacks. The document also highlights the importance of selecting appropriate parameters for effective clustering results.

Uploaded by

mous7457
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views80 pages

Complete Clustering

The document provides an overview of clustering techniques in data mining, focusing on methods such as K-Means, K-Medoids, Hierarchical Clustering, and Density-Based Clustering (DBSCAN). It explains the purpose of cluster analysis, scenarios for its application, and details on various algorithms, including their advantages and drawbacks. The document also highlights the importance of selecting appropriate parameters for effective clustering results.

Uploaded by

mous7457
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Clustering

KALINGA INSTITUTE OF INDUSTRIAL


TECHNOLOGY

School Of Computer
Engineering

Datamining and Dr.PradeepKUmar Mallick


Associate Professor [II]
Data warehousing School of Computer Engineering,
Kalinga Institute of Industrial Technology (KIIT),
(CS 2004) Deemed to be University,Odisha

3 Credit Lecture Note 11


Clustering
2

• Cluster analysis is an unsupervised learning algorithm, meaning that


you don’t know how many clusters exist in the data before running the
model.
• Cluster analysis, also known as clustering, is a method of data mining
that groups similar data points together.
• The goal of cluster analysis is to divide a dataset into groups (or
clusters) such that the data points within each group are more similar to
each other than to data points in other groups.
• This process is often used for exploratory data analysis and can help
identify patterns or relationships within the data that may not be
immediately obvious.
• There are many different algorithms used for cluster analysis, such as
k-means, hierarchical clustering, and density-based clustering.
• The choice of algorithm will depend on the specific requirements of the
analysis and the nature of the data being analyzed.
Cluster Analysis
3

When should cluster analysis be used?


o Cluster analysis is for when you’re looking to segment or
categorise a dataset into groups based on similarities, but aren’t
sure what those groups should be.
o While it’s tempting to use cluster analysis in many different
research projects, it’s important to know when it’s genuinely the
right fit.
Scenarios where cluster analysis proves its worth.
4

o Here are three of the most common scenarios where cluster analysis
proves its worth.
o Exploratory data analysis
1. When you have a new dataset and are in the early stages of understanding it,
cluster analysis can provide a much-needed guide.
2. By forming clusters, you can get a read on potential patterns or trends that
could warrant deeper investigation.
o Market segmentation
1. This is a golden application for cluster analysis, especially in the business
world. Because when you aim to target your products or services more
effectively, understanding your customer base becomes paramount.
2. Cluster analysis can carve out specific customer segments based on buying
habits, preferences or demographics, allowing for tailored marketing strategies
that resonate more deeply.
o Resource allocation
1. Be it in healthcare, manufacturing, logistics or many other sectors, resource
allocation is often one of the biggest challenges. Cluster analysis can be used to
identify which groups or areas require the most attention or resources, enabling
more efficient and targeted deployment.
Scenarios where cluster analysis proves its worth.
5

o Here are three of the most common scenarios where cluster analysis
proves its worth.
o Exploratory data analysis
1. When you have a new dataset and are in the early stages of understanding it,
cluster analysis can provide a much-needed guide.
2. By forming clusters, you can get a read on potential patterns or trends that
could warrant deeper investigation.
o Market segmentation
1. This is a golden application for cluster analysis, especially in the business
world. Because when you aim to target your products or services more
effectively, understanding your customer base becomes paramount.
2. Cluster analysis can carve out specific customer segments based on buying
habits, preferences or demographics, allowing for tailored marketing strategies
that resonate more deeply.
o Resource allocation
1. Be it in healthcare, manufacturing, logistics or many other sectors, resource
allocation is often one of the biggest challenges. Cluster analysis can be used to
identify which groups or areas require the most attention or resources, enabling
more efficient and targeted deployment.
K-mean Clustering
6

• K-Means Clustering is an Unsupervised Learning algorithm, which


groups the unlabeled dataset into different clusters. Here K defines the
number of pre-defined clusters that need to be created in the process, as
if K=2, there will be two clusters, and for K=3, there will be three
clusters, and so on.
• It is an iterative algorithm that divides the unlabeled dataset into k
different clusters in such a way that each dataset belongs only one
group that has similar properties.It allows us to cluster the data into
different groups and a convenient way to discover the categories of
groups in the unlabeled dataset on its own without the need for any
training.
• It is a centroid-based algorithm, where each cluster is associated with
a centroid. The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding clusters.
K-mean Clustering
7

• K-Means Clustering is an Unsupervised Machine Learning algorithm,


which groups the unlabeled dataset into different clusters.
• K means clustering, assigns data points to one of the K clusters
depending on their distance from the center of the clusters.
• It starts by randomly assigning the clusters centroid in the space. Then
each data point assign to one of the cluster based on its distance from
centroid of the cluster.
• After assigning each point to one of the cluster, new cluster centroids
are assigned.
• This process runs iteratively until it finds good cluster. In the analysis
we assume that number of cluster is given in advanced and we have to
put points in one of the group.
• In some cases, K is not clearly defined, and we have to think about the
optimal number of K.
K-mean Clustering
8

The algorithm works as follows:

1. Select the number K to decide the number of clusters.


2. Select random K points or centroids. (It can be other from the
input dataset).
3. Assign each data point to their closest centroid, which will form
the predefined K clusters.
4. Calculate the variance and place a new centroid of each cluster.
5. Repeat the third steps, which means reassign each datapoint to
the new closest centroid of each cluster.
6. If any reassignment occurs, then go to step-4 else go to FINISH.
7. The model is ready.
K-mean Clustering
9
K-mean Clustering
10
K-mean Clustering
11
K-mean Clustering
12
K-mean Clustering
13

https://www.youtube.com/watch?v=Kz
JORp8bgqs
Problems of K-Mean Algorithm
14

• K-Medoids and K-Means are two types of clustering mechanisms in Partition


Clustering.
• First, Clustering is the process of breaking down an abstract group of data points/
objects into classes of similar objects such that all the objects in one cluster have
similar traits. , a group of n objects is broken down into k number of clusters based on
their similarities.
• K-medoids is an unsupervised method with unlabelled data to be clustered.
• It is an improvised version of the K-Means algorithm mainly designed to deal with
outlier data sensitivity.
• Compared to other partitioning algorithms, the algorithm is simple, fast, and easy to
implement.
K-Medoids Algorithm
15

• The problem with the K-Means algorithm is that the algorithm needs to
handle outlier data.
• An outlier is a point different from the rest of the points.
• All the outlier data points show up in a different cluster and will attract other
clusters to merge with it.
• Outlier data increases the mean of a cluster by up to 10 units.
• Hence, K-Means clustering is highly affected by outlier data.
K-Medoids Algorithm
16

Algorithm:
1. Randomly select k points from the data to be the initial medoids
2. Calculate the distance between each medoid and non-medoid point, and assign each
point to the nearest medoid
3. Calculate the cost, which is the sum of the distances of each data point from its
assigned medoid
4. Swap a medoid point with a non-medoid point from the same cluster, and recalculate
the cost
5. If the new cost is higher, undo the swap
6. Otherwise, repeat step 4 until the medoids no longer change

Features
1. The number of clusters, k, must be specified before running the algorithm
2. K-medoids is a variant of the k-means algorithm, but uses actual data points instead of
centroids to represent clusters
3. K-medoids is less sensitive to noise and outliers than k-means
4. K-medoids can produce better solutions than other algorithms in some cases, but it can
be very slow
Example-1
17
K-Medoids Algorithm
18
K-Medoids Algorithm
19
K-Medoids Algorithm
20
K-Medoids Algorithm
21
K-Medoids Algorithm
22
K-Medoids Algorithm
23
K-Medoids Algorithm
24
K-Medoids Algorithm
25
K-Medoids Algorithm
26
K-Medoids Algorithm
27
K-Medoids Algorithm
28
K-Medoids Algorithm
29
K-Medoids Algorithm
30
K-Medoids Algorithm
31
Hierarchical Clustering
32

• Hierarchical clustering is a method of cluster analysis in data mining


that creates a hierarchical representation of the clusters in a dataset.
• The method starts by treating each data point as a separate cluster and
then iteratively combines the closest clusters until a stopping criterion
is reached.
• The result of hierarchical clustering is a tree-like structure, called a
dendrogram.
• Dendrogram:
✔ In Hierarchical Clustering, the aim is to produce a hierarchical
series of nested clusters.
✔ A diagram called Dendrogram (A Dendrogram is a tree-like
diagram that statistics the sequences of merges or splits)
graphically represents this hierarchy and is an inverted tree that
describes the order in which factors are merged (bottom-up view)
or clusters are broken up (top-down view).
Hierarchical Clustering
33

Advantages:
• The ability to handle non-convex clusters and clusters of
different sizes and densities.
• The ability to handle missing data and noisy data.
• The ability to reveal the hierarchical structure of the data,
which can be useful for understanding the relationships
among the clusters.
Drawbacks of Hierarchical Clustering
• The need for a criterion to stop the clustering process and
determine the final number of clusters.
• The computational cost and memory requirements of the
method can be high, especially for large datasets.
• The results can be sensitive to the initial conditions, linkage
criterion, and distance metric used.
Types of Hierarchical Clustering
34

Basically, there are two types of hierarchical Clustering:


1. Agglomerative Clustering
2. Divisive clustering

2.Agglomerative Clustering
• Initially consider every data point as an individual Cluster and at
every step, merge the nearest pairs of the cluster. (It is a bottom-up
method).
• At first, every dataset is considered an individual entity or cluster.
• At every iteration, the clusters merge with different clusters until one
cluster is formed.
Types of Hierarchical Clustering
35

1. Algorithm
1. Calculate the similarity of one cluster with all the other clusters
(calculate proximity matrix)
2. Consider every data point as an individual cluster
3. Merge the clusters which are highly similar or close to each other.
4. Recalculate the proximity matrix for each cluster
5. Repeat Steps 3 and 4 until only a single cluster remains.
Types of Hierarchical Clustering
36
Types of Hierarchical Clustering
37

2. Divisive Hierarchical clustering


✔ We can say that Divisive Hierarchical clustering is precisely
the opposite of Agglomerative Hierarchical clustering.
✔ In Divisive Hierarchical clustering, we take into account all of the
data points as a single cluster and in every iteration, we separate the
data points from the clusters which aren’t comparable.
✔ In the end, we are left with N clusters.
Hierarchical Clustering
38
Agglomerative Algorithm: Single Link
39
Hierarchical Clustering
40
Hierarchical Clustering
41
Hierarchical Clustering
42
Hierarchical Clustering
43
Hierarchical Clustering
44
Hierarchical Clustering
45
Hierarchical Clustering
46
Hierarchical Clustering
47
Hierarchical Clustering
48
Hierarchical Clustering
49
Hierarchical Clustering
50
Complete Hierarchical Clustering
51
Complete Hierarchical Clustering
52
Complete Hierarchical Clustering
53
Complete Hierarchical Clustering
54
Complete Hierarchical Clustering
55
Complete Hierarchical Clustering
56
Complete Hierarchical Clustering
57
Average Linkage
58
Average Linkage
59
AverageLinkage
60
AverageLinkage
61
AverageLinkage
62
AverageLinkage
63
AverageLinkage
64
AverageLinkage
65
AverageLinkage
66
Density-Based Spatial Clustering Of Applications With
Noise (DBSCAN)
67

o Density-Based Clustering refers to one of the most popular unsupervised


learning methodologies used in model building and machine learning
algorithms.
o The data points in the region separated by two clusters of low point
density are considered as noise.
o The surroundings with a radius ε of a given object are known as the ε
neighborhood of the object.
o If the ε neighborhood of the object comprises at least a minimum
number, MinPts of objects, then it is called a core object.
o It is a scan method.
o It requires density parameters as a termination condition.
o It is used to manage noise in data clusters.
o Density-based clustering is used to identify clusters of arbitrary size.
Density-Based Spatial Clustering Of Applications With
Noise (DBSCAN)
68

o Clusters are dense regions in the data space, separated by regions of


the lower density of points.
o The DBSCAN algorithm is based on this intuitive notion of “clusters” and
“noise”.
o The key idea is that for each point of a cluster, the neighborhood of a
given radius has to contain at least a minimum number of points.
Why DBSCAN?
69

o Partitioning methods (K-means, PAM clustering) and hierarchical


clustering work for finding spherical-shaped clusters or convex
clusters.
o In other words, they are suitable only for compact and well-separated
clusters.
o Moreover, they are also severely affected by the presence of noise and
outliers in the data.
o Real-life data may contain irregularities, like:
o Clusters can be of arbitrary shape such as those shown in the figure
below.
o Data may contain noise.

o The figure above shows a data set containing non-convex shape


clusters and outliers. Given such data, the k-means algorithm has
difficulties in identifying these clusters with arbitrary shapes.
Parameters Required For DBSCAN Algorithm
70

o eps: It defines the neighborhood around a data point i.e. if the distance
between two points is lower or equal to ‘eps’ then they are considered
neighbors.
o If the eps value is chosen too small then a large part of the data will be
considered as an outlier.
o If it is chosen very large then the clusters will merge and the majority of
the data points will be in the same clusters.
o One way to find the eps value is based on the k-distance graph.

o MinPts: Minimum number of neighbors (data points) within eps radius.


The larger the dataset, the larger value of MinPts must be chosen.
o As a general rule, the minimum MinPts can be derived from the number
of dimensions D in the dataset as, MinPts >= D+1.
o The minimum value of MinPts must be chosen at least 3.
Steps Used In DBSCAN Algorithm

71

i. Find all the neighbor points within eps and identify the core points or
visited with more than MinPts neighbors.

ii. For each core point if it is not already assigned to a cluster, create a new
cluster.
iii. Find recursively all its density-connected points and assign them to the
same cluster as the core point.
A point a and b are said to be density connected if there exists a
point c which has a sufficient number of points in its neighbors and both
points a and b are within the eps distance. This is a chaining process.
So, if b is a neighbor of c, c is a neighbor of d, and d is a neighbor of e,
which in turn is neighbor of a implying that b is a neighbor of a.
iv. Iterate through the remaining unvisited points in the dataset. Those
points that do not belong to any cluster are noise.
Steps Used In DBSCAN Algorithm

72

In this algorithm, we have 3 types of data points.

Core Point: A point is a core point if it has more than MinPts points
within eps.

Border Point: A point which has fewer than MinPts within eps but it is in
the neighborhood of a core point.

Noise or outlier: A point which is not a core point or border point.


DBSCAN-Example-1
73
DBSCAN
74
DBSCAN
75
DBSCAN
76
DBSCAN
77
DBSCAN
78
Examination

79
References
• https://www.youtube.com/watch?v=oNYtYm
0tFso
• https://www.youtube.com/watch?v=oNYtYm
0tFso

80

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy