Complete Clustering
Complete Clustering
School Of Computer
Engineering
o Here are three of the most common scenarios where cluster analysis
proves its worth.
o Exploratory data analysis
1. When you have a new dataset and are in the early stages of understanding it,
cluster analysis can provide a much-needed guide.
2. By forming clusters, you can get a read on potential patterns or trends that
could warrant deeper investigation.
o Market segmentation
1. This is a golden application for cluster analysis, especially in the business
world. Because when you aim to target your products or services more
effectively, understanding your customer base becomes paramount.
2. Cluster analysis can carve out specific customer segments based on buying
habits, preferences or demographics, allowing for tailored marketing strategies
that resonate more deeply.
o Resource allocation
1. Be it in healthcare, manufacturing, logistics or many other sectors, resource
allocation is often one of the biggest challenges. Cluster analysis can be used to
identify which groups or areas require the most attention or resources, enabling
more efficient and targeted deployment.
Scenarios where cluster analysis proves its worth.
5
o Here are three of the most common scenarios where cluster analysis
proves its worth.
o Exploratory data analysis
1. When you have a new dataset and are in the early stages of understanding it,
cluster analysis can provide a much-needed guide.
2. By forming clusters, you can get a read on potential patterns or trends that
could warrant deeper investigation.
o Market segmentation
1. This is a golden application for cluster analysis, especially in the business
world. Because when you aim to target your products or services more
effectively, understanding your customer base becomes paramount.
2. Cluster analysis can carve out specific customer segments based on buying
habits, preferences or demographics, allowing for tailored marketing strategies
that resonate more deeply.
o Resource allocation
1. Be it in healthcare, manufacturing, logistics or many other sectors, resource
allocation is often one of the biggest challenges. Cluster analysis can be used to
identify which groups or areas require the most attention or resources, enabling
more efficient and targeted deployment.
K-mean Clustering
6
https://www.youtube.com/watch?v=Kz
JORp8bgqs
Problems of K-Mean Algorithm
14
• The problem with the K-Means algorithm is that the algorithm needs to
handle outlier data.
• An outlier is a point different from the rest of the points.
• All the outlier data points show up in a different cluster and will attract other
clusters to merge with it.
• Outlier data increases the mean of a cluster by up to 10 units.
• Hence, K-Means clustering is highly affected by outlier data.
K-Medoids Algorithm
16
Algorithm:
1. Randomly select k points from the data to be the initial medoids
2. Calculate the distance between each medoid and non-medoid point, and assign each
point to the nearest medoid
3. Calculate the cost, which is the sum of the distances of each data point from its
assigned medoid
4. Swap a medoid point with a non-medoid point from the same cluster, and recalculate
the cost
5. If the new cost is higher, undo the swap
6. Otherwise, repeat step 4 until the medoids no longer change
Features
1. The number of clusters, k, must be specified before running the algorithm
2. K-medoids is a variant of the k-means algorithm, but uses actual data points instead of
centroids to represent clusters
3. K-medoids is less sensitive to noise and outliers than k-means
4. K-medoids can produce better solutions than other algorithms in some cases, but it can
be very slow
Example-1
17
K-Medoids Algorithm
18
K-Medoids Algorithm
19
K-Medoids Algorithm
20
K-Medoids Algorithm
21
K-Medoids Algorithm
22
K-Medoids Algorithm
23
K-Medoids Algorithm
24
K-Medoids Algorithm
25
K-Medoids Algorithm
26
K-Medoids Algorithm
27
K-Medoids Algorithm
28
K-Medoids Algorithm
29
K-Medoids Algorithm
30
K-Medoids Algorithm
31
Hierarchical Clustering
32
Advantages:
• The ability to handle non-convex clusters and clusters of
different sizes and densities.
• The ability to handle missing data and noisy data.
• The ability to reveal the hierarchical structure of the data,
which can be useful for understanding the relationships
among the clusters.
Drawbacks of Hierarchical Clustering
• The need for a criterion to stop the clustering process and
determine the final number of clusters.
• The computational cost and memory requirements of the
method can be high, especially for large datasets.
• The results can be sensitive to the initial conditions, linkage
criterion, and distance metric used.
Types of Hierarchical Clustering
34
2.Agglomerative Clustering
• Initially consider every data point as an individual Cluster and at
every step, merge the nearest pairs of the cluster. (It is a bottom-up
method).
• At first, every dataset is considered an individual entity or cluster.
• At every iteration, the clusters merge with different clusters until one
cluster is formed.
Types of Hierarchical Clustering
35
1. Algorithm
1. Calculate the similarity of one cluster with all the other clusters
(calculate proximity matrix)
2. Consider every data point as an individual cluster
3. Merge the clusters which are highly similar or close to each other.
4. Recalculate the proximity matrix for each cluster
5. Repeat Steps 3 and 4 until only a single cluster remains.
Types of Hierarchical Clustering
36
Types of Hierarchical Clustering
37
o eps: It defines the neighborhood around a data point i.e. if the distance
between two points is lower or equal to ‘eps’ then they are considered
neighbors.
o If the eps value is chosen too small then a large part of the data will be
considered as an outlier.
o If it is chosen very large then the clusters will merge and the majority of
the data points will be in the same clusters.
o One way to find the eps value is based on the k-distance graph.
71
i. Find all the neighbor points within eps and identify the core points or
visited with more than MinPts neighbors.
ii. For each core point if it is not already assigned to a cluster, create a new
cluster.
iii. Find recursively all its density-connected points and assign them to the
same cluster as the core point.
A point a and b are said to be density connected if there exists a
point c which has a sufficient number of points in its neighbors and both
points a and b are within the eps distance. This is a chaining process.
So, if b is a neighbor of c, c is a neighbor of d, and d is a neighbor of e,
which in turn is neighbor of a implying that b is a neighbor of a.
iv. Iterate through the remaining unvisited points in the dataset. Those
points that do not belong to any cluster are noise.
Steps Used In DBSCAN Algorithm
72
Core Point: A point is a core point if it has more than MinPts points
within eps.
Border Point: A point which has fewer than MinPts within eps but it is in
the neighborhood of a core point.
79
References
• https://www.youtube.com/watch?v=oNYtYm
0tFso
• https://www.youtube.com/watch?v=oNYtYm
0tFso
80