ML12 Clustering
ML12 Clustering
𝑝−𝑚
𝑑(𝑖, 𝑗) =
𝑝
Ordinal (numerical)
• An ordinal variable can be discrete or continuous
• order is important, e.g., rank 𝑟𝑖𝑓 ∈ {1, . . . , 𝑀𝑓 }
• map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by
𝑟𝑖𝑓 − 1
𝑧𝑖𝑓 =
𝑀𝑓 − 1
• Distances are normally used to measure the similarity or dissimilarity between two
data objects (Generally Standardized or Normalized before computing distances)
• Some popular ones include: Minkowski distance:
𝑞
𝑑(𝑖, 𝑗) = (|𝑥𝑖1 − 𝑥𝑗1 |𝑞 + |𝑥𝑖2 − 𝑥𝑗2 |𝑞 +. . . +|𝑥𝑖𝑝 − 𝑥𝑗𝑝 |𝑞 )
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects,
and q is a positive integer
• If q = 1, d is Manhattan distance
• Properties
• d(i,j) 0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j) d(i,k) + d(k,j)
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Distances between clusters
• Single. Single linkage defines the distance between two objects
or clusters as the distance between the two closest members of
those clusters.
Complete. Complete linkage uses the most distant pair of objects
in two clusters to compute between-cluster distances.
0 1 2 3 4
0.0 0.5 1.0 1.5
Distances
Distances
Cluster Tree Cluster Tree
Average linkage method Median linkage method
Case 29
Case 10 Case 38
Case 30 Case 10
Case 17 Case 12
Case 18 Case 11
Case 13 Case 1
Case 34 Case 14
Case 31 Case 40
Case 6 Case 32
Case 3 Case 21
Case 37 Case 15
Case 16 Case 31
Case 36 Case 30
Case 39 Case 17
Case 22 Case 18
Case 33 Case 13
Case 4 Case 34
Case 7 Case 3
Case 8 Case 6
Case 19 Case 37
Case 15 Case 16
Case 21 Case 36
Case 20 Case 39
Case 28 Case 22
Case 27 Case 41
Case 9 Case 33
Case 25 Case 35
Case 2 Case 20
Case 5 Case 28
Case 23 Case 27
Case 26 Case 9
Case 24 Case 25
Case 35 Case 2
Case 41 Case 5
Case 38 Case 23
Case 12 Case 26
Case 11 Case 24
Case 1 Case 19
Case 14 Case 8
Case 40 Case 7
Case 32 Case 4
Case 29
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0
Distances Distances
Cluster Tree Cluster Tree
Centroid linkage method Ward minimum variance method
Case 8 Case 24
Case 38 Case 26
Case 15 Case 25
Case 7 Case 2
Case 19 Case 5
Case 24 Case 23
Case 26 Case 9
Case 23 Case 27
Case 5 Case 28
Case 2 Case 20
Case 25 Case 29
Case 9 Case 21
Case 27 Case 15
Case 28 Case 38
Case 20 Case 41
Case 21 Case 22
Case 35 Case 33
Case 10 Case 35
Case 4 Case 19
Case 30 Case 8
Case 17 Case 7
Case 18 Case 4
Case 13 Case 30
Case 34 Case 17
Case 31 Case 18
Case 6 Case 13
Case 3 Case 34
Case 37 Case 31
Case 36 Case 6
Case 16 Case 3
Case 39 Case 37
Case 22 Case 16
Case 33 Case 36
Case 41 Case 39
Case 12 Case 10
Case 11 Case 12
Case 1 Case 11
Case 14 Case 1
Case 40 Case 40
Case 32 Case 14
Case 29 Case 32
#Perform Clustering
from sklearn.cluster import AgglomerativeClustering
agg=AgglomerativeClustering(n_clusters=5)
ypred=agg.fit_predict(df)
x=agg.labels_
print(x)
#Creating dendrogram
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, ward
result=ward(df)
dendrogram(result)
plt.title("DENDROGRAM")
plt.xlabel('Observations')
plt.ylabel('Distances')
plt.show()