Clustering
Clustering
Subhajit Chattopadhyay
– regression
– time series
– multiple discriminant analysis
– support vector machine
– decision tree
Figure 1: ML Methods
• Unsupervised - such methods does not have any dependent variable or labelled data
– clustering
• Semi-supervised - a section of methods uses labelled data and another section of method uses unlabeled
data
1
Figure 2: Cluster Analysis
2
Types of Clustering Techniques
• Connectivity models: distance connectivity between observations is the measure e.g. hierarchical clus-
tering
• Centroid models: distance from mean value of each observations / cluster is the measure e.g. k-means
clustering
• Distribution models: signficance of statistical distribution of variables in the dataset is the measure
e.g. expectation - maximization algorithms
• Density models: density in data space is the measure e.g. DBSCAN
• Another way to categorize clustering models -
• McQueen algorithm => McQueen and Lloyd algorithms are very similar. The only difference is -
– Lloyd algorithm updates the centroids after each iteration , so it is called batch or offline algorithm.
– McQueen algorithm updates the centroids when a case changes a cluster and algorithm passes
through the entire data set. McQueen algorith converges quicker than Llyod algorithm.
• Hartigan Wong
P algorithm => this algorithm, for each case of the dataset, calculates the sum of squared
error (ss = i=k (xi − ck )2 , where ck is the centroid of the other cluster) of that case’s current cluster
excluding the case and also sum of squared errors of other clusters where it may be assigned.
3
Measuring the distance between two data points
Distance between two cases underlines the similarity or dissimilarity. Higher the distance between two case,
lower is the similarity and vice-versa.
Pn 1
• Eucledian distance => d(x, y) = [ i=1 (xi − yi )2 ] 2
Pn
• Manhattan distance => d(x, y) = i=1 |xi − yi |
P
(xi −x̄)(yi −ȳ)
• Pearson Correlation Index =>d(x, y) = 1 − P P 1
[ (xi −x̄)2 ∗ (yi −ȳ)2 )] 2
P
xi yi
• Eisen Cosen correlation distance => d(x, y) = 1 − P x2 ∗P y2
i i
Pp 1
• Minkowski distance => d(x, y) = [ k=1 |xik − xjk |r ] r
To decide the number of clusters , internal cluster metrics are looked into -
• Davis Bouldin’s Index => It looks at the ratio of within cluster variance (scatter) to the distance
between the centroids of the clusters. This ratio is computer for all possible pair of clusters and for
each cluster.
1 X 1
Scatterk = [ (xi − ck )2 ] 2
nk
N
1
X
separationj,k = [ (cj − ck )2 ] 2
i<j<k
scatterj + scatterk
ratioj,k =
separationk
PN
The largest ratio is termed as Rk and DBIndex = N1 k=1 Rk
• Dunn’s Index => It aims to identify dense and well separated clusters. It is defined as the ratio of
minimal
• Pseudo F-statistics => It is the ratio of the between cluster sum of squares to the between clusters to
sum of square within clusters.
SSbetween /(k − 1)
P seudo − F =
SSwithin /(n − k)
n
X
SSbetween = (ck − cg )2 ∗ nk
k
nk
N X
X
SSwithin = (xi − ck )2
k i=k