Clustering: Sridhar S Department of IST Anna University
Clustering: Sridhar S Department of IST Anna University
Sridhar S
Department of IST
Anna University
Clustering
Definition
Why Clustering?
Learning
from Data
Unsupervised learning
What it does?
Pattern detection
Simplifications
concept construction
Examples
Clustering of documents or research groups
by citation index, evolution of papers.
Problem: find minimal set of papers describing
essential ideas
Clustering of items from excavations
according to cultural epoques
Clustering of tooth fragments in anthropology
Carl von Linne: Systema Naturae, 1735,
botanics, later zoology
etc ...
Mining
Information Retrieval
Text Mining
Web Analysis
Marketing
Medical Diagnostic
Image Analysis
Bioinformatics
Overview of clustering
Feature Selection
identifying the most effective subset of the original features to
use in clustering
Feature Extraction
transformations of the input features to produce new salient
features.
Inter-pattern Similarity
measured by a distance function defined on pairs of patterns.
Grouping
methods to group similar patterns in the same cluster
Distance Measures
The
p (x,y)
D4 distance
The
D8 distance
The
Example
Metric Distances
What
D(A,B) = D(B,A)
Symmetry
D(A,A) = 0
Constancy of Self-Similarity
D(A,B) >= 0
Positivity
D(A,B) D(A,C) + D(B,C) Triangular Inequality
Hierarchical Clustering
Produces
4
3
0.2
0.15
0.1
1
3
0.05
0
Hierarchical Clustering
Agglomerative:
Divisive Algorithms
Divisive:
The
Dsl Ci , C j min x , y d ( x, y ) x Ci , y C j
Initial Table
Pixels
15
24
24
12
First Iteration
1
4.0
11.7
20.0
21.5
4.0
8.1
16.0
17.9
11.7
8.1
9.8
9.8
20.0
16.0
9.8
8.0
21.5
17.9
9.8
8.0
Second Iteration
{1,2}
{1,2}
8.1
16.0
17.9
8.1
9.8
9.8
16.0
9.8
8.0
17.9
9.8
8.0
Third Iteration
{1,2}
{4,5}
{1,2}
8.1
16.0
8.1
9.8
{4,5}
16.0
9.8
Dendrogram
Initial Table
Pixels
15
24
24
12
First Iteration
1
4.0
11.7
20.0
21.5
4.0
8.1
16.0
17.9
11.7
8.1
9.8
9.8
20.0
16.0
9.8
8.0
21.5
17.9
9.8
8.0
Second Iteration
{1,2}
{1,2}
11.7
20.0
21.5
11.7
9.8
9.8
20.0
9.8
8.0
21.5
9.8
8.0
Third Iteration
{1,2}
{4,5}
{1,2}
11.7
21.5
11.7
9.8
{4,5}
21.5
9.8
Dendrogram
1
Davg Ci , C j
Ci C j
d ( x, y )
xCi , yC j
Initial Table
Pixels
15
24
24
12
First Iteration
1
4.0
11.7
20.0
21.5
4.0
8.1
16.0
17.9
11.7
8.1
9.8
9.8
20.0
16.0
9.8
8.0
21.5
17.9
9.8
8.0
Second Iteration
{1,2}
{1,2}
9.9
18.0
19.7
9.9
9.8
9.8
18
9.8
8.0
19.7
9.8
8.0
Third Iteration
{1,2}
{4,5}
{1,2}
9.9
18.9
9.9
9.8
{4,5}
18.9
9.8
Dendrogram
Strengths
Limitations
Wards Algorithm
Squared Error
{1,2},{3},{4},{5}
8.0
{1,3},{2},{4},{5}
68.5
{1,4},{2},{3},{5}
200
{1,5},{2},{3},{4}
232
{2,3},{1},{4},{5}
32.5
{2,4},{1},{3},{5}
128
{2,5},{1},{3},{4}
160
{3,4},{1},{2},{4}
48.5
{3,5},{1},{2},{5}
48.5
{4,5},{1},{2},{3}
32.0
Squared Error
{1,2,3},{4},{5}
72.7
{1,2,4}, {3},{5}
224
{1,2,5},{3},{4}
266.7
{1,2},{3,4},{5}
56.5
{1,2},{3,4},{5}
56.5
{1,2},{4,5},{3}
40
Squared Error
{1,2,3},{4,5}
104.7
{1,2,4,5},{3}
380.0
{1,2},{3,4,5}
94
Dendrogram
O(N3)
Overall Assessment
Once
No
Different
Divisive Hierarchical
Algorithms
Spanning Tree
A
Kruskal's Algorithm
Prim's Algorithm
Kruskal's Algorithm
Kruskal's Algorithm:
sort the edges of G in increasing
order by length
keep a subgraph S of G, initially
empty
for each edge e in sorted order
if the endpoints of e are
disconnected in S
add e to S
Prims Algorithm
Robert
Clay Prim
Prim Approach:
Prim's Algorithm:
let T be a single vertex x
while (T has fewer than n vertices)
{
find the smallest edge connecting T
to G-T
add it to T
12
9
2
11
20
40
50
1
14
10
12
9
2
11
20
40
50
1
14
10
12
9
2
11
20
40
50
1
14
10
12
9
2
11
20
40
50
1
14
10
12
9
2
11
20
40
50
1
14
10
12
9
2
11
20
40
50
1
14
10
12
9
2
11
20
40
50
1
14
10
12
9
2
11
20
40
50
1
14
10
12
9
2
11
20
40
50
1
14
10
Partitional Clustering
Introduction
Patritional
Clustering
Forgys Algorithm
k-means Algorithm
Isodata Algorithm
Partitional Clustering
Forgys Algorithm
k
Partitional Clustering
Forgys Algorithm
1.
2.
3.
4.
Partitional Clustering
Forgys Algorithm
Initialization
Sample
(4,4)
(4,4)
(8,4)
(8,4)
(15,8)
(24,4)
(24,12)
Partitional Clustering
Forgys Algorithm
First iteration
Sample
(4,4)
(4,4)
(8,4)
(8,4)
(15,8)
(8,4)
(24,4)
(8,4)
(24,12)
(8,4)
Partitional Clustering
Forgys Algorithm
Second iteration
Sample
(4,4)
(4,4)
(8,4)
(4,4)
(15,8)
(17.75, 7)
(24,4)
(17.75, 7)
(24,12)
(17.75, 7)
Partitional Clustering
Forgys Algorithm
Third iteration
Sample
(4,4)
(6,4)
(8,4)
(6,4)
(15,8)
(21, 8)
(24,4)
(21, 8)
(24,12)
(21, 8)
Partitional Clustering
Forgys Algorithm
It has been proved that Forgys algorithm
terminates.
k-means Algorithm
Patritional
Clustering
Forgys Algorithm
k-means Algorithm
Isodata Algorithm
k-means Algorithm
Similar
to Forgys algorithm
Differences
k-means Algorithm
Begin with k clusters, each consisting of one of
the first k samples. For each of the remaining nk samples, find the centroid nearest it. Put the
sample in the cluster identified with this nearest
centroid. After each sample is assigned,
recompute the centroid of the altered cluster.
Go through the data a second time. For each
sample, find the centroid nearest it. Put the
sample in the cluster identified with the nearest
centroild. (During this step, do not recompute
any centroid.)
k-means Algorithm
Set
Distance to
Distance to
Centroid (9,
Centroid (24,8)
5.3)
1.6
16.5
(24,4)
15.1
4.0
(15,8)
6.6
9.0
(4,4)
6.6
40.4
(24,12)
16.4
4.0
Isodata Algorithm
Patritional
Clustering
Forgys Algorithm
k-means Algorithm
Isodata Algorithm
Isodata Algorithm
An
Split clusters
Model-based clustering
Assume
Model-based clustering
Assume
EM Algorithm
EM Algorithm
Pr( x | C )
Pr( xi Ck ) Pr( xi | Ck )
wk
Pr( x C )
i
i 1
Pr( xi C k )
Pr( xi C j )
k
Validity of clusters
Why
validity of clusters?
Measuring
Optimality of clusters
Verification of biological meaning of clusters
Optimality of clusters
Optimal
clusters should
Example
of intracluster measure
Squared error se
k
se
i 1 pci
p mi
References