Clustering
Clustering
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no predefined classes
• Clustering is used:
– As a stand-alone tool to get insight into data distribution
• Visualization of clusters may unveil important information
– As a preprocessing step for other algorithms
• Efficient indexing or compression often relies on clustering
Some Applications of Clustering
• Pattern Recognition
• Image Processing
– cluster images based on their visual content
• Bio-informatics
• WWW and IR
– document classification
– cluster Weblog data to discover groups of similar access patterns
Applications of Cluster Analysis
• Understanding Discovered Clusters
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Industry Group
3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
• Summarization
– Reduce the size of large data
sets
Clustering
precipitation in
Australia
What is not Cluster Analysis?
• Supervised classification
– Have class label information
• Simple segmentation
– Dividing students into different registration groups alphabetically,
by last name
• Results of a query
– Groupings are a result of an external specification
Notion of a Cluster can be Ambiguous
© Prentice Hall 11
K-means Clustering – Details
• Initial centroids are often chosen randomly.
– Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine similarity,
correlation, etc.
• Most of the convergence happens in the first few iterations.
– Often the stopping condition is changed to ‘Until relatively few points
change clusters’
• Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
K-Means Example
• Given: {2,4,10,12,3,20,30,11,25}, k=2
• Randomly assign means: m1=3,m2=4
• K1={2,3}, K2={4,10,12,20,30,11,25},
m1=2.5,m2=16
• K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18
• K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
• K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25
• Stop as the clusters with these means are the
same.
© Prentice Hall 13
Evaluating K-means Clusters
– For each point, the error is the distance to the nearest cluster
– To get SSE(sum of squared error), we square these errors and sum
them.
K
SSE dist 2 (mi , x )
i 1 xCi
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Importance of Choosing Initial Centroids …
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Solutions to Initial Centroids Problem
• Multiple runs
– For each run, compute SSE and consider the case with minimum
SSE.
• Sample and use hierarchical clustering to determine initial
centroids.
• Take the centroid of all the points. Then select the point farthest
from this initial centroid as another centroid.
• Select more than k initial centroids and then select among these
initial centroids
– Select most widely separated
The K-Medoids Clustering Method
• Find representative objects, called medoids, in clusters
• PAM (Kaufmann & Rousseeuw 1987) [Partitioning Around
Medoids]
– starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
– PAM works effectively for small data sets, but does not scale well for
large data sets
• CLARA (Kaufmann & Rousseeuw, 1990)[clustering large
Applications]
• CLARANS (Ng & Han, 1994): Randomized sampling [Clustering
Large Applications based upon RANdomized Search]
The K-Medoids Clustering Method
9 9
j
8
t 8
t
7 7
5
j 6
4
i h 4
h
3
2
3
2
i
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
9
9
8
h 8
j
7
7
6
6
5
5 i
i 4
h j
4
3
t 3
2
2
1
t
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
© Prentice Hall 26
PAM
© Prentice Hall 27
PAM Cost Calculation
• At each step in algorithm, medoids are changed if
the overall cost is improved.
• Cjih – cost change for an item tj associated with
swapping medoid ti with non-medoid th.
© Prentice Hall 28
PAM Algorithm
© Prentice Hall 29