0% found this document useful (0 votes)
2 views29 pages

Clustering

Uploaded by

Sahil Pahuja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views29 pages

Clustering

Uploaded by

Sahil Pahuja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

What is Cluster Analysis?

• Finding groups of objects such that the objects in a group will


be similar (or related) to one another and different from (or
unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no predefined classes
• Clustering is used:
– As a stand-alone tool to get insight into data distribution
• Visualization of clusters may unveil important information
– As a preprocessing step for other algorithms
• Efficient indexing or compression often relies on clustering
Some Applications of Clustering
• Pattern Recognition
• Image Processing
– cluster images based on their visual content
• Bio-informatics
• WWW and IR
– document classification
– cluster Weblog data to discover groups of similar access patterns
Applications of Cluster Analysis
• Understanding Discovered Clusters
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Industry Group

– Group related documents for 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,


DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN

browsing, group genes and Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,


Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
proteins that have similar 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Technology2-DOWN
functionality, or group stocks Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

with similar price fluctuations Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP

• Summarization
– Reduce the size of large data
sets

Clustering
precipitation in
Australia
What is not Cluster Analysis?
• Supervised classification
– Have class label information

• Simple segmentation
– Dividing students into different registration groups alphabetically,
by last name

• Results of a query
– Groupings are a result of an external specification
Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters


What Is Good Clustering?
• A good clustering method will produce high quality clusters
with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the
similarity measure used by the method and its implementation
(i.e. algorithms used).
• The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
Major Clustering Approaches
• Partitioning algorithms: Construct random partitions and then iteratively
refine them by some criterion
• Hierarchical algorithms: Create a hierarchical decomposition of the set of
data (or objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the clusters and the
idea is to find the best fit of that model to each other
Partitioning Algorithms: Basic Concept

• Partitioning method: Construct a partition of a database D of n


objects into a set of k clusters
– k-means (MacQueen’67): Each cluster is represented by the center of the
cluster. (center may or may not be the object of the cluster)
– k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in the
cluster
K-means Clustering

• Partition clustering approach


• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple
K-Means Algorithm

© Prentice Hall 11
K-means Clustering – Details
• Initial centroids are often chosen randomly.
– Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine similarity,
correlation, etc.
• Most of the convergence happens in the first few iterations.
– Often the stopping condition is changed to ‘Until relatively few points
change clusters’
• Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
K-Means Example
• Given: {2,4,10,12,3,20,30,11,25}, k=2
• Randomly assign means: m1=3,m2=4
• K1={2,3}, K2={4,10,12,20,30,11,25},
m1=2.5,m2=16
• K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18
• K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
• K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25
• Stop as the clusters with these means are the
same.

© Prentice Hall 13
Evaluating K-means Clusters
– For each point, the error is the distance to the nearest cluster
– To get SSE(sum of squared error), we square these errors and sum
them.
K
SSE    dist 2 (mi , x )
i 1 xCi

– x is a data point in cluster Ci and mi is the representative point for


cluster Ci
• can show that mi corresponds to the center (mean) of the cluster
– Given two clusters, we can choose the one with the smallest error
Limitations of K-means
• K-means has problems when clusters are of differing
– Sizes
– Densities
– Non-spherical shapes
• Resultant clusters formed depends on Initial centroids chosen.
• K-means has problems when the data contains outliers.
Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)


Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)


Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)


Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Importance of Choosing Initial Centroids …

Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Iteration 3 Iteration 4 Iteration 5


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Solutions to Initial Centroids Problem

• Multiple runs
– For each run, compute SSE and consider the case with minimum
SSE.
• Sample and use hierarchical clustering to determine initial
centroids.
• Take the centroid of all the points. Then select the point farthest
from this initial centroid as another centroid.
• Select more than k initial centroids and then select among these
initial centroids
– Select most widely separated
The K-Medoids Clustering Method
• Find representative objects, called medoids, in clusters
• PAM (Kaufmann & Rousseeuw 1987) [Partitioning Around
Medoids]
– starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
– PAM works effectively for small data sets, but does not scale well for
large data sets
• CLARA (Kaufmann & Rousseeuw, 1990)[clustering large
Applications]
• CLARANS (Ng & Han, 1994): Randomized sampling [Clustering
Large Applications based upon RANdomized Search]
The K-Medoids Clustering Method

PAM (Kaufman and Rousseeuw, 1987)


• Arbitrarily select k objects as medoid
• Assign each data object in the given data set to most similar medoid.
• Randomly select nonmedoid object O’
• Compute total cost, S, of swapping a medoid object to O’ (cost as total
sum of absolute error)
• If S<0, then swap initial medoid with the new one
• Repeat until there is no change in the medoid.

k-medoids and (n-k) instances pair-wise comparison

December 5, 2024 Data Mining: Concepts and Techniques 23


PAM Clustering: Total swapping cost TCih=jCjih
• i is a current medoid, h is a non-selected object
• Assume that i is replaced by h in the set of
medoids
• TCih = 0;
• For each non-selected object j ≠ h:
– TCih += d(j,new_medj)-d(j,prev_medj):
• new_medj = the closest medoid to j after i is replaced by
h
• prev_medj = the closest medoid to j before i is replaced
by h
PAM Clustering: Total swapping cost TCih=jCjih
10 10

9 9
j
8
t 8
t
7 7

5
j 6

4
i h 4
h
3

2
3

2
i
1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Cjih = d(j, h) - d(j, i) Cjih = 0


10
10

9
9

8
h 8

j
7
7

6
6

5
5 i
i 4
h j
4

3
t 3

2
2

1
t
1

0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

Cjih = d(j, t) - d(j, i) Cjih = d(j, h) - d(j, t)


PAM
• Partitioning Around Medoids (PAM) (K-
Medoids)
• Handles outliers well.
• Ordering of input does not impact results.
• Does not scale well.
• Each cluster represented by one item, called
the medoid.
• Initial set of k medoids randomly chosen.

© Prentice Hall 26
PAM

© Prentice Hall 27
PAM Cost Calculation
• At each step in algorithm, medoids are changed if
the overall cost is improved.
• Cjih – cost change for an item tj associated with
swapping medoid ti with non-medoid th.

© Prentice Hall 28
PAM Algorithm

© Prentice Hall 29

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy