0% found this document useful (0 votes)
141 views91 pages

Clustering: Sridhar S Department of IST Anna University

This document provides an overview of clustering techniques in data mining. It begins by defining clustering as the process of organizing objects into groups of similar objects. The document discusses the differences between supervised and unsupervised learning as they relate to classification and clustering. It then covers examples of clustering applications and different clustering algorithms like hierarchical, k-means, density-based, and grid-based clustering. Specific hierarchical clustering techniques like single linkage, complete linkage, and average linkage are described. The document concludes by discussing considerations for conducting cluster analysis such as selecting distance measures and interpreting clusters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
141 views91 pages

Clustering: Sridhar S Department of IST Anna University

This document provides an overview of clustering techniques in data mining. It begins by defining clustering as the process of organizing objects into groups of similar objects. The document discusses the differences between supervised and unsupervised learning as they relate to classification and clustering. It then covers examples of clustering applications and different clustering algorithms like hierarchical, k-means, density-based, and grid-based clustering. Specific hierarchical clustering techniques like single linkage, complete linkage, and average linkage are described. The document concludes by discussing considerations for conducting cluster analysis such as selecting distance measures and interpreting clusters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 91

Clustering

Sridhar S
Department of IST
Anna University

Clustering
Definition

Clustering is the process of organizing


objects into groups whose members are
similar in some way.
A cluster is therefore a collection of objects
which are similar between them and are
dissimilar to the objects belonging to
other clusters.
A data mining technique called Class Discovery

Why Clustering?
Learning

from Data
Unsupervised learning
What it does?
Pattern detection
Simplifications
concept construction

Supervised vs. Unsupervised


Classification
Supervised: set of classes (clusters) is given, assign new pattern
(point) to proper cluster, label it with label of its cluster
Examples: classify bacteria to select proper antibiotics, assign
signature to book and place in proper shelf
Unsupervised: for given set of patterns, discover a set of clusters
(training set) and assign addtional patterns to proper cluster
Examples: buying behavior, stock groups, closed groups of
researchers citing each other, more?

Examples
Clustering of documents or research groups
by citation index, evolution of papers.
Problem: find minimal set of papers describing
essential ideas
Clustering of items from excavations
according to cultural epoques
Clustering of tooth fragments in anthropology
Carl von Linne: Systema Naturae, 1735,
botanics, later zoology
etc ...

The use of Clustering


Data

Mining
Information Retrieval
Text Mining
Web Analysis
Marketing
Medical Diagnostic
Image Analysis
Bioinformatics

Image Analysis MST


Clustering
An image file
before/after color
clustering using
HEMST (Hierarchical
EMST clustering
algorithm) and
SEMST (Standard
EMST clustering
algorithm).

Overview of clustering

Feature Selection
identifying the most effective subset of the original features to
use in clustering
Feature Extraction
transformations of the input features to produce new salient
features.
Inter-pattern Similarity
measured by a distance function defined on pairs of patterns.
Grouping
methods to group similar patterns in the same cluster

Conducting Cluster Analysis


Formulate the Problem
Select a Distance Measure
Select a Clustering Procedure
Decide on the Number of Clusters
Interpret and Profile Clusters
Assess the Validity of Clustering

Formulating the Problem

Most important is selecting the variables on


which the clustering is based.

Inclusion of even one or two irrelevant


variables may distort a clustering solution.

Select a Similarity Measure

Similarity measure can be correlations or distances

The most commonly used measure of similarity is


the Euclidean distance. The city-block distance is
also used.

If variables measured in vastly different units, we


must standardize data. Also eliminate outliers

Use of different similarity/distance measures may


lead to different clustering results. Hence, it is
advisable to use different measures and compare
the results.

Distance Measures
The

Euclidean Distance between p and q is


defined as:
De (p,q) = [(x s)2 + (y - t)2]1/2
q (s,t)
De (p,q)

p (x,y)

D4 distance
The

D4 distance (also called city-block


distance) between p and q is defined as:
D4 (p,q) = | x s | + | y t |

D8 distance
The

D8 distance (also called chessboard


distance) between p and q is defined as:
D8 (p,q) = max(| x s |,| y t |)

Example

Metric Distances
What

properties should a similarity distance


have?

D(A,B) = D(B,A)
Symmetry
D(A,A) = 0
Constancy of Self-Similarity
D(A,B) >= 0
Positivity
D(A,B) D(A,C) + D(B,C) Triangular Inequality

Hierarchical Clustering
Produces

a set of nested clusters organized


as a hierarchical tree
Can be visualized as a dendrogram

A tree like diagram that records the sequences of


merges or splits
5

4
3

0.2

0.15

0.1

1
3

0.05
0

Hierarchical Clustering

Two main types of hierarchical clustering

Agglomerative:

Start with the points as individual clusters


At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left

Divisive Algorithms

Divisive:

Start with one, all-inclusive cluster


At each step, split a cluster until each cluster contains a point (or
there are k clusters)

Single Linkage Algorithm

Distance between two clusters


Single-link

distance between clusters Ci


and Cj is the minimum distance between
any object in Ci and any object in Cj

The

distance is defined by the two most


similar objects

Dsl Ci , C j min x , y d ( x, y ) x Ci , y C j

Initial Table
Pixels

15

24

24

12

First Iteration
1

4.0

11.7

20.0

21.5

4.0

8.1

16.0

17.9

11.7

8.1

9.8

9.8

20.0

16.0

9.8

8.0

21.5

17.9

9.8

8.0

Objects 1 and 2 should be clustered

Second Iteration
{1,2}

{1,2}

8.1

16.0

17.9

8.1

9.8

9.8

16.0

9.8

8.0

17.9

9.8

8.0

Objects 4 and 5 should be clustered

Third Iteration
{1,2}

{4,5}

{1,2}

8.1

16.0

8.1

9.8

{4,5}

16.0

9.8

Objects {1,2} and {3} are clustered

Dendrogram

Complete Linkage Algorithm

Initial Table
Pixels

15

24

24

12

First Iteration
1

4.0

11.7

20.0

21.5

4.0

8.1

16.0

17.9

11.7

8.1

9.8

9.8

20.0

16.0

9.8

8.0

21.5

17.9

9.8

8.0

Objects 1 and 2 should be clustered

Second Iteration
{1,2}

{1,2}

11.7

20.0

21.5

11.7

9.8

9.8

20.0

9.8

8.0

21.5

9.8

8.0

Objects 4 and 5 should be clustered

Third Iteration
{1,2}

{4,5}

{1,2}

11.7

21.5

11.7

9.8

{4,5}

21.5

9.8

Objects {3} and {4,5} are clustered

Dendrogram

Average Linkage Algorithm

Distance between two clusters


Group

average distance between clusters


Ci and Cj is the average distance between
any object in Ci and any object in Cj

1
Davg Ci , C j
Ci C j

d ( x, y )

xCi , yC j

Initial Table
Pixels

15

24

24

12

First Iteration
1

4.0

11.7

20.0

21.5

4.0

8.1

16.0

17.9

11.7

8.1

9.8

9.8

20.0

16.0

9.8

8.0

21.5

17.9

9.8

8.0

Objects 1 and 2 should be clustered

Second Iteration
{1,2}

{1,2}

9.9

18.0

19.7

9.9

9.8

9.8

18

9.8

8.0

19.7

9.8

8.0

Objects 4 and 5 should be clustered

Third Iteration
{1,2}

{4,5}

{1,2}

9.9

18.9

9.9

9.8

{4,5}

18.9

9.8

Objects {3} and {4,5} are clustered

Dendrogram

Hierarchical Clustering: Group


Average

Compromise between Single and


Complete Link

Strengths

Less susceptible to noise and outliers

Limitations

Biased towards globular clusters

Wards Algorithm

Cluster Similarity: Wards


Method
Based on Minimum variance
Find mean vector for example of
(4,4) and (8,4) = (12/2,4+4/2) = (6,4)
Variance of merging objects 1 and 2 is
(4-6)2+(8-6)2 + (4-6)2+(8-6)2
=8

First Iteration For 4 clusters


Clusters

Squared Error

{1,2},{3},{4},{5}

8.0

{1,3},{2},{4},{5}

68.5

{1,4},{2},{3},{5}

200

{1,5},{2},{3},{4}

232

{2,3},{1},{4},{5}

32.5

{2,4},{1},{3},{5}

128

{2,5},{1},{3},{4}

160

{3,4},{1},{2},{4}

48.5

{3,5},{1},{2},{5}

48.5

{4,5},{1},{2},{3}

32.0

First Iteration For 3 clusters


Clusters

Squared Error

{1,2,3},{4},{5}

72.7

{1,2,4}, {3},{5}

224

{1,2,5},{3},{4}

266.7

{1,2},{3,4},{5}

56.5

{1,2},{3,4},{5}

56.5

{1,2},{4,5},{3}

40

First Iteration For 2 clusters


Clusters

Squared Error

{1,2,3},{4,5}

104.7

{1,2,4,5},{3}

380.0

{1,2},{3,4,5}

94

Dendrogram

Hierarchical Clustering: Time and Space requirements


O(N2)

space since it uses the proximity


matrix.

N is the number of points.

O(N3)

time in many cases

There are N steps and at each step the size, N2,


proximity matrix must be updated and searched
Complexity can be reduced to O(N2 log(N) ) time
for some approaches

Overall Assessment
Once

a decision is made to combine two


clusters, it cannot be undone

No

objective function is directly minimized

Different

schemes have problems with one or


more of the following:

Sensitivity to noise and outliers Difficulty handling


different sized clusters and convex shapes,
Breaking large clusters

Divisive Hierarchical
Algorithms

Spanning Tree
A

spanning tree of a graph is a tree


and is a subgraph that contains all the
vertices.
A graph may have many spanning
trees; for example, the complete
graph on four vertices has sixteen
spanning trees:

Spanning Tree cont.

Minimum Spanning Trees


Suppose

that the edges of the graph have


weights or lengths. The weight of a tree will
be the sum of weights of its edges.
Based on the example, we can see that
different trees have different lengths.
The question is: how to find the minimum
length spanning tree?

Minimum Spanning Trees


The

question can be solved by many different


algorithms, here is three classical minimumspanning tree algorithms :

Kruskal's Algorithm
Prim's Algorithm

Kruskal's Algorithm

Joseph Bernard Kruskal, Jr


Kruskal Approach:

Select the minimum weight edge


that does not form a cycle

Kruskal's Algorithm:
sort the edges of G in increasing
order by length
keep a subgraph S of G, initially
empty
for each edge e in sorted order
if the endpoints of e are
disconnected in S
add e to S

Kruskal's Algorithm - Example

Kruskal's Algorithm - Example

Prims Algorithm
Robert

Clay Prim
Prim Approach:

Choose an arbitrary start node v


At any point in time, we have connected
component N containing v and other nodes V-N
Choose the minimum weight edge from N to V-N

Prim's Algorithm:
let T be a single vertex x
while (T has fewer than n vertices)
{
find the smallest edge connecting T
to G-T
add it to T

Prim's Algorithm - Example


8
13

12
9

2
11

20

40

50
1

14
10

Prim's Algorithm - Example


8
13

12
9

2
11

20

40

50
1

14
10

Prim's Algorithm - Example


8
13

12
9

2
11

20

40

50
1

14
10

Prim's Algorithm - Example


8
13

12
9

2
11

20

40

50
1

14
10

Prim's Algorithm - Example


8
13

12
9

2
11

20

40

50
1

14
10

Prim's Algorithm - Example


8
13

12
9

2
11

20

40

50
1

14
10

Prim's Algorithm - Example


8
13

12
9

2
11

20

40

50
1

14
10

Prim's Algorithm - Example


8
13

12
9

2
11

20

40

50
1

14
10

Prim's Algorithm - Example


8
13

12
9

2
11

20

40

50
1

14
10

MST: Divisive Hierarchical


Clustering
Build

MST (Minimum Spanning Tree)

Start with a tree that consists of any point


In successive steps, look for the closest pair of points (p, q) such
that one point (p) is in the current tree but the other (q) is not
Add q to the tree and put an edge between p and q

MST: Divisive Hierarchical


Clustering
Use

MST for constructing hierarchy of


clusters

Partitional Clustering

Introduction
Patritional

Clustering

Forgys Algorithm
k-means Algorithm
Isodata Algorithm

Partitional Clustering

Forgys Algorithm
k

samples called seed points.


k is the number of clusters to be constructed

Partitional Clustering

Forgys Algorithm
1.

2.

3.
4.

Initialize the cluster centroids to the seed


points.
For each sample, find the cluster centroid
nearest it. Put the sample in the cluster
identified with this nearest cluster centroid.
If no sample changed clusters in step 2, stop.
Compute the centroids of the resulting clusters
and go to step 2.

Partitional Clustering

Forgys Algorithm
Initialization
Sample

Nearest Cluster Centroid

(4,4)

(4,4)

(8,4)

(8,4)

(15,8)
(24,4)
(24,12)

Partitional Clustering

Forgys Algorithm
First iteration
Sample

Nearest Cluster Centroid

(4,4)

(4,4)

(8,4)

(8,4)

(15,8)

(8,4)

(24,4)

(8,4)

(24,12)

(8,4)

Partitional Clustering

Forgys Algorithm
Second iteration
Sample

Nearest Cluster Centroid

(4,4)

(4,4)

(8,4)

(4,4)

(15,8)

(17.75, 7)

(24,4)

(17.75, 7)

(24,12)

(17.75, 7)

Partitional Clustering

Forgys Algorithm
Third iteration
Sample

Nearest Cluster Centroid

(4,4)

(6,4)

(8,4)

(6,4)

(15,8)

(21, 8)

(24,4)

(21, 8)

(24,12)

(21, 8)

Partitional Clustering

Forgys Algorithm
It has been proved that Forgys algorithm
terminates.

Eventually no sample change clusters.

If the number of samples is large, it may take


the algorithm considerable time to produce
stable clusters.

k-means Algorithm
Patritional

Clustering

Forgys Algorithm
k-means Algorithm
Isodata Algorithm

k-means Algorithm
Similar

to Forgys algorithm

k is the number of clusters to be constructed

Differences

The centroids of the clusters are recomputed as


soon as a sample joins a cluster.
Makes only two passes through the data set

Forgys algorithm is iterative

k-means Algorithm
Begin with k clusters, each consisting of one of
the first k samples. For each of the remaining nk samples, find the centroid nearest it. Put the
sample in the cluster identified with this nearest
centroid. After each sample is assigned,
recompute the centroid of the altered cluster.
Go through the data a second time. For each
sample, find the centroid nearest it. Put the
sample in the cluster identified with the nearest
centroild. (During this step, do not recompute
any centroid.)

k-means Algorithm
Set

k=2 and assume that the data are


ordered so that the first two samples are (8,4)
and (24,4).

Distances for use


by the k-means algorithm
Sample
(8,4)

Distance to
Distance to
Centroid (9,
Centroid (24,8)
5.3)
1.6
16.5

(24,4)

15.1

4.0

(15,8)

6.6

9.0

(4,4)

6.6

40.4

(24,12)

16.4

4.0

Isodata Algorithm
Patritional

Clustering

Forgys Algorithm
k-means Algorithm
Isodata Algorithm

Isodata Algorithm
An

enhancement of the approach taken by


Forgys algorithm and the k-means algorithm.

k is allowed to range over an interval

Discard clusters with too few elements


Merge clusters

Too large or too close

Split clusters

Too few or containing very dissimilar samples

Model-based clustering
Assume

data generated from k probability


distributions
Goal: find the distribution parameters
Algorithm: Expectation Maximization (EM)
Output: Distribution parameters and a soft
assignment of points to clusters

Model-based clustering
Assume

k probability distributions with


parameters: (1,, k)
Given data X, compute (1,, k) such that
Pr(X|1,, k) [likelihood] or ln(Pr(X|1,, k))
[loglikelihood] is maximized.
Every point xX need not be generated by a
single distribution but it can be generated by
multiple distributions with some probability [soft
clustering]

EM Algorithm

Initialize k distribution parameters (1,, k); Each distribution


parameter corresponds to a cluster center

Iterate between two steps

Expectation step: (probabilistically) assign points to


clusters

Maximation step: estimate model parameters that


maximize the likelihood for the given assignment of points

EM Algorithm

Initialize k cluster centers


Iterate between two steps

Expectation step: assign points to clusters

Pr( x | C )

Pr( xi Ck ) Pr( xi | Ck )
wk

Pr( x C )
i

Maximation step: estimate model parameters


1
rk
n

i 1

Pr( xi C k )
Pr( xi C j )
k

Validity of clusters
Why

validity of clusters?

Given some data, any clustering algorithm


generates clusters
So we need to make sure the clustering results
are valid and meaningful.

Measuring

the validity of clustering results


usually involve

Optimality of clusters
Verification of biological meaning of clusters

Optimality of clusters
Optimal

clusters should

minimize distance within clusters (intracluster)


maximize distance between clusters (intercluster)

Example

of intracluster measure

Squared error se
k

se

i 1 pci

p mi

where mi is the mean of all instances in cluster ci

References

A. K. Jain and M. N. Murty and P. J. Flynn, Data


clustering: a review, ACM Computing Surveys, 31:3,
pp. 264 - 323, 1999.
T. R. Golub et. al, Molecular Classification of
Cancer: Class Discovery and Class Prediction by
Gene Expression Monitoring, Science, 286:5439,
pp. 531 537, 1999.
Gasch,A.P. and Eisen,M.B. (2002) Exploring the
conditional coregulation of yeast gene expression
through fuzzy k-means clustering. Genome Biol., 3,
122.
M. Eisen et. al, Cluster Analysis and Display of
Genome-Wide Expression Patterns. Proc Natl Acad
Sci U S A 95, 14863-8, 1998.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy