Cluster
Cluster
Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
May 3, 2023 Data Mining: Concepts and Techniques 1
What is Cluster Analysis?
n Cluster: a collection of data objects
n Similar to one another within the same cluster
n Dissimilar to the objects in other clusters
n Cluster analysis
n Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
n Unsupervised learning: no predefined classes
n Typical applications
n As a stand-alone tool to get insight into data distribution
n As a preprocessing step for other algorithms
May 3, 2023 Data Mining: Concepts and Techniques 2
Clustering: Rich Applications and
Multidisciplinary Efforts
n Pattern Recognition
n Spatial Data Analysis
n Create thematic maps in GIS by clustering feature
spaces
n Detect spatial clusters or for other spatial mining tasks
n Image Processing
n Economic Science (especially market research)
n WWW
n Document classification
n Cluster Weblog data to discover groups of similar access
patterns
May 3, 2023 Data Mining: Concepts and Techniques 3
Examples of Clustering Applications
n Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
n Land use: Identification of areas of similar land use in an earth
observation database
n Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost
n City-planning: Identifying groups of houses according to their house
type, value, and geographical location
n Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
May 3, 2023 Data Mining: Concepts and Techniques 4
Quality: What Is Good Clustering?
n Scalability
n Ability to deal with different types of attributes
n Ability to handle dynamic data
n Discovery of clusters with arbitrary shape
n Minimal requirements for domain knowledge to
determine input parameters
n Able to deal with noise and outliers
n Insensitive to order of input records
n High dimensionality
n Incorporation of user-specified constraints
n Interpretability and usability
May 3, 2023 Data Mining: Concepts and Techniques 7
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
May 3, 2023 Data Mining: Concepts and Techniques 8
Data Structures
n Data matrix
x 11 ... x 1f ... x 1p
n (two modes)
... ... ... ... ...
x ... x if ... x ip
i1
... ... ... ... ...
x ... x nf ... x np
n1
n Dissimilarity matrix 0
n (one mode)
d(2,1) 0
d(3,1 ) d ( 3,2 ) 0
: : :
d ( n ,1) d ( n ,2 ) ... ... 0
n Interval-scaled variables
n Binary variables
n Nominal, ordinal, and ratio variables
n Variables of mixed types
n Standardize data
n Calculate the mean absolute deviation:
s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)
n If q = 2, d is Euclidean distance:
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
n Properties
n d(i,j) 0
n d(i,i) = 0
n d(i,j) = d(j,i)
n d(i,j) d(i,k) + d(k,j)
n Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
n f is ordinal or ratio-scaled
n Partitioning approach:
n Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square errors
n Typical methods: k-means, k-medoids, CLARANS
n Hierarchical approach:
n Create a hierarchical decomposition of the set of data (or objects) using
some criterion
n Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
n Density-based approach:
n Based on connectivity and density functions
n Typical methods: DBSACN, OPTICS, DenClue
N
n Radius: square root of average distance from any point of the
cluster to its centroid
N (t cm ) 2
Rm i 1 ip
N
n Diameter: square root of average mean squared distance between
all pairs of points in the cluster
N N (t t ) 2
Dm i 1 i 1 ip iq
N ( N 1)
k
m 1 t mi Km (C m t mi ) 2
n Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
n Global optimal: exhaustively enumerate all partitions
n Heuristic methods: k-means and k-medoids algorithms
n k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
n k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
n Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3
2 each
2 the 2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
object as initial
6 6
5 5
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
n Dissimilarity calculations
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
10 10 10
9 9 9
8 8 8
Arbitrary Assign
7 7 7
6 6 6
5 choose k 5 each 5
4 object as 4 remainin 4
3
initial 3
g object 3
2
medoids 2
to 2
nearest
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
8 Compute
9
8
Swapping O total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
9 9
j
8
t 8 t
7 7
5
j 6
4
i h 4
h
3 3
i
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10
10
9
9
8
h 8
j
7
7
6
6
5 i
i
5
4
t
4
3
h j
3
2
2
t
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
9
(3,4)
(2,6)
8
4 (4,5)
3
1
(4,7)
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10
n Clustering feature:
n summary of the statistics for a given subcluster: the 0-th, 1st and
2nd moments of the subcluster from the statistical point of view.
n registers crucial measurements for computing cluster and utilizes
storage efficiently
A CF tree is a height-balanced tree that stores the clustering features
for a hierarchical clustering
n A nonleaf node in a tree has descendants or “children”
n The nonleaf nodes store sums of the CFs of their children
n A CF tree has two parameters
n Branching factor: specify the maximum number of children.
n threshold: max diameter of sub-clusters stored at the leaf nodes
May 3, 2023 Data Mining: Concepts and Techniques 51
The CF Tree Structure
Root
B=7 CF1 CF2 CF3 CF6
child1 child2 child3 child6
L=6
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
n Major ideas
n Use links to measure similarity/proximity
n Not distance-based
n Computational complexity: O ( n 2
nmmma n 2 log n)
n Algorithm: sampling-based clustering
n Draw random sample
n Experiments
n Congressional voting, mushroom data
Data Set
Merge Partition
Final Clusters
n Handle noise
n One scan
based)
May 3, 2023 Data Mining: Concepts and Techniques 60
Density-Based Clustering: Basic Concepts
n Two parameters:
n Eps: Maximum radius of the neighbourhood
n MinPts: Minimum number of points in an Eps-
neighbourhood of that point
n NEps(p): {q belongs to D | dist(p,q) <= Eps}
n Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
n p belongs to NEps(q)
p MinPts = 5
n core point condition:
q
Eps = 1 cm
|NEps (q)| >= MinPts
n Density-reachable:
n A point p is density-reachable from p
a point q w.r.t. Eps, MinPts if there p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
n Density-connected
n A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there
is a point o such that both, p and o
q are density-reachable from o
w.r.t. Eps and MinPts
May 3, 2023 Data Mining: Concepts and Techniques 62
DBSCAN: Density Based Spatial Clustering of
Applications with Noise
n Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
n Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
techniques
May 3, 2023 Data Mining: Concepts and Techniques 67
OPTICS: Some Extension from
DBSCAN
n Index-based:
n k = number of dimensions
n N = 20
n p = 75%
D
n M = N(1-p) = 5
n Complexity: O(kN2)
n Core Distance p1
n Reachability Distance o
p2
o
Max (core-distance (o), d (o, p))
MinPts = 5
r(p1, o) = 2.8cm. r(p2,o) = 4cm
May 3, 2023 e = 3 cm
Data Mining: Concepts and Techniques 68
Reachability
-distance
undefined
e
e
e‘
Cluster-order
of the objects
May 3, 2023 Data Mining: Concepts and Techniques 69
Density-Based Clustering: OPTICS & Its Applications
d ( x , xi ) 2
N
D
(x)
2
2
f Gaussian i 1
e
d ( x , xi ) 2
( x, xi ) i 1 ( xi x) e
N
n Major features f D
Gaussian
2 2
n Major features:
n Complexity O(N)
n Maximization step:
n Estimation of model parameters
n Conceptual clustering
n A form of clustering in machine learning
objects
n Finds characteristic description for each concept (class)
n COBWEB (Fisher’87)
n A popular a simple method of incremental conceptual
learning
n Creates a hierarchical clustering in the form of a
classification tree
n Each node refers to a concept and contains a
n Competitive learning
n Partition the data space and find the number of points that
lie inside each cell of the partition.
n Identify the subspaces that contain clusters using the
Apriori principle
n Identify clusters
n Determine dense units in all subspaces of interests
n Determine connected dense units in all subspaces of
interests.
n Generate minimal description for the clusters
n Determine maximal regions that cover a cluster of
connected dense units for each cluster
n Determination of minimal cover for each cluster
(week)
Salary
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
age age
20 30 40 50 60 20 30 40 50 60
=3 Vacation
y
l ar 30 50
Sa age
n Strength
n automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces
n insensitive to the order of records in input and does not
1 1 1
d d d
ij | J | ij
d d d
n Where jJ
Ij | I | i I ij IJ | I || J | i I , j J ij
n Customer segmentation
n Medical analysis
n Drawbacks
n most tests are for single attribute
data distribution
n Distance-based outlier: A DB(p, D)-outlier is an object O
in a dataset T such that at least a fraction p of the objects
in T lies at a distance greater than D from O
n Algorithms for mining distance-based outliers
n Index-based algorithm
n Nested-loop algorithm
n Cell-based algorithm