dm_cl2a
dm_cl2a
continued
1. BIRCH skipped
1
BIRCH (1996)
Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny
(SIGMOD’96)
Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree
(a multi-level compression of the data that tries to
preserve the inherent clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to
cluster the leaf nodes of the CF-tree
Scales linearly: finds a good clustering with a single scan
and improves the quality with a few additional scans
Weakness: handles only numeric data, and sensitive to
the order of the data record.
2
Clustering Feature Vector
9
(3,4)
(2,6)
8
(4,5)
5
1
(4,7)
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10
3
CF Tree Root
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
4
Chapter 8. Cluster Analysis
6
DBSCAN
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-
neighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly
density-reachable from a point q wrt. Eps, MinPts
if
1) p belongs to NEps(q) p MinPts = 5
q
2) core point condition: Eps = 1 cm
|NEps (q)| >= MinPts
7
Density-Based Clustering:
Background (II)
Density-reachable:
A point p is density-reachable p
from a point q wrt. Eps, MinPts if p1
there is a chain of points p1, …, q
pn, p1 = q, pn = p such that pi+1 is
directly density-reachable from pi
Density-connected
p q
A point p is density-connected to
a point q wrt. Eps, MinPts if there
is a point o such that both, p and o
q are density-reachable from o
wrt. Eps and MinPts.
8
DBSCAN: Density Based Spatial
Clustering of Applications with
Noise
Border
Eps = 1cm
Core MinPts = 5
9
DBSCAN: The Algorithm
13
DENCLUE Influence Function and its
Gradient
Example
d ( x , y )2
f Gaussian ( x , y ) e 2 2
d ( x , xi ) 2
( x ) i 1 e
D N
2 2
f Gaussian
d ( x , xi ) 2
( x, xi ) i 1 ( xi x) e
D N
2 2
f Gaussian
14
Example: Density Computation
D={x1,x2,x3,x4}
x1
0.04 x3 0.08
y
x2 x4
0.06 0.6
x
Remark: the density value of y would be larger than the one for x
15
Density Attractor
16
Examples of DENCLUE Clusters
17
Basic Steps DENCLUE Algorithms
18
Chapter 8. Cluster Analysis
20
Advantages of Grid-based Clustering
Algorithms
fast:
No distance computations
neighboring
Shapes are limited to union of rectangular
grid-cells
21
Grid-Based Clustering Methods
Several interesting methods (in addition to the
basic grid-based algorithm)
STING (a STatistical INformation Grid approach)
by Wang, Yang and Muntz (1997)
CLIQUE: Agrawal, et al. (SIGMOD’98)
22
STING: A Statistical Information
Grid Approach
Wang, Yang and Muntz (VLDB’97)
The spatial area area is divided into rectangular
cells
There are several levels of cells corresponding to
different levels of resolution
23
STING: A Statistical
Information Grid Approach (2)
Main contribution of STING is the proposal of a data
structure that can be used for many purposes (e.g.
SCMRG, BIRCH kind of uses it)
The data structure is used to form clusters based on
queries
Each cell at a high level is partitioned into a number of
smaller cells in the next lower level
Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
Parameters of higher level cells can be easily calculated
from parameters of lower level cell
count, mean, s, min, max
type of distribution—normal, uniform, etc.
Use a top-down approach to answer spatial data queries
Clusters are formed by merging cells that match a given
24
STING: Query Processing(3)
Used a top-down approach to answer spatial data queries
1. Start from a pre-selected layer—typically with a small
number of cells
2. From the pre-selected layer until you reach the bottom
layer do the following:
For each cell in the current level compute the confidence
interval indicating a cell’s relevance to a given query;
If it is relevant, include the cell in a cluster
If it irrelevant, remove cell from further consideration
otherwise, look for relevant cells at the next lower layer
3. Combine relevant cells into relevant regions (based on
grid-neighborhood) and return the so obtained clusters
as your answers.
25
STING: A Statistical
Information Grid Approach (3)
Advantages:
Query-independent, easy to parallelize,
incremental update
O(K), where K is the number of grid cells at
the lowest level
Can be used in conjunction with a grid-based
clustering algorithm
Disadvantages:
All the cluster boundaries are either
horizontal or vertical, and no diagonal
boundary is detected
26
Subspace Clustering
29
Salary
(10,000)
=3
0 1 2 3 4 5 6 7
20
30
40
50
Sa
l ar
Vacation
y
60
age
30
Vacation
(week)
50
0 1 2 3 4 5 6 7
20
30
40
age
50
60
age
30
Strength and Weakness of
CLIQUE
Strength
It automatically finds subspaces of the highest
31
Self-organizing feature maps
(SOMs)