UNIT-5 PPT
UNIT-5 PPT
Cluster
Analysis
19/02/25
Cluster Analysis:
Basic Concepts and Algorithms
Overview, What Is Cluster Analysis? Different
Types of Clustering, Different Types of
Clusters;
K-means: The Basic K-means Algorithm, K-
means Additional Issues, Bisecting K-means,
Strengths and Weaknesses;
Agglomerative Hierarchical Clustering: Basic
Agglomerative Hierarchical Clustering
Algorithm
DBSCAN: Traditional Density Center-Based
Approach, DBSCAN Algorithm, Strengths and
Weaknesses.
19/02/25
Cluster Analysis
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the
characteristics found in the data and grouping
similar data objects into clusters
Unsupervised learning: no predefined classes
Typical applications
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms
19/02/25
Clustering Applications
Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
Land use: Identification of areas of similar land use in an
earth observation database
Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
City-planning: Identifying groups of houses according to their
house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
19/02/25
Clustering Applications and
Multidisciplinary Efforts
Pattern Recognition
Spatial Data Analysis
Create thematic maps in GIS by clustering feature
spaces
Detect spatial clusters or for other spatial mining
tasks
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar
access patterns
19/02/25
Quality - Good Clustering
19/02/25
Different Types of Clusters
Well-separated Cluster
Prototype-Based cluster
Graph-Based Cluster
Density-Based Cluster
19/02/25
Different Types of Clusters
Well-separated clusters : Each point is
closer to all of the points in its cluster than
to any point in another cluster.
19/02/25
Graph-based clusters: Each point is closer
to at least one point in its cluster than to
any point in another cluster.
19/02/25
Shared clusters: Points in a cluster share
some general property that derives from
the entire set of points.
19/02/25
Requirements of Clustering in Data
Mining
Scalability
Ability to deal with different types of attributes
Ability to handle dynamic data
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
19/02/25
Major Clustering Approaches
Partitioning approach:
Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects)
using some criterion
Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
19/02/25
Major Clustering Approaches
Grid-based approach:
based on a multiple-level granularity structure
Typical methods: STING, WaveCluster, CLIQUE
Model-based:
A model is hypothesized for each of the clusters and tries to find the
best fit of that model to each other
Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
Based on the analysis of frequent patterns
Typical methods: pCluster
User-guided or constraint-based:
Clustering by considering user-specified or application-specific
constraints
Typical methods: COD (obstacles), constrained clustering
19/02/25
Typical Alternatives to Calculate the
Distance between Clusters
Single link: smallest distance between an element in one
cluster and an element in the other, i.e., dis(K i, Kj) = min(tip, tjq)
Complete link: largest distance between an element in one
cluster and an element in the other, i.e., dis(K i, Kj) = max(tip, tjq)
Average: avg distance between an element in one cluster and
an element in the other, i.e., dis(K i, Kj) = avg(tip, tjq)
Centroid: distance between the centroids of two clusters, i.e.,
dis(Ki, Kj) = dis(Ci, Cj)
Medoid: distance between the medoids of two clusters, i.e.,
dis(Ki, Kj) = dis(Mi, Mj)
Medoid: one chosen, centrally located object in the cluster
19/02/25
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
Centroid: the “middle” of a cluster iN1(t )
Cm N ip
N N (t t ) 2
Dm i 1 i 1 ip iq
N ( N 1)
19/02/25
K-means
Partitioning Algorithms: Basic
Concept
Partitioning method: Construct a partition of a database D of n
objects into a set of k clusters, s.t., min sum of squared distance
2: repeat
3: Form K clusters by assigning each point to
its closest centroid.
19/02/25
The K-Means Clustering
Method
Given k, the k-means algorithm is implemented in four
steps:
Cluster Analysis
The K-Means Clustering Method
Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3
2 each
2 the 2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
most
similar reassign reassign
center 10 10
K=2 9 9
8 8
Arbitrarily choose 7 7
6 6
K object as initial 5 5
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
19/02/25
Table of notation
19/02/25
K-Means Method
19/02/25
Example
Cluster the following eight points (with (x,
y) representing locations) into three
clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7,
5), A6(6, 4), A7(1, 2), A8(4, 9)
Initial cluster centers are: A1(2, 10), A4(5,
8) and A7(1, 2).
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
K Means
Advantages
1. Easy to implement.
2. With a large number of variables, K-Means
may be computationally faster than
hierarchical clustering (if K is small)
Disadvantages
19/02/25
Reducing the SSE with
Postprocessing
Two strategies that decrease the total SSE
by increasing the number of clusters are the
following:
Split a cluster: The cluster with the largest
19/02/25
Bisecting K-means
The bisecting K-means algorithm is a
straightforward extension of the basic K-
means algorithm .
To obtain K clusters, split the set of all
points into two clusters, select one of these
clusters to split, and so on.
until K clusters have been produced.
19/02/25
Bisecting K-means
19/02/25
EXAMPLE
19/02/25
Hierarchical Clustering
9
8
Eventually all nodes belong to the same cluster 9
8
9
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
19/02/25
Dendrogram: Shows How the Clusters are Merged
19/02/25
19/02/25
Defining Proximity between Clusters
19/02/25
19/02/25
Agglomerative
19/02/25
Agglomerative
19/02/25
DENDROGRAM
19/02/25
DENDROGRAM
19/02/25
EXample
19/02/25
Plot
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
Complete Link
19/02/25
19/02/25
19/02/25
DIANA (Divisive Analysis)
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
19/02/25
Wards Method and Centroid Methods
Ward's method, the proximity between two
clusters is defined as the increase in the squared
error that results when two clusters are merged
Ward's method somewhat distinct from other
hierarchical techniques, it can be shown
mathematically that Ward's method is very similar
to the group average method when the proximity
between two points is taken to be the square of
the distance between them
Centroid methods calculate the proximity between
two clusters by calculating the distance between
the centroids of clusters, similar to K-means.
19/02/25
Strengths and Weaknesses
Both algorithms can produce better-quality
clusters
Agglomerative hierarchical clustering algorithms
are expensive in terms of their computational and
storage requirements.
all merges are final can also cause trouble for
noisy, high-dimensional data, such as document
data.
these two problems can be addressed to some
degree by first partially clustering the data using
another technique, such as K-means.
19/02/25
Density-Based Clustering Methods
Clustering based on density (local cluster criterion),
such as density-connected points
Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD’96)
based)
19/02/25
Density-Based Clustering: Basic
Concepts
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-
neighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps, MinPts
if
p belongs to NEps(q) p MinPts = 5
q
core point condition: Eps = 1 cm
Density-connected
p q
A point p is density-connected to
a point q w.r.t. Eps, MinPts if o
there is a point o such that both,
p and q are density-reachable
from o w.r.t. Eps and MinPts
19/02/25
DBSCAN: Density Based Spatial
Clustering of Applications with Noise
Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of density-
connected points
Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
19/02/25
Example
19/02/25
DBSCAN: The Algorithm
Gretzky, ...
Problem: Define and find outliers in large data sets
Applications:
Credit card fraud detection
Customer segmentation
Medical analysis
19/02/25