Clustering Methods
Clustering Methods
3
Clustering for Data Understanding and
Applications
Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an earth
observation database
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
City-planning: Identifying groups of houses according to their house
type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
Climate: understanding earth climate, find patterns of atmospheric
and ocean
Economic Science: market resarch
4
Clustering as a Preprocessing Tool (Utility)
Summarization:
Preprocessing for regression, PCA, classification, and
association analysis
Compression:
Image processing: vector quantization
Finding K-nearest Neighbors
Localizing search to one or a small number of clusters
Outlier detection
Outliers are often viewed as those “far away” from any
cluster
5
Quality: What Is Good Clustering?
6
Measure the Quality of Clustering
Dissimilarity/Similarity metric
Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables
Weights should be associated with different variables
based on applications and data semantics
Quality of clustering:
There is usually a separate “quality” function that
measures the “goodness” of a cluster.
It is hard to define “similar enough” or “good enough”
The answer is typically highly subjective
7
Considerations for Cluster Analysis
Partitioning criteria
Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
Separation of clusters
Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one
class)
Similarity measure
Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
Clustering space
Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
8
Requirements and Challenges
Scalability
Clustering all the data instead of only on samples
these
Constraint-based clustering
User may give inputs on constraints
Use domain knowledge to determine input parameters
Interpretability and usability
Others
Discovery of clusters with arbitrary shape
High dimensionality
9
Major Clustering Approaches (I)
Partitioning approach:
Construct various partitions and then evaluate them by some
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects)
Density-based approach:
Based on connectivity and density functions
Grid-based approach:
based on a multiple-level granularity structure
10
Major Clustering Approaches (II)
Model-based:
A model is hypothesized for each of the clusters and tries to find
Frequent pattern-based:
Based on the analysis of frequent patterns
User-guided or constraint-based:
Clustering by considering user-specified or application-specific
constraints
Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
Objects are often linked together in various ways
11
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
12
Partitioning Algorithms: Basic Concept
E ik1 pCi ( p ci ) 2
Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
13
The K-Means Clustering Method
14
An Example of K-Means Clustering
K=2