Unit - 4 - Modified
Unit - 4 - Modified
Syllabus
• Cluster Analysis: Introduction
• Requirements and overview of different categories
• Partitioning method: Introduction
• k-means
• k-medoids
• Hierarchical method: Introduction
• Agglomerative vs. Divisive method
• Distance measures in algorithmic methods
• BIRCH technique
• DBSCAN technique
• STING technique
• CLIQUE technique
• Evaluation of clustering techniques
Session 1
Cluster Analysis: Introduction
Requirements and overview of different categories
• Clustering is the process of grouping a set of data objects intomultiple
groups or clusters
• so that objects within a cluster have high similarity, but are very
dissimilar to objects in other clusters.
• Dissimilarities and similarities are assessed based on the attribute
values describing the objects and often involve distance measures.
• Clustering as a data mining tool has its roots in many application
areas such as biology, security, business intelligence, and Web search.
Cluster Analysis
• Cluster: A collection of data objects
• similar (or related) to one another within the same group
• dissimilar (or unrelated) to the objects in other groups
• Cluster analysis (or clustering, data segmentation, …)
• Finding similarities between data according to the characteristics found in the
data and grouping similar data objects into clusters
• Unsupervised learning: no predefined classes (i.e., learning by observations vs.
learning by examples: supervised)
• Typical applications
• As a stand-alone tool to get insight into data distribution
• As a preprocessing step for other algorithms
Applications of Cluster Analysis
• Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species
• Information retrieval: document clustering
• Land use: Identification of areas of similar land use in an earth observation database
• Marketing: Help marketers discover distinct groups in their customer bases, and then use this
knowledge to develop targeted marketing programs
• City-planning: Identifying groups of houses according to their house type, value, and
geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
• Climate: understanding earth climate, find patterns of atmospheric and ocean
• Economic Science: market research
• Owing to the huge amounts of data collected in databases, cluster analysis has recently become
a highly active topic in data mining research.
• distance-based cluster analysis
• Un-supervised learning // class label is not presented in D
• Learning by observation // classification was learning by example
Clustering as a Preprocessing Tool (Utility)
• Summarization:
• Preprocessing for regression, PCA, classification, and association analysis
• Compression:
• Image processing: vector quantization
• Finding K-nearest Neighbors
• Localizing search to one or a small number of clusters
• Outlier detection
• Outliers are often viewed as those “far away” from any cluster
Quality: What Is Good Clustering?
• Partitioning criteria
• Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)
(Eg:mining _University)
• Separation of clusters
• clusters may not be exclusive.
• Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one document may
belong to more than one class)
• Similarity measure
• Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity)
• Clustering space
• Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)
• May lead to unreliable similarity measurements
Conclusion
• clustering algorithms have several requirements. These factors
include scalability and the ability to deal with different types of
attributes, noisy data, incremental updates, clusters of arbitrary
shape, and constraints. Interpretability and usability are also
important.
• Also they differ with respect to the partitioning level, whether or not
clusters are mutually exclusive, the similarity measures used, and
whether or not subspace clustering is performed.
• we study the types of data that often occur in cluster analysis and
how to preprocess them for such an analysis.
• Suppose that a data set to be clustered contains n objects, which may
represent persons, houses, documents, countries, and so on. Main
memory-based clustering algorithms typically operate on either of the
following two data structures.
Types of Data in Cluster Analysis
Data matrix (or object-by-variable structure):
This represents n objects, such as persons, with p variables (also
called measurements or attributes), such as age, height, weight, gender,
and so on.
• The structure is in the form of a relational table, or n-by-p matrix (n
objects p variables)
• d(i, j) - nonnegative number that is close to 0 when objects i and j are highly similar or “near” each
other, and becomes larger the more they differ.
• For example, changing measurement units from meters to inches for height, or from kilograms
to pounds for weight, may lead to a very different clustering structure.
• To avoid dependence on the choice of measurement units, the data should be standardized.
• Standardizing measurements attempts to give all variables an equal weight.
• Data Transformation by Normalization
• The measurement unit used can affect the data analysis.
• A binary variable is symmetric if both of its states are equally valuable and carry the same weight.
• Dissimilarity that is based on symmetric binary variables is called symmetric binary dissimilarity.
• A binary variable is asymmetric if the outcomes of the states are not equally important, such as the
positive and negative outcomes of a disease test.
• A binary variable contains two possible outcomes: 1 (positive/present) or 0
(negative/absent). If there is no preference for which outcome should be coded as
0 and which as 1, the binary variable is called symmetric.
• For example, the binary variable "is evergreen?" for a plant has the possible states
"loses leaves in winter" and "does not lose leaves in winter." Both are equally
valuable and carry the same weight when a proximity measure is computed.
• If the outcomes of a binary variable are not equally important, the binary variable
is called asymmetric.
• An example of such a variable is the presence or absence of a relatively rare
attribute, such as "is color-blind" for a human being.
• While you say that two people who are color-blind have something in common,
you cannot say that people who are not color-blind have something in common.
Jaccard Coefficient
• The number of negative matches, t, is considered unimportant and thus is
ignored in the computation, as
• we can measure the distance between two binary variables based on the
notion of similarity instead of dissimilarity.
• Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
01
d ( jack , mary ) 0.33
2 01
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2
How can we compute the dissimilarity between
objects described by categorical, ordinal, and
ratio-scaled variables?"
Categorical, Ordinal, and Ratio-Scaled
Variables
• A categorical variable is a generalization of the binary variable in that it can take on more than
two states.
• For example, map color is a categorical variable that may have, say, five states: red, yellow, green,
pink, and blue.
• Let the number of states of a categorical variable be M. The states can be denoted by letters,
symbols, or a set of integers, such as 1, 2, : : : , M.
• The dissimilarity between two objects i and j can be computed based on the ratio of mismatches
(Eqn 7.3)
• m - where m is the number of matches (i.e., the number of variables for which i and j are in the
same state), and p is the total number of variables.
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Suppose that we have the sample data of Table 7.3, except that only the object-identifier and the variable (or attribute)
test-1 are available, where test-1 is categorical. Let's compute the dissimilarity matrix
3. Ordinal Variables
• An ordinal variable can be discrete or continuous. (we need to convert ordinal
into ratio scale)
• Order is important, e.g. rank (junior, senior)
• Can be treated like interval-scaled
• Replace an ordinal variable value by its rank:
• The distance can be calculated by treating ordinal as quantitative
• Map the range of each variable onto [0,1] by replacing i-th object in f-th
variable by • Normalized Rank
• There are three states for test-2, namely fair, good, and excellent, that
is Mf =3.
• step 1, if we replace each value for test-2 by its rank, the four objects
are assigned the ranks 3, 1, 2, and 3, respectively.
• Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5,
and rank 3 to 1.0.
• For step 3, we can use, say, the Euclidean distance, which results in
the following dissimilarity matrix:
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Ratio-Scaled Variables
• where A and B are positive constants, and t typically represents time. Eg:
growth of a bacteria population or the decay of a radioactive element.
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Ratio-Scaled Variables
• Treat ratio-scaled variables like interval-scaled variables.
• The latter two methods are the most effective, although the choice of
method used may depend on the given application.
Dissimilarity between ratio-scaled variables
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
4. Variables of Mixed Types
• how can we compute the dissimilarity between objects of mixed
variable types?”
• One approach is to group each kind of variable together, performing a
separate cluster analysis for each variable type.
• A more preferable approach is to process all variable types together,
performing a single cluster analysis.
• Suppose that the data set contains p variables of mixed type. The
dissimilarity d(i, j) between objects i and j is defined as
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Variables of Mixed Types
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Variables of Mixed Types
• Apply logarithmic transformation to its values. Based on the
transformed values of 2.65, 1.34, 2.21, and 3.08 obtained for the
objects 1 to 4.
• =3.08 and =1.34.
• Then normalize the values in the dissimilarity matrix obtained in
Example 7.5 by dividing each one by (3.08 – 1.34) = 1.74.
• We can now use the dissimilarity matrices for the three variables in
our computation.
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Vector Objects
• There are several ways to define such a similarity function, s(x, y), to
compare two vectors x and y.
• One popular way is to define the similarity function as a cosine measure
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Session 2
Partitioning Method: Introduction
K-Means Algorithm
Partitioning Algorithms: Basic Concept
• Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the
sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci)
E ik1 pCi ( p ci ) 2
• Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
• Global optimal: exhaustively enumerate all partitions
• Heuristic methods: k-means and k-medoids algorithms
• k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the cluster
• k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is
represented by one of the objects in the cluster
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in four steps:
• Partition objects into k nonempty subsets
• Compute seed points as the centroids of the clusters of the current
partitioning (the centroid is the center, i.e., mean point, of the cluster)
• Assign each object to the cluster with the nearest seed point
• Go back to Step 2, stop when the assignment does not change
K-Means Clustering-
K-Means clustering is an unsupervised iterative clustering technique.
It partitions the given data set into k predefined distinct clusters.
A cluster is defined as a collection of data points exhibiting certain similarities.
• Point-01:
• It is relatively efficient with time complexity O(nkt) where-
• n = number of instances
• k = number of clusters
• t = number of iterations
• Point-02:
K=2
• Since an object with an extremely large value may substantially distort the distribution of the
data
• K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point,
medoids can be used, which is the most centrally located object in a cluster
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
• Another variant to k-means is the k-modes method, which extends the k-means
paradigm to cluster categorical data by replacing the means of clusters with
modes, using new dissimilarity measures to deal with categorical objects and a
frequency-based method to update modes of clusters. The k-means and the k-
modes methods can be integrated to cluster data with mixed numeric and
categorical values.
• The EM (Expectation-Maximization) algorithm extends the k-means paradigm in a
different way. Whereas the k-means algorithm assigns each object to a cluster,
• In EM,each object is assigned to each cluster according to a weight representing its
probability of membership.
• In other words, there are no strict boundaries between clusters. Therefore, new
means are computed based on weighted measures.
How can we make the k-means algorithm more scalable?"
A recent approach to scaling the k-means algorithm is based on the idea of identifying three
kinds of regions in data:
1. regions that are compressible,
2. regions that must be maintained in main memory,
3. and regions that are discardable.
An object is discardable if its membership in a cluster is ascertained.
An object is compressible if it is not discardable but belongs to a tight subcluster.
A data structure known as a clustering feature is used to summarize objects that have been
discarded or compressed.
If an objectis neither discardable nor compressible, then it should be retained in main
memory.
To achieve scalability,
The iterative clustering algorithm only includes the clustering features of the compressible objects and the objects that
must be retained in main memory,
• thereby turning a secondary-memory-based algorithm into a main-memory- based algorithm.
• An alternative approach to scaling the k-means algorithm explores the microclustering idea,
Cluster the following eight points (with (x, y) representing locations) into three clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|
Use K-Means Algorithm to find the three cluster centers after the second iteration.
Calculating Distance Between A1(2, 10) and C2(5, 8)-
We calculate the distance of each point from each of the center of the three clusters.
The distance is calculated by using the given distance function.
• Calculate new cluster centers
• After second iteration, the center of the three clusters are-
• C1(3, 9.5)
• C2(6.5, 5.25)
• C3(1.5, 3.5)
Practice Problem
Session 3
K-Medoids
Hierarchical Method: Introduction
Quality of clustering,
Variation within clustering
Error E=
“How can we modify the k-means algorithm to diminish such sensitivity to
outliers?”
• Instead of taking the mean value of the objects in a cluster as a reference point,
• pick actual objects to represent the clusters, using one representative object per
cluster.
• Each remaining object is assigned to the cluster of which the representative
object is the most similar.
• The partitioning method is then performed based on the principle of minimizing
the sum of the dissimilarities between each object p and its
correspondingrepresentative object.
• That is, an absolute-error criterion is used, defined as
The K-Medoid Clustering Method
• K-Medoids Clustering: Find representative objects (medoids) in clusters
• Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the
non-medoids if it improves the total distance of the resulting clustering
• PAM works effectively for small data sets, but does not scale well for large data sets (due to the
computational complexity)
9 9 9
8 8 8
7 7
Arbitrary Assign
7
6 6 6
5
choose k 5 each 5
4 object as 4 remainin 4
3 initial 3
g object 3
2
medoids 2
to 2
nearest
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
8
Compute
9
8
Swapping O total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0Source
3 4 :5
1 627 8 9 10 0 1 2 3 4 5 6 7 8 9 10
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
• More robust
• Time complexity
• “How can we scale up the k-medoids method?”
• CLARA( Random Samples)
• In some situations we may want to partition our data into groups at
different levels such as in a hierarchy.
• A hierarchical clustering method works by grouping data objects into
a hierarchy or “tree” of clusters.
• Representing data objects in the form of a hierarchy is useful for data
summarization and visualization.
• Handwriting recognition, hierarchy of species(animals,birds
etc),employee,gaming (chess).
• Agglomerative versus divisive hierarchical clustering,
• which organize objects into a hierarchy using a bottom-up or top-
down strategy, respectively.
• Agglomerative methods start with individual objects as clusters,
which are iteratively merged to form larger clusters.
• Conversely, divisive methods initially let all the given objectsform one
cluster, which they iteratively split into smaller clusters.
• Hierarchical clustering methods can encounter difficulties regarding
the selection of merge or split points. Such a decision is critical,
• merge or split decisions, if not well chosen, may lead to low-quality
clusters.
Moreover, the methods do not scale well because each decision of
merge or split needs to examine and evaluate many objects or clusters.
Solution: can be combined with Multiphase clustering
Hierarchical Clustering
DBSCAN technique
STING technique
CLIQUE technique
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
• Decompose data objects into a several levels of nested partitioning (tree of
clusters), called a dendrogram.
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
DIANA (Divisive Analysis)
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Distance Between Clusters
• Single Link: smallest distance between points
• Complete Link: largest distance between points
• Average Link: average distance between points
• Centroid: distance between centroids
Distance between Clusters
• Single link: smallest distance between an element in one cluster and an element in
the other, i.e., dist(Ki, Kj) = min(tip, tjq)// updating distance matrix. X
• Complete link: largest distance between an element in one cluster and an element
in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
• Average: avg distance between an element in one cluster and an element in the
other, i.e., dist(Ki, Kj) = avg(tip, tjq)
X
• Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj)
• Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj)
• Medoid: a chosen, centrally located object in the cluster
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Centroid, Radius and Diameter of a Cluster (for numerical data
sets)
• Centroid: the “middle” of a cluster iN 1(t )
Cm N
ip
Threshold of
12 34 5
A B C D E
Problem: For the one dimensional data set {7,10,20,28,35}, perform hierarchical clustering and plot
the dendogram to visualize it.
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
DIANA (Divisive Analysis)
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Extensions to Hierarchical Clustering
• Major weakness of agglomerative clustering methods
N 2 10
(3,4)
Xi 9
8
(2,6)
i 1 7
5
(4,5)
4
3
(4,7)
2
1
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
CF-Tree in BIRCH
• Clustering feature:
• Summary of the statistics for a given subcluster: the 0-th, 1st, and 2nd moments of the
subcluster from the statistical point of view
• Registers crucial measurements for computing cluster and utilizes storage efficiently
• A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
The Birch Algorithm
• Cluster Diameter 1 2
(x x )
n( n 1) i j
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Partitioning and hierarchical methods are
designed to find spherical-shaped clusters.
Session 6
DBSCAN
Density-based clusters are dense areas in the data
space separated from each other by sparser areas.
Given such data, portioning and hierarchical would likely
inaccurately identify convex regions, where noise or outliers are
included in the clusters.
• Two parameters:
• Eps: Maximum radius of the neighbourhood
• MinPts: Minimum number of points in an Eps-
neighbourhood of that point
• NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
• Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
• p belongs to NEps(q)
• core point condition: p MinPts = 5
|NEps (q)| ≥ MinPts Eps = 1 cm
q
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Density-Reachable and Density-Connected
• Density-reachable:
• A point p is density-reachable from a
p
point q w.r.t. Eps, MinPts if there is a
chain of points p1, …, pn, p1 = q, pn = p p1
q
such that pi+1 is directly density-
reachable from pi
• Density-connected
• A point p is density-connected to a
point q w.r.t. Eps, MinPts if there is a p q
point o such that both, p and q are
density-reachable from o w.r.t. Eps o
and MinPts
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
• Relies on a density-based notion of cluster: A cluster is defined as
a maximal set of density-connected points
• Discovers clusters of arbitrary shape in spatial databases with
noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
DBSCAN: The Algorithm
• Arbitrary select a point p
• Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
• If p is a core point, a cluster is formed
• If p is a border point, no points are density-reachable from p
and DBSCAN visits the next point of the database
• Continue the process until all of the points have been
processed
DBSCAN: Sensitive to Parameters
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Session 7
STING
Grid-Based Clustering Method
• Using multi-resolution grid data structure
• Several interesting methods
• STING (a STatistical INformation Grid approach) by Wang,
Yang and Muntz (1997)
• WaveCluster by Sheikholeslami, Chatterjee, and Zhang
(VLDB’98)
• A multi-resolution clustering approach using wavelet method
• CLIQUE: Agrawal, et al. (SIGMOD’98)
• Both grid-based and subspace clustering
STING: A Statistical Information Grid Approach
1st layer
(i-1)st layer
i-th layer
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
The STING Clustering Method
• Each cell at a high level is partitioned into a number of smaller
cells in the next lower level
• Statistical info of each cell is calculated and stored beforehand
and is used to answer queries
• Parameters of higher level cells can be easily calculated from
parameters of lower level cell
• count, mean, s, min, max
• type of distribution—normal, uniform, etc.
• Use a top-down approach to answer spatial data queries
• Start from a pre-selected layer—typically with a small number of
cells
• For each cell in the current level compute the confidence
interval
STING Algorithm and Its Analysis
• Remove the irrelevant cells from further consideration
• When finish examining the current layer, proceed to the next
lower level
• Repeat this process until the bottom layer is reached
• Advantages:
• Query-independent, easy to parallelize, incremental update
• O(K), where K is the number of grid cells at the lowest level
• Disadvantages:
• All the cluster boundaries are either horizontal or vertical,
and no diagonal boundary is detected
Session 8
CLIQUE
CLIQUE (Clustering In QUEst)
• Partition the data space and find the number of points that lie
inside each cell of the partition.
• Identify the subspaces that contain clusters using the Apriori
principle
• Identify clusters
• Determine dense units in all subspaces of interests
• Determine connected dense units in all subspaces of interests.
• Generate minimal description for the clusters
• Determine maximal regions that cover a cluster of connected
dense units for each cluster
• Determination of minimal cover for each cluster
Vacation
(10,000)
(week)
Salary
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
age age
20 30 40 50 60 20 30 40 50 60
=3
Vacation
r y 30 50
l a
Sa age
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Strength and Weakness of CLIQUE
• Strength
• automatically finds subspaces of the highest dimensionality
such that high density clusters exist in those subspaces
• insensitive to the order of records in input and does not
presume some canonical data distribution
• scales linearly with the size of input and has good scalability
as the number of dimensions in the data increases
• Weakness
• The accuracy of the clustering result may be degraded at the
expense of simplicity of the method
Session 9
Evaluation of Clustering Techniques
Assessing Clustering Tendency
• Assess if non-random structure exists in the data by measuring the probability that the data is
generated by a uniform data distribution
• Test spatial randomness by statistic test: Hopkins Static
• Given a dataset D regarded as a sample of a random variable o, determine how far away o is
from being uniformly distributed in the data space
• Sample n points, p1, …, pn, uniformly from D. For each pi, find its nearest neighbor in D: xi =
min{dist (pi, v)} where v in D
• Sample n points, q1, …, qn, uniformly from D. For each qi, find its nearest neighbor in D – {qi}:
yi = min{dist (qi, v)} where v in D and v ≠ qi
• Calculate the Hopkins Statistic:
• If D is uniformly distributed, ∑ xi and ∑ yi will be close to each other and H is close to 0.5. If D
is highly skewed, H is close to 0
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Determine the Number of Clusters
• Empirical method
• # of clusters ≈√n/2 for a dataset of n points
• Elbow method
• Use the turning point in the curve of sum of within cluster variance w.r.t
the # of clusters
• Cross validation method
• Divide a given data set into m parts
• Use m – 1 parts to obtain a clustering model
• Use the remaining part to test the quality of the clustering
• E.g., For each point in the test set, find the closest centroid, and use
the sum of squared distance between all points in the test set and the
closest centroids to measure how well the model fits the test set
• For any k > 0, repeat it m times, compare the overall quality measure w.r.t.
different k’s, and find # of clusters that fits the data the best
Measuring Clustering Quality
• Two methods: extrinsic vs. intrinsic
• Extrinsic: supervised, i.e., the ground truth is available
• Compare a clustering against the ground truth using certain
clustering quality measure
• Ex. BCubed precision and recall metrics
• Intrinsic: unsupervised, i.e., the ground truth is unavailable
• Evaluate the goodness of a clustering by considering how well
the clusters are separated, and how compact the clusters are
• Ex. Silhouette coefficient
Measuring Clustering Quality: Extrinsic Methods
• Clustering quality measure: Q(C, Cg), for a clustering C given the ground truth Cg.
• Q is good if it satisfies the following 4 essential criteria
• Cluster homogeneity: the purer, the better
• Cluster completeness: should assign objects belong to the same category in
the ground truth to the same cluster
• Rag bag: putting a heterogeneous object into a pure cluster should be
penalized more than putting it into a rag bag (i.e., “miscellaneous” or “other”
category)
• Small cluster preservation: splitting a small category into pieces is more
harmful than splitting a large category into pieces
References
• Jiawei Han and Micheline Kamber, “ Data Mining: Concepts and
Techniques”, 3rd Edition, Morgan Kauffman Publishers, 2011.
• http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.pdf