Clustering Classification and Intro Neural Network
Clustering Classification and Intro Neural Network
CLASSIFICATION AND
CLUSTERING WITH
NEURAL NETWORK
Partitioning Algorithms: Basic
Concepts
• Partitioning method: Discovering the groupings in the data by
optimizing a specific objective function and iteratively improving
the quality of partitions
• K-partitioning method: Partitioning a dataset D of n objects into a
set of K clusters so that an objective function is optimized (e.g.,
the sum of squared distances is minimized, where 𝑐𝑘 is the
centroid or medoid of cluster 𝐶𝑘 )
• A typical objective function:
𝐾 Sum of Squared Errors (SSE)
2
𝑆𝑆𝐸 𝐶 = 𝑥𝑖 − 𝑐𝑘
𝑘=1 𝑥𝑖 ∈𝐶𝑘
• Problem definition: Given K, find a partition of K clusters that
optimizes the chosen partitioning criterion
• Global optimal: Needs to exhaustively enumerate all partitions
• Heuristic methods (i.e., greedy algorithms): K-Means, K-Medians, K-
Medoids, etc.
The K-Means Clustering Method
• K-Means (MacQueen’67, Lloyd’57/’82)
• Each cluster is represented by the center of the cluster
• Given K, the number of clusters, the K-Means
clustering algorithm is outlined as follows
• Select K points as initial centroids
• Repeat
• Form K clusters by assigning each point to its closest centroid
• Re-compute the centroids (i.e., mean point) of each cluster
• Until convergence criterion is satisfied
• Different kinds of measures can be used
• Manhattan distance (L1 norm), Euclidean distance (L2
norm), Cosine similarity
Example: K-Means Clustering
Assign
points to
clusters Recomput
e cluster
centers
The original
data points & Execution of the K-Means Redo point
randomly select assignment
Clustering Algorithm
K = 2 centroids
Select K points as initial centroids
Repeat
• Form K clusters by assigning each point to its
closest centroid
• Re-compute the centroids (i.e., mean point) of
each cluster
Discussion on the K-Means
Method
• Efficiency: O(tKn) where n: # of objects, K: # of clusters, and t: # of
iterations
• Normally, K, t << n; thus, an efficient method
• K-means clustering often terminates at a local optimal
• Initialization can be important to find high-quality clusters
• Need to specify K, the number of clusters, in advance
• There are ways to automatically determine the “best” K
• In practice, one often runs a range of values and selected the “best” K value
• Sensitive to noisy data and outliers
• Variations: Using K-medians, K-medoids, etc.
• K-means is applicable only to objects in a continuous n-dimensional
space
• Using the K-modes for categorical data
• Not suitable to discover clusters with non-convex shapes
• Using density-based clustering, kernel K-means, etc.
Example: Poor Initialization May Lead to Poor
Clustering
Assign Recompu
points te cluster
to centers
cluster
s
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0
Dendrogram: How Clusters are
Merged
• Dendrogram: Decompose a set of data objects into
a tree of clusters by multi-level nested partitioning
• A clustering of the data objects is obtained by
cutting the dendrogram at the desired level, then
each connected component forms a cluster
Hierarchical clustering
generates a dendrogram
(a hierarchy of clusters)
Agglomerative Clustering
Algorithm
• AGNES (AGglomerative NESting) (Kaufmann and
Rousseeuw, 1990)
• Use the single-link method and the dissimilarity matrix
• Continuously merge nodes that have the least dissimilarity
• Eventually all nodes belong to the same cluster
• Agglomerative clustering varies on different similarity measures
among clusters
• Single link (nearest neighbor)
• Complete link (diameter)
• Average link (group average)
• Centroid link (centroid similarity)
Agglomerative Clustering
Algorithm
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Single Link vs. Complete Link in
Hierarchical Clustering X
X
Ca : N a Cb : N b
• Agglomerative clustering with average link
• Average link: The average distance between an element in
one cluster and an element in the other (i.e., all pairs in two
clusters)
• Expensive to compute X
X
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Divisive Clustering Is a Top-down
Approach
• The process starts at the root with all the points as
one cluster
• It recursively splits the higher level clusters to build
the dendrogram
• Can be considered as a global approach
• More efficient when compared with agglomerative
clustering 10
9
10
9
10
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
More on Algorithm Design for
Divisive Clustering
• Choosing which cluster to split
• Check the sums of squared errors of the clusters and
choose the one with the largest value
• Splitting criterion: Determining how to split
• One may use Ward’s criterion to chase for greater
reduction in the difference in the SSE criterion as a
result of a split
• For categorical data, Gini-index can be used
• Handling the noise
• Use a threshold to determine the termination criterion
(do not generate clusters that are too small because
they contain mainly noises)
Extensions to Hierarchical
Clustering
• Weakness of the agglomerative & divisive hierarchical
clustering methods
• No revisit: cannot undo any merge/split decisions made
before
• Scalability bottleneck: Each merge/split needs to examine
many possible options
• Time complexity: at least O(n2), where n is the number of total
objects
• Several other hierarchical clustering algorithms
• BIRCH (1996): Use CF-tree and incrementally adjust the
quality of sub-clusters
• CURE (1998): Represent a cluster using a set of well-scattered
representative points
• CHAMELEON (1999): Use graph partitioning methods on the
K-nearest neighbor graph of the data
Evaluation of Clustering: Basic
Concepts
• Evaluation of clustering
• Assess the feasibility of clustering analysis on a data set
• Evaluate the quality of the results generated by a clustering
method
• Major issues on clustering assessment and validation
• Clustering tendency: assessing the suitability of clustering:
whether the data has any inherent grouping structure
• Determining the Number of Clusters: determining for a
dataset the right number of clusters that may lead to a good
quality clustering
• Clustering quality evaluation: evaluating the quality of the
clustering results
Clustering Tendency: Whether the
Data Contains Inherent Grouping
Structure
• Assess the suitability of clustering
• Whether the data has any “inherent grouping structure” — non-
random structure that may lead to meaningful clusters
• Determine clustering tendency or clusterability
• A hard task because there are so many different definitions of
clusters
• Different definitions: Partitioning, hierarchical, density-based and graph-
based
• Even fixing a type, still hard to define an appropriate null model for
a data set
• There are some clusterability assessment methods, such as
• Spatial histogram: Contrast the histogram of the data with that
generated from random samples
• Distance distribution: Compare the pairwise point distance from
the data with those from the randomly generated samples
• Hopkins Statistic: A sparse sampling test for spatial randomness
Testing Clustering Tendency: A
Spatial Histogram Approach
• Spatial Histogram Approach: Contrast the d-
dimensional histogram of the input dataset D with
the histogram generated from random samples
• Dataset D is clusterable if the distributions of two
histograms are rather different
(a) Input
dataset (b) Data generated from random
samples
Testing Clustering Tendency: A
Spatial Histogram Approach
• Method outline
• Divide each dimension into equiwidth bins, count how many points lie
in each cell, and obtain the empirical joint probability mass function
(EPMF)
• Do the same for the randomly sampled data
• Compute how much they differ using the Kullback-Leibler (KL)
divergence value
• Kullback-Leibler (KL) divergence, also known as relative entropy,
is a powerful metric used to compare two probability
distributions. Let’s explore some of its practical applications in the
field of data science and machine learning: Monitoring Data
Drift, Loss Function for Neural Networks, Variational Auto-
Encoder Optimization, Generative Adversarial Networks. Also,
KL divergence is asymmetric—given two distributions, the
divergence from P to Q may not be the same as from Q to P
Determining the Number of
Clusters
• The appropriate number of clusters controls the
proper granularity of cluster analysis
• Finding a good balance between compressibility and
accuracy in cluster analysis
• Two undesirable extremes
• The whole data set is one cluster: No value of clustering
• Treating each point as a cluster: No data summarization
Determining the Number of
Clusters
• The right number of clusters often depends on the
distribution's shape and scale in the data set, as
well as the clustering resolution required by the
user
• Methods for determining the number of clusters
• An empirical method
𝑛
• # of clusters: 𝑘 ≈ for a dataset of n points (e.g., n =
2
200, k = 10)
• Each cluster is expected to have about 2𝑛 points
Finding the Number of Clusters:
the Elbow Method
• Use the turning point in the curve of the sum of
within cluster variance with respect to the # of
clusters
• Increasing the # of clusters can help reduce the sum of
within-cluster variance of each cluster
• But splitting a cohesive cluster gives only a small
reduction
Finding K, the Number of Clusters:
A Cross Validation Method
• Divide a given data set into m parts, and use m – 1
parts to obtain a clustering model
• Use the remaining part to test the quality of the
clustering
• For example, for each point in the test set, find the
closest centroid, and use the sum of squared distance
between all points in the test set and their closest
centroids to measure how well the model fits the test
set
• For any k > 0, repeat it m times, compare the
overall quality measure w.r.t. different k’s, and find
# of clusters that fits the data the best
Measuring Clustering Quality
• Clustering Evaluation: Evaluating how good the clustering
results are
• No commonly recognized best suitable measure in practice
• Extrinsic vs. intrinsic methods: depending on whether
ground truth is used
• Ground truth: the ideal clustering built by using human experts
• Extrinsic: Supervised, employ criteria not inherent to the
dataset
• Compare a clustering against prior or expert-specified knowledge
(i.e., the ground truth) using certain clustering quality measure
• Intrinsic: Unsupervised, criteria derived from data itself
• Evaluate the goodness of a clustering by considering how well the
clusters are separated and how compact the clusters are (e.g.,
silhouette coefficient)
General Criteria for Measuring
Clustering Quality with Extrinsic
Methods
• Given the ground truth Cg, Q(C, Cg) is the quality Ground truth partitioning
G 1
G2
Cluster C1 C2 C3
• Other methods:
• maximum matching; F-measure
Information Theory-Based Methods (I)
Conditional Entropy
• A clustering can be regarded as a compressed
representation of a given set of objects
• The better the clustering results approach the
ground-truth, the less amount of information is
needed
• This idea leads to the use of conditional entropy
Ground Truth G
G1 2
Cluster C2 C3
C1
Information Theory-Based Methods (I)
Conditional Entropy
𝑚 |𝐶𝑖 | |𝐶𝑖 |
• Entropy of clustering C: 𝐻 𝐶 = − 𝑖=1 𝑛 log
𝑛
|𝐺 | |𝐺 |
• Entropy of ground truth G: 𝐻 𝐺 = − 𝑙𝑖=1 𝑖 log 𝑖
𝑛 𝑛
• Conditional entropy of G given cluster 𝐶𝑖 :
Ground Truth G1 G2
𝑙 𝐶𝑖 ∩𝐺𝑗 𝐶𝑖 ∩𝐺𝑗
𝐻 𝐺 𝐶𝑖 = − 𝑗=1 log Cluster C1 C2 C3
𝐶𝑖 𝐶𝑖
• Conditional entropy of G given clustering C:
𝑚 𝑚 𝑙
𝐶𝑖 𝐶𝑖 ∩ 𝐺𝑗 𝐶𝑖 ∩ 𝐺𝑗
𝐻 𝐺𝐶 = 𝐻 𝐺 𝐶𝑖 = − log
𝑛 𝑛 𝐶𝑖
𝑖=1 𝑖=1 𝑗=1
Example
G
• Consider 11 objects
Ground Truth
G1 2
Cluster C2 C3
C1
Note: conditional entropy cannot detect the issue that C1 splits the objects in G into two clusters
Information Theory-Based Methods
(II)
Normalized Mutual Information (NMI)
𝑝 𝑟 𝑘 𝑖𝑗
• Mutual information 𝐼 𝐶, 𝐺 = − 𝑖=1 𝑗=1 𝑝𝑖𝑗 log
𝑝𝐶𝑖 𝑝𝐺𝑗
• Quantify the amount of shared info between the clustering C and
the ground-truth partitioning G
• Measure the dependency between the observed joint probability
𝑝𝑖𝑗 of C and G, and the expected joint probability 𝑝𝐶𝑖 𝑝𝐺𝑗 under the
independence assumption
• When C and G are independent, 𝑝𝑖𝑗 = 𝑝𝐶𝑖 𝑝𝐺𝑗 , I(C, G) = 0
• However, there is no upper bound on the mutual information
• Normalized mutual information 𝑁𝑀𝐼 𝐶, 𝐺 =
𝐼(𝐶,𝐺) 𝐼(𝐶,𝐺) 𝐼 𝐶,𝐺
=
𝐻(𝐶) 𝐻(𝐺) 𝐻 𝐶 𝐻(𝐺)
• Value range of NMI: [0,1]
• Value close to 1 indicates a good clustering
Pairwise Comparison-Based
Methods: Jaccard Coefficient
• Pairwise comparison: treat each group in the ground truth as a class
• For each pair of objects (oi, oj) in D, if they are assigned to the same
cluster/group, the assignment is regarded as positive; otherwise,
negative
• Depending on assignments, we have four possible cases: