0% found this document useful (0 votes)
28 views168 pages

Clustering Classification and Intro Neural Network

Uploaded by

chhavi tomar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views168 pages

Clustering Classification and Intro Neural Network

Uploaded by

chhavi tomar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 168

CLASSIFICATION AND

Clustering WITH NEURAL


NETWORK

CLASSIFICATION AND
CLUSTERING WITH
NEURAL NETWORK
Partitioning Algorithms: Basic
Concepts
• Partitioning method: Discovering the groupings in the data by
optimizing a specific objective function and iteratively improving
the quality of partitions
• K-partitioning method: Partitioning a dataset D of n objects into a
set of K clusters so that an objective function is optimized (e.g.,
the sum of squared distances is minimized, where 𝑐𝑘 is the
centroid or medoid of cluster 𝐶𝑘 )
• A typical objective function:
𝐾 Sum of Squared Errors (SSE)
2
𝑆𝑆𝐸 𝐶 = 𝑥𝑖 − 𝑐𝑘
𝑘=1 𝑥𝑖 ∈𝐶𝑘
• Problem definition: Given K, find a partition of K clusters that
optimizes the chosen partitioning criterion
• Global optimal: Needs to exhaustively enumerate all partitions
• Heuristic methods (i.e., greedy algorithms): K-Means, K-Medians, K-
Medoids, etc.
The K-Means Clustering Method
• K-Means (MacQueen’67, Lloyd’57/’82)
• Each cluster is represented by the center of the cluster
• Given K, the number of clusters, the K-Means
clustering algorithm is outlined as follows
• Select K points as initial centroids
• Repeat
• Form K clusters by assigning each point to its closest centroid
• Re-compute the centroids (i.e., mean point) of each cluster
• Until convergence criterion is satisfied
• Different kinds of measures can be used
• Manhattan distance (L1 norm), Euclidean distance (L2
norm), Cosine similarity
Example: K-Means Clustering
Assign
points to
clusters Recomput
e cluster
centers

The original
data points & Execution of the K-Means Redo point
randomly select assignment
Clustering Algorithm
K = 2 centroids
Select K points as initial centroids
Repeat
• Form K clusters by assigning each point to its
closest centroid
• Re-compute the centroids (i.e., mean point) of
each cluster
Discussion on the K-Means
Method
• Efficiency: O(tKn) where n: # of objects, K: # of clusters, and t: # of
iterations
• Normally, K, t << n; thus, an efficient method
• K-means clustering often terminates at a local optimal
• Initialization can be important to find high-quality clusters
• Need to specify K, the number of clusters, in advance
• There are ways to automatically determine the “best” K
• In practice, one often runs a range of values and selected the “best” K value
• Sensitive to noisy data and outliers
• Variations: Using K-medians, K-medoids, etc.
• K-means is applicable only to objects in a continuous n-dimensional
space
• Using the K-modes for categorical data
• Not suitable to discover clusters with non-convex shapes
• Using density-based clustering, kernel K-means, etc.
Example: Poor Initialization May Lead to Poor
Clustering

Assign Recompu
points te cluster
to centers
cluster
s

Another random selection


of k centroids for the
same data points

 Rerun of the K-Means using another


random K seeds
 This run of K-Means generates a poor
quality clustering
Drawback of standard K-means algorithm
• One disadvantage of the K-means algorithm is that it is sensitive to the
initialization of the centroids or the mean points. So, if a centroid is initialized to
be a “far-off” point, it might just end up with no points associated with it, and
at the same time, more than one cluster might end up linked with a single
centroid. Similarly, more than one centroid might be initialized into the same
cluster resulting in poor clustering. For example, consider the images shown
below.
Variations of K-Means
• Choosing better initial centroid estimates
• K-means++, Intelligent K-Means, Genetic K-Means
• Choosing different representative prototypes for
the clusters
• K-Medoids, K-Medians, K-Modes
• Applying feature transformation techniques
• Weighted K-Means, Kernel K-Means
Initialization of K-Means
• Different initializations may generate rather different
clustering results (some could be far from optimal)
• Original proposal (MacQueen’67): Select K seeds randomly
• Need to run the algorithm multiple times using different seeds
• There are many methods proposed for better initialization of k seeds
• K-Means++ (Arthur & Vassilvitskii’07):
• The first centroid is selected at random
• The next centroid selected is the one that is farthest
from the currently selected (selection is based on a
weighted probability score)
• The selection continues until K centroids are obtained
K-mean++
• This algorithm ensures a smarter initialization of the centroids and
improves the quality of the clustering. Apart from initialization, the
rest of the algorithm is the same as the standard K-means
algorithm. That is K-means++ is the standard K-means algorithm
coupled with a smarter initialization of the centroids. The steps
involved are:
• Algorithm:
1. Randomly select the first centroid from the data points.
2. For each data point compute its distance from the nearest,
previously chosen centroid.
3. Select the next centroid from the data points such that the
probability of choosing a point as centroid is directly proportional
to its distance from the nearest, previously chosen centroid. (i.e.
the point having maximum distance from the nearest centroid is
most likely to be selected next as a centroid)
4. Repeat steps 2 and 3 until k centroids have been sampled
Applications of k-means++ algorithm
• Image segmentation: K-means++ can be used to segment images
into different regions based on their color or texture features. This is
useful in computer vision applications, such as object recognition or
tracking.
• Customer segmentation: K-means++ can be used to group
customers into different segments based on their purchasing habits,
demographic data, or other characteristics. This is useful in
marketing and advertising applications, as it can help businesses
target their marketing efforts more effectively.
• Anomaly detection: K-means++ can be used to identify outliers or
anomalies in a dataset. This is useful in fraud detection, network
intrusion detection, and other security applications.
• Document clustering: K-means++ can be used to group similar
documents together based on their content. This is useful in natural
language processing applications, such as text classification or
sentiment analysis.
• Recommender systems: K-means++ can be used to recommend
products or services to users based on their past purchases or
preferences. This is useful in e-commerce and online advertising
applications.
Handling Outliers: From K-Means
to K-Medoids
• The K-Means algorithm is sensitive to outliers
• An object with an extremely large value may substantially distort
the distribution of the data
• K-Medoids: Instead of taking the mean value of the object
in a cluster as a reference point, medoids can be used,
which is the most centrally located object in a cluster
• The K-Medoids clustering algorithm:
• Select K points as the initial representative objects (i.e.,
as initial K medoids)
• Repeat
• Assigning each point to the cluster with the closest medoid
• Randomly select a non-representative object 𝑜𝑖
• Compute the total cost S of swapping the medoid m with 𝑜𝑖
• If S < 0, then swap m with 𝑜𝑖 to form the new set of medoids
• Until convergence criterion is satisfied
Example
1.Initialize: select k random points out of
the n data points as the medoids.
2.Associate each data point to the closest
medoid by using any common distance metric
methods.
3.While the cost decreases: For each medoid m,
for each data o point which is not a medoid:
1. Swap m and o, associate each data point
to the closest medoid, and recompute the
cost.
2. If the total cost is more than that in the
previous step, undo the swap.
Step 1: Let the randomly selected 2 medoids, so
select k = 2, and let C1 -(4, 5) and C2 -(8, 5) are
the two medoids.
Step 2: Calculating cost. The dissimilarity of
each non-medoid point with the medoids is
calculated and tabulated: Here we have used
Manhattan distance formula to calculate the distance
matrices between medoid and non-medoid points.
That formula tell that Distance = |X1-X2| + |Y1-Y2|.
Each point is assigned to the cluster of that medoid
whose dissimilarity is less. Points 1, 2, and 5 go to
cluster C1 and 0, 3, 6, 7, 8 go to cluster C2. The Cost =
(3 + 4 + 4) + (3 + 1 + 1 + 2 + 2) = 20
Step 3: randomly select one non-medoid point and
recalculate the cost. Let the randomly selected point
be (8, 4). The dissimilarity of each non-medoid point
with the medoids – C1 (4, 5) and C2 (8, 4) is calculated
and tabulated.
Each point is assigned to that cluster whose
dissimilarity is less. So, points 1, 2, and 5 go to
cluster C1 and 0, 3, 6, 7, 8 go to cluster C2. The
New cost = (3 + 4 + 4) + (2 + 2 + 1 + 3 + 3) = 22
Swap Cost = New Cost – Previous Cost = 22 – 20
and 2 >0 As the swap cost is not less than zero,
we undo the swap. Hence (4, 5) and (8, 5) are the
final medoids. The clustering would be in the
following way The time complexity is O(k * (n-
k)^2)
Hierarchical Clustering: Basic
Concepts
• Hierarchical clustering
• Generate a clustering hierarchy (drawn as a dendrogram)
• Not required to specify K, the number of clusters
• More deterministic
• No iterative refinement
• Two categories of algorithms
• Agglomerative: Start with singleton clusters,
continuously merge two clusters at a time to build a
bottom-up hierarchy of clusters
• Divisive: Start with a huge macro-cluster, split it
continuously into two groups, generating a top-down
hierarchy of clusters
Agglomerative vs. Divisive
Clustering
Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative
(AGNES)

a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0
Dendrogram: How Clusters are
Merged
• Dendrogram: Decompose a set of data objects into
a tree of clusters by multi-level nested partitioning
• A clustering of the data objects is obtained by
cutting the dendrogram at the desired level, then
each connected component forms a cluster

Hierarchical clustering
generates a dendrogram
(a hierarchy of clusters)
Agglomerative Clustering
Algorithm
• AGNES (AGglomerative NESting) (Kaufmann and
Rousseeuw, 1990)
• Use the single-link method and the dissimilarity matrix
• Continuously merge nodes that have the least dissimilarity
• Eventually all nodes belong to the same cluster
• Agglomerative clustering varies on different similarity measures
among clusters
• Single link (nearest neighbor)
• Complete link (diameter)
• Average link (group average)
• Centroid link (centroid similarity)
Agglomerative Clustering
Algorithm

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Single Link vs. Complete Link in
Hierarchical Clustering X
X

• Single link (nearest neighbor)


• The similarity between two clusters is the similarity
between their most similar (nearest neighbor) members
• Local similarity-based: Emphasizing more on close
regions, ignoring the overall structure of the cluster
• Capable of clustering non-elliptical shaped group of
objects
• Sensitive to noise and outliers
Complete link (diameter) X
X

• Complete link (diameter)


• The similarity between two clusters is the similarity between their most
dissimilar members
• Merge two clusters to form one with the smallest diameter
• Nonlocal in behavior, obtaining compact shaped clusters
• Sensitive to outliers
• At each step, the two clusters separated by the shortest
distance are combined. The definition of 'shortest distance' is
what differentiates between the different agglomerative
clustering methods. In complete-linkage clustering, the link
between two clusters contains all element pairs, and the
distance between clusters equals the distance between those
two elements (one in each cluster) that are farthest away from
each other. The shortest of these links that remains at any step
causes the fusion of the two clusters whose elements are
involved.
• Mathematically, the complete linkage function —
the distance D(X,Y) between clusters X and Y — is
described by the following expression :
Average Linkage
• In average linkage, we define the distance between
two clusters to be the average distance between
data points in the first cluster and data points in the
second cluster. On the basis of this definition of
distance between clusters, at each stage of the
process we combine the two clusters that have the
smallest average linkage distance.
• At each step, the nearest two clusters, say 𝑖 and 𝑗,
are combined into a higher-level cluster 𝑖 ∪ 𝑗.Then,
its distance to another cluster 𝑘 is simply the
arithmetic mean of the average distances between
members of 𝑘 and 𝑖 and 𝑘 and 𝑗 :
Agglomerative Clustering: Average
vs. Centroid Links X X

Ca : N a Cb : N b
• Agglomerative clustering with average link
• Average link: The average distance between an element in
one cluster and an element in the other (i.e., all pairs in two
clusters)
• Expensive to compute X
X

• Agglomerative clustering with centroid link


• Centroid link: The distance between the centroids of two
clusters
• Group Averaged Agglomerative Clustering (GAAC)
• Let two clusters 𝐶𝑎 and 𝐶𝑏 be merged into 𝐶𝑎 ∪ 𝐶𝑏
𝑁 𝑐 +𝑁 𝑐
• The new centroid is 𝑐𝑎∪𝑏 = 𝑎𝑁𝑎 +𝑁𝑏 𝑏 , where 𝑁𝑎 and 𝑐𝑎 are the
𝑎 𝑏
cardinality and centroid of cluster 𝐶𝑎 , respectively
• The similarity measure for GAAC is the average of their distances
Divisive Clustering
• DIANA (Divisive Analysis) (Kaufmann and
Rousseeuw,1990)
• Implemented in some statistical analysis packages, e.g.,
Splus
• Inverse order of AGNES: Eventually each node
forms a cluster on its own
10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Divisive Clustering Is a Top-down
Approach
• The process starts at the root with all the points as
one cluster
• It recursively splits the higher level clusters to build
the dendrogram
• Can be considered as a global approach
• More efficient when compared with agglomerative
clustering 10

9
10

9
10

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
More on Algorithm Design for
Divisive Clustering
• Choosing which cluster to split
• Check the sums of squared errors of the clusters and
choose the one with the largest value
• Splitting criterion: Determining how to split
• One may use Ward’s criterion to chase for greater
reduction in the difference in the SSE criterion as a
result of a split
• For categorical data, Gini-index can be used
• Handling the noise
• Use a threshold to determine the termination criterion
(do not generate clusters that are too small because
they contain mainly noises)
Extensions to Hierarchical
Clustering
• Weakness of the agglomerative & divisive hierarchical
clustering methods
• No revisit: cannot undo any merge/split decisions made
before
• Scalability bottleneck: Each merge/split needs to examine
many possible options
• Time complexity: at least O(n2), where n is the number of total
objects
• Several other hierarchical clustering algorithms
• BIRCH (1996): Use CF-tree and incrementally adjust the
quality of sub-clusters
• CURE (1998): Represent a cluster using a set of well-scattered
representative points
• CHAMELEON (1999): Use graph partitioning methods on the
K-nearest neighbor graph of the data
Evaluation of Clustering: Basic
Concepts
• Evaluation of clustering
• Assess the feasibility of clustering analysis on a data set
• Evaluate the quality of the results generated by a clustering
method
• Major issues on clustering assessment and validation
• Clustering tendency: assessing the suitability of clustering:
whether the data has any inherent grouping structure
• Determining the Number of Clusters: determining for a
dataset the right number of clusters that may lead to a good
quality clustering
• Clustering quality evaluation: evaluating the quality of the
clustering results
Clustering Tendency: Whether the
Data Contains Inherent Grouping
Structure
• Assess the suitability of clustering
• Whether the data has any “inherent grouping structure” — non-
random structure that may lead to meaningful clusters
• Determine clustering tendency or clusterability
• A hard task because there are so many different definitions of
clusters
• Different definitions: Partitioning, hierarchical, density-based and graph-
based
• Even fixing a type, still hard to define an appropriate null model for
a data set
• There are some clusterability assessment methods, such as
• Spatial histogram: Contrast the histogram of the data with that
generated from random samples
• Distance distribution: Compare the pairwise point distance from
the data with those from the randomly generated samples
• Hopkins Statistic: A sparse sampling test for spatial randomness
Testing Clustering Tendency: A
Spatial Histogram Approach
• Spatial Histogram Approach: Contrast the d-
dimensional histogram of the input dataset D with
the histogram generated from random samples
• Dataset D is clusterable if the distributions of two
histograms are rather different

(a) Input
dataset (b) Data generated from random
samples
Testing Clustering Tendency: A
Spatial Histogram Approach
• Method outline
• Divide each dimension into equiwidth bins, count how many points lie
in each cell, and obtain the empirical joint probability mass function
(EPMF)
• Do the same for the randomly sampled data
• Compute how much they differ using the Kullback-Leibler (KL)
divergence value
• Kullback-Leibler (KL) divergence, also known as relative entropy,
is a powerful metric used to compare two probability
distributions. Let’s explore some of its practical applications in the
field of data science and machine learning: Monitoring Data
Drift, Loss Function for Neural Networks, Variational Auto-
Encoder Optimization, Generative Adversarial Networks. Also,
KL divergence is asymmetric—given two distributions, the
divergence from P to Q may not be the same as from Q to P
Determining the Number of
Clusters
• The appropriate number of clusters controls the
proper granularity of cluster analysis
• Finding a good balance between compressibility and
accuracy in cluster analysis
• Two undesirable extremes
• The whole data set is one cluster: No value of clustering
• Treating each point as a cluster: No data summarization
Determining the Number of
Clusters
• The right number of clusters often depends on the
distribution's shape and scale in the data set, as
well as the clustering resolution required by the
user
• Methods for determining the number of clusters
• An empirical method
𝑛
• # of clusters: 𝑘 ≈ for a dataset of n points (e.g., n =
2
200, k = 10)
• Each cluster is expected to have about 2𝑛 points
Finding the Number of Clusters:
the Elbow Method
• Use the turning point in the curve of the sum of
within cluster variance with respect to the # of
clusters
• Increasing the # of clusters can help reduce the sum of
within-cluster variance of each cluster
• But splitting a cohesive cluster gives only a small
reduction
Finding K, the Number of Clusters:
A Cross Validation Method
• Divide a given data set into m parts, and use m – 1
parts to obtain a clustering model
• Use the remaining part to test the quality of the
clustering
• For example, for each point in the test set, find the
closest centroid, and use the sum of squared distance
between all points in the test set and their closest
centroids to measure how well the model fits the test
set
• For any k > 0, repeat it m times, compare the
overall quality measure w.r.t. different k’s, and find
# of clusters that fits the data the best
Measuring Clustering Quality
• Clustering Evaluation: Evaluating how good the clustering
results are
• No commonly recognized best suitable measure in practice
• Extrinsic vs. intrinsic methods: depending on whether
ground truth is used
• Ground truth: the ideal clustering built by using human experts
• Extrinsic: Supervised, employ criteria not inherent to the
dataset
• Compare a clustering against prior or expert-specified knowledge
(i.e., the ground truth) using certain clustering quality measure
• Intrinsic: Unsupervised, criteria derived from data itself
• Evaluate the goodness of a clustering by considering how well the
clusters are separated and how compact the clusters are (e.g.,
silhouette coefficient)
General Criteria for Measuring
Clustering Quality with Extrinsic
Methods
• Given the ground truth Cg, Q(C, Cg) is the quality Ground truth partitioning
G 1
G2

measure for a clustering C Cluster C Cluster C


1 2

• Q(C, Cg) is good if it satisfies the following four essential


criteria
• Cluster homogeneity: the purer, the better
• Cluster completeness: assign objects belonging to the same
category in the ground truth to the same cluster
• Rag bag better than alien: putting a heterogeneous object
into a pure cluster should be penalized more than putting it
into a rag bag (i.e., “miscellaneous” or “other” category)
• Small cluster preservation: splitting a small category into
pieces is more harmful than splitting a large category into
pieces
Commonly Used Extrinsic
Methods
• Matching-based methods Ground truth partitioning G2
• Examine how well the clustering results match the ground
G1
Cluster C1 Cluster C2
truth in partitioning the objects in the data set
• Information theory-based methods
• Compare the distribution of the clustering results and that of
the ground truth
• Information theory (e.g., entropy) used to quantify the
comparison
• Ex. Conditional entropy, normalized mutual information (NMI)
• Pairwise comparison-based methods
• Treat each group in the ground truth as a class, and then
check the pairwise consistency of the objects in the clustering
results
• Ex. Four possibilities: TP, FN, FP, TN; Jaccard coefficient
Matching-Based Methods Ground Truth G1 G2

Cluster C1 C2 C3

• The matching based methods compare clusters in the


clustering results and the groups in the ground truth
• Suppose a clustering method partitions D = {o1, …, on}
into m clusters C = {C1, …, Cm}. The ground truth G
partitions D into l groups G = {G1, …,Gl}
• Purity: the extent that cluster 𝐶𝑖 contains points only
from one (ground truth) partition
|𝐶𝑖 ∩𝐺𝑗 |
• Purity for cluster 𝐶𝑖 : , where 𝐺𝑗 matching 𝐺𝑗 maximizes
|𝐶𝑖 |
|𝐶𝑖 ∩ 𝐺𝑗 |
• Total purity of clustering C:
𝑚 𝑚
|𝐶𝑖 | 𝐶𝑖 ∩ 𝐺𝑗 1
𝑝𝑢𝑟𝑖𝑡𝑦 = max 𝑙 = max l |𝐶𝑖 ∩ 𝐺𝑗 }
𝑛 𝑗=1 𝐶𝑖 𝑛 j=1
𝑖=1 𝑖=1
Matching-Based Methods:
Example
G
• Consider 11 objects
Ground Truth
G1 2
Cluster C2 C3
C1

Purity for clustering C1 = 1/11 (4 + 2 + 4 + 1) = 11/11


= 1;
Purity for clustering C2 = 1/11 (2 + 3 + 1) = 6/11

• Other methods:
• maximum matching; F-measure
Information Theory-Based Methods (I)
Conditional Entropy
• A clustering can be regarded as a compressed
representation of a given set of objects
• The better the clustering results approach the
ground-truth, the less amount of information is
needed
• This idea leads to the use of conditional entropy

Ground Truth G
G1 2
Cluster C2 C3
C1
Information Theory-Based Methods (I)
Conditional Entropy
𝑚 |𝐶𝑖 | |𝐶𝑖 |
• Entropy of clustering C: 𝐻 𝐶 = − 𝑖=1 𝑛 log
𝑛
|𝐺 | |𝐺 |
• Entropy of ground truth G: 𝐻 𝐺 = − 𝑙𝑖=1 𝑖 log 𝑖
𝑛 𝑛
• Conditional entropy of G given cluster 𝐶𝑖 :
Ground Truth G1 G2
𝑙 𝐶𝑖 ∩𝐺𝑗 𝐶𝑖 ∩𝐺𝑗
𝐻 𝐺 𝐶𝑖 = − 𝑗=1 log Cluster C1 C2 C3
𝐶𝑖 𝐶𝑖
• Conditional entropy of G given clustering C:
𝑚 𝑚 𝑙
𝐶𝑖 𝐶𝑖 ∩ 𝐺𝑗 𝐶𝑖 ∩ 𝐺𝑗
𝐻 𝐺𝐶 = 𝐻 𝐺 𝐶𝑖 = − log
𝑛 𝑛 𝐶𝑖
𝑖=1 𝑖=1 𝑗=1
Example
G
• Consider 11 objects
Ground Truth
G1 2
Cluster C2 C3
C1

Purity for clustering C1 = 1/11 (4 + 2 + 4 + 1) = 11/11


= 1;
Purity for clustering C2 = 1/11 (2 + 3 + 1) = 6/11

Note: conditional entropy cannot detect the issue that C1 splits the objects in G into two clusters
Information Theory-Based Methods
(II)
Normalized Mutual Information (NMI)
𝑝 𝑟 𝑘 𝑖𝑗
• Mutual information 𝐼 𝐶, 𝐺 = − 𝑖=1 𝑗=1 𝑝𝑖𝑗 log
𝑝𝐶𝑖 𝑝𝐺𝑗
• Quantify the amount of shared info between the clustering C and
the ground-truth partitioning G
• Measure the dependency between the observed joint probability
𝑝𝑖𝑗 of C and G, and the expected joint probability 𝑝𝐶𝑖 𝑝𝐺𝑗 under the
independence assumption
• When C and G are independent, 𝑝𝑖𝑗 = 𝑝𝐶𝑖 𝑝𝐺𝑗 , I(C, G) = 0
• However, there is no upper bound on the mutual information
• Normalized mutual information 𝑁𝑀𝐼 𝐶, 𝐺 =
𝐼(𝐶,𝐺) 𝐼(𝐶,𝐺) 𝐼 𝐶,𝐺
=
𝐻(𝐶) 𝐻(𝐺) 𝐻 𝐶 𝐻(𝐺)
• Value range of NMI: [0,1]
• Value close to 1 indicates a good clustering
Pairwise Comparison-Based
Methods: Jaccard Coefficient
• Pairwise comparison: treat each group in the ground truth as a class
• For each pair of objects (oi, oj) in D, if they are assigned to the same
cluster/group, the assignment is regarded as positive; otherwise,
negative
• Depending on assignments, we have four possible cases:

Note: Total # of n


N  
pairs of points
 2
• Jaccard coefficient: Ignoring the true negatives (thus asymmetric)
• Jaccard = TP/(TP + FN + FP) [i.e., denominator ignores TN]
• Jaccard = 1 if perfect clustering
• Many other measures are based on the pairwise comparison statistics:
• Rand statistic
• Fowlkes-Mallows measure
Intrinsic Methods (I): Dunn Index
• Intrinsic methods (i.e., no ground truth) examine how compact
clusters are and how well clusters are separated, based on
similarity/distance measure between objects
• Dunn Index:
• The compactness of clusters: the maximum distance between two
points that belong to the same cluster: Δ = max 𝑑 𝑜𝑖 , 𝑜𝑗
𝐶 𝑜𝑖 =𝐶(𝑜𝑗 )
• The degree of separation among different clusters: the minimum
distance between two points that belong to different clusters: 𝛿 =
min 𝑑 𝑜𝑖 , 𝑜𝑗
𝐶 𝑜𝑖 ≠𝐶(𝑜𝑗 )
𝛿
• The Dunn index is simply the ratio: 𝐷𝐼 = Δ, the larger the ratio, the
farther away the clusters are separated comparing to the compactness
of the clusters
• Dunn index uses the extreme distances to measure the cluster
compactness and inter-cluster separation and it can be affected
by outliers
Intrinsic Methods (II): Silhouette
Coefficient
• Suppose D is partitioned into k clusters: 𝐶1 , … , 𝐶𝑘
• For each object o in D, we′ calculate
𝑜′ ∈𝐶𝑖 ,𝑜≠𝑜′
𝑑𝑖𝑠𝑡(𝑜,𝑜 )
• 𝑎 𝑜 = : avg distance between o and all other
𝐶𝑖 −1
objects in the cluster to which o belongs, reflects the compactness of
the cluster to which o belongs ′
𝑜′ ∈𝐶𝑗
𝑑𝑖𝑠𝑡 𝑜,𝑜
• 𝑏 𝑜 = min 𝐶𝑗
: minimum avg distance from o to
𝐶𝑗 :1≤𝑗≤𝑘,𝑗≠𝑖
all clusters to which o does not belong, captures the degree to which o
is separated from other clusters
𝑏 𝑜 −𝑎(𝑜)
• Silhouette Coefficient: 𝑠 𝑜 = , value range (-1, 1)
max 𝑎 𝑜 ,𝑏 𝑜
• When the value of o approaches 1, the cluster containing o is compact
and o is far away from other clusters, which is the preferable case
• When the value is negative (i.e., b(o) < a(o)), o is closer to the objects in
another cluster than to the objects in the same cluster as o: a bad
situation to be avoided
Neural Networks
Representation Power of Multi
Layer Sigmoid Neuron
Feed Forward Neural Networks
Learning Parameters of Feed
Forward NN

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy