0% found this document useful (0 votes)

12 views19 pages

Lec10 Clustering

Uploaded by

mobinakohesta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views19 pages

Lec10 Clustering

Uploaded by

mobinakohesta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Recap

Learning to rank for zone scoring

Given query q and document d, weighted zone scoring assigns to the pair (q,d) a
score in the interval [0,1] by computing a linear combination of document zone
scores, where each zone contributes a value.
Consider a set of documents, which have l zones

For 1 ≤ i ≤ l, let si be the Boolean score denoting a match (or non-match) between
q and the ith zone
si = 1 if a query term occurs in zone i, 0 otherwise

Learning to rank approach: learn the weights gi from training data

Summary of learning to rank approach

The problem of making a binary relevant/nonrelevant judgment is cast as a

classi cation or regression problem, based on a training set of query-document
pairs and associated relevance judgments.
In principle, any method learning a classi er (including least squares regression)
can be used to nd this line.
fi
fi
fi
Big advantage of learning to rank: we can avoid hand-tuning scoring functions
and simply learn them from training data.
Bottleneck of learning to rank: the cost of maintaining a representative set of
training examples whose relevance assessments must be made by humans.
LTR features used by Microsoft Research (1)

Zones: body, anchor, title, url, whole document

Features derived from standard IR models: query term number, query term ratio,
length, idf, sum of term frequency, min of term frequency, max of term frequency,
mean of term frequency, variance of term frequency, sum of length normalized
term frequency, min of length normalized term frequency, max of length
normalized term frequency, mean of length normalized term frequency, variance
of length normalized term frequency, sum of tf-idf, min of tf-idf, max of tf-idf,
mean of tf-idf, variance of tf-idf, boolean model, BM25
Language model features: LMIR.ABS, LMIR.DIR, LMIR.JM
Web-speci c features: number of slashes in url, length of url, inlink number,
outlink number, PageRank, SiteRank
Spam features: QualityScore
Usage-based features: query-url click count, url click count, url dwell time
Ranking SVMs

Vector of feature differences: Φ(di , dj , q) = ψ(di , q) − ψ(dj , q) By hypothesis,

one of di and dj has been judged more
relevant.
Notation: We write di ≺ dj for “di precedes dj in the results ordering”.
If di is judged more relevant than dj , then we will assign the vector Φ(di , dj , q)
the class yijq = +1; otherwise −1.
This gives us a training set of pairs of vectors and “precedence indicators”. Each
of the vectors is computed as the difference of two document-query vectors.
We can then train an SVM on this training set with the goal of obtaining a
classi er that returns
w⃗ TΦ(di,dj,q)>0 iff di ≺dj
Clustering: Introduction

(Document) clustering is the process of grouping a set of documents into clusters

of similar documents.
fi
fi
Documents within a cluster should be similar.
Documents from different clusters should be dissimilar. Clustering is the most
common form of unsupervised learning. Unsupervised = there are no labeled or
annotated data.

Classi cation vs. Clustering

Classi cation: supervised learning Clustering: unsupervised learning

Classi cation: Classes are human-de ned and part of the input to the learning
algorithm.
Clustering: Clusters are inferred from the data without human input.
However, there are many ways of in uencing the outcome of clustering: number of
clusters, similarity measure, representation of documents, . . .

3 Clustering in IR

The cluster hypothesis

Cluster hypothesis. Documents in the same cluster behave similarly with respect
to relevance to information needs.
All applications of clustering in IR are based (directly or indirectly) on the cluster
hypothesis.
Van Rijsbergen’s original wording (1979): “closely associated documents tend to
be relevant to the same requests”.
fi
fi
fi
fl
fi
Global navigation: Yahoo
Global navigation: MESH (upper level)

Navigational hierarchies: Manual vs. automatic creation

Note: Yahoo/MESH are not examples of clustering.

But they are well known examples for using a global hierarchy for navigation.
Some examples for global navigation/exploration based on clustering:
Cartia
Themescapes
Google News
Desiderata for clustering

General goal: put related docs in the same cluster, put unrelated docs in different
clusters.
We’ll see different ways of formalizing this.
The number of clusters should be appropriate for the data set we are clustering.
Initially, we will assume the number of clusters K is given. Later: Semiautomatic methods
for determining K
Secondary goals in clustering
Avoid very small and very large clusters
De ne clusters that are easy to explain to the user Many others . . .
Flat vs. Hierarchical clustering

Flat algorithms
Usually start with a random (partial) partitioning of docs into groups
Re ne iteratively
Main algorithm: K-means
Hierarchical algorithms Create a hierarchy
Bottom-up, agglomerative Top-down, divisive
Hard vs. Soft clustering

Hard clustering: Each document belongs to exactly one cluster.

More common and easier to do
Soft clustering: A document can belong to more than one cluster.
Makes more sense for applications like creating browsable hierarchies
You may want to put sneakers in two clusters:
sports apparel
shoes
You can only do that with a soft clustering approach.
This class: at, hard clustering
Next time: hierarchical, hard clustering
Next week: latent semantic indexing, a form of soft clustering
Flat algorithms

Flat algorithms compute a partition of N documents into a set of K clusters.

Given: a set of documents and the number K
Find: a partition into K clusters that optimizes the chosen partitioning criterion
fi
fi
fl
Global optimization: exhaustively enumerate partitions, pick optimal one
Not tractable
Effective heuristic method: K-means algorithm
K -means

Perhaps the best known clustering algorithm

Simple, works well in many cases
Use as default / baseline for clustering documents

Document representations in clustering

Vector space model
As in vector space classi cation, we measure relatedness
between vectors by Euclidean distance . . .
. . . which is almost equivalent to cosine similarity. Almost: centroids are not
length-normalized.

K-means: Basic idea

Each cluster in K-means is de ned by a centroid. Objective/partitioning criterion:
minimize the average squared
difference from the centroid
Recall de nition of centroid:

where we use ω to denote a cluster.

We try to nd the minimum average squared difference by iterating two steps:
reassignment: assign each vector to its closest centroid recomputation: recompute each
centroid as the average of the vectors that were assigned to it in reassignment

K-means pseudocode (μk is centroid of ωk)

fi
fi
fi
fi
Worked Example

Exercise: (i) Guess what the optimal clustering into two clusters is in this case;
(ii) compute the centroids of the clusters
Worked Example: Random selection of initial centroids

Worked Example: Assignment Worked Example: Recompute cluster centroids

Worked Example: Assign points to closest centroid

Worked Example: Assignment Worked Example: Recompute cluster centroids

Worked Example: Assign points to closest centroid

Worked Example: Assignment Worked Example: Recompute cluster centroids

Worked Example: Assign points to closest centroid

Worked Example: Assignment Worked Example: Recompute cluster centroids

Worked Example: Assign points to closest centroid

Worked Example: Assignment Worked Example: Recompute cluster centroids

Worked Example: Assign points to closest centroid

Worked Example: Assignment Worked Example: Recompute cluster centroids

Worked Example: Assign points to closest centroid

Worked Example: Assignment Worked Example: Recompute cluster centroids

Worked Ex.: Centroids and assignments after convergence

K-means is guaranteed to converge: Proof
RSS = sum of all squared distances between document vector and closest centroid
RSS decreases during each reassignment step.
because each vector is moved to a closer centroid RSS decreases during each
recomputation step.
see next slide
There is only a nite number of clusterings.
Thus: We must reach a xed point.
Assumption: Ties are broken consistently.
Finite set & monotonically decreasing → convergence
Recomputation decreases average distance

The last line is the componentwise de nition of the centroid!

We minimize RSSk when the old centroid is replaced with the new centroid. RSS, the sum
of the RSSk , must then also decrease during recomputation.
K-means is guaranteed to converge
But we don’t know how long convergence will take!
If we don’t care about a few docs switching back and forth,
then convergence is usually fast (< 10-20 iterations). However, complete
convergence can take many more
iterations.
Optimality of K-means

Convergence̸= optimality
Convergence does not mean that we converge to the optimal
clustering!
This is the great weakness of K-means.
If we start with a bad set of seeds, the resulting clustering can be horrible.
fi
fi
fi
Exercise: Suboptimal clustering

What is the optimal clustering for K = 2?

Do we converge on this clustering for arbitrary seeds
di,dj?
Initialization of K-means
Random seed selection is just one of many ways K-means can be initialized.
Random seed selection is not very robust: It’s easy to get a suboptimal clustering.
Better ways of computing initial centroids:
Select seeds not randomly, but using some heuristic (e.g., lter out outliers or nd a set of
seeds that has “good coverage” of the document space)
Use hierarchical clustering to nd good seeds
Select i (e.g., i = 10) different random sets of seeds, do a K-means clustering for each,
select the clustering with lowest RSS
Time complexity of K-means
Computing one distance of two vectors is O(M). Reassignment step: O(KNM)
(we need to compute KN
document-centroid distances)
Recomputation step: O(NM) (we need to add each of the
document’s < M values to one of the centroids) Assume number of iterations
bounded by I
Overall complexity: O(IKNM) – linear in all important dimensions
However: This is not a real worst-case analysis.
In pathological cases, complexity can be worse than linear.
Evaluation
What is a good clustering?
Internal criteria
Example of an internal criterion: RSS in K-means
But an internal criterion often does not evaluate the actual utility of a clustering in
the application.
Alternative: External criteria
Evaluate with respect to a human-de ned classi cation
External criteria for clustering quality
Based on a gold standard data set, e.g., the Reuters collection we also used for the
evaluation of classi cation
fi
fi
fi
fi
fi
fi
Goal: Clustering should reproduce the classes in the gold standard
(But we only want to reproduce how documents are divided into groups, not the
class labels.)
First measure for how well we were able to reproduce the classes: purity
External criterion: Purity

Ω = {ω1,ω2,...,ωK} is the set of clusters and C = {c1,c2,...,cJ} is the set of classes.

For each cluster ωk : nd class cj with most members nkj in ωk Sum all nkj and
divide by total number of points

To compute purity: 5 = maxj |ω1 ∩ cj | (class x, cluster 1);

4 = maxj |ω2 ∩ cj| (class o, cluster 2); and
⋄
3 = maxj |ω3 ∩ cj| (class , cluster 3). Purity is (1/17) × (5 + 4 + 3) ≈ 0.71.
Another external criterion: Rand index
Purity can be increased easily by increasing K – a measure that does not have this
problem: Rand index.
De nition: RI = TP+TN/TP+FP+FN+TN
Based on 2x2 contingency table of all pairs of documents:
same cluster different clusters
same class true positives (TP) false negatives (FN)
different classes false positives (FP) true negatives (TN)
fi
fi
TP+FN+FP+TN is the total number of pairs.
TP+FN+FP+TN = n/2 for N documents.

Example: = 136 in o/⋄/x example

Each pair is either positive or negative (the clustering puts the two documents in
the same or in different clusters) . . .
. . . and either “true” (correct) or “false” (incorrect): the clustering decision is
correct or incorrect.
Rand Index: Example

As an example, we compute RI for the o/⋄/x example. We rst compute TP + FP.

The three clusters contain 6, 6, and 5 points, respectively, so the total number of
“positives” or pairs of documents that are in the same cluster is:

Of these, the x pairs in cluster 1, the o pairs in cluster 2, the ⋄

pairs in cluster 3, and the x pair in cluster 3 are true positives:

Thus, FP = 40 − 20 = 20.
FN and TN are computed similarly.
Ran measure for the o/⋄/x example

RI is then (20 + 72)/(20 + 20 + 24 + 72) ≈ 0.68.

Two other external evaluation measures
Two other measures
Normalized mutual information (NMI)
fi
How much information does the clustering contain about the classi cation?
Singleton clusters (number of clusters = number of docs) have maximum MI
Therefore: normalize by entropy of clusters and classes
F measure
Like Rand, but “precision” and “recall” can be weighted

Evaluation results for the o/⋄/x example

All four measures range from 0 (really bad clustering) to 1 (perfect clustering).

How many clusters?

Number of clusters K is given in many applications.

E.g., there may be an external constraint on K. Example: In the case of Scatter-Gather, it
was hard to show more than 10–20 clusters on a monitor in the 90s.
What if there is no external constraint? Is there a “right” number of clusters?
One way to go: de ne an optimization criterion
Given docs, nd K for which the optimum is reached.
What optimization criterion can we use?
We can’t use RSS or average squared distance from centroid as criterion: always chooses
K = N clusters.
Your job is to develop the clustering algorithms for a competitor to news.google.com
You want to use K-means clustering. How would you determine K?
Simple objective function for K: Basic idea

Start with 1 cluster (K = 1)

Keep adding clusters (= keep increasing K)
Add a penalty for each new cluster
Then trade off cluster penalties against average squared distance from centroid
Choose the value of K with the best tradeoff
Simple objective function for K: Formalization
Given a clustering, de ne the cost for a document as (squared) distance to
centroid
fi
fi
fi
fi
De ne total distortion RSS(K) as sum of all individual document costs
(corresponds to average distance)
Then: penalize each cluster with a cost λ
Thus for a clustering with K clusters, total cluster penalty is Kλ
De ne the total cost of a clustering as distortion plus total
cluster penalty: RSS(K) + Kλ
Select K that minimizes (RSS(K) + Kλ) Still need to determine good value for λ .
..
Finding the “knee” in the curve

Pick the number of clusters where curve “ attens”. Here: 4 or 9.

fi
fi
fl

Clustering
No ratings yet
Clustering
28 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
37 Application of K Means Clustering
No ratings yet
37 Application of K Means Clustering
38 pages
Clustering
No ratings yet
Clustering
41 pages
Clustering
No ratings yet
Clustering
80 pages
IR Lec 36
No ratings yet
IR Lec 36
29 pages
Baseline Maths Test
No ratings yet
Baseline Maths Test
13 pages
5clustering 2
No ratings yet
5clustering 2
35 pages
Week6 Clustering Regression
No ratings yet
Week6 Clustering Regression
101 pages
Unsupervised Algorithms Unit3
No ratings yet
Unsupervised Algorithms Unit3
53 pages
Clustering Lec 1 Introduction To Clustering
No ratings yet
Clustering Lec 1 Introduction To Clustering
48 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Unit 5
No ratings yet
Unit 5
33 pages
Lecture Unsupervised (17!04!2024)
No ratings yet
Lecture Unsupervised (17!04!2024)
61 pages
Clustering
No ratings yet
Clustering
75 pages
Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
ML Clustering
No ratings yet
ML Clustering
33 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Unit Iv
No ratings yet
Unit Iv
19 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Machine Learning Unsupervised
No ratings yet
Machine Learning Unsupervised
20 pages
DM Chapter 5 (Clustering)
No ratings yet
DM Chapter 5 (Clustering)
40 pages
Clustering, A Tool To Analyze Data Points
No ratings yet
Clustering, A Tool To Analyze Data Points
61 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
15-505 Internet Search Technologies: Kamal Nigam
No ratings yet
15-505 Internet Search Technologies: Kamal Nigam
62 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
DM&BAFall2204 2
No ratings yet
DM&BAFall2204 2
61 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
04-FSSR DS610 2024 2025T1 Kmeans
No ratings yet
04-FSSR DS610 2024 2025T1 Kmeans
57 pages
Lect 12
No ratings yet
Lect 12
80 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
Week 9
No ratings yet
Week 9
66 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Clustering Basics
No ratings yet
Clustering Basics
39 pages
ML L14 Clustering
No ratings yet
ML L14 Clustering
59 pages
UNIT5
No ratings yet
UNIT5
60 pages
Ds Module 5
No ratings yet
Ds Module 5
49 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
Cs8080 Unit3 Text Classification and Clustering
No ratings yet
Cs8080 Unit3 Text Classification and Clustering
171 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
No ratings yet
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
54 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
K Means
No ratings yet
K Means
36 pages
Understanding The Inners of Clustering: DR Akashdeep, UIET, Panjab University Chandigarh, Maivriklab@pu - Ac.in
No ratings yet
Understanding The Inners of Clustering: DR Akashdeep, UIET, Panjab University Chandigarh, Maivriklab@pu - Ac.in
61 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
Cluster
100% (1)
Cluster
72 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
Book 2.0 - Python
100% (1)
Book 2.0 - Python
143 pages
Year 2 Autumn Block 1 Step 1 PPT Count Objects To 100
No ratings yet
Year 2 Autumn Block 1 Step 1 PPT Count Objects To 100
21 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
All Newtons Laws Math
0% (2)
All Newtons Laws Math
2 pages
Ajaks Tutorials 2025 Jamb Syllabus Schedule-1
100% (1)
Ajaks Tutorials 2025 Jamb Syllabus Schedule-1
4 pages
Python Sach
No ratings yet
Python Sach
67 pages
MathPSHS Curriculum
No ratings yet
MathPSHS Curriculum
1 page
Clustering
No ratings yet
Clustering
39 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Reporting Document-Sap BPC Epm
100% (1)
Reporting Document-Sap BPC Epm
43 pages
Element Thickness 3
No ratings yet
Element Thickness 3
24 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
Sri Chaitanya: IIT Academy.,India
No ratings yet
Sri Chaitanya: IIT Academy.,India
11 pages
Mile Pra 25 Aug 2024 12th Jee Main Part Test Phase 3 KPM Model Test
No ratings yet
Mile Pra 25 Aug 2024 12th Jee Main Part Test Phase 3 KPM Model Test
12 pages
Triangles Report
No ratings yet
Triangles Report
10 pages
1 s2.0 S0022169421007320 Main
No ratings yet
1 s2.0 S0022169421007320 Main
13 pages
ANUCDE Math Assignment
No ratings yet
ANUCDE Math Assignment
8 pages
Manual Moisture
No ratings yet
Manual Moisture
38 pages
Full Guide To The Guide Mesh
No ratings yet
Full Guide To The Guide Mesh
3 pages
Craven Slides PDF
No ratings yet
Craven Slides PDF
84 pages
1st Year Honours Syllabus Statistics Physics
No ratings yet
1st Year Honours Syllabus Statistics Physics
16 pages
Kawasaki 1987
No ratings yet
Kawasaki 1987
23 pages
Probability Distributions: Lecture #5
No ratings yet
Probability Distributions: Lecture #5
50 pages
Placement With MCTS
No ratings yet
Placement With MCTS
15 pages
Kel Sir Solution 1
No ratings yet
Kel Sir Solution 1
2 pages
Applied Elasticity - Chapter 1
No ratings yet
Applied Elasticity - Chapter 1
59 pages
Sports Arbitrage Guide 04 - The Calculations
100% (2)
Sports Arbitrage Guide 04 - The Calculations
5 pages
Submitted in Partial Fulfilment For The Award of Degree of
No ratings yet
Submitted in Partial Fulfilment For The Award of Degree of
13 pages
Module 2 Revised
No ratings yet
Module 2 Revised
25 pages
Iecep National Quiz Review
No ratings yet
Iecep National Quiz Review
12 pages
Taylor and Maclaurin Series
No ratings yet
Taylor and Maclaurin Series
14 pages
1.4 Circle Diagram of Slip Ring Motor
No ratings yet
1.4 Circle Diagram of Slip Ring Motor
9 pages
Engg Mathematics - 1 Dec 2012 PDF
No ratings yet
Engg Mathematics - 1 Dec 2012 PDF
4 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lec10 Clustering

Uploaded by

Lec10 Clustering

Uploaded by

Recap

Learning to rank for zone scoring

Learning to rank approach: learn the weights gi from training data

Summary of learning to rank approach

The problem of making a binary relevant/nonrelevant judgment is cast as a

Zones: body, anchor, title, url, whole document

Vector of feature differences: Φ(di , dj , q) = ψ(di , q) − ψ(dj , q) By hypothesis,

(Document) clustering is the process of grouping a set of documents into clusters

Classi cation vs. Clustering

Classi cation: supervised learning Clustering: unsupervised learning

The cluster hypothesis

Navigational hierarchies: Manual vs. automatic creation

Note: Yahoo/MESH are not examples of clustering.

Hard clustering: Each document belongs to exactly one cluster.

Flat algorithms compute a partition of N documents into a set of K clusters.

Perhaps the best known clustering algorithm

Document representations in clustering

K-means: Basic idea

where we use ω to denote a cluster.

K-means pseudocode (μk is centroid of ωk)

Worked Example: Assignment Worked Example: Recompute cluster centroids

Worked Example: Assignment Worked Example: Recompute cluster centroids

Worked Example: Assign points to closest centroid

Worked Example: Assign points to closest centroid

Worked Example: Assignment Worked Example: Recompute cluster centroids

Worked Example: Assignment Worked Example: Recompute cluster centroids

Worked Example: Assign points to closest centroid

Worked Example: Assign points to closest centroid

Worked Example: Assignment Worked Example: Recompute cluster centroids

Worked Ex.: Centroids and assignments after convergence

The last line is the componentwise de nition of the centroid!

What is the optimal clustering for K = 2?

Ω = {ω1,ω2,...,ωK} is the set of clusters and C = {c1,c2,...,cJ} is the set of classes.

To compute purity: 5 = maxj |ω1 ∩ cj | (class x, cluster 1);

Example: = 136 in o/⋄/x example

As an example, we compute RI for the o/⋄/x example. We rst compute TP + FP.

Of these, the x pairs in cluster 1, the o pairs in cluster 2, the ⋄

pairs in cluster 3, and the x pair in cluster 3 are true positives:

RI is then (20 + 72)/(20 + 20 + 24 + 72) ≈ 0.68.

Evaluation results for the o/⋄/x example

How many clusters?

Number of clusters K is given in many applications.

Start with 1 cluster (K = 1)

Pick the number of clusters where curve “ attens”. Here: 4 or 9.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.