Lec10 Clustering
Lec10 Clustering
Given query q and document d, weighted zone scoring assigns to the pair (q,d) a
score in the interval [0,1] by computing a linear combination of document zone
scores, where each zone contributes a value.
Consider a set of documents, which have l zones
For 1 ≤ i ≤ l, let si be the Boolean score denoting a match (or non-match) between
q and the ith zone
si = 1 if a query term occurs in zone i, 0 otherwise
3 Clustering in IR
General goal: put related docs in the same cluster, put unrelated docs in different
clusters.
We’ll see different ways of formalizing this.
The number of clusters should be appropriate for the data set we are clustering.
Initially, we will assume the number of clusters K is given. Later: Semiautomatic methods
for determining K
Secondary goals in clustering
Avoid very small and very large clusters
De ne clusters that are easy to explain to the user Many others . . .
Flat vs. Hierarchical clustering
Flat algorithms
Usually start with a random (partial) partitioning of docs into groups
Re ne iteratively
Main algorithm: K-means
Hierarchical algorithms Create a hierarchy
Bottom-up, agglomerative Top-down, divisive
Hard vs. Soft clustering
Exercise: (i) Guess what the optimal clustering into two clusters is in this case;
(ii) compute the centroids of the clusters
Worked Example: Random selection of initial centroids
Convergence̸= optimality
Convergence does not mean that we converge to the optimal
clustering!
This is the great weakness of K-means.
If we start with a bad set of seeds, the resulting clustering can be horrible.
fi
fi
fi
Exercise: Suboptimal clustering
Each pair is either positive or negative (the clustering puts the two documents in
the same or in different clusters) . . .
. . . and either “true” (correct) or “false” (incorrect): the clustering decision is
correct or incorrect.
Rand Index: Example
Thus, FP = 40 − 20 = 20.
FN and TN are computed similarly.
Ran measure for the o/⋄/x example
All four measures range from 0 (really bad clustering) to 1 (perfect clustering).