0% found this document useful (0 votes)
12 views19 pages

Lec10 Clustering

Uploaded by

mobinakohesta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views19 pages

Lec10 Clustering

Uploaded by

mobinakohesta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Recap

Learning to rank for zone scoring

Given query q and document d, weighted zone scoring assigns to the pair (q,d) a
score in the interval [0,1] by computing a linear combination of document zone
scores, where each zone contributes a value.
Consider a set of documents, which have l zones

For 1 ≤ i ≤ l, let si be the Boolean score denoting a match (or non-match) between
q and the ith zone
si = 1 if a query term occurs in zone i, 0 otherwise

Learning to rank approach: learn the weights gi from training data

Summary of learning to rank approach

The problem of making a binary relevant/nonrelevant judgment is cast as a


classi cation or regression problem, based on a training set of query-document
pairs and associated relevance judgments.
In principle, any method learning a classi er (including least squares regression)
can be used to nd this line.
fi
fi
fi
Big advantage of learning to rank: we can avoid hand-tuning scoring functions
and simply learn them from training data.
Bottleneck of learning to rank: the cost of maintaining a representative set of
training examples whose relevance assessments must be made by humans.
LTR features used by Microsoft Research (1)

Zones: body, anchor, title, url, whole document


Features derived from standard IR models: query term number, query term ratio,
length, idf, sum of term frequency, min of term frequency, max of term frequency,
mean of term frequency, variance of term frequency, sum of length normalized
term frequency, min of length normalized term frequency, max of length
normalized term frequency, mean of length normalized term frequency, variance
of length normalized term frequency, sum of tf-idf, min of tf-idf, max of tf-idf,
mean of tf-idf, variance of tf-idf, boolean model, BM25
Language model features: LMIR.ABS, LMIR.DIR, LMIR.JM
Web-speci c features: number of slashes in url, length of url, inlink number,
outlink number, PageRank, SiteRank
Spam features: QualityScore
Usage-based features: query-url click count, url click count, url dwell time
Ranking SVMs

Vector of feature differences: Φ(di , dj , q) = ψ(di , q) − ψ(dj , q) By hypothesis,


one of di and dj has been judged more
relevant.
Notation: We write di ≺ dj for “di precedes dj in the results ordering”.
If di is judged more relevant than dj , then we will assign the vector Φ(di , dj , q)
the class yijq = +1; otherwise −1.
This gives us a training set of pairs of vectors and “precedence indicators”. Each
of the vectors is computed as the difference of two document-query vectors.
We can then train an SVM on this training set with the goal of obtaining a
classi er that returns
w⃗ TΦ(di,dj,q)>0 iff di ≺dj
Clustering: Introduction

(Document) clustering is the process of grouping a set of documents into clusters


of similar documents.
fi
fi
Documents within a cluster should be similar.
Documents from different clusters should be dissimilar. Clustering is the most
common form of unsupervised learning. Unsupervised = there are no labeled or
annotated data.

Classi cation vs. Clustering

Classi cation: supervised learning Clustering: unsupervised learning


Classi cation: Classes are human-de ned and part of the input to the learning
algorithm.
Clustering: Clusters are inferred from the data without human input.
However, there are many ways of in uencing the outcome of clustering: number of
clusters, similarity measure, representation of documents, . . .

3 Clustering in IR

The cluster hypothesis


Cluster hypothesis. Documents in the same cluster behave similarly with respect
to relevance to information needs.
All applications of clustering in IR are based (directly or indirectly) on the cluster
hypothesis.
Van Rijsbergen’s original wording (1979): “closely associated documents tend to
be relevant to the same requests”.
fi
fi
fi
fl
fi
Global navigation: Yahoo
Global navigation: MESH (upper level)

Navigational hierarchies: Manual vs. automatic creation

Note: Yahoo/MESH are not examples of clustering.


But they are well known examples for using a global hierarchy for navigation.
Some examples for global navigation/exploration based on clustering:
Cartia
Themescapes
Google News
Desiderata for clustering

General goal: put related docs in the same cluster, put unrelated docs in different
clusters.
We’ll see different ways of formalizing this.
The number of clusters should be appropriate for the data set we are clustering.
Initially, we will assume the number of clusters K is given. Later: Semiautomatic methods
for determining K
Secondary goals in clustering
Avoid very small and very large clusters
De ne clusters that are easy to explain to the user Many others . . .
Flat vs. Hierarchical clustering

Flat algorithms
Usually start with a random (partial) partitioning of docs into groups
Re ne iteratively
Main algorithm: K-means
Hierarchical algorithms Create a hierarchy
Bottom-up, agglomerative Top-down, divisive
Hard vs. Soft clustering

Hard clustering: Each document belongs to exactly one cluster.


More common and easier to do
Soft clustering: A document can belong to more than one cluster.
Makes more sense for applications like creating browsable hierarchies
You may want to put sneakers in two clusters:
sports apparel
shoes
You can only do that with a soft clustering approach.
This class: at, hard clustering
Next time: hierarchical, hard clustering
Next week: latent semantic indexing, a form of soft clustering
Flat algorithms

Flat algorithms compute a partition of N documents into a set of K clusters.


Given: a set of documents and the number K
Find: a partition into K clusters that optimizes the chosen partitioning criterion
fi
fi
fl
Global optimization: exhaustively enumerate partitions, pick optimal one
Not tractable
Effective heuristic method: K-means algorithm
K -means

Perhaps the best known clustering algorithm


Simple, works well in many cases
Use as default / baseline for clustering documents

Document representations in clustering


Vector space model
As in vector space classi cation, we measure relatedness
between vectors by Euclidean distance . . .
. . . which is almost equivalent to cosine similarity. Almost: centroids are not
length-normalized.

K-means: Basic idea


Each cluster in K-means is de ned by a centroid. Objective/partitioning criterion:
minimize the average squared
difference from the centroid
Recall de nition of centroid:

where we use ω to denote a cluster.


We try to nd the minimum average squared difference by iterating two steps:
reassignment: assign each vector to its closest centroid recomputation: recompute each
centroid as the average of the vectors that were assigned to it in reassignment

K-means pseudocode (μk is centroid of ωk)


fi
fi
fi
fi
Worked Example

Exercise: (i) Guess what the optimal clustering into two clusters is in this case;
(ii) compute the centroids of the clusters
Worked Example: Random selection of initial centroids

Worked Example: Assignment Worked Example: Recompute cluster centroids


Worked Example: Assign points to closest centroid

Worked Example: Assignment Worked Example: Recompute cluster centroids

Worked Example: Assign points to closest centroid


Worked Example: Assignment Worked Example: Recompute cluster centroids

Worked Example: Assign points to closest centroid

Worked Example: Assignment Worked Example: Recompute cluster centroids


Worked Example: Assign points to closest centroid

Worked Example: Assignment Worked Example: Recompute cluster centroids

Worked Example: Assign points to closest centroid


Worked Example: Assignment Worked Example: Recompute cluster centroids

Worked Example: Assign points to closest centroid

Worked Example: Assignment Worked Example: Recompute cluster centroids

Worked Ex.: Centroids and assignments after convergence


K-means is guaranteed to converge: Proof
RSS = sum of all squared distances between document vector and closest centroid
RSS decreases during each reassignment step.
because each vector is moved to a closer centroid RSS decreases during each
recomputation step.
see next slide
There is only a nite number of clusterings.
Thus: We must reach a xed point.
Assumption: Ties are broken consistently.
Finite set & monotonically decreasing → convergence
Recomputation decreases average distance

The last line is the componentwise de nition of the centroid!


We minimize RSSk when the old centroid is replaced with the new centroid. RSS, the sum
of the RSSk , must then also decrease during recomputation.
K-means is guaranteed to converge
But we don’t know how long convergence will take!
If we don’t care about a few docs switching back and forth,
then convergence is usually fast (< 10-20 iterations). However, complete
convergence can take many more
iterations.
Optimality of K-means

Convergence̸= optimality
Convergence does not mean that we converge to the optimal
clustering!
This is the great weakness of K-means.
If we start with a bad set of seeds, the resulting clustering can be horrible.
fi
fi
fi
Exercise: Suboptimal clustering

What is the optimal clustering for K = 2?


Do we converge on this clustering for arbitrary seeds
di,dj?
Initialization of K-means
Random seed selection is just one of many ways K-means can be initialized.
Random seed selection is not very robust: It’s easy to get a suboptimal clustering.
Better ways of computing initial centroids:
Select seeds not randomly, but using some heuristic (e.g., lter out outliers or nd a set of
seeds that has “good coverage” of the document space)
Use hierarchical clustering to nd good seeds
Select i (e.g., i = 10) different random sets of seeds, do a K-means clustering for each,
select the clustering with lowest RSS
Time complexity of K-means
Computing one distance of two vectors is O(M). Reassignment step: O(KNM)
(we need to compute KN
document-centroid distances)
Recomputation step: O(NM) (we need to add each of the
document’s < M values to one of the centroids) Assume number of iterations
bounded by I
Overall complexity: O(IKNM) – linear in all important dimensions
However: This is not a real worst-case analysis.
In pathological cases, complexity can be worse than linear.
Evaluation
What is a good clustering?
Internal criteria
Example of an internal criterion: RSS in K-means
But an internal criterion often does not evaluate the actual utility of a clustering in
the application.
Alternative: External criteria
Evaluate with respect to a human-de ned classi cation
External criteria for clustering quality
Based on a gold standard data set, e.g., the Reuters collection we also used for the
evaluation of classi cation
fi
fi
fi
fi
fi
fi
Goal: Clustering should reproduce the classes in the gold standard
(But we only want to reproduce how documents are divided into groups, not the
class labels.)
First measure for how well we were able to reproduce the classes: purity
External criterion: Purity

Ω = {ω1,ω2,...,ωK} is the set of clusters and C = {c1,c2,...,cJ} is the set of classes.


For each cluster ωk : nd class cj with most members nkj in ωk Sum all nkj and
divide by total number of points

To compute purity: 5 = maxj |ω1 ∩ cj | (class x, cluster 1);


4 = maxj |ω2 ∩ cj| (class o, cluster 2); and

3 = maxj |ω3 ∩ cj| (class , cluster 3). Purity is (1/17) × (5 + 4 + 3) ≈ 0.71.
Another external criterion: Rand index
Purity can be increased easily by increasing K – a measure that does not have this
problem: Rand index.
De nition: RI = TP+TN/TP+FP+FN+TN
Based on 2x2 contingency table of all pairs of documents:
same cluster different clusters
same class true positives (TP) false negatives (FN)
different classes false positives (FP) true negatives (TN)
fi
fi
TP+FN+FP+TN is the total number of pairs.
TP+FN+FP+TN = n/2 for N documents.

Example: = 136 in o/⋄/x example

Each pair is either positive or negative (the clustering puts the two documents in
the same or in different clusters) . . .
. . . and either “true” (correct) or “false” (incorrect): the clustering decision is
correct or incorrect.
Rand Index: Example

As an example, we compute RI for the o/⋄/x example. We rst compute TP + FP.


The three clusters contain 6, 6, and 5 points, respectively, so the total number of
“positives” or pairs of documents that are in the same cluster is:

Of these, the x pairs in cluster 1, the o pairs in cluster 2, the ⋄

pairs in cluster 3, and the x pair in cluster 3 are true positives:

Thus, FP = 40 − 20 = 20.
FN and TN are computed similarly.
Ran measure for the o/⋄/x example

RI is then (20 + 72)/(20 + 20 + 24 + 72) ≈ 0.68.


Two other external evaluation measures
Two other measures
Normalized mutual information (NMI)
fi
How much information does the clustering contain about the classi cation?
Singleton clusters (number of clusters = number of docs) have maximum MI
Therefore: normalize by entropy of clusters and classes
F measure
Like Rand, but “precision” and “recall” can be weighted

Evaluation results for the o/⋄/x example

All four measures range from 0 (really bad clustering) to 1 (perfect clustering).

How many clusters?

Number of clusters K is given in many applications.


E.g., there may be an external constraint on K. Example: In the case of Scatter-Gather, it
was hard to show more than 10–20 clusters on a monitor in the 90s.
What if there is no external constraint? Is there a “right” number of clusters?
One way to go: de ne an optimization criterion
Given docs, nd K for which the optimum is reached.
What optimization criterion can we use?
We can’t use RSS or average squared distance from centroid as criterion: always chooses
K = N clusters.
Your job is to develop the clustering algorithms for a competitor to news.google.com
You want to use K-means clustering. How would you determine K?
Simple objective function for K: Basic idea

Start with 1 cluster (K = 1)


Keep adding clusters (= keep increasing K)
Add a penalty for each new cluster
Then trade off cluster penalties against average squared distance from centroid
Choose the value of K with the best tradeoff
Simple objective function for K: Formalization
Given a clustering, de ne the cost for a document as (squared) distance to
centroid
fi
fi
fi
fi
De ne total distortion RSS(K) as sum of all individual document costs
(corresponds to average distance)
Then: penalize each cluster with a cost λ
Thus for a clustering with K clusters, total cluster penalty is Kλ
De ne the total cost of a clustering as distortion plus total
cluster penalty: RSS(K) + Kλ
Select K that minimizes (RSS(K) + Kλ) Still need to determine good value for λ .
..
Finding the “knee” in the curve

Pick the number of clusters where curve “ attens”. Here: 4 or 9.


fi
fi
fl

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy