Relevance Feedback
Relevance Feedback
Expansion
SEEM5680
1
Relevance Feedback
3
Similar pages
4
Relevance Feedback: Example
5
Results for Initial Query
6
Relevance Feedback
7
Results after Relevance Feedback
8
Ad hoc results for query canine
source: Fernando Diaz
9
Ad hoc results for query canine
source: Fernando Diaz
10
User feedback: Select what is relevant
source: Fernando Diaz
11
Results after relevance feedback
source: Fernando Diaz
12
Initial query/results
Initial query: New space satellite applications
+ 1. 0.539, 08/13/91, NASA Hasn’t Scrapped Imaging Spectrometer
+ 2. 0.533, 07/09/91, NASA Scratches Environment Gear From Satellite Plan
3. 0.528, 04/04/90, Science Panel Backs NASA Satellite Plan, But Urges
Launches of Smaller Probes
4. 0.526, 09/09/91, A NASA Satellite Project Accomplishes Incredible Feat:
Staying Within Budget
5. 0.525, 07/24/90, Scientist Who Exposed Global Warming Proposes
Satellites for Climate Research
6. 0.524, 08/22/90, Report Provides Support for the Critics Of Using Big
Satellites to Study Climate
7. 0.516, 04/13/87, Arianespace Receives Satellite Launch Pact From Telesat
Canada
+ 8. 0.509, 12/02/87, Telecommunications Tale of Two Companies
15
Key concept: Centroid
The centroid is the center of mass of a set of
points
Recall that we represent documents as points in
a high-dimensional space
Definition: Centroid
1
(C )
| C | dC
d
where C is a set of documents.
16
Rocchio Algorithm
The Rocchio algorithm uses the vector space
model to pick a relevance feed-back query
Rocchio seeks the query qopt that maximizes
qopt arg max [cos( q, (Cr )) cos( q, (Cnr ))]
q
x x
x x
o x x
x x x x
x x
o x
o
o x x x
o o x
x
x non-relevant documents
Optimal
query o relevant documents 18
Rocchio Algorithm (SMART)
Used in practice:
1 1
q m q0
Dr d j
Dnr d j
d j Dr d j Dnr
20
Relevance feedback on initial query
Initial
x x
query x
o x
x x
x x
x x
o x
x o
x o x
o o x
x x
x
x known non-relevant documents
Revised
query o known relevant documents 21
Relevance Feedback in vector spaces
22
Positive vs Negative Feedback
23
Relevance Feedback: Assumptions
26
Pseudo relevance feedback
Pseudo-relevance feedback automates the
“manual” part of true relevance feedback.
Pseudo-relevance algorithm:
Retrieve a ranked list of hits for the user’s query
Assume that the top k documents are relevant.
Do relevance feedback (e.g., Rocchio)
Works very well on average
But can go horribly wrong for some queries.
Several iterations can cause query drift.
Why?
27
Query Expansion
28
Query assist
29
How do we augment the user
query?
Manual thesaurus
E.g. MedLine: physician, syn: doc, doctor, MD,
medico
Can be query rather than just synonyms
Global Analysis: (static; of all documents in collection)
Automatically derived thesaurus
(co-occurrence statistics)
Refinements based on query log mining
Common on the web
Local Analysis: (dynamic)
Analysis of documents in result set 30
Example of manual thesaurus
31
Thesaurus-based query expansion
For each term, t, in a query, expand the query with
synonyms and related words of t from the thesaurus
feline → feline cat
May weight added terms less than original query terms.
Generally increases recall
Widely used in many science/engineering fields
May significantly decrease precision, particularly with
ambiguous terms.
“interest rate” “interest rate fascinate evaluate”
There is a high cost of manually producing a thesaurus
And for updating it for scientific changes
32
Automatic Thesaurus Generation
Attempt to generate a thesaurus automatically by
analyzing the collection of documents
Fundamental notion: similarity between two words
Definition 1: Two words are similar if they co-occur
with similar words.
Definition 2: Two words are similar if they occur in a
given grammatical relation with the same words.
You can harvest, peel, eat, prepare, etc. apples
and pears, so apples and pears must be similar.
33
Co-occurrence Thesaurus
Simplest way to compute one is based on term-term
similarities in C = AAT where A is term-document matrix.
dj N
ti
M
What does C contain if A is a term-doc incidence (0/1)
matrix?
For each ti, pick terms with high values in C 34
Automatic Thesaurus Generation
Example
35
Automatic Thesaurus Generation
Discussion
37