IR Presentation 1
IR Presentation 1
Information Retrieval
Evaluation of information retrieval systems
Information Needs and Queries
• A query can represent very different information needs
• May require different search techniques and ranking algorithms
to produce the best rankings
11
WordNet
• A more detailed database of semantic relationships between
English words.
• Developed by famous cognitive psychologist George Miller and a
team at Princeton University.
• About 144,000 English words.
• Nouns, adjectives, verbs, and adverbs grouped into about 109,000
synonym sets called synsets.
12
WordNet Synset Relationships
• Antonym: front → back
• Attribute: benevolence → good (noun to adjective)
• Pertainym: alphabetical → alphabet (adjective to noun)
• Similar: unquestioning → absolute
• Cause: kill → die
• Entailment: breathe → inhale
• Holonym: chapter → text (part to whole)
• Meronym: computer → cpu (whole to part)
• Hyponym: plant → tree (specialization)
• Hypernym: apple → fruit (generalization)
13
WordNet Query Expansion
• Add synonyms in the same synset.
• Add hyponyms to add specialized terms.
• Add hypernyms to generalize a query.
• Add other related terms to expand query.
14
Statistical Thesaurus
• Existing human-developed thesauri are not easily available in all
languages.
• Human thesuari are limited in the type and range of synonymy and
semantic relations they represent.
• Semantically related terms can be discovered from statistical
analysis of corpora.
15
Automatic Global Analysis
• Determine term similarity through a pre-computed statistical
analysis of the complete corpus.
• Compute association matrices which quantify term correlations
in terms of how frequently they co-occur.
• Expand queries with statistically most similar terms.
16
Association Matrix
17
Normalized Association Matrix
• Frequency based correlation factor favors more frequent terms.
• Normalize association scores:
cij
sij =
cii + c jj −c ij
• Normalized score is 1 if two terms have the same frequency in all
documents.
18
Metric Correlation Matrix
20
Query Expansion with Correlation Matrix
• For each term i in query, expand query with the n terms, j, with the
highest value of cij (sij).
21
Automatic Local Analysis
• At query time, dynamically determine similar terms based on
analysis of top-ranked retrieved documents.
• Base correlation analysis on only the “local” set of retrieved
documents for a specific query.
• Avoids ambiguity by determining similar (correlated) terms only
within relevant documents.
• “Apple computer” → “Apple computer Powerbook laptop”
Global vs. Local Analysis
• Global analysis requires intensive term correlation computation
only once at system development time.
• System modifies query using terms from those documents and re-
ranks documents
• example of simple machine learning algorithm using training data but,
very little training data
Revise Rankings
d IR ReRanked
Query System Documents
1. Doc2
2. Doc4
Query 3. Doc5
Ranked 1. Doc1
Reformulation 2. Doc2 .
Documents 3. Doc3 .
1. Doc1 .
2. Doc2 .
3. Doc3
Feedback .
.
27
Relevance Feedback
• Top 10
documents for
“tropical fish”
Relevance Feedback
Relevance feedback
searching over images. (top)
The user views the initial
query results for a query of
bike, selects the first, third
and fourth result in the top
row and the fourth result in
the bottom row as relevant,
and submits this feedback.
(bottom) The users sees the
revised result set. Precision is
greatly improved. 29
Query Reformulation
• Revise query to account for feedback:
• Add the vectors for the relevant documents to the query vector.
• Subtract the vectors for the irrelevant docs from the query vector.
• This both adds both positive and negatively weighted terms to the
query as well as reweighting the initial terms.
Optimal Query
qm = q +
Dr
d j −
Dn
d j
d j Dr d j Dn