0% found this document useful (0 votes)
16 views41 pages

IR Presentation 1

Uploaded by

Jawad Abid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views41 pages

IR Presentation 1

Uploaded by

Jawad Abid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Introduction to

Information Retrieval
Evaluation of information retrieval systems
Information Needs and Queries
• A query can represent very different information needs
• May require different search techniques and ranking algorithms
to produce the best rankings

• A query can be a poor representation of the information need


• User may find it difficult to express the information need
• User is encouraged to enter short queries both by the search
engine interface, and by the fact that long queries do not work
Interaction
• Interaction with the system occurs
• during query formulation and reformulation
• while browsing the result
• Key aspect of effective retrieval
• users cannot change the ranking algorithm but can change
results through interaction
• helps refine description of information need
• how does the user describe what the they do not know?
Keyword Queries
• Simple, natural language queries were
designed to enable everyone to search
• Current search engines do not perform
well (in general) with natural language
queries
• People are trained to use keywords
• Compare average of about 2.3
words/web query to average of 30
words/CQA query
• What would you answer someone
approaching you just asking “Portland
Maine”?
• Keyword selection is not always easy
• Query refinement techniques can
help
Query Refinement
• Refinement process aims to produce a query that is a better
representation of the information need
• spelling correction
• query expansion
• relevance feedback

• The initial stages of processing a text query should mirror the


processing steps that are used for documents
• Words in the query text must be transformed into the same terms that
were produced by document texts, or there will be errors in the ranking
Spell Checking
• Basic approach: suggest corrections for words not found in
spelling dictionary

• Suggestions found by comparing word to words in dictionary


using similarity measure

• Most common similarity measure is edit distance


• number of operations required to transform one word into the other.
• The edits can be insertions, deletions, or replacements, and each
operation is assigned a cost.
Edit Distance
Edit Distance
• Number of techniques used to speed up calculation of
edit distances
• restrict to words starting with same character
• restrict to words of same or similar length
• restrict to words that sound the same
Spelling Correction Issues
Query Expansion
• The Thesaurus

• Particularly useful for query expansion


• adding synonyms or more specific terms using query operators based on
thesaurus
• improves search effectiveness
Thesaurus-based Query Expansion
• For each term, t, in a query, expand the query with synonyms and
related words of t from the thesaurus.
• May weight added terms less than original query terms.
• Generally increases recall.
• May significantly decrease precision, particularly with ambiguous
terms.
• “interest rate” → “interest rate fascinate evaluate”

11
WordNet
• A more detailed database of semantic relationships between
English words.
• Developed by famous cognitive psychologist George Miller and a
team at Princeton University.
• About 144,000 English words.
• Nouns, adjectives, verbs, and adverbs grouped into about 109,000
synonym sets called synsets.

12
WordNet Synset Relationships
• Antonym: front → back
• Attribute: benevolence → good (noun to adjective)
• Pertainym: alphabetical → alphabet (adjective to noun)
• Similar: unquestioning → absolute
• Cause: kill → die
• Entailment: breathe → inhale
• Holonym: chapter → text (part to whole)
• Meronym: computer → cpu (whole to part)
• Hyponym: plant → tree (specialization)
• Hypernym: apple → fruit (generalization)

13
WordNet Query Expansion
• Add synonyms in the same synset.
• Add hyponyms to add specialized terms.
• Add hypernyms to generalize a query.
• Add other related terms to expand query.

14
Statistical Thesaurus
• Existing human-developed thesauri are not easily available in all
languages.
• Human thesuari are limited in the type and range of synonymy and
semantic relations they represent.
• Semantically related terms can be discovered from statistical
analysis of corpora.

15
Automatic Global Analysis
• Determine term similarity through a pre-computed statistical
analysis of the complete corpus.
• Compute association matrices which quantify term correlations
in terms of how frequently they co-occur.
• Expand queries with statistically most similar terms.

16
Association Matrix

cij: Correlation factor between term i and term j


cij = f
d k D
ik  f jk

fik : Frequency of term i in document k

17
Normalized Association Matrix
• Frequency based correlation factor favors more frequent terms.
• Normalize association scores:
cij
sij =
cii + c jj −c ij
• Normalized score is 1 if two terms have the same frequency in all
documents.

18
Metric Correlation Matrix

• Association correlation does not account for the proximity of


terms in documents, just co-occurrence frequencies within
documents.
• Metric correlations account for term proximity.
1
cij = 
ku Vi k v V j r ( ku , k v )
Vi: Set of all occurrences of term i in any document.
r(ku,kv): Distance in words between word occurrences ku and kv
( if ku and kv are occurrences in different documents).
19
Normalized Metric Correlation Matrix
• Normalize scores to account for term frequencies:
cij
sij =
Vi  V j

20
Query Expansion with Correlation Matrix

• For each term i in query, expand query with the n terms, j, with the
highest value of cij (sij).

• This adds semantically related terms in the “neighborhood” of the


query terms.

21
Automatic Local Analysis
• At query time, dynamically determine similar terms based on
analysis of top-ranked retrieved documents.
• Base correlation analysis on only the “local” set of retrieved
documents for a specific query.
• Avoids ambiguity by determining similar (correlated) terms only
within relevant documents.
• “Apple computer” → “Apple computer Powerbook laptop”
Global vs. Local Analysis
• Global analysis requires intensive term correlation computation
only once at system development time.

• Local analysis requires intensive term correlation computation for


every query at run time (although number of terms and
documents is less than in global analysis).

• But local analysis gives better results.


Global Analysis Refinements
• Only expand the query with terms that are similar to all terms in
the query.
sim(ki , Q) = c
k j Q
ij

• “fruit” not added to “Apple computer” since it is far from “computer.”


• “fruit” added to “apple pie” since “fruit” close to both “apple” and “pie.”

• Use more sophisticated term weights (instead of just frequency)


when computing term correlations.
Query Expansion
• A variety of automatic or semi-automatic query expansion
techniques have been developed
• goal is to improve effectiveness by matching related terms
• semi-automatic techniques require user interaction to select
best expansion terms
• Query suggestion is a related technique
• alternative queries, not necessarily more terms
• Expansion of queries with related terms can improve
performance, particularly recall.
• However, one must select similar terms very carefully to
avoid problems, such as loss of precision.
Relevance Feedback
• User identifies relevant (and maybe non-relevant) documents in
the initial result list

• System modifies query using terms from those documents and re-
ranks documents
• example of simple machine learning algorithm using training data but,
very little training data

• Pseudo-relevance feedback just assumes top-ranked documents


are relevant – no user input
Relevance Feedback Architecture
Query Document
String corpus

Revise Rankings
d IR ReRanked
Query System Documents
1. Doc2
2. Doc4
Query 3. Doc5
Ranked 1. Doc1
Reformulation 2. Doc2 .
Documents 3. Doc3 .
1. Doc1  .
2. Doc2  .
3. Doc3 
Feedback .
.
27
Relevance Feedback
• Top 10
documents for
“tropical fish”
Relevance Feedback
Relevance feedback
searching over images. (top)
The user views the initial
query results for a query of
bike, selects the first, third
and fourth result in the top
row and the fourth result in
the bottom row as relevant,
and submits this feedback.
(bottom) The users sees the
revised result set. Precision is
greatly improved. 29
Query Reformulation
• Revise query to account for feedback:

• Query Expansion: Add new terms to query from relevant documents.


• Term Reweighting: Increase weight of terms in relevant documents and
decrease weight of terms in irrelevant documents.

• Several algorithms for query reformulation.


Query Reformulation for VSR

• Change query vector using vector algebra.

• Add the vectors for the relevant documents to the query vector.

• Subtract the vectors for the irrelevant docs from the query vector.

• This both adds both positive and negatively weighted terms to the
query as well as reweighting the initial terms.
Optimal Query

• Optimal Query: A query vector that maximizes


similarity with relevant documents while
minimizing similarity with non-relevant
documents
Optimal Query
• Assume that the relevant set of documents Cr are known.
• Then the best query that ranks all and only the relevant
queries at the top is:
 1  1 
qopt =
Cr


d j −
N − Cr


d j
d j Cr d j Cr

Where N is the total number of documents.

• The optimal query is the vector difference between the


centroids of the relevant and non-relevant documents.
Standard Rochio Method
• Since all relevant documents unknown, just use the known
relevant (Dr) and irrelevant (Dn) sets of documents and include the
initial query q.

     
qm =  q +
Dr


d j −
Dn


d j
d j Dr d j Dn

: Tunable weight for initial query.


: Tunable weight for relevant documents.
: Tunable weight for irrelevant documents.
Standard Rochio Method
     
q =q +
D
 d −
m
D
 d  j  j
r d j Dr n d j Dn

: Tunable weight for initial query.


: Tunable weight for relevant documents.
: Tunable weight for irrelevant documents.
• α, β, and γ are weights attached to each term. These control the balance between trusting
the judged document set versus the query:
• if we have a lot of judged documents, we would like a higher β and γ.
• Starting from initial query , the new query moves you some distance toward the
centroid of the relevant documents and some distance away from the centroid
of the non-relevant documents.
Optimal Query

An application of Rocchio’s algorithm. Some documents have


been labeled as relevant and non-relevant and the initial query
vector is moved in response to this feedback.
Context and Personalization
• If a query has the same words as another query, the results will be
the same regardless of who submitted the query
• why the query was submitted
• where the query was submitted
• what other queries were submitted in the same session

• The other factors (the context) could have a significant impact on


relevance
• difficult to incorporate into ranking
User Models (personalized search)
• Generate user profiles based on documents that the person
looks at
• such as web pages visited, email messages, or word-processing
documents on the desktop

• Modify queries using words from the profile


• Generally not effective for retrieval
• imprecise profiles, information needs can change significantly
• privacy issues
Query logs
• Query logs provide important contextual information that can be used
effectively
• Context in this case is
• previous queries that are the same
• previous queries that are similar
• query sessions including the same query
• Query history for individuals could be used for caching
• Role of Cookies
(Geographic) Local Search
• Location is context

• Local search uses geographic information to modify the ranking of search


results
• location derived from the query text

• location of the device where the query originated


Local Search

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy