0% found this document useful (0 votes)

16 views41 pages

IR Presentation 1

Uploaded by

Jawad Abid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views41 pages

IR Presentation 1

Uploaded by

Jawad Abid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Introduction to

Information Retrieval
Evaluation of information retrieval systems
Information Needs and Queries
• A query can represent very different information needs
• May require different search techniques and ranking algorithms
to produce the best rankings

• A query can be a poor representation of the information need

• User may find it difficult to express the information need
• User is encouraged to enter short queries both by the search
engine interface, and by the fact that long queries do not work
Interaction
• Interaction with the system occurs
• during query formulation and reformulation
• while browsing the result
• Key aspect of effective retrieval
• users cannot change the ranking algorithm but can change
results through interaction
• helps refine description of information need
• how does the user describe what the they do not know?
Keyword Queries
• Simple, natural language queries were
designed to enable everyone to search
• Current search engines do not perform
well (in general) with natural language
queries
• People are trained to use keywords
• Compare average of about 2.3
words/web query to average of 30
words/CQA query
• What would you answer someone
approaching you just asking “Portland
Maine”?
• Keyword selection is not always easy
• Query refinement techniques can
help
Query Refinement
• Refinement process aims to produce a query that is a better
representation of the information need
• spelling correction
• query expansion
• relevance feedback

• The initial stages of processing a text query should mirror the

processing steps that are used for documents
• Words in the query text must be transformed into the same terms that
were produced by document texts, or there will be errors in the ranking
Spell Checking
• Basic approach: suggest corrections for words not found in
spelling dictionary

• Suggestions found by comparing word to words in dictionary

using similarity measure

• Most common similarity measure is edit distance

• number of operations required to transform one word into the other.
• The edits can be insertions, deletions, or replacements, and each
operation is assigned a cost.
Edit Distance
Edit Distance
• Number of techniques used to speed up calculation of
edit distances
• restrict to words starting with same character
• restrict to words of same or similar length
• restrict to words that sound the same
Spelling Correction Issues
Query Expansion
• The Thesaurus

• Particularly useful for query expansion

• adding synonyms or more specific terms using query operators based on
thesaurus
• improves search effectiveness
Thesaurus-based Query Expansion
• For each term, t, in a query, expand the query with synonyms and
related words of t from the thesaurus.
• May weight added terms less than original query terms.
• Generally increases recall.
• May significantly decrease precision, particularly with ambiguous
terms.
• “interest rate” → “interest rate fascinate evaluate”

11
WordNet
• A more detailed database of semantic relationships between
English words.
• Developed by famous cognitive psychologist George Miller and a
team at Princeton University.
• About 144,000 English words.
• Nouns, adjectives, verbs, and adverbs grouped into about 109,000
synonym sets called synsets.

12
WordNet Synset Relationships
• Antonym: front → back
• Attribute: benevolence → good (noun to adjective)
• Pertainym: alphabetical → alphabet (adjective to noun)
• Similar: unquestioning → absolute
• Cause: kill → die
• Entailment: breathe → inhale
• Holonym: chapter → text (part to whole)
• Meronym: computer → cpu (whole to part)
• Hyponym: plant → tree (specialization)
• Hypernym: apple → fruit (generalization)

13
WordNet Query Expansion
• Add synonyms in the same synset.
• Add hyponyms to add specialized terms.
• Add hypernyms to generalize a query.
• Add other related terms to expand query.

14
Statistical Thesaurus
• Existing human-developed thesauri are not easily available in all
languages.
• Human thesuari are limited in the type and range of synonymy and
semantic relations they represent.
• Semantically related terms can be discovered from statistical
analysis of corpora.

15
Automatic Global Analysis
• Determine term similarity through a pre-computed statistical
analysis of the complete corpus.
• Compute association matrices which quantify term correlations
in terms of how frequently they co-occur.
• Expand queries with statistically most similar terms.

16
Association Matrix

cij: Correlation factor between term i and term j

cij = f
d k D
ik  f jk

fik : Frequency of term i in document k

17
Normalized Association Matrix
• Frequency based correlation factor favors more frequent terms.
• Normalize association scores:
cij
sij =
cii + c jj −c ij
• Normalized score is 1 if two terms have the same frequency in all
documents.

18
Metric Correlation Matrix

• Association correlation does not account for the proximity of

terms in documents, just co-occurrence frequencies within
documents.
• Metric correlations account for term proximity.
1
cij = 
ku Vi k v V j r ( ku , k v )
Vi: Set of all occurrences of term i in any document.
r(ku,kv): Distance in words between word occurrences ku and kv
( if ku and kv are occurrences in different documents).
19
Normalized Metric Correlation Matrix
• Normalize scores to account for term frequencies:
cij
sij =
Vi  V j

20
Query Expansion with Correlation Matrix

• For each term i in query, expand query with the n terms, j, with the
highest value of cij (sij).

• This adds semantically related terms in the “neighborhood” of the

query terms.

21
Automatic Local Analysis
• At query time, dynamically determine similar terms based on
analysis of top-ranked retrieved documents.
• Base correlation analysis on only the “local” set of retrieved
documents for a specific query.
• Avoids ambiguity by determining similar (correlated) terms only
within relevant documents.
• “Apple computer” → “Apple computer Powerbook laptop”
Global vs. Local Analysis
• Global analysis requires intensive term correlation computation
only once at system development time.

• Local analysis requires intensive term correlation computation for

every query at run time (although number of terms and
documents is less than in global analysis).

• But local analysis gives better results.

Global Analysis Refinements
• Only expand the query with terms that are similar to all terms in
the query.
sim(ki , Q) = c
k j Q
ij

• “fruit” not added to “Apple computer” since it is far from “computer.”

• “fruit” added to “apple pie” since “fruit” close to both “apple” and “pie.”

• Use more sophisticated term weights (instead of just frequency)

when computing term correlations.
Query Expansion
• A variety of automatic or semi-automatic query expansion
techniques have been developed
• goal is to improve effectiveness by matching related terms
• semi-automatic techniques require user interaction to select
best expansion terms
• Query suggestion is a related technique
• alternative queries, not necessarily more terms
• Expansion of queries with related terms can improve
performance, particularly recall.
• However, one must select similar terms very carefully to
avoid problems, such as loss of precision.
Relevance Feedback
• User identifies relevant (and maybe non-relevant) documents in
the initial result list

• System modifies query using terms from those documents and re-
ranks documents
• example of simple machine learning algorithm using training data but,
very little training data

• Pseudo-relevance feedback just assumes top-ranked documents

are relevant – no user input
Relevance Feedback Architecture
Query Document
String corpus

Revise Rankings
d IR ReRanked
Query System Documents
1. Doc2
2. Doc4
Query 3. Doc5
Ranked 1. Doc1
Reformulation 2. Doc2 .
Documents 3. Doc3 .
1. Doc1  .
2. Doc2  .
3. Doc3 
Feedback .
.
27
Relevance Feedback
• Top 10
documents for
“tropical fish”
Relevance Feedback
Relevance feedback
searching over images. (top)
The user views the initial
query results for a query of
bike, selects the first, third
and fourth result in the top
row and the fourth result in
the bottom row as relevant,
and submits this feedback.
(bottom) The users sees the
revised result set. Precision is
greatly improved. 29
Query Reformulation
• Revise query to account for feedback:

• Query Expansion: Add new terms to query from relevant documents.

• Term Reweighting: Increase weight of terms in relevant documents and
decrease weight of terms in irrelevant documents.

• Several algorithms for query reformulation.

Query Reformulation for VSR

• Change query vector using vector algebra.

• Add the vectors for the relevant documents to the query vector.

• Subtract the vectors for the irrelevant docs from the query vector.

• This both adds both positive and negatively weighted terms to the
query as well as reweighting the initial terms.
Optimal Query

• Optimal Query: A query vector that maximizes

similarity with relevant documents while
minimizing similarity with non-relevant
documents
Optimal Query
• Assume that the relevant set of documents Cr are known.
• Then the best query that ranks all and only the relevant
queries at the top is:
 1  1 
qopt =
Cr


d j −
N − Cr


d j
d j Cr d j Cr

Where N is the total number of documents.

• The optimal query is the vector difference between the

centroids of the relevant and non-relevant documents.
Standard Rochio Method
• Since all relevant documents unknown, just use the known
relevant (Dr) and irrelevant (Dn) sets of documents and include the
initial query q.

     
qm =  q +
Dr


d j −
Dn


d j
d j Dr d j Dn

: Tunable weight for initial query.

: Tunable weight for relevant documents.
: Tunable weight for irrelevant documents.
Standard Rochio Method
     
q =q +
D
 d −
m
D
 d  j  j
r d j Dr n d j Dn

: Tunable weight for initial query.

: Tunable weight for relevant documents.
: Tunable weight for irrelevant documents.
• α, β, and γ are weights attached to each term. These control the balance between trusting
the judged document set versus the query:
• if we have a lot of judged documents, we would like a higher β and γ.
• Starting from initial query , the new query moves you some distance toward the
centroid of the relevant documents and some distance away from the centroid
of the non-relevant documents.
Optimal Query

An application of Rocchio’s algorithm. Some documents have

been labeled as relevant and non-relevant and the initial query
vector is moved in response to this feedback.
Context and Personalization
• If a query has the same words as another query, the results will be
the same regardless of who submitted the query
• why the query was submitted
• where the query was submitted
• what other queries were submitted in the same session

• The other factors (the context) could have a significant impact on

relevance
• difficult to incorporate into ranking
User Models (personalized search)
• Generate user profiles based on documents that the person
looks at
• such as web pages visited, email messages, or word-processing
documents on the desktop

• Modify queries using words from the profile

• Generally not effective for retrieval
• imprecise profiles, information needs can change significantly
• privacy issues
Query logs
• Query logs provide important contextual information that can be used
effectively
• Context in this case is
• previous queries that are the same
• previous queries that are similar
• query sessions including the same query
• Query history for individuals could be used for caching
• Role of Cookies
(Geographic) Local Search
• Location is context

• Local search uses geographic information to modify the ranking of search

results
• location derived from the query text

• location of the device where the query originated

Local Search

Data Engineering Road Map
No ratings yet
Data Engineering Road Map
1 page
Asddas
No ratings yet
Asddas
34 pages
What Is Information Retrieval (IR)
No ratings yet
What Is Information Retrieval (IR)
17 pages
Query Operation 2021
No ratings yet
Query Operation 2021
35 pages
Query and Document Expansion in Text Retrieval: Clara Isabel Cabezas University of Maryland College Park May, 2 2000
No ratings yet
Query and Document Expansion in Text Retrieval: Clara Isabel Cabezas University of Maryland College Park May, 2 2000
37 pages
Chap 6
No ratings yet
Chap 6
70 pages
Query Expansion
No ratings yet
Query Expansion
31 pages
Chap - Week8 - Queries and Information Needs
No ratings yet
Chap - Week8 - Queries and Information Needs
44 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
7 Query Operations
No ratings yet
7 Query Operations
19 pages
NLP Question
No ratings yet
NLP Question
38 pages
IR Lecture 6b
No ratings yet
IR Lecture 6b
45 pages
Relevance Feedback
No ratings yet
Relevance Feedback
16 pages
Bulu
No ratings yet
Bulu
47 pages
Relevance Feedback
No ratings yet
Relevance Feedback
37 pages
Information Retrieval
No ratings yet
Information Retrieval
72 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
Query Languages
No ratings yet
Query Languages
54 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
A New Survey On Upgrade Query Testimonial Technique Supporting Exploratory Search Using Search Goal Shift Graph
No ratings yet
A New Survey On Upgrade Query Testimonial Technique Supporting Exploratory Search Using Search Goal Shift Graph
3 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
QUESEM: Towards Building A Meta Search Service Utilizing Query Semantics
No ratings yet
QUESEM: Towards Building A Meta Search Service Utilizing Query Semantics
10 pages
Irs Unit-4 Modified
No ratings yet
Irs Unit-4 Modified
13 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
61 pages
Emutye
No ratings yet
Emutye
20 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
IRS Unit 4 by Krishna
No ratings yet
IRS Unit 4 by Krishna
23 pages
Search Engines
No ratings yet
Search Engines
4 pages
Chapter 2: Modeling: Advanced Topics in Information Retrieval
No ratings yet
Chapter 2: Modeling: Advanced Topics in Information Retrieval
28 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Multimedia Information Retrieval (CSC 545) : The Problem of IR
No ratings yet
Multimedia Information Retrieval (CSC 545) : The Problem of IR
29 pages
Query Languages: Chapter Seven
No ratings yet
Query Languages: Chapter Seven
36 pages
Module 7
No ratings yet
Module 7
53 pages
Modern Information Retrieval Chapter 5 Query Operations
No ratings yet
Modern Information Retrieval Chapter 5 Query Operations
33 pages
Irt Ia 2
No ratings yet
Irt Ia 2
9 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
LIBS 894 Assignment Three Classic Models
No ratings yet
LIBS 894 Assignment Three Classic Models
8 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
34 pages
IR Berhampore Sukomalpal
No ratings yet
IR Berhampore Sukomalpal
82 pages
Text Similarity Algorithms
No ratings yet
Text Similarity Algorithms
28 pages
Modern Information Retrieval: Computer Engineering Department Fall 2005
No ratings yet
Modern Information Retrieval: Computer Engineering Department Fall 2005
19 pages
1 Overview
No ratings yet
1 Overview
44 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Module 1 Part BInformation Retrieval Webdocuments
No ratings yet
Module 1 Part BInformation Retrieval Webdocuments
49 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
Materi Pertemuan Ke-1-Dno 2018-1
No ratings yet
Materi Pertemuan Ke-1-Dno 2018-1
42 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
Irs Unit - 4
No ratings yet
Irs Unit - 4
29 pages
Irs Unit-4 Notes - 241202 - 150037
No ratings yet
Irs Unit-4 Notes - 241202 - 150037
18 pages
Introduction To Telecom Technologies (Telecom) : Getachew Mamo
No ratings yet
Introduction To Telecom Technologies (Telecom) : Getachew Mamo
65 pages
Lecture 2: More Similarity Searching Multidimensional Scaling
No ratings yet
Lecture 2: More Similarity Searching Multidimensional Scaling
8 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
How to Write a Dissertation: An Instructional Manual for Dissertation Writers.
From Everand
How to Write a Dissertation: An Instructional Manual for Dissertation Writers.
Benjamin Baisai Silas Madondo
No ratings yet
Database 101 - Guy Kawasaki - Ss
No ratings yet
Database 101 - Guy Kawasaki - Ss
2 pages
SQL Record Questions Part1
No ratings yet
SQL Record Questions Part1
9 pages
20131a0249 Ai-Ml
100% (1)
20131a0249 Ai-Ml
45 pages
Forticnp: Prioritize Risk Management Activities
No ratings yet
Forticnp: Prioritize Risk Management Activities
5 pages
Gis Project Management Approach For Implementation of Gis For Planning Organizations
No ratings yet
Gis Project Management Approach For Implementation of Gis For Planning Organizations
34 pages
Rit 39
No ratings yet
Rit 39
19 pages
Core 8 MVC Crash Course Learn To Build Fast and Secure Web Applications With ASP - Net MVC 8 (Lim, Greg) (Z-Library)
No ratings yet
Core 8 MVC Crash Course Learn To Build Fast and Secure Web Applications With ASP - Net MVC 8 (Lim, Greg) (Z-Library)
111 pages
6.script Include& Glide API's
No ratings yet
6.script Include& Glide API's
32 pages
Tablue MCQ
No ratings yet
Tablue MCQ
39 pages
RESUME SATYARTHGAUR 2023 TechM
No ratings yet
RESUME SATYARTHGAUR 2023 TechM
14 pages
Salesforce Herokue
No ratings yet
Salesforce Herokue
17 pages
CE31501 Soft-Computing Tools in Engineering ES 2013
No ratings yet
CE31501 Soft-Computing Tools in Engineering ES 2013
1 page
Database System Final Exam Sheet 3
No ratings yet
Database System Final Exam Sheet 3
5 pages
Rumon's CV
No ratings yet
Rumon's CV
2 pages
Relational Model
No ratings yet
Relational Model
9 pages
The Roles of Various Personnel in Computer-Related Professions
No ratings yet
The Roles of Various Personnel in Computer-Related Professions
17 pages
UNit1 - Database Design
No ratings yet
UNit1 - Database Design
18 pages
Lecture Notes - DynamoDB
No ratings yet
Lecture Notes - DynamoDB
24 pages
ArchiCAD Report
No ratings yet
ArchiCAD Report
1 page
CLASS 12 Midterm QP With Ans (Aug'23)
No ratings yet
CLASS 12 Midterm QP With Ans (Aug'23)
9 pages
AWS Certified Cloud Practitioner CLF-C02 Exam - 14 Nov 2024 Without Discussions
No ratings yet
AWS Certified Cloud Practitioner CLF-C02 Exam - 14 Nov 2024 Without Discussions
239 pages
Kolej Yayasan Pelajaran Johor Lab Skill 3: SQL SEMESTER 2 SESI 2020/2021
No ratings yet
Kolej Yayasan Pelajaran Johor Lab Skill 3: SQL SEMESTER 2 SESI 2020/2021
3 pages
Data Models, Schemas & Instances
No ratings yet
Data Models, Schemas & Instances
20 pages
Sadlier Level A Unit 6
No ratings yet
Sadlier Level A Unit 6
10 pages
Programming Assignment Unit 2
No ratings yet
Programming Assignment Unit 2
10 pages
St. Roberts International Academy "Guest Monitoring System"
No ratings yet
St. Roberts International Academy "Guest Monitoring System"
17 pages
MySQL Perf Tuning OOW2015 Dim
No ratings yet
MySQL Perf Tuning OOW2015 Dim
141 pages
Gis Enabled Interactive Web Site & On-Line Property Tax Management System 2008-09
No ratings yet
Gis Enabled Interactive Web Site & On-Line Property Tax Management System 2008-09
5 pages
Solution:: Quiz 1
No ratings yet
Solution:: Quiz 1
20 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

IR Presentation 1

Uploaded by

IR Presentation 1

Uploaded by

Introduction to

• A query can be a poor representation of the information need

• The initial stages of processing a text query should mirror the

• Suggestions found by comparing word to words in dictionary

• Most common similarity measure is edit distance

• Particularly useful for query expansion

cij: Correlation factor between term i and term j

fik : Frequency of term i in document k

• Association correlation does not account for the proximity of

• This adds semantically related terms in the “neighborhood” of the

• Local analysis requires intensive term correlation computation for

• But local analysis gives better results.

• “fruit” not added to “Apple computer” since it is far from “computer.”

• Use more sophisticated term weights (instead of just frequency)

• Pseudo-relevance feedback just assumes top-ranked documents

• Query Expansion: Add new terms to query from relevant documents.

• Several algorithms for query reformulation.

• Change query vector using vector algebra.

• Optimal Query: A query vector that maximizes

Where N is the total number of documents.

• The optimal query is the vector difference between the

: Tunable weight for initial query.

: Tunable weight for initial query.

An application of Rocchio’s algorithm. Some documents have

• The other factors (the context) could have a significant impact on

• Modify queries using words from the profile

• Local search uses geographic information to modify the ranking of search

• location of the device where the query originated

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.