0% found this document useful (0 votes)
35 views46 pages

4 IRModels

Uploaded by

hailemariamhg93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views46 pages

4 IRModels

Uploaded by

hailemariamhg93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 46

IR models

• Why IR models?
• Boolean IR Model
• Vector space IR model
• Probabilistic IR model
What is Information Retrieval ?
• Information retrieval is the
process of searching for
relevant documents from
unstructured large corpus that
satisfy users information need.
• It is a tool that finds and selects
from a collection of items a
subset that serves the user’s
purpose
• Much IR research focuses more specifically on text retrieval. But
there are many other interesting areas:
 Cross-language vs. multilingual information retrieval,
 Multimedia (audio, video & image) information retrieval (QBIC, WebSeek,
SaFe)
 Question-answering (AskJeeves, Answerbus).
 Digital and virtual libraries
Information Retrieval serve as a
Bridge
• An Information Retrieval System serves as a bridge
between the world of authors and the world of
readers/users,
• That is, writers present a set of ideas in a document using a
set of concepts. Then Users seek the IR system for relevant
documents that satisfy their information need.

Black box
User Documents
Typical IR System
Architecture
Document
corpus

Query IR
String System

1. Doc1
2. Doc2
Ranked 3. Doc3
Relevant Documents .
.
Our focus during IR system design
• In improving Effectiveness of the system
• The concern here is retrieving more relevant documents as per users query
• Effectiveness of the system is measured in terms of precision, recall, …
• Main emphasis: text operations (such as stemming, stopwords removal, normalization,
etc.), weighting schemes, matching algorithms, …
• In improving Efficiency of the system
• The concern here is
• enhancing searching time, indexing time, access time…
• reducing storage space requirement of the system
• space – time tradeoffs
• Main emphasis:
• Compression
• Index terms selection (free text or content-bearing terms)
• indexing structures
Subsystems of IR system
The two subsystems of an IR system: Indexing and Searching
•Indexing:
• is an offline process of organizing documents using keywords extracted from
the collection
• Indexing is used to speed up access to desired information from document
collection as per users query
•Searching
• Is an online process that scans document corpus to find relevant documents that matches users
query
Indexing Subsystem

documents
Documents Assign document identifier

document document
Tokenization
IDs
tokens
Stopword removal
non-stoplist tokens
Stemming &
stemmed terms
Normalization
Term weighting

Weighted index
terms Index File
Searching Subsystem
query parse query
query tokens
ranked
Stop word non-stoplist
document
tokens
set
Ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
IR Models - Basic Concepts
IR systems usually adopt index terms to index and
retrieve documents
Each document is represented by a set of representative
keywords or index terms (called Bag of Words)
• An index term is a word useful for remembering the document main
themes
•Not all terms are equally useful for representing the
document contents:
less frequent terms allow identifying a narrower set of
documents
• But no ordering information is attached to the Bag of Words identified from the
document collection.
IR Models - Basic Concepts
•One central problem regarding IR systems is the issue of
predicting the degree of relevance of documents for a given
query
 Such a decision is usually dependent on a ranking
algorithm which attempts to establish a simple ordering
of the documents retrieved
 Documents appearning at the top of this ordering are
considered to be more likely to be relevant
•Thus ranking algorithms are at the core of IR systems
 The IR models determine the predictions of what is
relevant and what is not, based on the notion of
relevance implemented by the system
IR models

Probabilistic
relevance
How to find relevant documents for a query?
• Step 1: Map documents & queries into term-document vector space. Note that
queries are considered as short document
• Represent both documents & queries as N-dimensional vectors in a term-document
matrix, which shows occurrence of terms in the document collection or query
 
d j (t1, j , t 2, j ,..., t N , j ); qk (t1,k , t 2,k ,..., t N ,k )

T 1 T2 …. TN
– Document collection is mapped to term-
D1 … … .. … … by-document matrix
D2 … … .. … … – View as vector in multidimensional
: … … ..… …: space
… … ..… … • Nearby vectors are related
DM … … .. … …
Qi … … .. … …
How to find relevant documents for a query?
• Step 2: Queries and documents are represented as weighted vectors, wij
 Why we need weighting techniques?
 To know the importance of a term in describing the content of a given document.
 There are binary weights & non-binary weighting technique. Any difference between
the two?
 What method you recommend to compute weights for term i in document j and query
q; wij and wiq ?

T1 T2 …. TN
• An entry in the matrix corresponds to
the “weight” of a term in the document; D1 w11 w12 … w1N
zero means the term doesn’t exist in the D2 w21 w22 … w2N
document. : : : :
• Normalize for vector length to avoid : : : :
the effect of document length DM wM1 wM2 … wMN
Qi wi1 wi2 … wiN
How to find relevant documents for a query?
• Step 3: Rank documents (in increasing or decreasing order) based on their
closeness to the query.
 Documents are ranked by the degree of their closeness to the query.
 How closeness of the document to query measured?
 It is determined by a similarity/dissimilarity score calculation
 How many matching (similarity/dissimilarity measurements) you know? Which
one is best for IR?
 

n
d j q wi , j wi , q
sim(d j , q )     i 1

i 1 w i 1 i ,q
n n
dj q 2
i, j w 2
How to evaluate Models?
• We need to investigate what procedures the IR Models follow and what
techniques they use:
• What is the weighting technique used by the IR Models for measuring importance of
terms in documents?
• Are they using binary or non-binary weight?
• What is the matching technique used by the IR models?
• Are they measuring similarity or dissimilarity?
• Are they applying exact matching or partial matching in the course of finding relevant
documents for a given query?
• Are they applying best matching principle to measure the degree of relevance of
documents to display in ranked-order?
• Is there any Ranking mechanism applied before displaying relevant documents for the users?
The Boolean Model
•Boolean model is a simple model based on set theory
 The Boolean model imposes a binary criterion for
deciding relevance
 Documents must exactly match the query
•Terms are either present or absent. Thus,
wij  {0,1}
•sim(q,dj) 1 - if document satisfies the boolean
T1 T2 ….queryT
N
0 - otherwise D1 w11 w12 … w1N
- Note that, no weights D2 w21 w22 … w2N
assigned in-between 0 : : : :
and 1, just only values 0
: : : :
or 1
DM wM1 wM2 … wMN
The Boolean Model: Example
Given the following three documents, Construct Term – document matrix and find
the relevant documents retrieved by the Boolean model for the query
“gold silver truck”
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
Table below shows document –term (ti) matrix

arrive damage deliver fire gold silver ship truck


D1
D2
D3
query

Also find the documents relevant for the queries:


(a)gold delivery; (b) ship gold; (c) silver truck
The Boolean Model: Further Example
• Given the following determine documents retrieved by the Boolean model based IR
system
• Index Terms: K1, …,K8.
• Documents:

1. D1 = {K1, K2, K3, K4, K5}


2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2  K3)
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
= {D1, D2, D6}
Exercise
Given the following four documents with the following contents:
• D1 = “computer information retrieval”
• D2 = “computer retrieval”
• D3 = “information”
• D4 = “computer information”

• What are the relevant documents retrieved for the queries:


• Q1 = “information  retrieval”
• Q2 = “information  ¬computer”
Drawbacks of the Boolean Model
•Retrieval based on binary decision criteria with no notion of
partial matching
•No ranking of the documents is provided (absence of a
grading scale)
•Information need has to be translated into a Boolean
expression which most users find awkward
•The Boolean queries formulated by the users are most often
too simplistic
 As a consequence, the Boolean model frequently returns
either too few or too many documents in response to a
user query
Vector-Space Model
• This is the most commonly used strategy for measuring relevance of documents for a
given query. This is because,
 Use of binary weights is too limiting
 Non-binary weights provide consideration for partial matches

• These term weights are used to compute a degree of similarity between a query and each
document
 Ranked set of documents provides for better matching

• The idea behind VSM is that


 the meaning of a document is conveyed by the words used in that document
 VSM represent documents and queries as vectors in a multi-dimensional space,
where each dimension corresponds to a unique term (word) from the entire corpus.
The weights of the terms in these vectors are typically derived from their frequency
in the document and their importance in the corpus.
Vector-Space Model
To find relevant documens for a given query,
• First, map documents and queries into term-document vector space.
Note that queries are considered as short document
• Second, in the vector space, queries and documents are represented as
weighted vectors, wij
There are different weighting technique; the most widely used one is computing
TF*IDF weight for each term

• Third, similarity measurement is used to rank documents by the closeness of


their vectors to the query.
To measure closeness of documents to the query cosine similarity score is used by most
search engines
Computing weights

freq (i, j )
wij  * log(N/n i )
max( freq ( k , j ))

freq (i, q )
wiq  0.5  [0.5 * ] * log(N/n i )
max( freq ( k , q ))
Example: Computing weights
• A collection includes 10,000 documents
 The term tA appears 20 times in a particular document j
 The maximum appearance of term tk in document j is 50 times
 The term tA appears in 2,000 of the document collections.

• Compute TF*IDF weight of term A?


 tf(A,j) = freq(A,j) / max(freq(k,j)) = 20/50 = 0.4
 idf(A) = log(N/DFA) = log (10,000/2,000) = log(5) = 2.32
 wAj = tf(A,j) * log(N/DFA) = 0.4 * 2.32 = 0.928
Similarity Measure
• A similarity measure is a function that computes the degree of
similarity/dissimilarity between document j and users query.
 

n
d j q w w
i 1 i , j i , q
sim(d j , q )    
i 1 i , j i 1 i ,q
n n
dj q w 2
w 2

• Using a similarity score between the query and each document:


• It is possible to apply best matching such that documents are ranked for
retrieval in the order of presumed relevance.
• It is possible to enforce a certain threshold so that we can control the size of
the retrieved set of documents.
Vector Space with Term Weights
and Cosine Similarity Measure
Di=(d1i,w1di;d2i, w2di;…;dti, wtdi)
Term B
Q =(q1i,w1qi;q2i, w2qi;…;qti, wtqi)
1.0 Q = (0.4,0.8)

t
D2 Q D1=(0.8,0.3)
j 1
w jq w jdi
0.8 D2=(0.2,0.7) sim(Q, Di ) 
 j 1 (w jq )  j 1 jdi
t 2 t 2
( w )
0.6
2 (0.4 0.2)  (0.8 0.7)
sim (Q, D 2) 
0.4
[(0.4) 2  (0.8) 2 ] [(0.2) 2  (0.7) 2 ]
D1
0.2 1 0.64
 0.98
0.42
0 0.2 0.4 0.6 0.8 1.0
.56
Term A sim(Q, D1 )  0.74
0.58
Example Vector-Space Model
• Suppose user query for: Q = “gold silver truck”. The database collection
consists of three documents with the following content.
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
• Show retrieval results in ranked order?
1.Assume that full text terms are used during indexing, without removing
common terms, stop words, & also no terms are stemmed.
2.Assume that content-bearing terms are selected during indexing
3.Also compare your result with or without normalizing term frequency
Example VSM: Weighting
Terms Counts TF W = TF*IDF i
Terms Q DF IDF
D1 D2 D3 Q D1 D2 D3

arrive 0 0 1 1 2 0.176 0 0 0.176 0.176


damage 0 1 0 0 1 0.477 0 0.477 0 0
deliver 0 0 1 0 1 0.477 0 0 0.477 0
fire 0 1 0 0 1 0.477 0 0.477 0 0
gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176
silver 1 0 2 0 1 0.477 0.477 0 0.954 0
ship 0 1 0 1 2 0.176 0 0.176 0 0.176
truck 1 0 1 1 2 0.176 0.176 0 0.176 0.176
Example VSM: Weighting
Terms
Terms Q D1 D2 D3
arrive 0 0 0.176 0.176
damage 0 0.477 0 0
deliver 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
silver 0.477 0 0.954 0
ship 0 0.176 0 0.176
truck 0.176 0 0.176 0.176
Example VSM: similarity Measure
•Compute similarity using cosine Sim(q,d1)

• First, for each document and query, compute all vector lengths
(zero terms ignored)
|d1|= 0.477 2  0.477 2  0.1762  0.176
=2 = 0.719
0.517
|d2|= = 2
0.1762  0.477 2  0.9542  0.176 = 1.095
1.1996
|d3|= =2
0.176 2  0.1762  0.1762  0.176 0=
.124
0.352

|q|= 0.1762  0.4712  0.176=2 = 0.538


0.2896
• Next, compute dot products (zero products ignored)
Q*d1= 0.176*0.176 = 0.029392
Q*d2 =0.954*0.477 + 0.176 *0.176 = 0.4862
Q*d3 = 0.176*0.167 + 0.176*0.167 = 0.0620
Example VSM: Ranking
Now, compute similarity score
Sim(q,d1) = (0.029392) / (0.538*0.719) = 0.075678
Sim(q,d2) = (0.4862 ) / (0.538*1.095)= 0.8246
Sim(q,d3) = (0.0620) / (0.538*0.352)= 0.3271
Finally, we sort and rank documents in descending order according
to the similarity scores
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.075678
• Exercise: using normalized TF, rank documents using cosine
similarity measure? Hint: Normalize TF of term i in doc j
using max frequency of a term k in document j.
Vector-Space Model
• Advantages:
• Term-weighting improves quality of the answer set since it helps to display relevant
documents in ranked order
• Partial matching allows retrieval of documents that approximate the query
conditions
• Cosine ranking formula sorts documents according to degree of similarity to the
query

• Disadvantages:
• Assumes independence of index terms. It doesn’t relate one term with another
term: challenging relevance ranking by capturing semantic relationship
• Computationally expensive since it measures the similarity between each
document and the query
Exercise 1
Suppose the database collection consists of the following documents.
c1: Human machine interface for Lab ABC computer applications
c2: A survey of user opinion of computer system response time
c3: The EPS user interface management system
c4: System and human system engineering testing of EPS
c5: Relation of user-perceived response time to error measure
M1: The generation of random, binary, unordered trees
M2: The intersection graph of paths in trees
M3: Graph minors: Widths of trees and well-quasi-ordering
m4: Graph minors: A survey
Query:
Find documents relevant to "human computer interaction"
Exercise 2
• Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
• Draw the term-document incidence matrix for this document collection.
• Draw the inverted index representation for this collection.

• For the document collection shown above, what are the returned results
for the queries:
• schizophrenia AND drug
• for AND NOT(drug OR approach)
Probabilistic Model
• IR is an uncertain process
• Mapping Information need to Query is not perfect
• Mapping Documents to index terms is a logical representation
• Query terms and index terms mostly mismatch

• This situation leads to several statistical approaches: probability theory,


fuzzy logic, theory of evidence, language modeling, etc.
• Probabilistic retrieval model is rigorous formal model that attempts to
predict the probability that a given document will be relevant to a given
query; i.e. Prob(R|(q,di))
• Use probability to estimate the “odds” of relevance of a query to a document.
• It relies on accurate estimates of probabilities
Probability Ranking Principle
• The relevance of a given document for users query can be determined by the
probability score
• High probability (prob(rel | di q): means more likely for users to get relevant
information by reading document di.
• A Probabilistic retrieval model follows Probability ranking principle
• You have a collection of Documents
• A set of relevant documents needs to be returned for queries issued by users
• Intuitively, want the “best” document to be first, second best - second, etc…
• According to probability ranking principle, documents are ranked in decreasing order
of probability of relevance to users information need
Terms Existence in Relevant Document
N=the total number of documents in the collection
n= the total number of documents that contain term ti
R=the total number of relevant documents retrieved
r=the total number of relevant documents retrieved that contain term t i
Document Relevance
For term ti No of relevant No of non-relevant Total
docs docs
No of docs including r n-r n
term ti
No of docs excluding R-r N-R-(n-r) N-n
term ti
Total R N-R N
(r  0.5)( N  n  R  r  0.5)
wi log
(n  r  0.5)( R  r  0.5)
Computing term probabilities
Probabilistic Model Example
d Document vectors <tfd,t>
cold day eat hot lot nine old pea pizza pot
1 1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1
6 1 1
wt 0.26 0.56 0.56 0.26 0.56 0.56 0.56 0.0 0.0 0.26
• q1 = eat
• q2 = eat pizza
• q4 = eat hot pizza
Improving the Ranking
• Now, suppose
• we have shown the initial ranking to the user
• the user has labeled some of the documents as relevant ("relevance feedback")
• We now have
• N documents in collection, R are known relevant documents
• ni documents containing ti, out of which ri are relevant
Relevance weighted Example
Document vectors <tfd,t>
d
cold day eat hot lot nine old pea pizza pot Relev
ance
1 1 1 1 1 NR
2 1 1 1 R
3 1 1 1 NR
4 1 1 1 NR
5 1 1 NR
6 1 1 NR
wt -0.33 0.00 0.00 -0.33 0.00 0.00 0.00 0.62 0.62 0.95
• query = hot pizza
• Document 2 is relevant
Probabilistic Retrieval Example
• D1: “Cost of paper is up.” (relevant)
• D2: “Cost of jellybeans is up.” (not relevant)
• D3: “Salaries of CEO’s are up.” (not relevant)
• D4: “Paper: CEO’s labor cost up.” (????)
Probabilistic Retrieval Example
cost paper Jellybean salary CEO labor up Releva
nce
D1 1 1 0 0 0 0 1 R
D2 1 0 1 0 0 0 1 NR
D3 0 0 0 1 1 0 1 NR
D4 1 1 0 0 1 1 1 ??
Wij 0.477 1.176 -0.477 -0.477 -0.477 0.222 -0.222

• D1=0.477 +1.176+ -0.222


• D2=0.477 + -0.477+ -0.222
• D3= -0.477 + -0.477+ -0.222
• D4=1.176 + -0.477 + 0.222 +0.477 + -0.222
Exercise
• Consider the collection below. The collection has 5 documents and each document is
described by two terms. The initial guess of relevance to a particular query Q is as given
in the table below. Assuming the query Q has a total of 2 relevant documents in this
collection solve the following questions
Document T1 T2 Relevance
D1 1 1 R
D2 0 1 NR
D3 1 0 NR
D4 1 0 R
D5 0 1 NR

• Using the probabilistic term weighting formula, calculate the new weight for each of the
query in Q
• Rank the documents according to their probability of relevance with the new query
Probabilistic model
• Probabilistic model uses probability theory to model the uncertainty in the
retrieval process
• Assumptions are made explicit
• Term weight without relevance information is IDF
• Relevance feedback can improve the ranking by giving better term probability
estimates
• Advantages of probabilistic model over vector ‐space
• Strong theoretical basis
• Since the base is probability theory, it is very well understood
• Easy to extend
• Disadvantages
• Models are often complicated
• No term frequency weighting
• Which is better: vector‐space or probabilistic?
• Both are approximately as good as each other
• Depends on collection, query, and other factors
Thank you

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy