4 IRModels
4 IRModels
• Why IR models?
• Boolean IR Model
• Vector space IR model
• Probabilistic IR model
What is Information Retrieval ?
• Information retrieval is the
process of searching for
relevant documents from
unstructured large corpus that
satisfy users information need.
• It is a tool that finds and selects
from a collection of items a
subset that serves the user’s
purpose
• Much IR research focuses more specifically on text retrieval. But
there are many other interesting areas:
Cross-language vs. multilingual information retrieval,
Multimedia (audio, video & image) information retrieval (QBIC, WebSeek,
SaFe)
Question-answering (AskJeeves, Answerbus).
Digital and virtual libraries
Information Retrieval serve as a
Bridge
• An Information Retrieval System serves as a bridge
between the world of authors and the world of
readers/users,
• That is, writers present a set of ideas in a document using a
set of concepts. Then Users seek the IR system for relevant
documents that satisfy their information need.
Black box
User Documents
Typical IR System
Architecture
Document
corpus
Query IR
String System
1. Doc1
2. Doc2
Ranked 3. Doc3
Relevant Documents .
.
Our focus during IR system design
• In improving Effectiveness of the system
• The concern here is retrieving more relevant documents as per users query
• Effectiveness of the system is measured in terms of precision, recall, …
• Main emphasis: text operations (such as stemming, stopwords removal, normalization,
etc.), weighting schemes, matching algorithms, …
• In improving Efficiency of the system
• The concern here is
• enhancing searching time, indexing time, access time…
• reducing storage space requirement of the system
• space – time tradeoffs
• Main emphasis:
• Compression
• Index terms selection (free text or content-bearing terms)
• indexing structures
Subsystems of IR system
The two subsystems of an IR system: Indexing and Searching
•Indexing:
• is an offline process of organizing documents using keywords extracted from
the collection
• Indexing is used to speed up access to desired information from document
collection as per users query
•Searching
• Is an online process that scans document corpus to find relevant documents that matches users
query
Indexing Subsystem
documents
Documents Assign document identifier
document document
Tokenization
IDs
tokens
Stopword removal
non-stoplist tokens
Stemming &
stemmed terms
Normalization
Term weighting
Weighted index
terms Index File
Searching Subsystem
query parse query
query tokens
ranked
Stop word non-stoplist
document
tokens
set
Ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
IR Models - Basic Concepts
IR systems usually adopt index terms to index and
retrieve documents
Each document is represented by a set of representative
keywords or index terms (called Bag of Words)
• An index term is a word useful for remembering the document main
themes
•Not all terms are equally useful for representing the
document contents:
less frequent terms allow identifying a narrower set of
documents
• But no ordering information is attached to the Bag of Words identified from the
document collection.
IR Models - Basic Concepts
•One central problem regarding IR systems is the issue of
predicting the degree of relevance of documents for a given
query
Such a decision is usually dependent on a ranking
algorithm which attempts to establish a simple ordering
of the documents retrieved
Documents appearning at the top of this ordering are
considered to be more likely to be relevant
•Thus ranking algorithms are at the core of IR systems
The IR models determine the predictions of what is
relevant and what is not, based on the notion of
relevance implemented by the system
IR models
Probabilistic
relevance
How to find relevant documents for a query?
• Step 1: Map documents & queries into term-document vector space. Note that
queries are considered as short document
• Represent both documents & queries as N-dimensional vectors in a term-document
matrix, which shows occurrence of terms in the document collection or query
d j (t1, j , t 2, j ,..., t N , j ); qk (t1,k , t 2,k ,..., t N ,k )
T 1 T2 …. TN
– Document collection is mapped to term-
D1 … … .. … … by-document matrix
D2 … … .. … … – View as vector in multidimensional
: … … ..… …: space
… … ..… … • Nearby vectors are related
DM … … .. … …
Qi … … .. … …
How to find relevant documents for a query?
• Step 2: Queries and documents are represented as weighted vectors, wij
Why we need weighting techniques?
To know the importance of a term in describing the content of a given document.
There are binary weights & non-binary weighting technique. Any difference between
the two?
What method you recommend to compute weights for term i in document j and query
q; wij and wiq ?
T1 T2 …. TN
• An entry in the matrix corresponds to
the “weight” of a term in the document; D1 w11 w12 … w1N
zero means the term doesn’t exist in the D2 w21 w22 … w2N
document. : : : :
• Normalize for vector length to avoid : : : :
the effect of document length DM wM1 wM2 … wMN
Qi wi1 wi2 … wiN
How to find relevant documents for a query?
• Step 3: Rank documents (in increasing or decreasing order) based on their
closeness to the query.
Documents are ranked by the degree of their closeness to the query.
How closeness of the document to query measured?
It is determined by a similarity/dissimilarity score calculation
How many matching (similarity/dissimilarity measurements) you know? Which
one is best for IR?
n
d j q wi , j wi , q
sim(d j , q ) i 1
i 1 w i 1 i ,q
n n
dj q 2
i, j w 2
How to evaluate Models?
• We need to investigate what procedures the IR Models follow and what
techniques they use:
• What is the weighting technique used by the IR Models for measuring importance of
terms in documents?
• Are they using binary or non-binary weight?
• What is the matching technique used by the IR models?
• Are they measuring similarity or dissimilarity?
• Are they applying exact matching or partial matching in the course of finding relevant
documents for a given query?
• Are they applying best matching principle to measure the degree of relevance of
documents to display in ranked-order?
• Is there any Ranking mechanism applied before displaying relevant documents for the users?
The Boolean Model
•Boolean model is a simple model based on set theory
The Boolean model imposes a binary criterion for
deciding relevance
Documents must exactly match the query
•Terms are either present or absent. Thus,
wij {0,1}
•sim(q,dj) 1 - if document satisfies the boolean
T1 T2 ….queryT
N
0 - otherwise D1 w11 w12 … w1N
- Note that, no weights D2 w21 w22 … w2N
assigned in-between 0 : : : :
and 1, just only values 0
: : : :
or 1
DM wM1 wM2 … wMN
The Boolean Model: Example
Given the following three documents, Construct Term – document matrix and find
the relevant documents retrieved by the Boolean model for the query
“gold silver truck”
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
Table below shows document –term (ti) matrix
• These term weights are used to compute a degree of similarity between a query and each
document
Ranked set of documents provides for better matching
freq (i, j )
wij * log(N/n i )
max( freq ( k , j ))
freq (i, q )
wiq 0.5 [0.5 * ] * log(N/n i )
max( freq ( k , q ))
Example: Computing weights
• A collection includes 10,000 documents
The term tA appears 20 times in a particular document j
The maximum appearance of term tk in document j is 50 times
The term tA appears in 2,000 of the document collections.
• First, for each document and query, compute all vector lengths
(zero terms ignored)
|d1|= 0.477 2 0.477 2 0.1762 0.176
=2 = 0.719
0.517
|d2|= = 2
0.1762 0.477 2 0.9542 0.176 = 1.095
1.1996
|d3|= =2
0.176 2 0.1762 0.1762 0.176 0=
.124
0.352
• Disadvantages:
• Assumes independence of index terms. It doesn’t relate one term with another
term: challenging relevance ranking by capturing semantic relationship
• Computationally expensive since it measures the similarity between each
document and the query
Exercise 1
Suppose the database collection consists of the following documents.
c1: Human machine interface for Lab ABC computer applications
c2: A survey of user opinion of computer system response time
c3: The EPS user interface management system
c4: System and human system engineering testing of EPS
c5: Relation of user-perceived response time to error measure
M1: The generation of random, binary, unordered trees
M2: The intersection graph of paths in trees
M3: Graph minors: Widths of trees and well-quasi-ordering
m4: Graph minors: A survey
Query:
Find documents relevant to "human computer interaction"
Exercise 2
• Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
• Draw the term-document incidence matrix for this document collection.
• Draw the inverted index representation for this collection.
• For the document collection shown above, what are the returned results
for the queries:
• schizophrenia AND drug
• for AND NOT(drug OR approach)
Probabilistic Model
• IR is an uncertain process
• Mapping Information need to Query is not perfect
• Mapping Documents to index terms is a logical representation
• Query terms and index terms mostly mismatch
• Using the probabilistic term weighting formula, calculate the new weight for each of the
query in Q
• Rank the documents according to their probability of relevance with the new query
Probabilistic model
• Probabilistic model uses probability theory to model the uncertainty in the
retrieval process
• Assumptions are made explicit
• Term weight without relevance information is IDF
• Relevance feedback can improve the ranking by giving better term probability
estimates
• Advantages of probabilistic model over vector ‐space
• Strong theoretical basis
• Since the base is probability theory, it is very well understood
• Easy to extend
• Disadvantages
• Models are often complicated
• No term frequency weighting
• Which is better: vector‐space or probabilistic?
• Both are approximately as good as each other
• Depends on collection, query, and other factors
Thank you