Lec2 BooleanRetrieval 1
Lec2 BooleanRetrieval 1
Boolean Retrieval
Ayan Das
Classic IR models
Boolean model
Vector Space model
Probabilistic model
1 ki be an index term
2 dj is a document
3 t - Total number of index terms
4 K = {k1 , k2 , ·, kt } - Set of all index terms.
5 wij weight associated with (ki , dj ), 0 indicates absence of ki in dj .
6 vec(dj ) = (w1j , w2j , ·, wtj ) is the weight vector indicating the weights
associated with the index terms in dj .
7 gi (vec(dj )) - function returning the weight associated with (ki , dj ).
Term-Document Matrix
Inverted Index
A possible solution
A linear scan of documents (BRUTE FORCE).
1 grep for all plays containing the words BRUTUS and CAESAR.
2 From them, strip out all the plays containing the word CALPURNIA.
Cons
1 Slow for large data collection (e.g., the web, which contains billions or
trillions of words)
A better solution: Organize and index the documents into better
representation to enable more efficient search.
Brutus 110100
Caesar 110111
Calpurnia 010000
Brutus AND Caesar 110100
NOT Calpurnia 101111
Brutus AND Caesar AND (NOT Calpurnia) 100100
Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 8 / 60
Retrieval result
Postings list
Posting
Dictionary
Preprocessing
1 Collect documents to be indexed
2 Tokenize the text, turning each document into a list of tokens
3 Identify the index terms to form the vocabulary
4 Do linguistic pre-processing, producing a list of normalized tokens,
which are the indexing terms
breakthrough 1 against 3
vaccin 1 approach 3
covid 1 breakthrough 1
new 2 covid 1
covid 2 covid 2
vaccin 2 covid 3
new 3 covid 4
approach 3 hope 4
vaccin 3 new 2
against 3 new 3
covid 3 new 4
new 4 patient 4
hope 4 vaccin 1
covid 4 vaccin 2
patient 4 vaccin 3
Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 15 / 60
Building Inverted Index
Multiple term entries in a single document are merged.
Split into Dictionary and Postings
Document frequency information is added to dictionary entries.
against 3
against 3
approach 3
breakthrough 1
approach 3
covid 1
covid 2 breakthrough 1
covid 3
covid 4 covid 1 2 3 4
hope 4
new 2 hope 4
new 3
new 4 new 2 3 4
patient 4
vaccin 1 patient 4
vaccin 2
vaccin 3 vaccin 1 2 3
1 Term vocabulary
2 Skip pointers
3 Phrase queries
4 Dictionary structures
Tokenization
Stop words
Normalization
Stemming and Lemmatization
Hypens
Hewlett-Packard
Hewlett and Packard as two tokens?
state-of-the-art
co-education
lowercase, lower-case, lower case
White Space
San Francisco: one token or two?
red herring: one token or two?
Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 32 / 60
Issues in Tokenization
Asymmetric expansion
Perform the expansion during index construction e.g. When the document
contains automobile, we index it under car as well and vice versa.
Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 38 / 60
Token normalization
Stemming refers to a crude heuristic process that chops off the ends
of words and removes the derivational affixes.
It commonly collapses derivationally related words
Lemmatization refers to doing things properly with the use of a
vocabulary and morphological analysis of words, normally aiming to
remove inflectional endings only and to return the base or dictionary
form of a word, which is known as the lemma.
It only collapses the different inflectional forms into the corresponding
root forms.
Points to consider
Where to place the skip pointers?
How to do efficient merging using skip pointers?
Nouns and noun groups (N) are usually more significant in queries
as compared to words with other parts-of-speeches (X).
For any string of terms of the form NX ∗ N, the word pair
corresponding to NN
forms an extended word pair
indexed in the dictionary
Extended bi-words
1 cost overruns
2 overruns power
3 power plant
Query: (cost overruns) AND (overruns power) AND (power plant)
Store in the posting the positions where the term appear in the
document.
to, 993427:
(1, 6: (7, 18, 33, 72, 86,
231);
<term, # docs containing term; 2, 5: (1, 17, 74, 222, 255);
doc1 : freq. of the term; pos1, pos2, 4, 5: (8, 16, 190, 429, 433);
··· ; 5, 2: (363, 367);
doc2 : freq. of the term; pos1, pos2, 7, 3: (13, 23, 191); · · · )
··· ; be, 178239:
etc. > (1, 2: (17, 25);
4, 5: (17, 191, 291, 430,
434);
5, 3: (14, 19, 101); · · · )
Query: to be or not to be
Start from the postings lists of the terms in increasing order of
document frequency.
Consider to and be
1 Find the documents containing both terms
2 Look for positions in the lists where be occurs with one index
position greater than an occurrence of to
3 Look for occurrence of both words with token positions 4 higher than
first occurrence
to: < · · · ;4:< · · · ,429,433,>;· · · >
be: < · · · ;4:< · · · ,430,434,>;· · · >
Posting
Dictionary
Cons
1 Minor variants may be Positions in t1
Query dictionary
mapped to distant p1
t2
k1
integers.(color/colour) k2
Hash
p2
function t221
2 No prefix search k3 p3
(free/freely/freedom) t548
3 Expanding vocab may
necessitate redesigning t1024
Collision
the hash function. resolution
n-sh si-z
a-hu hy-m
rk
s
en
le
va
e
ck
t
yg
go
rd
si
hu
aa
zy
Efficient search time is O(M) if tree is balanced
Allows prefix search
If balanced at each node, the difference in depth of left and right
subtrees differ by at most 1.
Insertion and deletion unbalance a tree
Costly rebalancing step required to maintain balance
Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 59 / 60
B-tree