0% found this document useful (0 votes)
11 views61 pages

Lec2 BooleanRetrieval 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views61 pages

Lec2 BooleanRetrieval 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Information Retrieval (CSD510)

Boolean Retrieval

Ayan Das
Classic IR models

Boolean model
Vector Space model
Probabilistic model

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 2 / 60


Basic concepts (Terminology.)

1 ki be an index term
2 dj is a document
3 t - Total number of index terms
4 K = {k1 , k2 , ·, kt } - Set of all index terms.
5 wij weight associated with (ki , dj ), 0 indicates absence of ki in dj .
6 vec(dj ) = (w1j , w2j , ·, wtj ) is the weight vector indicating the weights
associated with the index terms in dj .
7 gi (vec(dj )) - function returning the weight associated with (ki , dj ).

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 3 / 60


Boolean model
Simple model based on set theory and Boolean algebra.
Documents are sets of terms
Queries are Boolean expressions on terms
Queries specified as boolean expressions.
Terms are either present or absent.wij ∈ {0, 1}.
There are three connectives used
AND (∧): the intersection of two sets
OR (∨): the union of two sets
NOT (¬): set inverse, or set difference
Document: A set of words (indexing terms) present in a document
each term is either present (1) or absent (0)
Query: A Boolean expression.
Effective terms are index terms.
Operation: Boolean algebra over sets of terms and sets of
documents.
Relevant: A document is relevant to a query expression if it satisfies
the query expression
Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 4 / 60
Boolean Retrieval

Term-Document Matrix
Inverted Index

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 5 / 60


Example: Boolean retrieval

Document set: All plays of Shakespeare.


Query: BRUTUS AND CAESAR AND NOT CALPURNIA
Task: Find all Shakespeare’s plays that satisfy the query

A possible solution
A linear scan of documents (BRUTE FORCE).
1 grep for all plays containing the words BRUTUS and CAESAR.
2 From them, strip out all the plays containing the word CALPURNIA.
Cons
1 Slow for large data collection (e.g., the web, which contains billions or
trillions of words)
A better solution: Organize and index the documents into better
representation to enable more efficient search.

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 6 / 60


Term-Document Incidence Matrix

Two dimensional: Terms and documents


Matrix element (t, d) = 1 if term t appears in document d

Brutus AND Caesar AND NOT Calpurnia

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 7 / 60


Term-Document Incidence Matrix

Brutus 110100
Caesar 110111
Calpurnia 010000
Brutus AND Caesar 110100
NOT Calpurnia 101111
Brutus AND Caesar AND (NOT Calpurnia) 100100
Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 8 / 60
Retrieval result

The incidence matrices are usually sparse.


Difficult to build for too big Document Corpus.

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 9 / 60


Bigger collections

Consider N = 1 million documents, each with about 1000 words.


Avg 6 bytes/word including spaces/punctuation
6GB of data in the documents.
Say there are M = 500K distinct terms among these.

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 10 / 60


Can’t build the matrix

500K x 1M matrix has half-a-trillion 0’s and 1’s.


But it has no more than one billion 1’s.
matrix is extremely sparse.
What’s a better representation?
Solution is to record only if a term appears in a document.

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 11 / 60


Inverted Index

Postings list

Posting

Dictionary

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 12 / 60


Building inverted index

Preprocessing
1 Collect documents to be indexed
2 Tokenize the text, turning each document into a list of tokens
3 Identify the index terms to form the vocabulary
4 Do linguistic pre-processing, producing a list of normalized tokens,
which are the indexing terms

Inverted index construction


1 Identify each document by a unique identifier (docID).
2 For each term t in the vocabulary
prepare a list of documents in which the term appears.
sort the list on the docIDs.
3 Can be implemented using either singly linked lists or variable length
arrays

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 13 / 60


Inverted index
Consider the following documents
Doc 1: Breakthrough vaccine for Covid
Doc 2: New Covid vaccine
Doc 3: A new approach to vaccination against Covid
Doc 4: New hopes for Covid patients
Tokens: Breakthrough, vaccine, for, Covid, New, A, new, approach,
to, vaccination, against, hopes, patients
Case normalization: breakthrough, vaccine, for, covid, a, new,
approach, to, vaccination, against, hopes, patients
Stopword removal breakthrough, vaccine, covid, new, approach,
vaccination, against, hopes, patients (a, for, to)
Stemming: breakthrough, vaccin, covid, new, approach, against,
hope, patient
Index terms: breakthrough, vaccin, covid, new, approach, against,
hope, patient
Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 14 / 60
Inverted index
Sort by docID Sort by terms

breakthrough 1 against 3
vaccin 1 approach 3
covid 1 breakthrough 1
new 2 covid 1
covid 2 covid 2
vaccin 2 covid 3
new 3 covid 4
approach 3 hope 4
vaccin 3 new 2
against 3 new 3
covid 3 new 4
new 4 patient 4
hope 4 vaccin 1
covid 4 vaccin 2
patient 4 vaccin 3
Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 15 / 60
Building Inverted Index
Multiple term entries in a single document are merged.
Split into Dictionary and Postings
Document frequency information is added to dictionary entries.
against 3
against 3
approach 3
breakthrough 1
approach 3
covid 1
covid 2 breakthrough 1
covid 3
covid 4 covid 1 2 3 4
hope 4
new 2 hope 4
new 3
new 4 new 2 3 4
patient 4
vaccin 1 patient 4
vaccin 2
vaccin 3 vaccin 1 2 3

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 16 / 60


Boolean Retrieval

Processing Boolean queries


Term vocabulary and postings lists

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 17 / 60


Practical considerations

For a practical IR system handling a huge corpus


Postings lists will be stored on disk.
Ideally, retrieve (from disk) only those postings lists that are needed
to answer a query.

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 18 / 60


Processing Boolean Queries

Consider the query: Brutus AND Calpurnia


1 Locate Brutus in the Dictionary
2 Retrieve its postings
3 Locate Calpurnia in the Dictionary
4 Retrieve its postings
5 Intersect the two postings lists

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 19 / 60


Intersecting two postings lists (a “merge” algorithm)

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 20 / 60


Query processing
Query: Brutus AND Calpurnia AND Caesar
For each of the n terms, get its postings, then AND them together.

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 21 / 60


Query optimization

Process in order of increasing frequency:


start with smallest set, then keep cutting further.

Execute the query as (Calpurnia AND Brutus) AND Caesar.


If the list lengths are x and y, the merge takes O(x+y) operations.
Crucial: postings sorted by docID.

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 22 / 60


Query processing

(wind OR fire) AND (thunder OR lightning)


Get doc. frequencies for all terms.
Estimate the size of each OR by the sum of its document frequencies.
Process in increasing order of OR sizes.

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 23 / 60


Query processing

Given the following postings list sizes:


Recommend a query processing order for the following two queries
1 (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope
OR eyes)
2 (tangerine AND (NOT trees)) AND (NOT marmalade)

Term Posting size


eyes 213312
kaleidoscope 87009
marmalade 107913
skies 271658
tangerine 46653
trees 316812

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 24 / 60


Query processing

(tangerine OR trees) (363,465)


(marmalade OR skies) (379,571)
(kaleidoscope OR eyes) (300,321)

((kaleidoscope OR eyes) AND (tangerine OR trees)) AND


(marmalade or skies)
(tangerine AND (NOT trees)) AND (NOT marmalade)

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 25 / 60


Limitations of Boolean model

Retrieval based on binary decision criteria with no notion of partial


matching
No ranking of the documents is provided (absence of a grading scale)
Information need has to be translated into a Boolean expression
which most users find awkward
Binary term weights extremely limited in terms of expressiveness
and relation among contextual words.

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 26 / 60


Lecture outline

1 Term vocabulary
2 Skip pointers
3 Phrase queries
4 Dictionary structures

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 27 / 60


Term Vocabulary and Postings List

Pre-processing to form the Term vocabulary


Documents
Tokenization
Indexing

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 28 / 60


Term Vocabulary and Postings List

Pre-processing to form the Term vocabulary


Documents
Tokenization
Indexing
Postings
Faster merges: skip lists
Positional postings and phrase queries

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 28 / 60


Document interpretation

Obtaining the character sequence in a document.


Choosing a document features
We need to deal with format and language of each document.
What format is it in? pdf, word, excel, html etc.
Language of the document

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 29 / 60


Document processing steps for vocabulary generation

Tokenization
Stop words
Normalization
Stemming and Lemmatization

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 30 / 60


Tokenization

Token: An instance of character sequence in some particular


document that are grouped together as a semantic unit for processing.
Type: A type is the class of all tokens containing the same character
sequence.
Term: A term is a type that is included in the IR system’s dictionary.
Tokenization is a way of separating a document into smaller units,
called tokens, by removing unwanted tokens.
Example of tokenization
Input: “Friends, Romans, Countrymen”
Output: Friends, Romans, Countrymen
Each such token is now a candidate for an index entry, after further
processing.

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 31 / 60


Issues in Tokenization
What are the correct tokens to use?
Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t
amusing.

Hypens
Hewlett-Packard
Hewlett and Packard as two tokens?
state-of-the-art
co-education
lowercase, lower-case, lower case
White Space
San Francisco: one token or two?
red herring: one token or two?
Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 32 / 60
Issues in Tokenization

Different character sequences


email addresses (jblack@mail.yahoo.com)
Web URLs (http://stuff.big.com/new/specials.html)
numeric IP addresses (142.32.48.231)
package tracking numbers (1Z9999W99845399981)
Often have embedded spaces
Older IR systems may not index numbers
But often very useful:
looking up error codes/stack traces on the web
Date of an email Will often index “meta-data” separately, Creation
date, format, etc.

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 33 / 60


Tokenization
Tokenization: language issues
French
L’ensemble one token or two?
L ? L’ ? Le ?
German noun compounds are not segmented
Lebensversicherungsgesellschaftsangestellter
‘life insurance company employee’
Chinese and Japanese have no spaces between words
莎拉波娃现在居住在美国东南部的佛罗里达。
Not always guaranteed a unique tokenization
Arabic (or Hebrew) is written right to left, but with certain items like
numbers written left to right
Use rule-based or machine learning-based compound-splitters or word
segmentation tools to tokenize long compound words or languages
where explicit separators are not used to indicate word boundaries.
Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 34 / 60
Stop words

Common words that appear to be of little value in helping select


documents matching a user’s need.
With a stop list, exclude from the dictionary entirely the most
common.
They have little semantic content
the, a, and, to, be
To sort the terms by collection frequency and then to take the most
frequent.

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 35 / 60


Issues in removing stop words

Some special query types are disproportionately affected.


Phrase queries:
“King of Denmark”
“President of the United States”, President AND “United States”
Various song titles, etc.:
“Let it be”, “To be or not to be”
“Relational” queries:
“flights to London”: if to removed, it implies both “flights to London”
or “flights from London”
Standard use of quite large stop lists (200–300 terms) to very small
stop lists (7–12 terms)

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 36 / 60


Token normalization

Token normalization is the process of canonicalizing tokens so that


matches occur despite superficial differences in the character
sequences of the tokens
match U.S.A. and USA
A term is a (normalized) word type, which is an entry in the IR
system dictionary
To implicitly create equivalence classes, which are normally named
after one member of the set.
deleting periods to form a term
U.S.A., USA
deleting hyphens to form a term
anti-discriminatory, antidiscriminatory

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 37 / 60


Token normalization
Alternatives to creating equivalence classes are
to maintain relations between unnormalized tokens.
to do asymmetric expansion.
Example: Microsoft Windows, Rear Window, glass window
Enter: window Search: window, windows
Enter: windows Search:Windows,windows, window
Enter: Windows Search: Windows

Maintain relations between unnormalized tokens


1 Index unnormalized tokens.
2 Maintain a query expansion list of multiple vocabulary entries to
consider for a certain query term.
3 A query term is then effectively a disjunction of several postings lists.

Asymmetric expansion
Perform the expansion during index construction e.g. When the document
contains automobile, we index it under car as well and vice versa.
Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 38 / 60
Token normalization

Accents and Diacritics: Naı̈ve, peña (a cliff), pena (sorrow).


Case folding – True Casing
Reduce all letters to lower case
The simplest heuristic is to convert to lowercase words
at the beginning of a sentence
all words that are all uppercase or in which most or all words are
capitalized
exception: upper case in mid-sentence

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 39 / 60


Text normalization

Handling synonyms and homonyms


e.g., by hand-constructed equivalence classes
by hand-constructed equivalence classes
car = automobile; color = colour
We can rewrite to form equivalence-class terms
When the document contains automobile, index it under
car-automobile (and vice-versa)
Or we can expand a query
When the query contains automobile, look for car as well
Spelling mistakes
One approach is Soundex, which forms equivalence classes of words
based on phonetic heuristics

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 40 / 60


Stemming and Lemmatization

To reduce inflectional forms and sometimes derivationally related


forms of a word to a common base form.
Example:
am, is, are → be
car, cars, car’s, cars’ → car
the boy’s cars are different colors → the boy car be different color

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 41 / 60


Stemming and Lemmatization

Stemming refers to a crude heuristic process that chops off the ends
of words and removes the derivational affixes.
It commonly collapses derivationally related words
Lemmatization refers to doing things properly with the use of a
vocabulary and morphological analysis of words, normally aiming to
remove inflectional endings only and to return the base or dictionary
form of a word, which is known as the lemma.
It only collapses the different inflectional forms into the corresponding
root forms.

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 42 / 60


Stemming

Reduce terms to their common basic form before indexing.


“Stemming” suggests crude affix chopping
language dependent
e.g., automate(s), automatic, automation all reduced to automat

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 43 / 60


Porter’s Stemmer
The most common algorithm for stemming English.
Results suggest it’s at least as good as other stemming options
Algorithm
1 5 phases of reductions
2 phases applied sequentially
3 each phase has various conventions to select rules
4 sample convention: Of the rules in a compound command, select the
one that applies to the longest suffix.
Phase 1
SSES → SS caresses → caress
IES → I ponies → poni
SS → SS caress → caress
S→ cats → cat
Phase 2
Loosely checks the number of syllables to find whether a syllable is
suffix or part of the stem of the word.
replacement→replac, and NOT cement → c
Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 44 / 60
Lemmatizer

Tool from Natural language Processing


Does full morphological analysis to accurately identify the lemma for
each word.
Full morphological analysis
Is usually more time consuming and elaborate process.
produces at most very modest benefits for retrieval.

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 45 / 60


Faster postings list access

If lengths of postings lists are m and n then, intersection operation


takes O(m + n) time.
The speed of intersection may be increased by using skip pointers
Skip pointers are shorcuts to bypass parts of posting lists that will
not appear in the search result

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 46 / 60


Skip pointers

Points to consider
Where to place the skip pointers?
How to do efficient merging using skip pointers?

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 47 / 60


Skip pointers

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 48 / 60


Where to place the skip pointers?

More skips → shorter skip spans


1 more likely to skip
2 increased number of skip comparison operations.
3 more successful skips
Less skips → longer skip spans
1 fewer pointer comparison
2 fewer successful skips

Simple heuristic: for postings of length P, use P evenly-spaced
skip pointers.

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 49 / 60


Phrase queries

Consider the query - “Stanford University - as a phrase


Following documents are false positives
1 “I went to university at Stanford”
2 “The inventor Stanford Ovshinsky never went to university ”
Postings lists comprising of documents containing individual terms
not sufficient to handle such queries.
Approaches for phrase queries
1 Biword Indexes
2 Positional Indexes

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 50 / 60


Biword Indexes

Index every consecutive pair of terms in the text as a phrase


Query:“Friends, Romans, Countrymen”
Pairs of consecutive words indexed as dictionary terms
friends romans
romans countrymen
For longer queries consecutive word pairs are ANDed
1 Query:“Friends, Romans, Countrymen”
(friends roman) AND (roman countrymen)
2 stanford university palo alto
(stanford university) AND (university palo) AND (palo alto)
Disadvantage: False positives: The biwords may not necessarily
appear together in the retrieved document.

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 51 / 60


Extended biwords

Nouns and noun groups (N) are usually more significant in queries
as compared to words with other parts-of-speeches (X).
For any string of terms of the form NX ∗ N, the word pair
corresponding to NN
forms an extended word pair
indexed in the dictionary

cost overruns on a power plant


N N X X N N

Extended bi-words
1 cost overruns
2 overruns power
3 power plant
Query: (cost overruns) AND (overruns power) AND (power plant)

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 52 / 60


Positional indexes

Store in the posting the positions where the term appear in the
document.
to, 993427:
(1, 6: (7, 18, 33, 72, 86,
231);
<term, # docs containing term; 2, 5: (1, 17, 74, 222, 255);
doc1 : freq. of the term; pos1, pos2, 4, 5: (8, 16, 190, 429, 433);
··· ; 5, 2: (363, 367);
doc2 : freq. of the term; pos1, pos2, 7, 3: (13, 23, 191); · · · )
··· ; be, 178239:
etc. > (1, 2: (17, 25);
4, 5: (17, 191, 291, 430,
434);
5, 3: (14, 19, 101); · · · )

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 53 / 60


Proximity intersection

Query: to be or not to be
Start from the postings lists of the terms in increasing order of
document frequency.
Consider to and be
1 Find the documents containing both terms
2 Look for positions in the lists where be occurs with one index
position greater than an occurrence of to
3 Look for occurrence of both words with token positions 4 higher than
first occurrence
to: < · · · ;4:< · · · ,429,433,>;· · · >
be: < · · · ;4:< · · · ,430,434,>;· · · >

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 54 / 60


Dictionaries

Dictionary data structures


Tolerant retrieval
1 Wild-card queries
2 Spelling correction
3 Phonetic correction
Develop techniques that are robust to typographical errors in the
query, as well as alternative spellings.

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 55 / 60


Search structures for dictionaries

The dictionary data structure stores the term vocabulary, document


frequency, pointers to each postings list.
Explore the data structures for the dictionary.
Postings list

Posting

Dictionary

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 56 / 60


A simple dictionary
An array of structures
document pointer to
term frequency postings list
a 656,265 →
aardvark 65 →
··· ··· ···
zulu 221 →
char[20] int Postings
20 bytes 4/8 bytes 4/8 bytes
Storage and retrieval is not efficient
Points to be considered:
1 # of terms in dictionary
2 Keys remain static or dynamic
3 The relative frequencies with which various keys will be accessed
Two choices
1 Hashtables
2 Trees
Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 57 / 60
Hashing

Query terms (keys) mapped to integers from a big enough space to


avoid collision.
Collision resolution done by auxiliary structures
O(1) search complexity

Cons
1 Minor variants may be Positions in t1
Query dictionary
mapped to distant p1
t2
k1
integers.(color/colour) k2
Hash
p2
function t221
2 No prefix search k3 p3

(free/freely/freedom) t548
3 Expanding vocab may
necessitate redesigning t1024
Collision
the hash function. resolution

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 58 / 60


Binary trees
ROOT
a-m n-z

n-sh si-z
a-hu hy-m

rk

s
en

le
va

e
ck

t
yg

go
rd

si
hu
aa

zy
Efficient search time is O(M) if tree is balanced
Allows prefix search
If balanced at each node, the difference in depth of left and right
subtrees differ by at most 1.
Insertion and deletion unbalance a tree
Costly rebalancing step required to maintain balance
Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 59 / 60
B-tree

To mitigate rebalancing, B-trees may be used


Each internal node of a B-trees has variable number of children in a
fixed range.
Each branch under an internal node represents a test for a range of
character sequences.

Boolean Retrieval (Ayan Das) Information Retrieval (CSD510) 60 / 60

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy