5 The Term Vocabulary & Posting List
5 The Term Vocabulary & Posting List
VOCABULARY AND
POSTINGS LISTS
INFORMATION RETRIEVAL
AND WEB SEARCH
LECTURE 8
Outline
Elaborate basic indexing
Preprocessing to form the term vocabulary
Documents
Tokenization
What terms do we put in the index?
1
Recall the basic indexing pipeline
Tokenizer
Token stream.
Tokenize the text Friends Romans Countrymen
Linguistic modules
Modified tokens. friend roman countryman
Do linguistic pre-processing of tokens
Indexer friend 2 4
Introduction to
Information Retrieval
2
What exactly is a document?
Before indexing we need to understand
content of the digital document
Documents are in digital formats so these are
called digital documents
Is a input to an indexing process
Is in the form of bytes in a file or may be on a web
server (type of file)
3
What exactly is a document?
Byte Sequence
File/
Process into
Document
Character sequence
Tokenize
Parsing a document
4
Bytes in File
Parsing a Document
2) What is its Language?
Plain English, Hindi, Bengali etc.,
The tokenization & linguistic pre-processing steps
depends on the language i.e.
The kind of tokenization you do
The kind of linguistic pre-processing you do
10
5
Parsing a Document
3) Which character set it is using?
ASCII, Unicode UTF-8 or any other Vendor Specific Standard.
Based on these 3 task need to figure out the formats, language and character set
is a
Classification Problem
Parsing a Document
12
6
Supervised Learning
Complications: Formats/Language
Complications
Documents being indexed can include documents from many
different languages.
Example : combination of English & Hindi
A single index may have to contain terms of several languages.
Some documents in Hindi
Some documents in French Need to
Tokenize
Some documents in German and so on...
Linguistic Processes
7
Complications: Formats/Language
Sometimes a document or its components can contain multiple
languages/formats
English email with French pdf attachments
A file?
An email? (Perhaps one of many in an mbox.)
An email with 5 attachments?
A group of files (PPT or LaTeX as HTML pages)
16
8
What is unit document?
A file?
word file or PDF file or emails or ppt or so on...
If there are 10 plays of Shakespeare then we could have 10
documents.
Assign id’s to those documents
Example of Documents
9
Example of Documents
Individual Documents
10
Unit Document
11
Example (Issue)
Example (Issue)
12
What could be done?
Instead
Index each chapter or paragraph as a mini-document.
13
Solution
Solution
14
Vocabulary of Terms
29
Vocabulary of Terms
Tokenization
Word – A delimited string of characters as it appears in the
text.
Token – An instance of a word or term occurring in a
document.
Example: Friends, Romans, Countrymen.
Friends
Roman Tokens
Countrymen
If there are two occurence of the word ‟Friends“ in the document then
there would be two tokens generated. Each having same string ‟Friends“
15
Vocabulary of Terms
Tokenization
Term – A “normalized” word (case, morphology, spelling
etc); an equivalence class of words.
Its an entry in the dictionary of inverted index
So both instances of “Friends” map to the same term in the index
and that term would be “friend”
31
Vocabulary of Terms
Tokenization
Input: “Friends, Romans and Countrymen”
Output: Tokens Remove white space, punctuation marks (for input)
Friends
Romans
Countrymen
Each such token is now a candidate for an index entry, after
further processing
Linguistic processing
But what are valid tokens to emit?
32
16
Vocabulary of Terms
Tokenization
How would you deal with this
Issues in tokenization:
Finland’s capital How would you tokenize this word
Finland? Finlands? Finland’s?
The way you deal with apostrophes can impact performance of the system
This make sense
Aren’t Short form of “are not”
If we use same conversion as done above, we will be having “aren”
Tokenization
Kevin O’Brien Irish cricketer
Suppose we apply the same technique
We would get two separate tokens
Kevin O’Brien
As if you are tokenizing everything and removing the terms after O’
17
Vocabulary of Terms
Tokenization
Suppose if Hewlett
Considered two separate tokens
Packard
then
Hewlett Packard
If query contains Hewlett - Packard
Documents
or would be
returned
Hewlett
Packard
36
18
Vocabulary of Terms
Both the starting letters are capitalized may be a name of either a person or place
37
Vocabulary of Terms
Numbers
3/20/91 Mar. 12, 1991 20/3/91
55 B.C.
B-52 Need to preserve the hyphens
My PGP key is 324a3df234cb23e
(800) 234-2333 Need to preserve the hyphens
Often have embedded spaces
Older IR systems may not index numbers
But often very useful: think about things like looking up error
codes/stacktraces on the web 404 not found error
(One answer is using n-grams)
Will often index “meta-data” separately
Creation date, format, etc.
38
19
Vocabulary of Terms
Vocabulary of Terms
20
Vocabulary of Terms
← → ←→ ←
‘Algeria achieved its independence in 1962 after 132
years of French occupation.’
With Unicode, the surface presentation is complex, but the
stored form is straightforward 41
Vocabulary of Terms
Stop words
With a stop list, you exclude from the dictionary
entirely the commonest words. Intuition:
They have little semantic content: the, a, and, to, be
There are a lot of them: ~30% of postings for top 30 words
21
Stop words
Collection Frequency
along with
Document Frequency
Term Frequency
43
Vocabulary of Terms
Stop words
22
Vocabulary of Terms
Normalization to terms
We need to “normalize” words in indexed text as
well as query words into the same form
We want to match U.S.A. and USA
Result is terms: a term is a (normalized) word type,
which is an entry in our IR system dictionary
We most commonly implicitly define equivalence
classes of terms by, e.g.,
deleting periods to form a term
U.S.A., USA USA
deleting hyphens to form a term
anti-discriminatory, antidiscriminatory antidiscriminatory
45
Vocabulary of Terms
46
23
Vocabulary of Terms
Vocabulary of Terms
Case folding
Reduce all letters to lower case
exception: upper case in mid-sentence?
e.g., General Motors
Fed vs. fed (Federal Reserve System )
SAIL vs. sail
Often best to lower case everything, since
users will use lowercase regardless of
‘correct’ capitalization…
Google example:
Query C.A.T.
#1 result is for “cat” (well, Lolcats) not
Caterpillar Inc. 48
24
Vocabulary of Terms
Normalization to terms
49
Vocabulary of Terms
25
Vocabulary of Terms
Lemmatization
Reduce inflectional/variant forms to base form
Lemmatization is derived from a word Lemma which refers
to root form of a particular word
This is sophisticated NLP technique
E.g.,
am, are, is be
car, cars, car's, cars' car
the boy's cars are different colors the boy car be different
color
Plural forms are converted into singular form
Lemmatization implies doing “proper” reduction to dictionary
headword form 51
Vocabulary of Terms
26
Vocabulary of Terms
53
54
27
Skip Pointers
55
Skip Pointers
2 4 8 41 48 64 128 Brutus
2 8
1 2 3 8 11 17 21 31 Caesar
Can we do better?
Yes (if index isn’t changing too fast). 56
28
Skip Pointers
Skip Pointers
2 4 8 41 48 64 128
11 31
1 2 3 8 11 17 21 31
58
29
Skip Pointers
59
Skip Pointers
2 4 8 41 48 64 128
2 8
1 2 3 8 11 17 21 31
p2 p2 p2 p2 p2 And so on....
We are looking into the intermediate results between 11 & 31
30
Skip Pointers
Tradeoff:
More skips shorter skip spans more likely to skip.
But lots of comparisons to skip pointers.
Fewer skips few pointer comparison, but then long skip
spans few successful skips.
61
Skip Pointers
Placing skips
Simple heuristic: for postings of length L, use L
evenly-spaced skip pointers.
i.e., if total length of posting list is L, use L evenly-spaced skip
pointers.
L L L L L
31
Skip Pointers
Important Points
If Index is small entirely fits into Memory (both
dictionary & posting list can fit into main memory)
Skip Pointers
Skips
Only AND queries.
Does not work with OR queries. Why?
64
32
Skip Pointers
65
Skip Pointers
Exercise Problems
Problem 1:
We have two-word query. For one term the postings list
consists of the following 16 entries
[4, 6, 10, 12, 14, 16, 18, 20, 22, 32, 47, 81, 120,122, 157, 180]
and for the other it is the one entry posting list
[47]
Workout how many comparisons would be done to intersect
the two posting lists with the following two strategies. Briefly
justify your answer.
(a) Using standard posting list.
(b) Using posting lists stored with skip pointers, with a skip
length of L
66
33
Skip Pointers
Problem 1 Solution
(a) The no. of comparisons would be 11 as shown
(4,47), (6,47), (10,47), (12,47), (16,47), (18,47), (20,47),
(22,47), (32,47), (47,47)
(b) Total length of posting L=16
Skip length L = 16 = 4
4 14, 14 22, 22 120, 120 180
14 22 120
4 6 10 12 14 16 18 20 22 32 47 81 120
180
Skip Pointers
Problem 1 Solution
14 22 120
4 6 10 12 14 16 18 20 22 32 47 81 120
180
68
34
Skip Pointers
Problem 2
69
Skip Pointers
Problem 2 Solution
(a) The skip pointers is followed only once, 24 75
35
Skip Pointers
Assignment - II
Why are skip pointers not useful for queries of the form
x OR y?
Exercise 1.6, 1.11, 1.9, 2.2, 2.3
71
36