C2 Dictionary
C2 Dictionary
Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Introduction to Information Retrieval Sec. 2.1 Introduction to Information Retrieval Sec. 2.1
1
Introduction to Information Retrieval Sec. 2.2.1 Introduction to Information Retrieval Sec. 2.2.1
Tokenization Tokenization
Input: “Friends, Romans and Countrymen” Issues in tokenization:
Output: Tokens Finland’s capital
Friends Finland AND s? Finlands? Finland’s?
Romans
Hewlett-Packard Hewlett and Packard as two
Countrymen tokens?
A token is an instance of a sequence of characters state-of-the-art: break up hyphenated sequence.
Each such token is now a candidate for an index co-education
lowercase, lower-case, lower case ?
entry, after further processing
It can be effective to get the user to put in possible hyphens
Described below
San Francisco: one token or two?
But what are valid tokens to emit? How do you decide it is one token?
Introduction to Information Retrieval Sec. 2.2.1 Introduction to Information Retrieval Sec. 2.2.1
Introduction to Information Retrieval Sec. 2.2.1 Introduction to Information Retrieval Sec. 2.2.1
2
Introduction to Information Retrieval Introduction to Information Retrieval Sec. 2.2.2
Stop words
With a stop list, you exclude from the dictionary
Introduction to entirely the commonest words. Intuition:
Information Retrieval They have little semantic content: the, a, and, to, be
There are a lot of them: ~30% of postings for top 30 words
But the trend is away from doing this:
Terms Good compression techniques (IIR 5) means the space for including
stop words in a system is very small
The things indexed in an IR system Good query optimization techniques (IIR 7) mean you pay little at
query time for including stop words.
You need them for:
Phrase queries: “King of Denmark”
Various song titles, etc.: “Let it be”, “To be or not to be”
“Relational” queries: “flights to London”
Introduction to Information Retrieval Sec. 2.2.3 Introduction to Information Retrieval Sec. 2.2.3
Introduction to Information Retrieval Sec. 2.2.3 Introduction to Information Retrieval Sec. 2.2.3
3
Introduction to Information Retrieval Sec. 2.2.3 Introduction to Information Retrieval
Potentially more powerful, but less efficient What about spelling mistakes?
One approach is Soundex, which forms equivalence classes
of words based on phonetic heuristics
More in IIR 3 and IIR 9
Lemmatization
Reduce inflectional/variant forms to base form
Introduction to E.g.,
am, are, is be
Information Retrieval car, cars, car's, cars' car
the boy's cars are different colors the boy car be
Stemming and Lemmatization different color
Lemmatization implies doing “proper” reduction to
dictionary headword form
Introduction to Information Retrieval Sec. 2.2.4 Introduction to Information Retrieval Sec. 2.2.4
4
Introduction to Information Retrieval Sec. 2.2.4 Introduction to Information Retrieval Sec. 2.2.4
Introduction to Information Retrieval Sec. 2.2.4 Introduction to Information Retrieval Sec. 2.2.4
Information Retrieval
2 4 8 41 48 64 128 Brutus
2 8
Faster postings merges: 1 2 3 8 11 17 21 31 Caesar
Skip pointers/Skip lists
If the list lengths are m and n, the merge takes O(m+n)
operations.
Can we do better?
Yes (if the index isn’t changing too fast).
5
Introduction to Information Retrieval Sec. 2.3 Introduction to Information Retrieval Sec. 2.3
11 31 11 31
1 2 3 8 11 17 21 31 1 2 3 8 11 17 21 31
Introduction to Information Retrieval Sec. 2.3 Introduction to Information Retrieval Sec. 2.3
35