Term Vocabulary and Postings List
Term Vocabulary and Postings List
Introduction to
Information Retrieval
1
Introduction to Information Retrieval
Overview
❶ Recap
❷ Documents
❸ Terms
General + Non-English
English
❹ Skip pointers
❺ Phrase queries
2
Introduction to Information Retrieval
Outline
❶ Recap
❷ Documents
❸ Terms
General + Non-English
English
❹ Skip pointers
❺ Phrase queries
3
Introduction to Information Retrieval
Inverted Index
dictionary postings
4
Introduction to Information Retrieval
5
Introduction to Information Retrieval
6
Introduction to Information Retrieval
7
Introduction to Information Retrieval
Take-away
9
Introduction to Information Retrieval
Outline
❶ Recap
❷ Documents
❸ Terms
General + Non-English
English
❹ Skip pointers
❺ Phrase queries
10
Introduction to Information Retrieval
Documents
11
Introduction to Information Retrieval
Parsing a document
12
Introduction to Information Retrieval
Format/Language: Complications
A single index usually contains terms of several languages.
Sometimes a document or its components contain multiple
languages/formats.
French email with Spanish pdf attachment
What is the document unit for indexing?
A file?
An email?
An email with 5 attachments?
A group of files (ppt or latex in HTML)?
Upshot: Answering the question “what is a document?” is not
trivial and requires some design decisions.
Also: XML
13
Introduction to Information Retrieval
Outline
❶ Recap
❷ Documents
❸ Terms
General + Non-English
English
❹ Skip pointers
❺ Phrase queries
14
Introduction to Information Retrieval
Outline
❶ Recap
❷ Documents
❸ Terms
General + Non-English
English
❹ Skip pointers
❺ Phrase queries
15
Introduction to Information Retrieval
Definitions
16
Introduction to Information Retrieval
Normalization
Need to “normalize” terms in indexed text as well as query
terms into the same form.
Example: We want to match U.S.A. and USA
We most commonly implicitly define equivalence classes of
terms.
Alternatively: do asymmetric expansion
window → window, windows
windows → Windows, windows
Windows (no expansion)
More powerful, but less efficient
Why don’t you want to put window, Window, windows, and
Windows in the same equivalence class?
17
Introduction to Information Retrieval
18
Introduction to Information Retrieval
Input:
Output:
19
Introduction to Information Retrieval
Exercises
In June, the dog likes to chase the cat in the barn. – How many
word tokens? How many word types? Why tokenization is difficult
20
Introduction to Information Retrieval
21
Introduction to Information Retrieval
Numbers
3/20/91
20/3/91
Mar 20, 1991
B-52
100.2.86.144
(800) 234-2333
800.234.2333
Older IR systems may not index numbers . . .
. . . but generally it’s a useful feature.
22
Introduction to Information Retrieval
Chinese: No whitespace
23
Introduction to Information Retrieval
24
Introduction to Information Retrieval
25
Introduction to Information Retrieval
Japanese
26
Introduction to Information Retrieval
Arabic script
27
Introduction to Information Retrieval
← → ← → ← START
‘Algeria achieved its independence in 1962 after 132 years of French occupation.’
28
Introduction to Information Retrieval
29
Introduction to Information Retrieval
Outline
❶ Recap
❷ Documents
❸ Terms
General + Non-English
English
❹ Skip pointers
❺ Phrase queries
30
Introduction to Information Retrieval
Case folding
31
Introduction to Information Retrieval
Stop words
32
Introduction to Information Retrieval
33
Introduction to Information Retrieval
Lemmatization
34
Introduction to Information Retrieval
Stemming
35
Introduction to Information Retrieval
Porter algorithm
Most common algorithm for stemming English
Results suggest that it is at least as good as other stemming
options
Conventions + 5 phases of reductions
Phases are applied sequentially
Each phase consists of a set of commands.
Sample command: Delete final ement if what remains is longer
than 1 character
replacement → replac
cement → cement
Sample convention: Of the rules in a compound command,
select the one that applies to the longest suffix.
36
Introduction to Information Retrieval
Rule Example
SSES → SS caresses → caress
IES → I ponies → poni
SS → SS caress → caress
S→ cats → cat
37
Introduction to Information Retrieval
39
Introduction to Information Retrieval
40
Introduction to Information Retrieval
Outline
❶ Recap
❷ Documents
❸ Terms
General + Non-English
English
❹ Skip pointers
❺ Phrase queries
41
Introduction to Information Retrieval
42
Introduction to Information Retrieval
Skip pointers
43
Introduction to Information Retrieval
Basic idea
44
Introduction to Information Retrieval
45
Introduction to Information Retrieval
46
Introduction to Information Retrieval
47
Introduction to Information Retrieval
48
Introduction to Information Retrieval
49
Introduction to Information Retrieval
Outline
❶ Recap
❷ Documents
❸ Terms
General + Non-English
English
❹ Skip pointers
❺ Phrase queries
50
Introduction to Information Retrieval
Phrase queries
We want to answer a query such as [stanford university] – as
a phrase.
Thus The inventor Stanford Ovshinsky never went to
university should not be a match.
The concept of phrase query has proven easily understood by
users.
About 10% of web queries are phrase queries.
Consequence for inverted index: it no longer suffices to store
docIDs in postings lists.
Two ways of extending the inverted index:
biword index
positional index
51
Introduction to Information Retrieval
Biword indexes
52
Introduction to Information Retrieval
53
Introduction to Information Retrieval
Extended biwords
Parse each document and perform part-of-speech tagging
Bucket the terms into (say) nouns (N) and
articles/prepositions (X)
Now deem any string of terms of the form NX*N to be an
extended biword
Examples: catcher in the rye
N X X N
king of Denmark
N X N
Include extended biwords in the term vocabulary
Queries are processed accordingly
54
Introduction to Information Retrieval
55
Introduction to Information Retrieval
Positional indexes
56
Introduction to Information Retrieval
57
Introduction to Information Retrieval
Proximity search
We just saw how to use a positional index for phrase
searches.
We can also use it for proximity search.
For example: employment /4 place
Find all documents that contain EMPLOYMENT and PLACE within
4 words of each other.
Employment agencies that place healthcare workers are
seeing growth is a hit.
Employment agencies that have learned to adapt now place
healthcare workers is not a hit.
58
Introduction to Information Retrieval
Proximity search
59
Introduction to Information Retrieval
“Proximity” intersection
60
Introduction to Information Retrieval
Combination scheme
Biword indexes and positional indexes can be profitably
combined.
Many biwords are extremely frequent: Michael Jackson,
Britney Spears etc
For these biwords, increased speed compared to positional
postings intersection is substantial.
Combination scheme: Include frequent biwords as vocabulary
terms in the index. Do all other phrases by positional
intersection.
Williams et al. (2004) evaluate a more sophisticated mixed
indexing scheme. Faster than a positional index, at a cost of
26% more space for index.
61
Introduction to Information Retrieval
62
Introduction to Information Retrieval
Take-away
63
Introduction to Information Retrieval
Resources
Chapter 2 of IIR
Resources at http://ifnlp.org/ir
Porter stemmer
64