IR Summary Lec 1 - Introduction
IR Summary Lec 1 - Introduction
Information Retrieval
Introducing Information Retrieval
and Web Search
Information Retrieval
Information Retrieval
• Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).
3
Unstructured (text) vs. structured (database)
data in the midnineties
4
Unstructured (text) vs. structured (database)
data today
5
Sec. 1.1
6
The classic search model
User task Get rid of mice in a
politically correct way
Misconception?
Info need
Info about removing mice
without killing them
Misformulation?
Query
how trap mice alive Search
Search
engine
Query Results
Collection
refinement
Sec. 1.1
Right or wrong/retrieved
or not
8
Sec. 1.1
Search 9
Word:”ford
Introduction to
Information Retrieval
Termdocument incidence matrices
Sec. 1.1
• One could grep all of Shakespeare’s plays for Brutus and Caesar,
then lines containing Calpurnia?
• Why is that not the answer?
– Slow (for large corpora)
– Roman near countrymen is not trival (position of terms)
– Repeat linear scan with each query(too long time)
– Ranked retrieval (best documents to return(the no each word repeated
in doc)
11
Sec. 1.1
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus,
Caesar and Calpurnia (complemented) ➔
bitwise AND.
– 110100 AND
– 110111 AND
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
– 101111 = Caesar
Calpurnia
1
0
1
1
0
0
1
0
1
0
1
0
Cleopatra 1 0 0 0 0 0
– 100100 mercy
worser
1
1
0
0
1
1
1
1
1
1
1
0
13
Sec. 1.1
Answers to query
14
Sec. 1.1
Bigger collections
15
Sec. 1.1
• (1000*1million).
– matrix is extremely sparse (“most of entries are 0”
99.8%).
• What’s a better representation?
– We only record the 1 positions.
16
Introduction to
Information Retrieval
The Inverted Index
The key data structure underlying
modern IR
Sec. 1.2
Inverted index
• For each term t, we must store a list of all
documents that contain t.
– Identify each doc by a docID, a document serial
number
• Can we used fixedsize arrays for this?
Brutus 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
What happens if the word Caesar
is added to document 14?
18
Sec. 1.2
Inverted index
• We need variablesize postings lists
– On disk, a continuous run of postings is normal
and best
– In memory, can use linked lists or variable length
arrays Posting
• Some tradeoffs in size/ease of insertion
Brutus 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
Dictionary Postings
Sorted by docID (more later on why).
19
Sec. 1.2
Tokenizer
Linguistic modules
Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Initial stages of text processing
• Tokenization
– Cut character sequence into word tokens
• Deal with “John’s”, a stateoftheart solution
• Normalization
– Map text and query term to same form
• You want U.S.A. and USA to match
• Stemming
– We may wish different forms of a root to match
• authorize, authorization
• Stop words
– We may omit very common words (or not)
• the, a, to, of
Sec. 1.2
Doc 1 Doc 2
Lists of
docIDs
Terms
and
counts
IR system
implementation
• How do we
index efficiently?
• How much
storage do we
need?
Pointers 25
Introduction to
Information Retrieval
Query processing with an inverted index
Sec. 1.3
27
Sec. 1.3
28
Intersecting two postings lists
(a “merge” algorithm)
29
Sec. 1.3
The merge
• Walk through the two postings
simultaneously, in time linear in the total
number of postings entries
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
30
Quiz
• When a search engine returns 30 pages only
20 of which were relevant while failing to
return 40 additional relevant pages, its
precision =……………. while its recall
=…………………….
Introduction to
Information Retrieval
The Boolean Retrieval Model
& Extended Boolean Models
Sec. 2.4
Phrase queries
• We want to be able to answer queries such as
“stanford university” – as a phrase
• Thus the sentence “I went to university at
Stanford” is not a match.
– The concept of phrase queries has proven easily
understood by users; one of the few “advanced
search” ideas that works
• For this, it no longer suffices to store only
<term : docs> entries
Sec. 2.4.1
<be: 993427;
1: 7, 18, 33, 72, 86, 231;
Which of docs 1,2,4,5
2: 3, 149; could contain “to be
4: 17, 191, 291, 430, 434; or not to be”?
Rules of thumb
• A positional index is 2–4 as large as a non
positional index
Combination schemes
• These two approaches can be profitably combined
– For particular phrases (“Michael Jackson”, “Britney
Spears”) it is inefficient to keep on merging positional
postings lists (popular biwords)
• Even more so for phrases like “The Who”(intersection will be very
expensive in positional index(large posting lists)
Query optimization
• What is the best order for query
processing?
• Consider a query that is an AND of n terms.
• For each of the n terms, get its postings,
then AND them together.
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16
48
Exercise
• Recommend a query
processing order for
49
50
Introduction to
Information Retrieval
Structured vs. Unstructured Data
IR vs. databases:
Structured vs unstructured data
• Structured data tends to refer to information
in “tables”
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
Ivy Smith 50000
53
Semistructured data
• In fact almost no data is “unstructured”
• E.g., this slide has distinctly identified zones such
as the Title and Bullets
• … to say nothing of linguistic structure
• Facilitates “semistructured” search such as
– Title contains data AND Bullets contain search
• Or even
– Title is about Object Oriented Programming AND
Author something like stro*rup
– where * is the wildcard operator
54