C1 Intro
C1 Intro
Information Retrieval
• Information Retrieval (IR) is finding material
Introduction to (usually documents) of an unstructured nature
(usually text) that satisfies an information need
Information Retrieval from within large collections (usually stored on
computers).
Introducing Information Retrieval – These days we frequently think first of web search,
but there are many other cases:
and Web Search • E-mail search
• Searching your laptop
• Corporate knowledge bases
• Legal information retrieval
IR vs. databases:
Unstructured (text) vs. structured (database)
Structured vs unstructured data
• Structured data tends to refer to information
in “tables”
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
Ivy Smith 50000
5 6
1
11/25/2015
Sec. 1.1
Info need
Info about removing mice
without killing them
• Goal: Retrieve documents with information Misformulation?
that is relevant to the user’s information need
Query
and helps the user complete a task how trap mice alive Search
Search
engine
Query Results
Collection
7 refinement
Sec. 1.1
2
11/25/2015
– 110100 AND Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
– 110111 AND
Antony
Brutus
1
1
1
1
0
0
0
1
0
0
1
0
• Hamlet, Act III, Scene ii
Caesar 1 1 0 1 1 1 Lord Polonius: I did enact Julius Caesar I was killed i’ the
– 101111 = Calpurnia
Cleopatra
0
1
1
0
0
0
0
0
0
0
0
0
Capitol; Brutus killed me.
– 100100 mercy
worser
1
1
0
0
1
1
1
1
1
1
1
0
13 14
15 16
Sec. 1.2
Inverted index
• For each term t, we must store a list of all documents
that contain t.
Introduction to – Identify each doc by a docID, a document serial number
Information Retrieval • Can we used fixed-size arrays for this?
3
11/25/2015
Brutus 1 2 4 11 31 45 173 174 Modified tokens Linguistic modules friend roman countryman
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
Indexer friend 2 4
Sec. 1.2
Why frequency?
Will discuss later.
4
11/25/2015
Sec. 1.2
Lists of
docIDs Introduction to
Terms Information Retrieval
and
counts
IR system
implementation Query processing with an inverted index
• How do we
index efficiently?
• How much
storage do we
need?
Pointers 25
Sec. 1.3
5
11/25/2015
Sec. 1.3
& Extended Boolean Models – Perhaps the simplest model to build an IR system on
• Primary commercial retrieval tool for 3 decades.
• Many search systems you still use are Boolean:
– Email, library catalog, Mac OS X Spotlight
32
Query: Brutus AND Calpurnia AND Caesar Execute the query as (Calpurnia AND Brutus) AND Caesar.
35
36
6
11/25/2015
Sec. 1.3
37
7
11/25/2015
8
11/25/2015
Exercices
• Soit la collection suivante :
Exercise Exercices
• For a collection of 500000 • Soit un « biword index ». Donnez un exemple de document
Term Freq
documents recommend a query retourné pour la requête « bibliothèque scientifique
eyes 213312
processing order for d’Evry » qui est en fait un faux positif.
kaleidoscope 87009
(tangerine OR trees) AND marmalade 107913
(marmalade OR skies) AND skies 271658
• Ecrire le « biword index » de la collection précédente
(kaleidoscope OR eyes) tangerine 46653
trees 316812
9
11/25/2015
Exercices
• Soit le « positional index » suivant :
10