CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
Lecture 1: Introduction
Information Retrieval
Information retrieval is the science of searching for information in
documents, searching for documents themselves, searching for
metadata which describe documents, or searching within databases,
whether relational stand-alone databases or hypertextually-networked
databases such as the World Wide Web.
Wikipedia
Finding material of an unstructured nature that satisfies an information
need from within large collections.
Manning et al 2008
The study of methods and structures used to represent and access
information.
Witten et al
The IR definition can be found in this book.
Salton
IR deals with the representation, storage, organization of, and access to
information items.
Salton
Information retrieval is the term conventionally, though somewhat
inaccurately, applied to the type of activity discussed in this volume.
2
van Rijsbergen
1
IR is now largely what Google does…
But...
2
Other Hot Topics
Image search
How to index images
With and without additional information like
captions
Multilingual issues
Cross-language search and indexing
Spoken language issues
ASR for indexing videos
Manning…
3
Unstructured (text) vs. structured (database)
data in 1996
4
Boulder players
Boulder players
10
5
Course Plan
11
Last year...
6
Go to the web
13
7
Term-Document Matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Incidence vectors
16
8
Answers to query
17
Bigger corpora
18
9
Can’t build the matrix explicitly
19
Inverted index
Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 13 21 34
Caesar 13 16
10
Inverted index
Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 13 21 34
Caesar 13 16
Tokenizer
Token stream. Friends Romans Countrymen
More on Linguistic
these later. modules
Modified tokens. friend roman countryman
Indexer friend 2 4
roman 1 2
Inverted index.
countryman 13 22 16
11
Indexer steps
Term Doc #
Sequence of (Modified token, Document ID) I
did
1
1
pairs. enact
julius
1
1
caesar 1
I 1
was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
Doc 1 Doc 2 killed
me
1
1
so 2
let 2
Sort by terms.
I 1 ambitious 2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2
caesar 1 capitol 1
I 1 caesar 1
Core indexing step. was
killed
1
1
caesar
caesar
2
2
i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
me 1 i' 1
so 2 it 2
let 2 julius 1
it 2 killed 1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
24
12
Term Doc #
Multiple term entries in a
Term Doc # Term freq
ambitious 2 ambitious 2 1
merged.
brutus 2 brutus 2 1
capitol 1 capitol 1 1
caesar 1
Frequency information is
caesar 1 1
caesar 2 caesar 2 2
caesar 2
added. did 1
did
enact
1
1
1
1
enact 1 hath 2 1
hath 1 I 1 2
I 1 i' 1 1
I 1 it 2 1
i' 1 julius 1 1
We’ll see why frequency it
julius
2
1
killed
let
1
2
2
1
matters later. killed
killed
1
1
me
noble
1
2
1
1
let 2 so 2 1
me 1
the 1 1
noble 2
the 2 1
so 2
told 2 1
the 1
you 2 1
the 2
was 1 1
told 2
was 2 1
you 2
with 2 1
was 1
was 2
with 2 25
26
13
Storage costs?
Doc # Freq
Term N docs Coll freq 2 1
2 1
ambitious 1 1
1 1
be 1 1 2 1
brutus 2 2 1 1
capitol 1 1 1 1
caesar 2 3 2 2
1 1
did 1 1
1 1
enact 1 1 2 1
hath 1 1 1 2
I 1 2 1 1
i' 1 1 2 1
Terms it
julius
1
1
1
1
1
1
1
2
2 1
killed 1 2 1 1
let 1 1 2 1
me 1 1 2 1
noble 1 1 1 1
so 1 1 2 1
2 1
the 2 2 2 1
told 1 1 1 1
you 1 1 2 1
was 2 2 2 1
with 1 1
Pointers 27
Distributed Systems
14
Administrivia
Work/Grading:
Problem sets and programming exercises
50%
Quizzes 20%
Group Project 30%
Textbooks:
Introduction to Information Retrieval ---
Manning, Raghavan and Schutze
Collective Intelligence --- Toby Segaran
29
Administrivia
30
15
Administrivia
31
Next time
32
16