IR-Lec1 - Ch1-2023
IR-Lec1 - Ch1-2023
Information Retrieval
Course Outline
Course Hrs /
No Course week Exam
Year Semester
Title Hours
Lect Lab
IS418 2 2 First 2
Information 2023/
Storage and 2024 –
Retrieval 4th Year
Assessments Methods:
• Assessment weight
• Total 100 %
Course Resources
• Textbook :
– Christopher D. Manning, Prabhakar Raghavan, and Hinrich
Schütze “An Introduction to Information Retrieval”,
Cambridge University Press, Cambridge, England, 2009
• Additional Materials:
– Lecture Slides.
Sec.
1.1
Course Content
5
Introduction to
Information Retrieval
Introducing Information Retrieval
and Web Search
Introduction
Web
7
Sec.
1.1
• Information Need is the topic about which the user desires to know
more and is differentiated from a query
Query
IR system
Retrieval
Document Answer list
collection
9
Information Retrieval
• Information Retrieval (IR) is finding material (usually documents) of an
– These days we frequently think first of web search, but there are many
other cases:
• E-mail search
10
Sec.
1.1
11
Main issues in IR
• System evaluation
– How good is a system?
– Are the retrieved documents relevant? (precision)
– Are all the relevant documents retrieved? (recall)
12
Sec.
1.1
13
Introduction to
Information Retrieval
Structured vs. Unstructured Data
IR vs. databases:
Structured vs unstructured data
• Structured data tends to refer to information in “tables”
Typically allows numerical range and exact match (for text) queries,
e.g.,
Salary < 60000 AND Manager = Smith.
15
Unstructured data
• E.g., this slide has distinctly identified zones such as the Title and Bullets
• Or even
- Title is about Object Oriented Programming AND Author something like stro*rup
18
Unstructured (text) vs. structured
(database) data in 2009
19
Unstructured (text) vs. structured
(database) data today
20
Introduction to
Information Retrieval
Term-document incidence matrices
Sec.
1.1
• Which plays of Shakespeare contain the words Brutus AND Caesar but
NOT Calpurnia?
• One could grep all of Shakespeare’s plays for Brutus and Caesar, then
– Other operations (e.g., find the word Romans near countrymen) not feasible
• The way to avoid linearly scanning the texts for each query is to
23
Sec.
1.1
Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus, Caesar and
Calpurnia (complemented the last) 🡺 bitwise AND.
– 110100 AND 110111 AND 101111 = 100100
26
Sec.
1.1
Answers to query
27
Sec.
1.1
Bigger collections
• Consider corpus has N = 1 million documents,
• Each document with about 1000 words.
• Each word Average 6 bytes/word including spaces/punctuation
• Size of corpus= 1million X 1000 X 6 = 6GB
• Number of distinct terms M = 500,000 distinct terms among these
documents.
• Number of cells in the term-document matrix=1 million X
500,000= .5 trillion (too much for memory),
• Can we cut down on the space?
28
Sec.
1.1
Inverted index
• It sometimes called inverted file
• It keeps a dictionary of terms (sometimes referred to as vocabulary or lexicon).
• We use dictionary for the data structure and vocabulary for the set of terms
• Posting list (inverted list): a list that records which documents the terms
occurs in
• All the postings lists taken together are referred to as the postings
• Posting (position in the document): each item in the list which records that a
term appeared in a document.
• The dictionary is sorted alphabetically and each postings list is sorted by
document ID
31
Sec.
1.2
Inverted index
• For each term t, we must store a list of all documents that
contain t.
– Identify each doc by a docID, a document serial number
– DocID a unique number for each document, known as the
document identifier
• Can we used fixed-size arrays for this?
Inverted index
• We need variable-size postings lists
– On disk, a continuous run of postings is normal and best
– In memory, can use linked lists or variable length arrays
• Some tradeoffs in size/ease of insertion
Posting
Dictionary Posting
Sorted by docID s(more later on 33
Sec.
1.2
Linguistic modules
Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Initial stages of text processing
• Tokenization
– Cut character sequence into word tokens
• Deal with “John’s”, a state-of-the-art solution
• Normalization
– Map text and query term to same form
• You want U.S.A. and USA to match
• Stemming
– We may wish different forms of a root to match
• authorize, authorization
• Stop words
– We may omit very common words (or not)
• the, a, to, of
Sec.
1.2
Doc Doc
1 2
I did enact Julius So let it be with
Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Sec.
1.2
added.
postings list)
Why frequency?
Will discuss later.
Sec.
1.2
Lists of
docIDs
Terms
and
counts
IR system
implementation
• How do we
index efficiently?
• How much
storage do we
need?
Pointers 40
Sec.
1.2
documents .
• We can use a hybrid scheme , with linked list of fixed length arrays for
each term.