0% found this document useful (0 votes)
17 views41 pages

IR-Lec1 - Ch1-2023

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views41 pages

IR-Lec1 - Ch1-2023

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Introduction to

Information Retrieval
Course Outline

Course Hrs /
No Course week Exam
Year Semester
Title Hours
Lect Lab
IS418 2 2 First 2
Information 2023/
Storage and 2024 –
Retrieval 4th Year
Assessments Methods:

• Assessment weight

• Midterm Exam 20%


• Oral Examination & Lab 10%
• Practical Examination 10%
• Final-term Examination 60%

• Total 100 %
Course Resources
• Textbook :
– Christopher D. Manning, Prabhakar Raghavan, and Hinrich
Schütze “An Introduction to Information Retrieval”,
Cambridge University Press, Cambridge, England, 2009

• Additional Materials:
– Lecture Slides.
Sec.
1.1

Course Content

• Chap. 1: Introducing Information Retrieval and Web Search

• Chap. 2: The term vocabulary and postings lists

• Chap. 3: Dictionaries and tolerant retrieval

• Chap. 6: Scoring, term weighting and the vector space mode

• Chap. 8: Evaluation in information retrieval

5
Introduction to
Information Retrieval
Introducing Information Retrieval
and Web Search
Introduction

Google

Web

7
Sec.
1.1

Basic assumptions of Information Retrieval

• Collection: A set of documents over which we perform retrieval


– Sometimes referred to as a corpus
– Assume it is a static collection for the moment

• Information Need is the topic about which the user desires to know
more and is differentiated from a query

• Query is what the user conveys to the computer in an attempt to


communicate the information need.

• Relevance :a document is relevant if it is one that the user perceives as


containing information of value with respect to their personal information
8
need.
The problem ???
• Goal = find documents relevant to the user’s
information need from a large document set
Info.
need

Query
IR system
Retrieval
Document Answer list
collection

9
Information Retrieval
• Information Retrieval (IR) is finding material (usually documents) of an

unstructured nature (usually text) that satisfies an information need from

within large collections (usually stored on computers).

– These days we frequently think first of web search, but there are many

other cases:

• E-mail search

• Searching your laptop

• Legal information retrieval

10
Sec.
1.1

Possible Approaches of Information


Retrieval
• Grep: the simplest form of document retrieval is for a computer to do this
sort of linear scan through documents (is called grepping through text)
- Grep is a Unix command which perform this process

• String matching (linear search in documents)


• Slow
• Difficult to improve

11
Main issues in IR

• Query evaluation (or retrieval process)


– To what extent does a document correspond to a
query?

• System evaluation
– How good is a system?
– Are the retrieved documents relevant? (precision)
– Are all the relevant documents retrieved? (recall)

12
Sec.
1.1

How good are the retrieved docs?


▪ Effectiveness: to assess the effectiveness of an IR system (the
quality of its search results)
▪ A user usually wants to know two key statistics about the
system’s results for a query
▪ Precision : Fraction of retrieved docs that are relevant to the
user’s information need
▪ Recall : Fraction of relevant docs in collection that are retrieved
▪ More precise definitions and measurements to follow later

13
Introduction to
Information Retrieval
Structured vs. Unstructured Data
IR vs. databases:
Structured vs unstructured data
• Structured data tends to refer to information in “tables”

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

Ivy Smith 50000

Typically allows numerical range and exact match (for text) queries,
e.g.,
Salary < 60000 AND Manager = Smith.
15
Unstructured data

• Typically refers to free text


• Allows
– Keyword queries including operators
– More sophisticated “concept” queries e.g.,
• find all web pages dealing with drug abuse

• Classic model for searching text documents


16
Semi-structured data
• In fact, almost no data is “unstructured”

• E.g., this slide has distinctly identified zones such as the Title and Bullets

• … to say nothing of linguistic structure

• IR is also used to facilitates “semi-structured” search such as

– Finding a document where the

Title contains data AND Bullets contain search

Title contains Java AND Body contain threading

• Or even

- Title is about Object Oriented Programming AND Author something like stro*rup

– where * is the wild-card operator


17
Unstructured (text) vs. structured
(database) data in 1996

18
Unstructured (text) vs. structured
(database) data in 2009

19
Unstructured (text) vs. structured
(database) data today

20
Introduction to
Information Retrieval
Term-document incidence matrices
Sec.
1.1

An example information retrieval problem

• Which plays of Shakespeare contain the words Brutus AND Caesar but

NOT Calpurnia?

• One could grep all of Shakespeare’s plays for Brutus and Caesar, then

strip out lines containing Calpurnia?

• Why is that not the answer?


– Slow (for large corpora)

– NOT Calpurnia is non-trivial

– Other operations (e.g., find the word Romans near countrymen) not feasible

– Ranked retrieval (best documents to return)


22
Sec.
1.1

An example information retrieval solution

• The way to avoid linearly scanning the texts for each query is to

INDEX the documents in advance.

• Index is used to introduce the basics of the Boolean retrieval model

23
Sec.
1.1

Boolean retrieval model


• Suppose we record each document (a play of Shakespeare’s) whether it
contains each word out of all the words Shakespeare used (Shakespeare
used about 32,000 different words)
• Boolean Retrieval Model: is a model for information retrieval in which we
can pose any query which is in the form of a Boolean expression of terms,
in which terms are combined with operators AND,OR, and NOT.
– The model views each document as a set of words
• The result is a binary term-document “incidence matrix”
• Terms are the indexed units.

- Terms are usually words


24
- some of terms phrase
Sec.
1.1

Term-document incidence matrices

Brutus AND Caesar BUT 1 if play contains


NOT Calpurnia ??? word, 0 otherwise
Sec.
1.1

Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus, Caesar and
Calpurnia (complemented the last) 🡺 bitwise AND.
– 110100 AND 110111 AND 101111 = 100100

26
Sec.
1.1

Answers to query

• Antony and Cleopatra, Act III, Scene ii


Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.

• Hamlet, Act III, Scene ii


Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.

27
Sec.
1.1

Bigger collections
• Consider corpus has N = 1 million documents,
• Each document with about 1000 words.
• Each word Average 6 bytes/word including spaces/punctuation
• Size of corpus= 1million X 1000 X 6 = 6GB
• Number of distinct terms M = 500,000 distinct terms among these
documents.
• Number of cells in the term-document matrix=1 million X
500,000= .5 trillion (too much for memory),
• Can we cut down on the space?
28
Sec.
1.1

Can’t build the term-document matrix


• A 500K x 1M matrix has half-a-trillion 0’s and 1’s (at most .2%
of the cells can have a 1).
• Too many to fit computer’s memory.
• But it has no more than one billion 1’s. Why?

– matrix is extremely sparse.


– Minimum of 99.8% of the cells are zero.
• What’s a better representation?
– Is to record only the things that do occur
29
– We only record the 1 positions.
Introduction to
Information Retrieval
The Inverted Index
The key data structure underlying
modern IR
Sec.
1.2

Inverted index
• It sometimes called inverted file
• It keeps a dictionary of terms (sometimes referred to as vocabulary or lexicon).
• We use dictionary for the data structure and vocabulary for the set of terms
• Posting list (inverted list): a list that records which documents the terms
occurs in
• All the postings lists taken together are referred to as the postings
• Posting (position in the document): each item in the list which records that a
term appeared in a document.
• The dictionary is sorted alphabetically and each postings list is sorted by
document ID

31
Sec.
1.2

Inverted index
• For each term t, we must store a list of all documents that
contain t.
– Identify each doc by a docID, a document serial number
– DocID a unique number for each document, known as the
document identifier
• Can we used fixed-size arrays for this?

Brutus 1 2 4 11 31 45 173 174


Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

What happens if the word Caesar is added to document 14?


32
Sec.
1.2

Inverted index
• We need variable-size postings lists
– On disk, a continuous run of postings is normal and best
– In memory, can use linked lists or variable length arrays
• Some tradeoffs in size/ease of insertion
Posting

Brutus 1 2 4 11 31 45 173 174


Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

Dictionary Posting
Sorted by docID s(more later on 33
Sec.
1.2

Inverted index construction


1. Collect the documents to be indexed
2. Tokenize the text, turning each document into a list
of tokens
3. Do linguistic preprocessing, producing a list of
normalized tokens, which are the indexing terms
4. Index the documents that each term occurs in by
creating an inverted index, consisting of a dictionary
and postings 34
Sec.
1.2

Inverted index construction


Documents Friends, Romans, countrymen.
to
be indexed
Tokenizer

Token stream Friends Romans Countrymen

Linguistic modules

Modified tokens friend roman countryman

Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Initial stages of text processing
• Tokenization
– Cut character sequence into word tokens
• Deal with “John’s”, a state-of-the-art solution
• Normalization
– Map text and query term to same form
• You want U.S.A. and USA to match
• Stemming
– We may wish different forms of a root to match
• authorize, authorization
• Stop words
– We may omit very common words (or not)
• the, a, to, of
Sec.
1.2

Indexer steps: Token sequence


• Sequence of (Modified token, Document ID) pairs.

Doc Doc
1 2
I did enact Julius So let it be with
Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Sec.
1.2

Indexer steps: Sort


• Sort by terms
– And then docID

Core indexing step


Sec.
1.2

Indexer steps: Dictionary & Postings

• Multiple term entries in a single

document are merged.

• Split into Dictionary and Postings

• Doc. frequency information is

added.

• Doc. Frequency: the number of

documents which contain each

term (which is the length of each

postings list)

Why frequency?
Will discuss later.
Sec.
1.2

Where do we pay in storage?

Lists of
docIDs

Terms
and
counts
IR system
implementation
• How do we
index efficiently?
• How much
storage do we
need?
Pointers 40
Sec.
1.2

What data structure should be used for


posting list?

• A fixed length array would be wasteful, some words occur in many

documents .

• Two good alternatives are linked lists or variable length arrays.

• We can use a hybrid scheme , with linked list of fixed length arrays for

each term.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy