0% found this document useful (0 votes)

17 views41 pages

IR-Lec1 - Ch1-2023

Uploaded by

mohamedyasserezz067

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views41 pages

IR-Lec1 - Ch1-2023

Uploaded by

mohamedyasserezz067

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 41

Introduction to

Information Retrieval
Course Outline

Course Hrs /
No Course week Exam
Year Semester
Title Hours
Lect Lab
IS418 2 2 First 2
Information 2023/
Storage and 2024 –
Retrieval 4th Year
Assessments Methods:

• Assessment weight

• Midterm Exam 20%

• Oral Examination & Lab 10%
• Practical Examination 10%
• Final-term Examination 60%

• Total 100 %
Course Resources
• Textbook :
– Christopher D. Manning, Prabhakar Raghavan, and Hinrich
Schütze “An Introduction to Information Retrieval”,
Cambridge University Press, Cambridge, England, 2009

• Additional Materials:
– Lecture Slides.
Sec.
1.1

Course Content

• Chap. 1: Introducing Information Retrieval and Web Search

• Chap. 2: The term vocabulary and postings lists

• Chap. 3: Dictionaries and tolerant retrieval

• Chap. 6: Scoring, term weighting and the vector space mode

• Chap. 8: Evaluation in information retrieval

5
Introduction to
Information Retrieval
Introducing Information Retrieval
and Web Search
Introduction

Google

Web

7
Sec.
1.1

Basic assumptions of Information Retrieval

• Collection: A set of documents over which we perform retrieval

– Sometimes referred to as a corpus
– Assume it is a static collection for the moment

• Information Need is the topic about which the user desires to know
more and is differentiated from a query

• Query is what the user conveys to the computer in an attempt to

communicate the information need.

• Relevance :a document is relevant if it is one that the user perceives as

containing information of value with respect to their personal information
8
need.
The problem ???
• Goal = find documents relevant to the user’s
information need from a large document set
Info.
need

Query
IR system
Retrieval
Document Answer list
collection

9
Information Retrieval
• Information Retrieval (IR) is finding material (usually documents) of an

unstructured nature (usually text) that satisfies an information need from

within large collections (usually stored on computers).

– These days we frequently think first of web search, but there are many

other cases:

• E-mail search

• Searching your laptop

• Legal information retrieval

10
Sec.
1.1

Possible Approaches of Information

Retrieval
• Grep: the simplest form of document retrieval is for a computer to do this
sort of linear scan through documents (is called grepping through text)
- Grep is a Unix command which perform this process

• String matching (linear search in documents)

• Slow
• Difficult to improve

11
Main issues in IR

• Query evaluation (or retrieval process)

– To what extent does a document correspond to a
query?

• System evaluation
– How good is a system?
– Are the retrieved documents relevant? (precision)
– Are all the relevant documents retrieved? (recall)

12
Sec.
1.1

How good are the retrieved docs?

▪ Effectiveness: to assess the effectiveness of an IR system (the
quality of its search results)
▪ A user usually wants to know two key statistics about the
system’s results for a query
▪ Precision : Fraction of retrieved docs that are relevant to the
user’s information need
▪ Recall : Fraction of relevant docs in collection that are retrieved
▪ More precise definitions and measurements to follow later

13
Introduction to
Information Retrieval
Structured vs. Unstructured Data
IR vs. databases:
Structured vs unstructured data
• Structured data tends to refer to information in “tables”

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

Ivy Smith 50000

Typically allows numerical range and exact match (for text) queries,
e.g.,
Salary < 60000 AND Manager = Smith.
15
Unstructured data

• Typically refers to free text

• Allows
– Keyword queries including operators
– More sophisticated “concept” queries e.g.,
• find all web pages dealing with drug abuse

• Classic model for searching text documents

16
Semi-structured data
• In fact, almost no data is “unstructured”

• E.g., this slide has distinctly identified zones such as the Title and Bullets

• … to say nothing of linguistic structure

• IR is also used to facilitates “semi-structured” search such as

– Finding a document where the

Title contains data AND Bullets contain search

Title contains Java AND Body contain threading

• Or even

- Title is about Object Oriented Programming AND Author something like stro*rup

– where * is the wild-card operator

17
Unstructured (text) vs. structured
(database) data in 1996

18
Unstructured (text) vs. structured
(database) data in 2009

19
Unstructured (text) vs. structured
(database) data today

20
Introduction to
Information Retrieval
Term-document incidence matrices
Sec.
1.1

An example information retrieval problem

• Which plays of Shakespeare contain the words Brutus AND Caesar but

NOT Calpurnia?

• One could grep all of Shakespeare’s plays for Brutus and Caesar, then

strip out lines containing Calpurnia?

• Why is that not the answer?

– Slow (for large corpora)

– NOT Calpurnia is non-trivial

– Other operations (e.g., find the word Romans near countrymen) not feasible

– Ranked retrieval (best documents to return)

22
Sec.
1.1

An example information retrieval solution

• The way to avoid linearly scanning the texts for each query is to

INDEX the documents in advance.

• Index is used to introduce the basics of the Boolean retrieval model

23
Sec.
1.1

Boolean retrieval model

• Suppose we record each document (a play of Shakespeare’s) whether it
contains each word out of all the words Shakespeare used (Shakespeare
used about 32,000 different words)
• Boolean Retrieval Model: is a model for information retrieval in which we
can pose any query which is in the form of a Boolean expression of terms,
in which terms are combined with operators AND,OR, and NOT.
– The model views each document as a set of words
• The result is a binary term-document “incidence matrix”
• Terms are the indexed units.

- Terms are usually words

24
- some of terms phrase
Sec.
1.1

Term-document incidence matrices

Brutus AND Caesar BUT 1 if play contains

NOT Calpurnia ??? word, 0 otherwise
Sec.
1.1

Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus, Caesar and
Calpurnia (complemented the last) 🡺 bitwise AND.
– 110100 AND 110111 AND 101111 = 100100

26
Sec.
1.1

Answers to query

• Antony and Cleopatra, Act III, Scene ii

Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.

• Hamlet, Act III, Scene ii

Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.

27
Sec.
1.1

Bigger collections
• Consider corpus has N = 1 million documents,
• Each document with about 1000 words.
• Each word Average 6 bytes/word including spaces/punctuation
• Size of corpus= 1million X 1000 X 6 = 6GB
• Number of distinct terms M = 500,000 distinct terms among these
documents.
• Number of cells in the term-document matrix=1 million X
500,000= .5 trillion (too much for memory),
• Can we cut down on the space?
28
Sec.
1.1

Can’t build the term-document matrix

• A 500K x 1M matrix has half-a-trillion 0’s and 1’s (at most .2%
of the cells can have a 1).
• Too many to fit computer’s memory.
• But it has no more than one billion 1’s. Why?

– matrix is extremely sparse.

– Minimum of 99.8% of the cells are zero.
• What’s a better representation?
– Is to record only the things that do occur
29
– We only record the 1 positions.
Introduction to
Information Retrieval
The Inverted Index
The key data structure underlying
modern IR
Sec.
1.2

Inverted index
• It sometimes called inverted file
• It keeps a dictionary of terms (sometimes referred to as vocabulary or lexicon).
• We use dictionary for the data structure and vocabulary for the set of terms
• Posting list (inverted list): a list that records which documents the terms
occurs in
• All the postings lists taken together are referred to as the postings
• Posting (position in the document): each item in the list which records that a
term appeared in a document.
• The dictionary is sorted alphabetically and each postings list is sorted by
document ID

31
Sec.
1.2

Inverted index
• For each term t, we must store a list of all documents that
contain t.
– Identify each doc by a docID, a document serial number
– DocID a unique number for each document, known as the
document identifier
• Can we used fixed-size arrays for this?

Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

What happens if the word Caesar is added to document 14?

32
Sec.
1.2

Inverted index
• We need variable-size postings lists
– On disk, a continuous run of postings is normal and best
– In memory, can use linked lists or variable length arrays
• Some tradeoffs in size/ease of insertion
Posting

Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

Dictionary Posting
Sorted by docID s(more later on 33
Sec.
1.2

Inverted index construction

1. Collect the documents to be indexed
2. Tokenize the text, turning each document into a list
of tokens
3. Do linguistic preprocessing, producing a list of
normalized tokens, which are the indexing terms
4. Index the documents that each term occurs in by
creating an inverted index, consisting of a dictionary
and postings 34
Sec.
1.2

Inverted index construction

Documents Friends, Romans, countrymen.
to
be indexed
Tokenizer

Token stream Friends Romans Countrymen

Linguistic modules

Modified tokens friend roman countryman

Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Initial stages of text processing
• Tokenization
– Cut character sequence into word tokens
• Deal with “John’s”, a state-of-the-art solution
• Normalization
– Map text and query term to same form
• You want U.S.A. and USA to match
• Stemming
– We may wish different forms of a root to match
• authorize, authorization
• Stop words
– We may omit very common words (or not)
• the, a, to, of
Sec.
1.2

Indexer steps: Token sequence

• Sequence of (Modified token, Document ID) pairs.

Doc Doc
1 2
I did enact Julius So let it be with
Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Sec.
1.2

Indexer steps: Sort

• Sort by terms
– And then docID

Core indexing step

Sec.
1.2

Indexer steps: Dictionary & Postings

• Multiple term entries in a single

document are merged.

• Split into Dictionary and Postings

• Doc. frequency information is

added.

• Doc. Frequency: the number of

documents which contain each

term (which is the length of each

postings list)

Why frequency?
Will discuss later.
Sec.
1.2

Where do we pay in storage?

Lists of
docIDs

Terms
and
counts
IR system
implementation
• How do we
index efficiently?
• How much
storage do we
need?
Pointers 40
Sec.
1.2

What data structure should be used for

posting list?

• A fixed length array would be wasteful, some words occur in many

documents .

• Two good alternatives are linked lists or variable length arrays.

• We can use a hybrid scheme , with linked list of fixed length arrays for

each term.

Week 6
No ratings yet
Week 6
98 pages
Lecture1 Intro Boolean
No ratings yet
Lecture1 Intro Boolean
42 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
57 pages
Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
IR Summary Lec 1 - Introduction
No ratings yet
IR Summary Lec 1 - Introduction
54 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
60 pages
Azure Databricks Course Slide Deck V4
100% (5)
Azure Databricks Course Slide Deck V4
308 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
11 Multimedia Media IR
No ratings yet
11 Multimedia Media IR
19 pages
Chapter4 Indexconstruction
No ratings yet
Chapter4 Indexconstruction
49 pages
Module 4-Boolean Retrieval Models
No ratings yet
Module 4-Boolean Retrieval Models
52 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Lecture 1: Introduction and Overview: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 1: Introduction and Overview: Information Retrieval Computer Science Tripos Part II
38 pages
Unit 1 Intro To IR
No ratings yet
Unit 1 Intro To IR
32 pages
Mini Project Report
No ratings yet
Mini Project Report
39 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
Information Retrieval
No ratings yet
Information Retrieval
57 pages
01 - Introduction To Information Retrieval
No ratings yet
01 - Introduction To Information Retrieval
15 pages
Lecture02 - IR
No ratings yet
Lecture02 - IR
36 pages
Unit 1
No ratings yet
Unit 1
181 pages
Unit I
No ratings yet
Unit I
83 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Decision Support and Business Intelligence Systems (Test Bank)
100% (1)
Decision Support and Business Intelligence Systems (Test Bank)
11 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
SAP MMBE Stock Overview
No ratings yet
SAP MMBE Stock Overview
7 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
Boolean Retrieval PPT Updated
No ratings yet
Boolean Retrieval PPT Updated
30 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
115 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
Ir 1
No ratings yet
Ir 1
59 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
51 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
L3L4 IRSW Boolean Retrieval
No ratings yet
L3L4 IRSW Boolean Retrieval
54 pages
Ir 1
No ratings yet
Ir 1
14 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
Database Systems: Design, Implementation, and Management: The Relational Database Model
No ratings yet
Database Systems: Design, Implementation, and Management: The Relational Database Model
48 pages
Lecture2 Intro Boolean 6per
No ratings yet
Lecture2 Intro Boolean 6per
9 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
100 Most Difficult Data Analyst Interview Q&A
No ratings yet
100 Most Difficult Data Analyst Interview Q&A
26 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
02 Boolean Retrieval
No ratings yet
02 Boolean Retrieval
52 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
C#Unit-5 Notes
No ratings yet
C#Unit-5 Notes
22 pages
Lecture 2-Boolean Retrieval
No ratings yet
Lecture 2-Boolean Retrieval
29 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Peoplesoft Upgrade Deep Dive
No ratings yet
Peoplesoft Upgrade Deep Dive
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
30 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
46 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
31 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
Odt Sda
No ratings yet
Odt Sda
53 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Top 75 MongoDB Interview Questions and Answers 2024 - Intellipaat
No ratings yet
Top 75 MongoDB Interview Questions and Answers 2024 - Intellipaat
9 pages
Kubernetes Basic To Advanced
No ratings yet
Kubernetes Basic To Advanced
4 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
No ratings yet
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
16 pages
SQL Quiz Results
No ratings yet
SQL Quiz Results
17 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Dbms Lab Spiral
No ratings yet
Dbms Lab Spiral
40 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
As You Delve Into The World of Data Analytics
No ratings yet
As You Delve Into The World of Data Analytics
10 pages
On Information Retrival
No ratings yet
On Information Retrival
23 pages
Unit 1
No ratings yet
Unit 1
26 pages
Boolean Retrieval
No ratings yet
Boolean Retrieval
34 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Planning and Configuring Mailbox Servers
No ratings yet
Planning and Configuring Mailbox Servers
31 pages
Multi-Cloud CC - 20240702 - 211446 - 0000
No ratings yet
Multi-Cloud CC - 20240702 - 211446 - 0000
16 pages
Sem 4
No ratings yet
Sem 4
20 pages
Various Research Softwares
No ratings yet
Various Research Softwares
11 pages
Introduction To SQL Class 1
No ratings yet
Introduction To SQL Class 1
21 pages
UVM2
No ratings yet
UVM2
10 pages
1-ER Model&design Issues
No ratings yet
1-ER Model&design Issues
84 pages
Understanding Error Codes: in This Chapter
No ratings yet
Understanding Error Codes: in This Chapter
45 pages
HTTP Methods For RESTful Services
No ratings yet
HTTP Methods For RESTful Services
2 pages
To DB: Dr. Mohammed Eshtay
No ratings yet
To DB: Dr. Mohammed Eshtay
40 pages
A Brief Introduction To Web Dynpro
No ratings yet
A Brief Introduction To Web Dynpro
36 pages
COMP1556 - Week 4 Database Technologies Applications: The On-Line Examination Case Study and Object Orientation
No ratings yet
COMP1556 - Week 4 Database Technologies Applications: The On-Line Examination Case Study and Object Orientation
31 pages
Requirements of Hostel Management System. (S.E)
No ratings yet
Requirements of Hostel Management System. (S.E)
4 pages
11 - SQL FOREIGN KEY Constraint
No ratings yet
11 - SQL FOREIGN KEY Constraint
9 pages
Jdegtget - Allmotype/ Jdegtget - Allmotypekeystr: Syntax
No ratings yet
Jdegtget - Allmotype/ Jdegtget - Allmotypekeystr: Syntax
5 pages
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

IR-Lec1 - Ch1-2023

Uploaded by

IR-Lec1 - Ch1-2023

Uploaded by

Introduction to

• Midterm Exam 20%

• Chap. 1: Introducing Information Retrieval and Web Search

• Chap. 2: The term vocabulary and postings lists

• Chap. 3: Dictionaries and tolerant retrieval

• Chap. 6: Scoring, term weighting and the vector space mode

• Chap. 8: Evaluation in information retrieval

Basic assumptions of Information Retrieval

• Collection: A set of documents over which we perform retrieval

• Query is what the user conveys to the computer in an attempt to

• Relevance :a document is relevant if it is one that the user perceives as

unstructured nature (usually text) that satisfies an information need from

within large collections (usually stored on computers).

• Searching your laptop

• Legal information retrieval

Possible Approaches of Information

• String matching (linear search in documents)

• Query evaluation (or retrieval process)

How good are the retrieved docs?

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

Ivy Smith 50000

• Typically refers to free text

• Classic model for searching text documents

• … to say nothing of linguistic structure

• IR is also used to facilitates “semi-structured” search such as

– Finding a document where the

Title contains data AND Bullets contain search

Title contains Java AND Body contain threading

– where * is the wild-card operator

An example information retrieval problem

strip out lines containing Calpurnia?

• Why is that not the answer?

– NOT Calpurnia is non-trivial

– Ranked retrieval (best documents to return)

An example information retrieval solution

INDEX the documents in advance.

• Index is used to introduce the basics of the Boolean retrieval model

Boolean retrieval model

- Terms are usually words

Term-document incidence matrices

Brutus AND Caesar BUT 1 if play contains

• Antony and Cleopatra, Act III, Scene ii

• Hamlet, Act III, Scene ii

Can’t build the term-document matrix

– matrix is extremely sparse.

Brutus 1 2 4 11 31 45 173 174

What happens if the word Caesar is added to document 14?

Brutus 1 2 4 11 31 45 173 174

Inverted index construction

Inverted index construction

Token stream Friends Romans Countrymen

Modified tokens friend roman countryman

Indexer steps: Token sequence

Indexer steps: Sort

Core indexing step

Indexer steps: Dictionary & Postings

• Multiple term entries in a single

document are merged.

• Split into Dictionary and Postings

• Doc. frequency information is

• Doc. Frequency: the number of

documents which contain each

term (which is the length of each

Where do we pay in storage?

What data structure should be used for

• A fixed length array would be wasteful, some words occur in many

• Two good alternatives are linked lists or variable length arrays.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.