0% found this document useful (0 votes)

17 views54 pages

L3L4 IRSW Boolean Retrieval

Uploaded by

Saurabh Mor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views54 pages

L3L4 IRSW Boolean Retrieval

Uploaded by

Saurabh Mor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 54

INFORMATION RETRIEVAL AND

SEMANTIC WEB
16B1NCI648

Lecture 3 and 4
CONTENTS TO BE COVERED

• Information retrieval Models

• Term Document Matrix
• Boolean retrieval incidence model
• Inverted Index Creation
THE INFORMATION RETRIEVAL
CYCLE
Source
Selection Resource

Query
Formulation Quer
y

Search Ranked List

Selection Documents
System discovery
Vocabulary discovery
Concept discovery
Document discovery Examination Documents

source reselection
Delivery
WHAT IS A MODEL?

• A model is a construct designed help us understand a

complex system
• A particular way of “looking at things”

• Models inevitably make simplifying assumptions

• What are the limitations of the model?

• Different types of models:

• Conceptual models
• Physical analog models
• Mathematical models
• …
THE CENTRAL PROBLEM IN IR
Information Seeker Authors

Concepts Concepts

Query Terms Document Terms

Do these represent the same concepts?

THE IR BLACK BOX

Query Documents

Representation Representation
Function Function

Query Representation Document Representation

Comparison
Function Index

Hits
TYPES OF MODELS

• Boolean model
• Based on the notion of sets
• Documents are retrieved only if they satisfy Boolean
conditions specified in the query
• Does not impose a ranking on retrieved documents
• Exact match
TYPES OF MODELS

• Vector space model

• Based on geometry, the notion of vectors in high
dimensional space
• Documents are ranked based on their similarity to the
query
• Best/partial match
MODELS (CONTD..)

• Probabilistic Model (Language model)

• Based on the notion of probabilities and processes
for generating text
• Documents are ranked based on the probability that
they generated the query
• Best/partial match
REPRESENTING TEXT
Query Documents

Representation Representation
Function Function

Query Representation Document Representation

Comparison
Function Index

Hits
HOW DO WE REPRESENT TEXT?

• How do we represent the complexities of language?

• Keeping in mind that computers don’t “understand” documents or
queries

• Simple, yet effective approach: “bag of words”

• Treat all the words in a document as index terms for that document
• Disregard order, structure, meaning, etc. of the words
SIMPLE INFORMATION RETRIEVAL
SYSTEM
• Lets consider a simple example : A collection with 5
documents having the following contents
• d1: IIIT ALLAHABAD
• d2: IIIT DELHI
• d3: IIIT GUWAHATI
• d4: IIIT KANCHIPURAM
• d5: IIIT SRI CITY
• Query is
• IIIT SRI CITY
• Which document will you match and why?
APPROACH FOLLOWED (LINEAR
SCANNING)
• First match the term IIIT.
• Filter out documents that contain this term.
• Next match the term Sri.
• Filter out documents that contain this term.
• Next match the term City.
• Filter out documents that contain this term.

• Three iterations!
• Quiz: Can we do better?
ANOTHER EXAMPLE

• Which plays of Shakespeare contain the words

Brutus AND Caesar but NOT Calpurnia?
• Performs linear scanning through the documents
• One could grep all the documents to find Brutus
and Caesar, then strip out lines containing
Calpurnia?
ISSUES IN THE LINEAR
SCANNING

– Slow (for large corpora)

– NOT Calpurnia is non-trivial
– Other operations (e.g., find the word Romans near
countrymen) not feasible
– Ranked retrieval not possible
BOOLEAN RETRIEVAL SYSTEM

• Weights assigned to terms are either “0” or “1”

• “0” represents “absence”: term isn’t in the document
• “1” represents “presence”: term is in the document

• Build queries by combining terms with Boolean

operators
• AND, OR, NOT

• The system returns all documents that satisfy the

query
Why do we say that Boolean retrieval is “set-based”?
AND/OR/NOT
All documents

A B

C
REPRESENTING DOCUMENTS AS A TERM-DOCUMENT INCIDENCE
MATRIX

1 if document contains word, 0 otherwise

Documents
d1 d2 d3 d4 d5
IIIT 1 1 1 1 1
T ALLAHABAD 1 0 0 0 0
e
r DELHI 0 1 0 0 0
m GUWAHATI 0 0 1 0 0
s KANCHIPURAM 0 0 0 1 0
SRI 0 0 0 0 1
CITY 0 0 0 0 1
• Query is IIIT SRI CITY
• Answer : Document D5 is Suitable
TERM-DOCUMENT INCIDENCE MATRICES
Documents
Antony and Julius The Haml Othel Macbe
Cleopatra Caesar Tempest et lo th
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Terms

Caesar 1 1 0 1 1 1
0 1 0 0 0 0
Calpurnia
1 0 0 0 0 0
Cleopatra
1 0 1 1 1 1
mercy
1 0 1 1 1 0
worser

“Brutus and Caesar and not Calpurnia”

What is the best way to get to the answer?
TERM-DOCUMENT INCIDENCE
MATRICES
Documents
Antony and Julius The Hamlet Othel Macbe
Cleopatra Caesar Tempest 0 lo th
Antony 1 1 0 0 1
Brutus 1 1 0 1 0 0
Terms

Caesar 1 1 0 1 1 1
0 1 0 0 0 0
Calpurnia
1 0 0 0 0 0
Cleopatra
1 0 1 1 1 1
mercy
1 0 1 1 1 0
worser

“Brutus and Caesar and not Calpurnia”

We will take vectors for each token and compliment of the last
110100 AND 110111 AND 101111 = 100100
BOOLEAN RETRIEVAL MODEL

Boolean Retrieval Model is the simplest model to build an IR system.

• Boolean queries uses AND, OR and NOT to join query terms.

• Views each document as a set of words.

• Either there is a match or no-match. We do not rank the

results.
BOOLEAN RETRIEVAL

• Advantages
• Results are predictable, relatively easy to explain
• Many different features can be incorporated
• Efficient processing since many documents can be
eliminated from search

• Disadvantages
• Effectiveness depends entirely on user
• Complex queries are difficult
QUESTION 1

D1: “Information Retrieval”

D2: “Information Theory”
D3: “Modern Information Retrieval: Theory and Practice”
D4: “Text Compression”

Query: ((text ∨ information)∧ retrieval ∧¬theory)

QUESTION 2

query :
(nuclear AND treaty) OR ((NOT treaty) AND (nonproliferation OR Iran))
EXTENDED BOOLEAN
RETRIEVAL MODEL
A REALISTIC EXAMPLE

• Consider N = 1 million documents.

• Number of distinct terms, T=500,000
• Suppose we create term document incidence matrix
Total number of cells in matrix M = 500,000*10,00,000
=0.5 * 10^12 = approx 500GB
It will require lot of space in memory for execution which is
infeasible
BOOLEAN RETRIEVAL MODEL ISSUE:
CAN’T BUILD THE MATRIX

• In addition matrix M will have half-a-trillion 0’s and 1’s.

• Matrix will be extremely sparse.

What’s a better representation?

• Solution:
• We only record the 1 positions.
• This idea is central to the first major concept in information
retrieval, the inverted index.
INVERTED INDEX

• Here we maintain a dictionary of each term (also known as lexicon)

• For each term t, we store a list of all documents that contain t known
as postings.
• Each document is identified by its document id.
• Each term has its own posting list.
INVERTED INDEX EXAMPLE
Documents

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 0 1 0 0
1
Caesar 1 0 1 1 1
1
1 0 0 0 0
Terms

Calpurnia 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132

Calpurnia 2 31 54 101
INVERTED INDEX EXAMPLE

How to maintain these posting list in memory

• Fixed size array : waste the storage space if unfilled.
• Linked List: require additional pointers
• Variable size array: insertion is difficult.

Linked list are preferred in case of dynamic insertions.

To search fast variable size arrays are preferred.
INVERTED INDEX

• We need variable-size postings lists

• In memory, can use linked lists or variable length arrays
• Some tradeoffs in size/ease of insertion
Posting

Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132

Calpurnia 2 31 54 101

Dictionary Postings Sorted by docID

INVERTED INDEX CONSTRUCTION

Documents to Friends, Romans, countrymen…..

be indexed.

Tokenize
r
Token stream. Friends Romans Countrymen

Linguistic modules

Modified tokens. friend roman countryman

Indexe 2 4
friend
r
Inverted index. 1 2
roman

countryman 13 16
INDEXER STEPS: TOKEN SEQUENCE

• Sequence of (Modified token, Document ID) pairs.

Doc 1 Doc 2

I did enact Julius So let it be with

Caesar I was killed Caesar. The noble
i' the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
INDEXER STEPS: SORT

• Sort by terms
• And then docID

Core indexing step

INDEXER STEPS: DICTIONARY &
POSTINGS

• Multiple term entries in a

single document are merged.
• Split into Dictionary and
Postings
• Doc. frequency information
is added.
WHERE DO WE PAY IN STORAGE?

Lists of
docIDs

Terms
and
counts

Pointers
PRACTICE QUESTION

Question I1: Consider these documents:

Doc 1 breakthrough drug for schizophrenia

Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
a. Draw the term‐document incidence matrix for this document collection.
b. Draw the inverted index representation for this collection,

C. Answer the query:

• a. schizophrenia AND drug
• b. for AND NOT(drug OR approach)
TERM DOCUMENT MATRIX

INVERTED INDEX
C. Query answers
a. schizophrenia AND drug
Doc1 and Doc 2

b. for AND NOT(drug OR approach)

Doc4
PROCESSING THE BOOLEAN
QUERIES

• How do we process a query using an

inverted index and basic Boolean retrieval
model?
REFERENCES

• Christopher D. Manning, Prabhakar Raghavan and Hinrich

Schütze, “An introduction to Information Retrieval”, 2013
Cambridge University Press UP.
• Stefan Büttcher, Charles L. A. Clarke and Gordon V. Cormack,
Information Retrieval, 2010, MIT Press.

00 Creating The Driving School Database
No ratings yet
00 Creating The Driving School Database
28 pages
Oracle.1z0-931-23.v2024-01-29.q64 Dumps
No ratings yet
Oracle.1z0-931-23.v2024-01-29.q64 Dumps
44 pages
Data Mining in Business Intelligence
No ratings yet
Data Mining in Business Intelligence
64 pages
XIIComp - Sc. 15 Sample Papers
100% (1)
XIIComp - Sc. 15 Sample Papers
144 pages
The Vintage Rolex Field Manual: An Essential Collectors Reference Guide
From Everand
The Vintage Rolex Field Manual: An Essential Collectors Reference Guide
Colin A White
No ratings yet
Contribution of Records Management On Organisation Performance
100% (2)
Contribution of Records Management On Organisation Performance
13 pages
Network Analytics - Problem Statement
No ratings yet
Network Analytics - Problem Statement
4 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
IR Unit 2 Final
No ratings yet
IR Unit 2 Final
43 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
Module 4-Boolean Retrieval Models
No ratings yet
Module 4-Boolean Retrieval Models
52 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
Unit 1
No ratings yet
Unit 1
181 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
Ir 1
No ratings yet
Ir 1
14 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
57 pages
Lecture02 - IR
No ratings yet
Lecture02 - IR
36 pages
2-Boolean IR and Indexing
No ratings yet
2-Boolean IR and Indexing
46 pages
Web Search and Mining: Lecture 2: Boolean Retrieval
No ratings yet
Web Search and Mining: Lecture 2: Boolean Retrieval
45 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
60 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Unit 1 Intro To IR
No ratings yet
Unit 1 Intro To IR
32 pages
Boolean Retrieval PPT Updated
No ratings yet
Boolean Retrieval PPT Updated
30 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
51 pages
Information Retrieval (CS6370) : Maunendra Sankar Desarkar
No ratings yet
Information Retrieval (CS6370) : Maunendra Sankar Desarkar
44 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Boolean Retrieval
No ratings yet
Boolean Retrieval
34 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Lecture2 Intro Boolean 6per
No ratings yet
Lecture2 Intro Boolean 6per
9 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Lect 2 Boolean Retrieval
No ratings yet
Lect 2 Boolean Retrieval
24 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
31 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
02 Boolean Retrieval
No ratings yet
02 Boolean Retrieval
52 pages
01 Intro
No ratings yet
01 Intro
145 pages
Unit 2
No ratings yet
Unit 2
58 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Introduction To Information Retrieval
100% (2)
Introduction To Information Retrieval
60 pages
Boolean Model 2021spring
No ratings yet
Boolean Model 2021spring
43 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
Ir 1
No ratings yet
Ir 1
59 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
30 pages
Lecture 2-Boolean Retrieval
No ratings yet
Lecture 2-Boolean Retrieval
29 pages
03lecture 3 - Biomedical IR-indexing
No ratings yet
03lecture 3 - Biomedical IR-indexing
27 pages
Week 6
No ratings yet
Week 6
98 pages
Lect 3 Inverted Index
No ratings yet
Lect 3 Inverted Index
24 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
46 pages
Lec2 BooleanRetrieval 1
No ratings yet
Lec2 BooleanRetrieval 1
61 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
Unit Ii Modeling
No ratings yet
Unit Ii Modeling
15 pages
KEN2570-5-Search and IR
No ratings yet
KEN2570-5-Search and IR
18 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
On Information Retrival
No ratings yet
On Information Retrival
23 pages
Information Retrieval: Indexing
No ratings yet
Information Retrieval: Indexing
32 pages
Elysium: City of Promise
From Everand
Elysium: City of Promise
Bryan Barton
No ratings yet
Queries With Table
No ratings yet
Queries With Table
4 pages
SAD Unit 4 Distributed Database Components, Types
No ratings yet
SAD Unit 4 Distributed Database Components, Types
8 pages
Lab Terminal Data Warehousing and Data Mining: Part-I (CLO-C1, C2, C3)
No ratings yet
Lab Terminal Data Warehousing and Data Mining: Part-I (CLO-C1, C2, C3)
7 pages
Mysql Commands Notes
No ratings yet
Mysql Commands Notes
23 pages
Chapter 4
No ratings yet
Chapter 4
19 pages
Dbms Unit 10
No ratings yet
Dbms Unit 10
28 pages
Main PDF 6 - MySQL Database
No ratings yet
Main PDF 6 - MySQL Database
48 pages
Tadm 70
No ratings yet
Tadm 70
2 pages
Class 10 - IT - Restuarant Database Management System
No ratings yet
Class 10 - IT - Restuarant Database Management System
31 pages
DBMS
No ratings yet
DBMS
9 pages
Library and Information Science Parameters and Perspectives 1st Edition Joyce Mcintosh Instant Download
No ratings yet
Library and Information Science Parameters and Perspectives 1st Edition Joyce Mcintosh Instant Download
72 pages
Database HW3 Chap5
No ratings yet
Database HW3 Chap5
5 pages
Everything You Need To Know: @iammukeshm
No ratings yet
Everything You Need To Know: @iammukeshm
30 pages
SQL Subquery
100% (1)
SQL Subquery
57 pages
Syllabus Information Retrieval Techniques
No ratings yet
Syllabus Information Retrieval Techniques
2 pages
Module #3 Transaction Concurrency Control and Recovery System
No ratings yet
Module #3 Transaction Concurrency Control and Recovery System
82 pages
Algorithm
No ratings yet
Algorithm
5 pages
SPPU 2022 Solved Question Paper DWDM
50% (2)
SPPU 2022 Solved Question Paper DWDM
25 pages
Scope of The Project
No ratings yet
Scope of The Project
2 pages
IT Practical File
No ratings yet
IT Practical File
39 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Final Exam: Introduction To Database Systems: Class Account
No ratings yet
Final Exam: Introduction To Database Systems: Class Account
14 pages
Lab Assignment-I: Q 1. Create The Table Describe Below
100% (1)
Lab Assignment-I: Q 1. Create The Table Describe Below
15 pages
Chapter 9 Database Security
No ratings yet
Chapter 9 Database Security
21 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

L3L4 IRSW Boolean Retrieval

Uploaded by

L3L4 IRSW Boolean Retrieval

Uploaded by

INFORMATION RETRIEVAL AND

• Information retrieval Models

Search Ranked List

• A model is a construct designed help us understand a

• Models inevitably make simplifying assumptions

• Different types of models:

Query Terms Document Terms

Do these represent the same concepts?

Query Representation Document Representation

• Vector space model

• Probabilistic Model (Language model)

Query Representation Document Representation

• How do we represent the complexities of language?

• Simple, yet effective approach: “bag of words”

• Which plays of Shakespeare contain the words

– Slow (for large corpora)

• Weights assigned to terms are either “0” or “1”

• Build queries by combining terms with Boolean

• The system returns all documents that satisfy the

1 if document contains word, 0 otherwise

“Brutus and Caesar and not Calpurnia”

“Brutus and Caesar and not Calpurnia”

Boolean Retrieval Model is the simplest model to build an IR system.

• Boolean queries uses AND, OR and NOT to join query terms.

• Views each document as a set of words.

• Either there is a match or no-match. We do not rank the

D1: “Information Retrieval”

Query: ((text ∨ information)∧ retrieval ∧¬theory)

• Consider N = 1 million documents.

• In addition matrix M will have half-a-trillion 0’s and 1’s.

What’s a better representation?

• Here we maintain a dictionary of each term (also known as lexicon)

Brutus 1 2 4 11 31 45 173 174

How to maintain these posting list in memory

Linked list are preferred in case of dynamic insertions.

• We need variable-size postings lists

Brutus 1 2 4 11 31 45 173 174

Dictionary Postings Sorted by docID

Documents to Friends, Romans, countrymen…..

Modified tokens. friend roman countryman

• Sequence of (Modified token, Document ID) pairs.

I did enact Julius So let it be with

Core indexing step

• Multiple term entries in a

Question I1: Consider these documents:

Doc 1 breakthrough drug for schizophrenia

C. Answer the query:

b. for AND NOT(drug OR approach)

• How do we process a query using an

• Christopher D. Manning, Prabhakar Raghavan and Hinrich

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.