0% found this document useful (0 votes)
17 views54 pages

L3L4 IRSW Boolean Retrieval

Uploaded by

Saurabh Mor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views54 pages

L3L4 IRSW Boolean Retrieval

Uploaded by

Saurabh Mor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

INFORMATION RETRIEVAL AND

SEMANTIC WEB
16B1NCI648

Lecture 3 and 4
CONTENTS TO BE COVERED

• Information retrieval Models


• Term Document Matrix
• Boolean retrieval incidence model
• Inverted Index Creation
THE INFORMATION RETRIEVAL
CYCLE
Source
Selection Resource

Query
Formulation Quer
y

Search Ranked List

Selection Documents
System discovery
Vocabulary discovery
Concept discovery
Document discovery Examination Documents

source reselection
Delivery
WHAT IS A MODEL?

• A model is a construct designed help us understand a


complex system
• A particular way of “looking at things”

• Models inevitably make simplifying assumptions


• What are the limitations of the model?

• Different types of models:


• Conceptual models
• Physical analog models
• Mathematical models
• …
THE CENTRAL PROBLEM IN IR
Information Seeker Authors

Concepts Concepts

Query Terms Document Terms

Do these represent the same concepts?


THE IR BLACK BOX

Query Documents

Representation Representation
Function Function

Query Representation Document Representation

Comparison
Function Index

Hits
TYPES OF MODELS

• Boolean model
• Based on the notion of sets
• Documents are retrieved only if they satisfy Boolean
conditions specified in the query
• Does not impose a ranking on retrieved documents
• Exact match
TYPES OF MODELS

• Vector space model


• Based on geometry, the notion of vectors in high
dimensional space
• Documents are ranked based on their similarity to the
query
• Best/partial match
MODELS (CONTD..)

• Probabilistic Model (Language model)


• Based on the notion of probabilities and processes
for generating text
• Documents are ranked based on the probability that
they generated the query
• Best/partial match
REPRESENTING TEXT
Query Documents

Representation Representation
Function Function

Query Representation Document Representation

Comparison
Function Index

Hits
HOW DO WE REPRESENT TEXT?

• How do we represent the complexities of language?


• Keeping in mind that computers don’t “understand” documents or
queries

• Simple, yet effective approach: “bag of words”


• Treat all the words in a document as index terms for that document
• Disregard order, structure, meaning, etc. of the words
SIMPLE INFORMATION RETRIEVAL
SYSTEM
• Lets consider a simple example : A collection with 5
documents having the following contents
• d1: IIIT ALLAHABAD
• d2: IIIT DELHI
• d3: IIIT GUWAHATI
• d4: IIIT KANCHIPURAM
• d5: IIIT SRI CITY
• Query is
• IIIT SRI CITY
• Which document will you match and why?
APPROACH FOLLOWED (LINEAR
SCANNING)
• First match the term IIIT.
• Filter out documents that contain this term.
• Next match the term Sri.
• Filter out documents that contain this term.
• Next match the term City.
• Filter out documents that contain this term.

• Three iterations!
• Quiz: Can we do better?
ANOTHER EXAMPLE

• Which plays of Shakespeare contain the words


Brutus AND Caesar but NOT Calpurnia?
• Performs linear scanning through the documents
• One could grep all the documents to find Brutus
and Caesar, then strip out lines containing
Calpurnia?
ISSUES IN THE LINEAR
SCANNING

– Slow (for large corpora)


– NOT Calpurnia is non-trivial
– Other operations (e.g., find the word Romans near
countrymen) not feasible
– Ranked retrieval not possible
BOOLEAN RETRIEVAL SYSTEM

• Weights assigned to terms are either “0” or “1”


• “0” represents “absence”: term isn’t in the document
• “1” represents “presence”: term is in the document

• Build queries by combining terms with Boolean


operators
• AND, OR, NOT

• The system returns all documents that satisfy the


query
Why do we say that Boolean retrieval is “set-based”?
AND/OR/NOT
All documents

A B

C
REPRESENTING DOCUMENTS AS A TERM-DOCUMENT INCIDENCE
MATRIX

1 if document contains word, 0 otherwise


Documents
d1 d2 d3 d4 d5
IIIT 1 1 1 1 1
T ALLAHABAD 1 0 0 0 0
e
r DELHI 0 1 0 0 0
m GUWAHATI 0 0 1 0 0
s KANCHIPURAM 0 0 0 1 0
SRI 0 0 0 0 1
CITY 0 0 0 0 1
• Query is IIIT SRI CITY
• Answer : Document D5 is Suitable
TERM-DOCUMENT INCIDENCE MATRICES
Documents
Antony and Julius The Haml Othel Macbe
Cleopatra Caesar Tempest et lo th
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Terms

Caesar 1 1 0 1 1 1
0 1 0 0 0 0
Calpurnia
1 0 0 0 0 0
Cleopatra
1 0 1 1 1 1
mercy
1 0 1 1 1 0
worser

“Brutus and Caesar and not Calpurnia”


What is the best way to get to the answer?
TERM-DOCUMENT INCIDENCE
MATRICES
Documents
Antony and Julius The Hamlet Othel Macbe
Cleopatra Caesar Tempest 0 lo th
Antony 1 1 0 0 1
Brutus 1 1 0 1 0 0
Terms

Caesar 1 1 0 1 1 1
0 1 0 0 0 0
Calpurnia
1 0 0 0 0 0
Cleopatra
1 0 1 1 1 1
mercy
1 0 1 1 1 0
worser

“Brutus and Caesar and not Calpurnia”


We will take vectors for each token and compliment of the last
110100 AND 110111 AND 101111 = 100100
BOOLEAN RETRIEVAL MODEL

Boolean Retrieval Model is the simplest model to build an IR system.

• Boolean queries uses AND, OR and NOT to join query terms.

• Views each document as a set of words.

• Either there is a match or no-match. We do not rank the


results.
BOOLEAN RETRIEVAL

• Advantages
• Results are predictable, relatively easy to explain
• Many different features can be incorporated
• Efficient processing since many documents can be
eliminated from search

• Disadvantages
• Effectiveness depends entirely on user
• Complex queries are difficult
QUESTION 1

D1: “Information Retrieval”


D2: “Information Theory”
D3: “Modern Information Retrieval: Theory and Practice”
D4: “Text Compression”

Query: ((text ∨ information)∧ retrieval ∧¬theory)


QUESTION 2

query :
(nuclear AND treaty) OR ((NOT treaty) AND (nonproliferation OR Iran))
EXTENDED BOOLEAN
RETRIEVAL MODEL
A REALISTIC EXAMPLE

• Consider N = 1 million documents.


• Number of distinct terms, T=500,000
• Suppose we create term document incidence matrix
Total number of cells in matrix M = 500,000*10,00,000
=0.5 * 10^12 = approx 500GB
It will require lot of space in memory for execution which is
infeasible
BOOLEAN RETRIEVAL MODEL ISSUE:
CAN’T BUILD THE MATRIX

• In addition matrix M will have half-a-trillion 0’s and 1’s.


• Matrix will be extremely sparse.

What’s a better representation?


• Solution:
• We only record the 1 positions.
• This idea is central to the first major concept in information
retrieval, the inverted index.
INVERTED INDEX

• Here we maintain a dictionary of each term (also known as lexicon)


• For each term t, we store a list of all documents that contain t known
as postings.
• Each document is identified by its document id.
• Each term has its own posting list.
INVERTED INDEX EXAMPLE
Documents

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 0 1 0 0
1
Caesar 1 0 1 1 1
1
1 0 0 0 0
Terms

Calpurnia 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132

Calpurnia 2 31 54 101
INVERTED INDEX EXAMPLE

How to maintain these posting list in memory


• Fixed size array : waste the storage space if unfilled.
• Linked List: require additional pointers
• Variable size array: insertion is difficult.

Linked list are preferred in case of dynamic insertions.


To search fast variable size arrays are preferred.
INVERTED INDEX

• We need variable-size postings lists


• In memory, can use linked lists or variable length arrays
• Some tradeoffs in size/ease of insertion
Posting

Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132

Calpurnia 2 31 54 101

Dictionary Postings Sorted by docID


INVERTED INDEX CONSTRUCTION

Documents to Friends, Romans, countrymen…..


be indexed.

Tokenize
r
Token stream. Friends Romans Countrymen

Linguistic modules

Modified tokens. friend roman countryman

Indexe 2 4
friend
r
Inverted index. 1 2
roman

countryman 13 16
INDEXER STEPS: TOKEN SEQUENCE

• Sequence of (Modified token, Document ID) pairs.

Doc 1 Doc 2

I did enact Julius So let it be with


Caesar I was killed Caesar. The noble
i' the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
INDEXER STEPS: SORT

• Sort by terms
• And then docID

Core indexing step


INDEXER STEPS: DICTIONARY &
POSTINGS

• Multiple term entries in a


single document are merged.
• Split into Dictionary and
Postings
• Doc. frequency information
is added.
WHERE DO WE PAY IN STORAGE?

Lists of
docIDs

Terms
and
counts

Pointers
PRACTICE QUESTION

Question I1: Consider these documents:

Doc 1 breakthrough drug for schizophrenia


Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
a. Draw the term‐document incidence matrix for this document collection.
b. Draw the inverted index representation for this collection,

C. Answer the query:


• a. schizophrenia AND drug
• b. for AND NOT(drug OR approach)
TERM DOCUMENT MATRIX

INVERTED INDEX
C. Query answers
a. schizophrenia AND drug
Doc1 and Doc 2

b. for AND NOT(drug OR approach)


Doc4
PROCESSING THE BOOLEAN
QUERIES

• How do we process a query using an


inverted index and basic Boolean retrieval
model?
REFERENCES

• Christopher D. Manning, Prabhakar Raghavan and Hinrich


Schütze, “An introduction to Information Retrieval”, 2013
Cambridge University Press UP.
• Stefan Büttcher, Charles L. A. Clarke and Gordon V. Cormack,
Information Retrieval, 2010, MIT Press.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy