L3L4 IRSW Boolean Retrieval
L3L4 IRSW Boolean Retrieval
SEMANTIC WEB
16B1NCI648
Lecture 3 and 4
CONTENTS TO BE COVERED
Query
Formulation Quer
y
Selection Documents
System discovery
Vocabulary discovery
Concept discovery
Document discovery Examination Documents
source reselection
Delivery
WHAT IS A MODEL?
Concepts Concepts
Query Documents
Representation Representation
Function Function
Comparison
Function Index
Hits
TYPES OF MODELS
• Boolean model
• Based on the notion of sets
• Documents are retrieved only if they satisfy Boolean
conditions specified in the query
• Does not impose a ranking on retrieved documents
• Exact match
TYPES OF MODELS
Representation Representation
Function Function
Comparison
Function Index
Hits
HOW DO WE REPRESENT TEXT?
• Three iterations!
• Quiz: Can we do better?
ANOTHER EXAMPLE
A B
C
REPRESENTING DOCUMENTS AS A TERM-DOCUMENT INCIDENCE
MATRIX
Caesar 1 1 0 1 1 1
0 1 0 0 0 0
Calpurnia
1 0 0 0 0 0
Cleopatra
1 0 1 1 1 1
mercy
1 0 1 1 1 0
worser
Caesar 1 1 0 1 1 1
0 1 0 0 0 0
Calpurnia
1 0 0 0 0 0
Cleopatra
1 0 1 1 1 1
mercy
1 0 1 1 1 0
worser
• Advantages
• Results are predictable, relatively easy to explain
• Many different features can be incorporated
• Efficient processing since many documents can be
eliminated from search
• Disadvantages
• Effectiveness depends entirely on user
• Complex queries are difficult
QUESTION 1
query :
(nuclear AND treaty) OR ((NOT treaty) AND (nonproliferation OR Iran))
EXTENDED BOOLEAN
RETRIEVAL MODEL
A REALISTIC EXAMPLE
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 0 1 0 0
1
Caesar 1 0 1 1 1
1
1 0 0 0 0
Terms
Calpurnia 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
INVERTED INDEX EXAMPLE
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
Tokenize
r
Token stream. Friends Romans Countrymen
Linguistic modules
Indexe 2 4
friend
r
Inverted index. 1 2
roman
countryman 13 16
INDEXER STEPS: TOKEN SEQUENCE
Doc 1 Doc 2
• Sort by terms
• And then docID
Lists of
docIDs
Terms
and
counts
Pointers
PRACTICE QUESTION
INVERTED INDEX
C. Query answers
a. schizophrenia AND drug
Doc1 and Doc 2