0% found this document useful (0 votes)

27 views54 pages

IR Summary Lec 1 - Introduction

Information Retrieval (IR) involves finding unstructured documents that meet specific information needs, commonly associated with web search but applicable in various contexts like email and corporate databases. The document discusses key concepts such as precision and recall, the inverted index as a fundamental data structure, and methods for processing queries, including phrase queries and positional indexes. It highlights the importance of efficient indexing and query optimization in handling large collections of text data.

Uploaded by

rmdanyoussef01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views54 pages

IR Summary Lec 1 - Introduction

Uploaded by

rmdanyoussef01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Introduction to

Information Retrieval
Introducing Information Retrieval
and Web Search
Information Retrieval
Information Retrieval
• Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).

– These days we frequently think first of web search,

but there are many other cases:
• Email search
• Searching your laptop
• Corporate knowledge bases
• Legal information retrieval

3
Unstructured (text) vs. structured (database)
data in the midnineties

4
Unstructured (text) vs. structured (database)
data today

5
Sec. 1.1

Basic assumptions of Information Retrieval

• Collection: A set of documents

– Assume it is a static collection for the moment

• Goal: Retrieve documents with information

that is relevant to the user’s information need
and helps the user complete a task

6
The classic search model
User task Get rid of mice in a
politically correct way

Misconception?
Info need
Info about removing mice
without killing them

Misformulation?

Query
how trap mice alive Search

Search
engine

Query Results
Collection
refinement
Sec. 1.1

How good are the retrieved docs?

▪ Precision : Fraction of retrieved docs that are
relevant to the user’s information need
▪ Recall : Fraction of relevant docs in collection
that are retrieved

Right or wrong/retrieved
or not

8
Sec. 1.1

How good are the retrieved docs?

▪ Precision : Fraction of retrieved docs that are
relevant to the user’s information need
▪ Recall : Fraction of relevant docs in collection
that are retrieved

Search 9
Word:”ford
Introduction to
Information Retrieval
Termdocument incidence matrices
Sec. 1.1

Unstructured data in 1620

• One could grep all of Shakespeare’s plays for Brutus and Caesar,
then lines containing Calpurnia?
• Why is that not the answer?
– Slow (for large corpora)
– Roman near countrymen is not trival (position of terms)
– Repeat linear scan with each query(too long time)
– Ranked retrieval (best documents to return(the no each word repeated
in doc)

11
Sec. 1.1

Termdocument incidence matrices

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

Brutus AND Caesar BUT NOT 1 if play contains

Calpurnia word, 0 otherwise
Sec. 1.1

Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus,
Caesar and Calpurnia (complemented) ➔
bitwise AND.
– 110100 AND
– 110111 AND
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0

– 101111 = Caesar
Calpurnia
1
0
1
1
0
0
1
0
1
0
1
0
Cleopatra 1 0 0 0 0 0

– 100100 mercy
worser
1
1
0
0
1
1
1
1
1
1
1
0

13
Sec. 1.1

Answers to query

• Antony and Cleopatra, Act III, Scene ii

Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.

• Hamlet, Act III, Scene ii

Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.

14
Sec. 1.1

Bigger collections

15
Sec. 1.1

Can’t build the matrix

• 500K x 1M matrix has halfatrillion 0’s and 1’s.
Total cell in
matrix=0.5*1012

• But it has no more than one billion 1’s Why?

• (1000*1million).
– matrix is extremely sparse (“most of entries are 0”
99.8%).
• What’s a better representation?
– We only record the 1 positions.

16
Introduction to
Information Retrieval
The Inverted Index
The key data structure underlying
modern IR
Sec. 1.2

Inverted index
• For each term t, we must store a list of all
documents that contain t.
– Identify each doc by a docID, a document serial
number
• Can we used fixedsize arrays for this?
Brutus 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
What happens if the word Caesar
is added to document 14?
18
Sec. 1.2

Inverted index
• We need variablesize postings lists
– On disk, a continuous run of postings is normal
and best
– In memory, can use linked lists or variable length
arrays Posting
• Some tradeoffs in size/ease of insertion
Brutus 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

Dictionary Postings
Sorted by docID (more later on why).
19
Sec. 1.2

Inverted index construction

Documents to Friends, Romans, countrymen.
be indexed

Tokenizer

Token stream Friends Romans Countrymen

Linguistic modules

Modified tokens friend roman countryman

Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Initial stages of text processing
• Tokenization
– Cut character sequence into word tokens
• Deal with “John’s”, a stateoftheart solution
• Normalization
– Map text and query term to same form
• You want U.S.A. and USA to match
• Stemming
– We may wish different forms of a root to match
• authorize, authorization
• Stop words
– We may omit very common words (or not)
• the, a, to, of
Sec. 1.2

Indexer steps: Token sequence

• Sequence of (Modified token, Document ID) pairs.

Doc 1 Doc 2

I did enact Julius So let it be with

Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Sec. 1.2

Indexer steps: Sort

• Sort by terms
– And then docID

Core indexing step

Sec. 1.2

Indexer steps: Dictionary & Postings

• Multiple term entries
in a single document
are merged.
• Split into Dictionary
and Postings
• Doc. frequency
information is added.
Sec. 1.2

Where do we pay in storage?

Lists of
docIDs

Terms
and
counts
IR system
implementation
• How do we
index efficiently?
• How much
storage do we
need?
Pointers 25
Introduction to
Information Retrieval
Query processing with an inverted index
Sec. 1.3

The index we just built

• How do we process a query? Our focus

– Later what kinds of queries can we process?

27
Sec. 1.3

Query processing: AND

• Consider processing the query:
Brutus AND Caesar
– Locate Brutus in the Dictionary;
• Retrieve its postings.
– Locate Caesar in the Dictionary;
• Retrieve its postings.
– “Merge” the two postings (intersect the document
sets): 2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

28
Intersecting two postings lists
(a “merge” algorithm)

29
Sec. 1.3

The merge
• Walk through the two postings
simultaneously, in time linear in the total
number of postings entries
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

30
Quiz
• When a search engine returns 30 pages only
20 of which were relevant while failing to
return 40 additional relevant pages, its
precision =……………. while its recall
=…………………….
Introduction to
Information Retrieval
The Boolean Retrieval Model
& Extended Boolean Models
Sec. 2.4

Phrase queries
• We want to be able to answer queries such as
“stanford university” – as a phrase
• Thus the sentence “I went to university at
Stanford” is not a match.
– The concept of phrase queries has proven easily
understood by users; one of the few “advanced
search” ideas that works
• For this, it no longer suffices to store only
<term : docs> entries
Sec. 2.4.1

A first attempt: Biword indexes

• Index every consecutive pair of terms in the text
as a phrase
• For example the text “Friends, Romans,
Countrymen” would generate the biwords
– friends romans
– romans countrymen
• Each of these biwords is now a dictionary term
• Twoword phrase queryprocessing is now
immediate.
Sec. 2.4.1

Longer phrase queries

• Longer phrases can be processed by breaking
them down
• stanford university palo alto can be broken
into the Boolean query on biwords:
stanford university AND university palo AND
palo alto 2get
1find documents for each biwords
the intersection between them

• Although its find in a doc but indifferent

places ”not as the pharse”
Can have false positives!
Most of queries are nouns
Search for nouns in each document
Sec. 2.4.1

Issues for biword indexes

Biwords for a phrase can be find in
• False positives, as noted before same doc but not near to each
other

• Index blowup due to bigger dictionary

– Infeasible for more than biwords, big even for
them (hard for memory)
• Biword indexes are not the standard solution
(for all biwords) but can be part of a
compound strategy(popular phrases only to
save retrieving time )
Sec. 2.4.2

Solution 2: Positional indexes

• In the postings, store, for each term the
position(s) in which tokens of it appear:

<term, number of docs containing term;

doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
Sec. 2.4.2

Positional index example

<be: 993427;
1: 7, 18, 33, 72, 86, 231;
Which of docs 1,2,4,5
2: 3, 149; could contain “to be
4: 17, 191, 291, 430, 434; or not to be”?

5: 363, 367, …>

• For biword phrase queries, we use a merge
algorithm(phrase in query with dictionary)
recursively at the document level
• But we now need to deal with more than just
equality
Sec. 2.4.2

Processing a phrase query

• Extract inverted index entries for each distinct
term: to, be, or, not.
• Merge their doc:position lists to enumerate all
positions with “to be or not to be”.
– to:
• 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...
– be:
• 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
Sec. 2.4.2

Positional index size

• Need an entry for each occurrence, not just
once per document
• Index size depends on average document size Why?

– Average web page has <1000 terms

– SEC filings, books, even some epic poems … easily
100,000 terms
• Consider a term with frequency 0.1%
Document size Postings Positional postings
1000 1 1
100,000 1 100
Sec. 2.4.2

Rules of thumb
• A positional index is 2–4 as large as a non
positional index

• Positional index size 35–50% of volume of

original text (disk space)

– Caveat: all of this holds for “Englishlike”

languages
Sec. 2.4.3

Combination schemes
• These two approaches can be profitably combined
– For particular phrases (“Michael Jackson”, “Britney
Spears”) it is inefficient to keep on merging positional
postings lists (popular biwords)
• Even more so for phrases like “The Who”(intersection will be very
expensive in positional index(large posting lists)

• Williams et al. (2004) evaluate a more sophisticated mixed

indexing scheme
– A typical web query mixture was executed in ¼ of the time
of using just a positional index
– It required 26% more space than having a positional index
alone
Introduction to
Information Retrieval
Structured vs. Unstructured Data
Sec. 1.3

Query optimization
• What is the best order for query
processing?
• Consider a query that is an AND of n terms.
• For each of the n terms, get its postings,
then AND them together.
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16

Query: Brutus AND Calpurnia AND Caesar 45

46
Sec. 1.3

Query optimization example

• Process in order of increasing freq:
– start with smallest set, then keep cutting further.

This is why we kept

document freq. in dictionary
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16

Execute the query as (Calpurnia AND Brutus) AND Caesar.

47
Sec. 1.3

More general optimization

• e.g., (madding OR crowd) AND (ignoble OR
strife)
• Get doc. freq.’s for all terms.
• Estimate the size of each OR by the sum of its
doc. freq.’s (conservative).
• Process in increasing order of OR sizes.

48
Exercise
• Recommend a query
processing order for

(tangerine OR trees) AND

Term Freq
(marmalade OR skies) AND eyes 213312
(kaleidoscope OR eyes) kaleidoscope 87009
300321 (1)

marmalade 107913 379571 (3)

• Which two terms should we skies 271658
process first? tangerine 46653 363465 (2)
46653+316812=363465 (2)� trees 316812
�
107913+271658=379571 (1)�
�
87009+213312=300321 (3)

49
50
Introduction to
Information Retrieval
Structured vs. Unstructured Data
IR vs. databases:
Structured vs unstructured data
• Structured data tends to refer to information
in “tables”
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
Ivy Smith 50000

Typically allows numerical range and exact match

(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith.
52
Unstructured data
• Typically refers to free text
• Allows
– Keyword queries including operators
– More sophisticated “concept” queries e.g.,
• find all web pages dealing with drug abuse
• Classic model for searching text documents

53
Semistructured data
• In fact almost no data is “unstructured”
• E.g., this slide has distinctly identified zones such
as the Title and Bullets
• … to say nothing of linguistic structure
• Facilitates “semistructured” search such as
– Title contains data AND Bullets contain search
• Or even
– Title is about Object Oriented Programming AND
Author something like stro*rup
– where * is the wildcard operator

Document Indexing in Information Retrieval:
No ratings yet
Document Indexing in Information Retrieval:
19 pages
week6
No ratings yet
week6
98 pages
Term Vocabulary and Postings List
No ratings yet
Term Vocabulary and Postings List
64 pages
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
No ratings yet
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
124 pages
Script Convex Analysis
No ratings yet
Script Convex Analysis
167 pages
Lecture4-Indexconstruction Ch2 and Ch4
No ratings yet
Lecture4-Indexconstruction Ch2 and Ch4
49 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
49 pages
William Graham Sumner - FOLKWAYS - A Study of The Sociological Importance of Usage, Manners, Customs Mores, and Morals
No ratings yet
William Graham Sumner - FOLKWAYS - A Study of The Sociological Importance of Usage, Manners, Customs Mores, and Morals
712 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
62 pages
lecture1-intro-boolean
No ratings yet
lecture1-intro-boolean
42 pages
C3 IndexConstruction
No ratings yet
C3 IndexConstruction
46 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Information Retrieval
No ratings yet
Information Retrieval
57 pages
Lecture2 Intro Boolean 6per
No ratings yet
Lecture2 Intro Boolean 6per
9 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
46 pages
Intro To IRE
No ratings yet
Intro To IRE
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
Unit 1
No ratings yet
Unit 1
181 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
51 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
115 pages
Ir 1
No ratings yet
Ir 1
59 pages
Lecture 1: Introduction and Overview: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 1: Introduction and Overview: Information Retrieval Computer Science Tripos Part II
38 pages
lec2
No ratings yet
lec2
17 pages
Chapter4 Indexconstruction
No ratings yet
Chapter4 Indexconstruction
49 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
30 pages
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
No ratings yet
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
16 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
Lecture 4-Indexconstruction
No ratings yet
Lecture 4-Indexconstruction
45 pages
lecture1-intro
No ratings yet
lecture1-intro
60 pages
Index Construction
No ratings yet
Index Construction
48 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
Unit 1 Intro to IR
No ratings yet
Unit 1 Intro to IR
32 pages
lec1
No ratings yet
lec1
21 pages
Module 4-Boolean Retrieval Models
No ratings yet
Module 4-Boolean Retrieval Models
52 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
2
No ratings yet
2
50 pages
(Handbooks of Japanese Language and Linguistics (HJLL), 2) Haruo Kubozono - Handbook of Japanese Phonetics and Phonology-De Gruyter Mouton (2015)
100% (2)
(Handbooks of Japanese Language and Linguistics (HJLL), 2) Haruo Kubozono - Handbook of Japanese Phonetics and Phonology-De Gruyter Mouton (2015)
808 pages
Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
31 pages
Lecture 2-Boolean Retrieval
No ratings yet
Lecture 2-Boolean Retrieval
29 pages
Unit I
No ratings yet
Unit I
83 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
Notes Class 8
No ratings yet
Notes Class 8
9 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Dlp-Tle He - Cookery 9
100% (2)
Dlp-Tle He - Cookery 9
3 pages
Research Notes 18 PDF
No ratings yet
Research Notes 18 PDF
24 pages
Singapore p1
No ratings yet
Singapore p1
46 pages
Boolean Retrieval PPT Updated
No ratings yet
Boolean Retrieval PPT Updated
30 pages
De Smet Et Al. 2015 - Degrees of Adverbialization
No ratings yet
De Smet Et Al. 2015 - Degrees of Adverbialization
31 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
Wind Class 8 Extra Question Answer (1) (1)_copy
No ratings yet
Wind Class 8 Extra Question Answer (1) (1)_copy
16 pages
Complete Download of C++ How to Program Early Objects Version 9th Edition Deitel Test Bank Full Chapters in PDF DOCX
100% (29)
Complete Download of C++ How to Program Early Objects Version 9th Edition Deitel Test Bank Full Chapters in PDF DOCX
39 pages
User Guide For The Customs Valuation Compendium
No ratings yet
User Guide For The Customs Valuation Compendium
15 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
On Information Retrival
No ratings yet
On Information Retrival
23 pages
From Genesis To Apocalypse Exploring The Biblical Metaverse WH
No ratings yet
From Genesis To Apocalypse Exploring The Biblical Metaverse WH
71 pages
Transposition A Hazard And Somerset Mystery Gregory Ashe instant download
No ratings yet
Transposition A Hazard And Somerset Mystery Gregory Ashe instant download
29 pages
Lecture 1 Paragraph
No ratings yet
Lecture 1 Paragraph
14 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Boolean Retrieval
No ratings yet
Boolean Retrieval
34 pages
1stWk Aug 24-26
No ratings yet
1stWk Aug 24-26
14 pages
Deep Learning Based Mobile Assistive Device For Visually Impaired People
No ratings yet
Deep Learning Based Mobile Assistive Device For Visually Impaired People
3 pages
Lesson Plan Grammar Merged
No ratings yet
Lesson Plan Grammar Merged
13 pages
Grade 10-Char - Ed 2nd Monthly
No ratings yet
Grade 10-Char - Ed 2nd Monthly
3 pages
Chapter One ISR
No ratings yet
Chapter One ISR
25 pages
Readings-2-Learning-Activity-Sheet (2)
No ratings yet
Readings-2-Learning-Activity-Sheet (2)
4 pages
Apathy
No ratings yet
Apathy
6 pages
Sol 2.2.4 Install Guide
No ratings yet
Sol 2.2.4 Install Guide
18 pages
Kristen Ashley Printable Booklist April 2022
No ratings yet
Kristen Ashley Printable Booklist April 2022
3 pages
Profile Phillips (6) Docx Peer Review by Marrissa Nelson
No ratings yet
Profile Phillips (6) Docx Peer Review by Marrissa Nelson
8 pages
Introduction To Zero Conditional
No ratings yet
Introduction To Zero Conditional
8 pages
Python Tutorial For Beginners
No ratings yet
Python Tutorial For Beginners
3 pages
Missing Number Quiz 2
No ratings yet
Missing Number Quiz 2
9 pages
The Uniqueness of Athol Fugards Characterization
No ratings yet
The Uniqueness of Athol Fugards Characterization
10 pages
Quranic Verses That Speak About the Quran Itself
No ratings yet
Quranic Verses That Speak About the Quran Itself
10 pages
CV Word
No ratings yet
CV Word
1 page
Ansys AQWA Automation - Boat Design Net
No ratings yet
Ansys AQWA Automation - Boat Design Net
4 pages
Sacred Wind: The Appendices
From Everand
Sacred Wind: The Appendices
Andy Coffey
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

IR Summary Lec 1 - Introduction

Uploaded by

IR Summary Lec 1 - Introduction

Uploaded by

Introduction to

– These days we frequently think first of web search,

Basic assumptions of Information Retrieval

• Collection: A set of documents

• Goal: Retrieve documents with information

How good are the retrieved docs?

How good are the retrieved docs?

Unstructured data in 1620

Term­document incidence matrices

Brutus AND Caesar BUT NOT 1 if play contains

• Antony and Cleopatra, Act III, Scene ii

• Hamlet, Act III, Scene ii

Can’t build the matrix

• But it has no more than one billion 1’s Why?

Inverted index construction

Token stream Friends Romans Countrymen

Modified tokens friend roman countryman

Indexer steps: Token sequence

I did enact Julius So let it be with

Indexer steps: Sort

Core indexing step

Indexer steps: Dictionary & Postings

Where do we pay in storage?

The index we just built

– Later ­ what kinds of queries can we process?

Query processing: AND

A first attempt: Biword indexes

Longer phrase queries

• Although its find in a doc but indifferent

Issues for biword indexes

• Index blowup due to bigger dictionary

Solution 2: Positional indexes

<term, number of docs containing term;

Positional index example

5: 363, 367, …>

Processing a phrase query

Positional index size

– Average web page has <1000 terms

• Positional index size 35–50% of volume of

– Caveat: all of this holds for “English­like”

• Williams et al. (2004) evaluate a more sophisticated mixed

Query: Brutus AND Calpurnia AND Caesar 45

Query optimization example

This is why we kept

Execute the query as (Calpurnia AND Brutus) AND Caesar.

More general optimization

(tangerine OR trees) AND

marmalade 107913 379571 (3)

Typically allows numerical range and exact match

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Termdocument incidence matrices

– Later what kinds of queries can we process?

– Caveat: all of this holds for “Englishlike”