0% found this document useful (0 votes)
8 views36 pages

5 The Term Vocabulary & Posting List

Uploaded by

Ajitesh Thawait
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views36 pages

5 The Term Vocabulary & Posting List

Uploaded by

Ajitesh Thawait
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

THE TERM

VOCABULARY AND
POSTINGS LISTS
INFORMATION RETRIEVAL
AND WEB SEARCH

LECTURE 8

Outline
 Elaborate basic indexing
 Preprocessing to form the term vocabulary
 Documents
 Tokenization
 What terms do we put in the index?

1
Recall the basic indexing pipeline

Documents to be indexed. Friends, Romans, countrymen.


Collect the documents to be indexed

Tokenizer
Token stream.
Tokenize the text Friends Romans Countrymen
Linguistic modules
Modified tokens. friend roman countryman
Do linguistic pre-processing of tokens
Indexer friend 2 4

Inverted index. roman 1 2


Index the documents that each term occurs in countryman 13 16
3

Introduction to
Information Retrieval

Obtaining the Character Sequence In A Document

2
What exactly is a document?
 Before indexing we need to understand
content of the digital document
 Documents are in digital formats so these are
called digital documents
 Is a input to an indexing process
 Is in the form of bytes in a file or may be on a web
server (type of file)

What exactly is a document?


 In order to work with documents
 We need to extract Character Sequence in the documents

. how it is stored on the machine)


Is a sequence of bytes in a file (that’s

Take this sequence of bytes and extract character sequence from it

This sequence of character is what we tokenize

3
What exactly is a document?

Byte Sequence
File/
Process into
Document
Character sequence

Tokenize

What exactly is a document?


•Need to understand certain properties of
document before we can fetch the character
sequence from the byte sequence

•So that comes under the

Parsing a document

4
Bytes in File

Parsing a Document Character Sequence

 Need to understand the format of the document


1) What is its format?
Documents comes in different formats
 PDF
Character sequence are extracted from
 Word
each of these kind of documents using
 Excel various techniques
 HTML etc...

Parsing a Document
2) What is its Language?
 Plain English, Hindi, Bengali etc.,
 The tokenization & linguistic pre-processing steps
depends on the language i.e.
 The kind of tokenization you do
 The kind of linguistic pre-processing you do

We should know the language of document in advance. So that


we can use appropriate scheme for tokenization

10

5
Parsing a Document
3) Which character set it is using?
 ASCII, Unicode UTF-8 or any other Vendor Specific Standard.

Based on these 3 task  need to figure out the formats, language and character set

is a

Classification Problem

In order to answer this questions


Solution is
11

Parsing a Document

12

6
Supervised Learning

 Classification: Machine is trained to classify something into


some class.
 Whether the class corresponds to
different formats of documents
different language
different character set

Classification tasks are often done heuristically


Character Sequence
Now we can say that we can extract Out of
Bytes in a File

Complications: Formats/Language

 Complications
 Documents being indexed can include documents from many
different languages.
 Example : combination of English & Hindi
 A single index may have to contain terms of several languages.
 Some documents in Hindi
 Some documents in French Need to
Tokenize
 Some documents in German and so on...
Linguistic Processes

Separately for documents belonging to different languages

7
Complications: Formats/Language
 Sometimes a document or its components can contain multiple
languages/formats
 English email with French pdf attachments

So these are some of the complication you have to deal with


if you
build an Information Retrieval System for web

This problem is usually solved by licensing a software library that


handles decoding documents formats and character encodings
All these discussed above are regarding extraction of
character sequence.

What is a Unit Document

 A file?
 An email? (Perhaps one of many in an mbox.)
 An email with 5 attachments?
 A group of files (PPT or LaTeX as HTML pages)

16

8
What is unit document?

 A file?
word file or PDF file or emails or ppt or so on...
 If there are 10 plays of Shakespeare then we could have 10
documents.
 Assign id’s to those documents

 A Email (Perhaps one of many in an mbox)?


 Traditional Unix (mbox-format) email file stores a sequence of
email messages (an email folder) in one file,
 But we need each email message as a separate document

Example of Documents

 An email with attachments?


 Many email messages now contain attached documents, and we
might then want to regard
 the email message and Total 6
as separate documents.
 each contained attachment documents are
being created
Single document into multiple files out of this
email
 If an email message has an attached zip file, you might want to
decode/unzip the zip file and regard each file it contains as a
separate document.

9
Example of Documents

 A group of files (PPT or Latex as HTML)


 If there are 30 slides in PPT.
 30 HTML pages are generated for these slides and
stored as separate files (here we don’t treat each HTML
page as a separate document)
 We might combine 30 pages into a single document

Multiple files into Single document

Example of Unit Document

 Lets say you have a huge book in PDF format


 Split book into individual chapters, and each chapter as a separate
document
Book

Chapter 1 Chapter 12 . . . . . . . . . Chapter n

Individual Documents

10
Unit Document

 Why you want to do such a thing ?


 Why would it makes sense sometimes to split an entire
book a number of documents instead of treating it as single
document?
 What are the advantages and disadvantages of doing that?

Lets go with an example to understand this

Issues of Indexing Granularity

 Granularity: the scale or level of detail in a set of data.

 For a collection of books, it would usually be a bad idea to


index an entire book as a document

11
Example (Issue)

 Consider we have a document (book) on “Middle Ages in


Europe”.
 That book contains the term such as
 Christ (appearing at many places)
 Also during this time there were
 Rise of University in Europe
 you would also see that University also appearing in some of those
chapters
 Search for terms “Christ” and “University”
 Above mentioned documents may be listed
We know that query has no relevance with the above document/book

Example (Issue)

 Such as for example if I go with query “Christ University”


 Will the “Middle Ages in Europe” book relevant to us Probably not

 Document size is high (too large)


 Precision : Low
 Recall: High

12
What could be done?

 Instead
 Index each chapter or paragraph as a mini-document.

 Matches are then more likely to be relevant

 Since the documents are smaller it will be much easier for


the user to find the relevant passages in the document.

 Document size is small


 Precision: High
 Recall : Low

13
Solution

 The problems with large document units can be alleviated


by use of explicit or implicit proximity search.

Looks for documents where two or more separately matching term


occurrences are within a specified distance

The number of intermediate words or characters

 An IR system should be designed to offer choices of


granularity.

Solution

 For this choice to be made well


 Person who is deploying the system must have
 A good understanding of the document collection
 The users and their information need and
 Usage patterns.

14
Vocabulary of Terms

TOKENS AND TERMS

29

Vocabulary of Terms

Tokenization
 Word – A delimited string of characters as it appears in the
text.
 Token – An instance of a word or term occurring in a
document.
 Example: Friends, Romans, Countrymen.
 Friends
 Roman Tokens
 Countrymen
 If there are two occurence of the word ‟Friends“ in the document then
there would be two tokens generated. Each having same string ‟Friends“

Token is output of the tokenizer


30

15
Vocabulary of Terms

Tokenization
 Term – A “normalized” word (case, morphology, spelling
etc); an equivalence class of words.
 Its an entry in the dictionary of inverted index
 So both instances of “Friends” map to the same term in the index
and that term would be “friend”

Term is what is being stored in dictionary

31

Vocabulary of Terms

Tokenization
 Input: “Friends, Romans and Countrymen”
 Output: Tokens Remove white space, punctuation marks (for input)
 Friends
 Romans
 Countrymen
 Each such token is now a candidate for an index entry, after
further processing
Linguistic processing
 But what are valid tokens to emit?

32

16
Vocabulary of Terms

Processing done on documents need to be done on query

Tokenization
How would you deal with this
 Issues in tokenization:
 Finland’s capital  How would you tokenize this word
Finland? Finlands? Finland’s?
The way you deal with apostrophes can impact performance of the system
This make sense
Aren’t Short form of “are not”
If we use same conversion as done above, we will be having “aren”

If your query is “aren’t”  would this document be returned


Yes, it would be, if query was processed in the same way it should be
as “aren’t” is also converted into “aren” So document containing “aren” would be in result
33

Tokenization
Kevin O’Brien Irish cricketer
Suppose we apply the same technique
We would get two separate tokens

Kevin O’Brien
As if you are tokenizing everything and removing the terms after O’

That would be disaster


So you can see that this type of generalization don’t work with names
Dealing with apostrophes is a non trivial problem 34

17
Vocabulary of Terms

Processing done on documents need to be done on query


Tokenization
How would you tokenize Hewlett-Packard

Question is how you deal with this hyphen (-)

 Hewlett-Packard  Hewlett and Packard as two


tokens? Will there be any consequence of splitting them

Query may contain Hewlett Packard (with a space)

Hewlett-Packard Document may


not be returned
(i.e., would not be
considered
35
relevant)

Tokenization
Suppose if Hewlett
Considered two separate tokens
Packard
then

Hewlett Packard
If query contains Hewlett - Packard
Documents
or would be
returned
Hewlett
Packard
36

18
Vocabulary of Terms

Processing done on documents need to be done on query


Tokenization
 Hewlett-Packard  Hewlett and Packard
as two tokens?
 state-of-the-art: break up hyphenated sequence.
 co-education You may want to preserve the hyphen in this case
 lowercase, lower-case, lower case ?
 It can be effective to get the user to put in possible hyphens

 San Francisco: one token or two?


 How do you decide it is one token?
You may have list of city names available to you

Both the starting letters are capitalized  may be a name of either a person or place
37

Vocabulary of Terms

Numbers
 3/20/91 Mar. 12, 1991 20/3/91
 55 B.C.
 B-52 Need to preserve the hyphens
 My PGP key is 324a3df234cb23e
 (800) 234-2333 Need to preserve the hyphens
 Often have embedded spaces
 Older IR systems may not index numbers
 But often very useful: think about things like looking up error
codes/stacktraces on the web 404 not found error
 (One answer is using n-grams)
 Will often index “meta-data” separately
 Creation date, format, etc.
38

19
Vocabulary of Terms

Tokenization: language issues


 French
 L'ensemble  one token or two?
 L ? L’ ? Le ?
 Want l’ensemble to match with un ensemble
 Until at least 2003, it didn’t on Google
 Internationalization!

 German noun compounds are not segmented


 Lebensversicherungsgesellschaftsangestellter
 ‘life insurance company employee’
 German retrieval systems benefit greatly from a compound splitter
module
 Can give a 15% performance boost for German
39

Vocabulary of Terms

Tokenization: language issues


 Chinese and Japanese have no spaces between
words:
 莎拉波娃现在居住在美国东南部的佛罗里达。
 Not always guaranteed a unique tokenization
 Further complicated in Japanese, with multiple
alphabets intermingled
 Dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana! 40

20
Vocabulary of Terms

Tokenization: language issues


 Arabic (or Hebrew) is basically written right to left,
but with certain items like numbers written left to
right
 Words are separated, but letter forms within a word
form complex ligatures

 ← → ←→ ←
 ‘Algeria achieved its independence in 1962 after 132
years of French occupation.’
 With Unicode, the surface presentation is complex, but the
stored form is straightforward 41

Vocabulary of Terms

Stop words
 With a stop list, you exclude from the dictionary
entirely the commonest words. Intuition:
 They have little semantic content: the, a, and, to, be
 There are a lot of them: ~30% of postings for top 30 words

Dropping common words: a, an, and, are, as, ......


have
Little value in helping select the documents
 General Strategy for determining a stop word list is
to sort the terms by Collection Frequency
No. of times the termSo this is the
‘t’ appears third
in the frequency42
document

21
Stop words

When you build a index you can also keep track of

Collection Frequency

along with

Document Frequency

Term Frequency

43

Vocabulary of Terms

Stop words

 But the trend is away from doing this:


 Good compression techniques means the space
for including stop words in a system is very small
 Good query optimization techniques mean you
pay little at query time for including stop words.
 You need them for:
 Phrase queries: “King of Denmark”
 Various song titles, etc.: “Let it be”, “To be or not to
be”
 “Relational” queries: “flights to London”
44

22
Vocabulary of Terms

Normalization to terms
 We need to “normalize” words in indexed text as
well as query words into the same form
 We want to match U.S.A. and USA
 Result is terms: a term is a (normalized) word type,
which is an entry in our IR system dictionary
 We most commonly implicitly define equivalence
classes of terms by, e.g.,
 deleting periods to form a term
 U.S.A., USA  USA
 deleting hyphens to form a term
 anti-discriminatory, antidiscriminatory  antidiscriminatory
45

Vocabulary of Terms

Normalization is heavily language dependent


Normalization: other languages
 Accents: e.g., French résumé vs. resume.
 Umlauts: e.g., German: Tuebingen vs. Tübingen
 Should be equivalent
 Most important criterion:
 How users like to write their queries for these words?

 Even in languages that standardly have accents, users


often may not type them
 Often best to normalize to a de-accented term
 Tuebingen, Tübingen, Tubingen  Tubingen

46

23
Vocabulary of Terms

Normalization: other languages


 Normalization of things like date forms
 7月30日 vs. 7/30
 Japanese use of kana vs. Chinese characters

 Tokenization and normalization may depend on the


language and so is intertwined with language
detection
Is this
Morgen will ich in MIT … German “mit”?

 Crucial: Need to “normalize” indexed text as well as


query terms into the same form
47

Vocabulary of Terms

Case folding
 Reduce all letters to lower case
 exception: upper case in mid-sentence?
 e.g., General Motors
 Fed vs. fed (Federal Reserve System )
 SAIL vs. sail
 Often best to lower case everything, since
users will use lowercase regardless of
‘correct’ capitalization…

 Google example:
 Query C.A.T.
 #1 result is for “cat” (well, Lolcats) not
Caterpillar Inc. 48

24
Vocabulary of Terms

Normalization to terms

 An alternative to equivalence classing is to do


asymmetric expansion (Query expansion)
 An example of where this may be useful
 Enter: window Search: window, windows
 Enter: windows Search: Windows, windows, window
 Enter: Windows Search: Windows
 Potentially more powerful, but less efficient

49

Vocabulary of Terms

Thesauri and soundex


 Do we handle synonyms and homonyms?
 E.g., by hand-constructed equivalence classes
 car = automobile color = colour
 We can rewrite to form equivalence-class terms
 When the document contains automobile, index it under car-
automobile (and vice-versa)
 Or we can expand a query
 When the query contains automobile, look under car as well
 What about spelling mistakes?
 One approach is soundex, which forms equivalence classes
of words based on phonetic heuristics
 Will see in coming lectures
50

25
Vocabulary of Terms

Lemmatization
 Reduce inflectional/variant forms to base form
 Lemmatization is derived from a word Lemma  which refers
to root form of a particular word
 This is sophisticated NLP technique
 E.g.,
 am, are, is  be
 car, cars, car's, cars'  car
 the boy's cars are different colors  the boy car be different
color
Plural forms are converted into singular form
 Lemmatization implies doing “proper” reduction to dictionary
headword form 51

Vocabulary of Terms

Stemming Is more crude form of normalization

 Reduce terms to their “roots” before indexing


 “Stemming” suggest crude affix chopping
 language dependent
 e.g., automate(s), automatic, automation all reduced to
automat.

for example compressed for exampl compress and


and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.
52

26
Vocabulary of Terms

Porter’s algorithmDeveloped by Martin Porter


 Commonest algorithm for stemming English
 Results suggest it’s at least as good as other stemming
options

 The algorithm has 5 phases of reductions


 phases applied sequentially
 each phase consists of a set of commands
 sample convention: Of the rules in a compound command,
select the one that applies to the longest suffix.

53

FASTER POSTINGS MERGES:


SKIP POINTERS/SKIP LISTS

54

27
Skip Pointers

Faster postings merges via Skip pointers/Skip lists


 Extension to posting list data structures
 Way to increase the efficiency of using
posting lists.

55

Skip Pointers

Recall basic merge


 Walk through the two postings simultaneously, in
time linear in the total number of postings entries

2 4 8 41 48 64 128 Brutus
2 8
1 2 3 8 11 17 21 31 Caesar

If the list lengths are m and n, the merge takes O(m+n)


operations.

Can we do better?
Yes (if index isn’t changing too fast). 56

28
Skip Pointers

Recall basic merge


Can we do better?
Yes (if index isn’t changing too fast).
 i.e.,
 There are not new entries been added or
deleted from the posting list

 Use skip list by augmenting posting lists


with skip pointers (at indexing time)
57

Skip Pointers

Look into an example


 Skip pointer is a pointer that points from a particular
node to some other node far ahead in the same list.
41 128

2 4 8 41 48 64 128

11 31
1 2 3 8 11 17 21 31

58

29
Skip Pointers

Benefits of adding skip pointers


 Let see how can we use these skip pointers to
increase our search and how do we add them
41 128
2 4 8 41 48 64 128

Intervening results Not useful for the answer

59

Skip Pointers

Augment postings with skip pointers


(at indexing time)
p1 p1 p1 p1

2 4 8 41 48 64 128
2 8

1 2 3 8 11 17 21 31

p2 p2 p2 p2 p2 And so on....
We are looking into the intermediate results between 11 & 31

 Two question need to be answered


 Where do we place skip pointers?
 How to do efficient merging using skip pointers? 60

30
Skip Pointers

Where do we place skips?


How many skip pointers should we add

 Tradeoff:
 More skips  shorter skip spans  more likely to skip.
But lots of comparisons to skip pointers.
 Fewer skips  few pointer comparison, but then long skip
spans  few successful skips.

61

Skip Pointers

Placing skips
 Simple heuristic: for postings of length L, use L
evenly-spaced skip pointers.
i.e., if total length of posting list is L, use L evenly-spaced skip
pointers.
L L L L L

 Easy if the index is relatively static; harder if L keeps


changing because of updates.
Best is static
Deleting/inserting elements 62

31
Skip Pointers

Important Points
 If Index is small  entirely fits into Memory (both
dictionary & posting list can fit into main memory)

 If corpus size is large  posting may have to be


stored on disk, while dictionary is kept in memory.

 If you index is entirely in memory  using skip


pointers will help
 Because you will end up doing fewer no. of operations to
transverse a particular posting list, if you follow the skip
pointers.
63

Skip Pointers

Skips
 Only AND queries.
 Does not work with OR queries. Why?

64

32
Skip Pointers

Algorithm: Postings lists intersection with skip pointers

65

Skip Pointers

Exercise Problems
 Problem 1:
 We have two-word query. For one term the postings list
consists of the following 16 entries
[4, 6, 10, 12, 14, 16, 18, 20, 22, 32, 47, 81, 120,122, 157, 180]
and for the other it is the one entry posting list
[47]
Workout how many comparisons would be done to intersect
the two posting lists with the following two strategies. Briefly
justify your answer.
(a) Using standard posting list.
(b) Using posting lists stored with skip pointers, with a skip
length of L
66

33
Skip Pointers

Problem 1 Solution
 (a) The no. of comparisons would be 11 as shown
 (4,47), (6,47), (10,47), (12,47), (16,47), (18,47), (20,47),
(22,47), (32,47), (47,47)
 (b) Total length of posting L=16
 Skip length L = 16 = 4
 4  14, 14  22, 22  120, 120  180
14 22 120

4 6 10 12 14 16 18 20 22 32 47 81 120

180

122 157 180


67

Skip Pointers

Problem 1 Solution
14 22 120

4 6 10 12 14 16 18 20 22 32 47 81 120

180

122 157 180

 14 < 47, 22 < 47 & 120 > 47


 So there will be no comparisons after (32,47) and (47,47)

 No. of comparisons would be 6


 (4,47), (14,47), (22,47), (120,47), (32,47), (47,47)

68

34
Skip Pointers

Problem 2

69

Skip Pointers

Problem 2 Solution
 (a) The skip pointers is followed only once, 24  75

 (b) 18 posting comparisons will be made by the algorithm in


total (with skip pointers)
 (3,3), (5,5), (9,89), (15,89), (24,89), (75,89), (92,89), (81,89), (84,89),
(89,89), (95,92), (95,115), (95,96), (97,96), (97,97), (99,100), (100,100),
(101,115)

 (c) 19 posting comparisons would be made if the posting lists


are intersected without the use of skip pointers
 (3,3), (5,5), (89,9), (89,15), (89,24), (89, 39), (89,60), (89,68), (89,75),
(89,81), (89,84), (89,89), (95,92), (95,96), (97,96), (97,97), (99,100),
(100,100), (101,115)
70

35
Skip Pointers

Assignment - II
 Why are skip pointers not useful for queries of the form
x OR y?
 Exercise 1.6, 1.11, 1.9, 2.2, 2.3

71

36

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy