9 Dictionaries and Tolerant Retrieval
9 Dictionaries and Tolerant Retrieval
and Tolerant
Retrieval
Outline
• Dictionary Data Structures
• “Tolerant” Retrieval
• Wild-Card Queries
• Spell Correction: Document Correction, Query Misspellings
• Isolated Word Correction, Context-Sensitive Spell
Correction
• Soundex
2
Dictionaries and Tolerant Retrieval
Term Vocabulary
Document Frequency
Storing Dictionaries
• For each term, we need to store a couple of items:
• Terms
• Document frequency
• Pointer to posting list
• Assume for the time being that we can store this information in a fixed –
length entry.
6
A naïve dictionary
• Simplest way to implement a dictionary – array of structures.
• Just think of storing all the terms in a sorted array,
• An array of struct:
A naïve dictionary
• If you are searching for a particular term in the dictionary
• We can se something like
• Binary search (if your are implementing in a array of structures)
• Going to take you about O(log n) time (where ‘n’ is the size of the array)
8
How do we store a dictionary in memory efficiently?
36 times the no. of terms in the index that would give the no. of bytes taken up by this
implementation (naïve dictionary)
10
How do we store a dictionary in memory efficiently?
• In Chapter 5, we will see the naïve implementation & then we will see how we
can compress the dictionary and also compress posting lists
11
12
Dictionary Data Structures
• Two main choices of data structures:
• Hash Tables
• Trees
• Some IR systems use hashes, some use search trees
• Criteria for when to use hashesVs. trees:
• Is there a fixed number of terms or will it keep growing?
• What are the relative frequencies with which various keys will be accessed?
• How many terms are we likely to have?
Hash Tables
• Each vocabulary term is hashed into an integer.
• In hash table you have a hash function, which takes in a key and maps that key to an
integer
• At query time: hash query term, locate entry in fixed-width array
• (assume you’ve seen hash tables before)
14
Hash Tables
• Pros:
• Lookup in a hash is faster than lookup in tree: O(1)
• Cons:
• No easy way to find minor variants:
• judgment/judgement
• Resume vs Résumé As we know that there are two variants
Hash table will not allow me to
judgment/judgement get both of them at the same time
because
American English British English The value of function for string
‘judgment’ & ‘judgement’ is
different
Suppose I want to search for the word ‘judgement’ in the dictionary
15
Hash Tables
• No prefix search (all terms starting with automat) [tolerant retrieval]
• Automatic, Automation stemmed to automat (Equivalence class)
• in hashing we are not storing equivalence classes we obtain after stemming. We are storing automatic &
automation in different memory locations and that corresponds to different postings list.
17
18
This the way of storing the prefixes
Tree: binary tree
Root
All terms which has prefixes/ beginning a-m n-z All terms which has prefixes/ beginning
with a, b, c …m with n, o, p …z
Leaf node
contains the 19
exact terms
Binary Tree
• Terms are stored in main memory and corresponding posting list are stored in
secondary memory (since it is very huge for each term).
20
Example
Query: retrieve the documents
corresponding to the terms beginning
with ‘G’
As these two terms are my results. Now the question is how to find the relevant documents
Perform lookup into the posting list of these two terms from the secondary memory Will it be AND or OR of the
posting list 21
• Definition: Every internal node has a number of children in the interval [a,b] where a, b are
appropriate natural numbers, e.g., [2,4].
means B-tree having at least 2 children/4 children But will never have a case were
vary
a particular number has exactly
2 to 4 2, 3 or 4 one child or 5 children 22
Could be
Tree: B-tree
• B-trees are usual way to implement the dictionary
• One of the reason for creating this range [a,b]
23
Trees
• Simplest: binary tree
• More usual: B-trees
• Trees require a standard ordering of characters and hence strings … but we standardly have
one
• Pros:
• Solves the prefix problem (terms starting with hyp)
• Cons:
• Slower: O(log M) [and this requires balanced tree]
• Rebalancing binary trees is expensive
• But B-trees mitigate the rebalancing problem
24
Wild-Card
Queries
25
Wild-card queries: *
• Wildcard queries are used in any of the following situations:
1. The user is uncertain of the spelling of a query term (e.g., Sydney vs.
Sidney, which leads to the wildcard query S*dney);
2. The user is aware of multiple variants of spelling a term and (consciously)
seeks documents containing any of the variants (e.g., color vs. colour);
3. The user seeks documents containing variants of a term that would be
caught by stemming, but is unsure whether the search engine performs
stemming (e.g., judicial vs. judiciary, leading to the wildcard query
judicia*);
4. The user is uncertain of the correct rendition of a foreign word or phrase
(e.g., the query Universit* Stuttgart).
26
Wild-card queries: *
• There are two types of wild cards
• trailing wildcard query – prefix query
• leading wildcard queries – suffix query
27
A Trailing Wildcard
• A query such as mon* is known as a trailing wildcard query, because the * symbol
occurs only once, at the end of the search string
• mon*: find all docs containing any word beginning with “mon”. (trailing
wild-card queries)
• We can have word for example
• monday, money, monitor, monkey..... word starting with ‘mon’
28
A Trailing Wildcard
• A search tree on the dictionary is a convenient way of
handling trailing wildcard queries:
Easy with binary tree (or B-tree) lexicon
• We walk down the tree following the symbols m, o and n in turn,
at which point we can enumerate the set W of terms in the
dictionary with the prefix mon. retrieve all words that lie in range: mon ≤ w < moo
• Finally, we use |W| lookups on the standard inverted index to retrieve
all documents containing any term in W.
retrieve posting list for each term individually Perform OR on them
29
31
Exercise: from this, how can we enumerate all terms meeting the wild-card query pro*cent ?
32
Excercise
• How can we enumerate all terms meeting the wild-card query?
• pro*cent? pro*cent
You should maintain Create result for pro* Create result for *cent
both trees in dictionary Using normal B-tree Using reverse B-tree
Will get terms starting with pro* and terms ending with *cent
Get individual posting list for each term and perform AND to get the result.
33
Excercise
• Query Jayasur*a Don’t know the spelling either it may be ‘iy’ or ’y’
Jayasur*a
36
B-trees handle *’s at the end of a query term
37
Permuterm index
• Here we are going to employ extra index from standard inverted index called
the Permuterm index.
38
Permuterm index
• Step 2: Create Permuterm Index
1. Look at every term that goes into the Standard Inverted Index
2. Create rotation of that term
• Ex: hello
• For term hello, index under:
• hello$, ello$h, llo$he, lo$hel, o$hell permuterm index is going to have each of these
rotation versions of original term
• where $ is a special symbol tells us that the term ends at this particular $
sign
39
Permuterm index
• Step 3: Put all permuterm into dictionary part of permuterm index
• Posting list of permuterm index will have a single entry
hello$
ello$h
llo$he hello
lo$hel
o$hell
$hello
40
Permuterm index
• Queries:
• X lookup on X$
• X* lookup on $X* (use B-tree look up)
• *X lookup on X$*
• X*Y lookup on Y$X*
• X*Y*Z ??? Exercise!
Query = hel*o
X=hel, Y=o
Lookup o$hel* 41
42
Exercise
• Write down the entries in the permuterm index dictionary that are generated
by the term “mama”
• Solution
• mama$
• ama$m
• ma$ma
• a$mam
• $mama
• If you wanted to search for s*ng in permuterm wild card index, what key(s)
would one do the lookup
• Solution
• ng$s*
43
44
Bigram (k-gram) indexes
• Example: april
• If k=2 get bigrams
• $a,ap,pr,ri,il,l$ bigrams of april
• e.g., from text “April is the cruelest month” we get the 2-grams (bigrams) k=2
$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,ue,el,le,es,st,t$,$m,mo,on,nt,h$
• $ is a special word boundary symbol
• Maintain a second inverted index from bigrams to dictionary terms that match
each bigram.
45
46
Bigram (k-gram) indexes
Ex: if bigram ‘ap’ found in terms
april, apple, map....
47
Since * is at the end so we don’t put $ at the end (as it is not a character it’s a wildcard)
48
Processing wild-cards
Ex: mon*
generate bigrams
$m, mo, on
Entry in bigram index & posting Entry in bigram index & posting Entry in bigram index & posting
list would contain all terms start list would contain mo list would contain on
with m
$m mace madden
mo among amortize
on along among 49
$m mace madden
mo among amortize
50
Processing wild-card queries
• As before, we must execute a Boolean query for each enumerated, filtered
term.
• Wild-cards can result in expensive query execution (very large disjunctions…)
• gen* AND universit*
• (geneva AND university) OR (geneva AND université) OR (genève AND university)
OR (genève AND université) OR (general AND universities) OR . . . . . .
Very Expensive Requires query optimization
• If you encourage “laziness” people will respond!
51
Search
Type your search terms, use ‘*’ if you need to.
E.g., Alex* will match Alexander.
• Does Google allow wildcard queries? (Which web search engines allow
wildcard queries?)
52
Spelling
Correction
53
Spell Correction
• Two principal uses
We will be looking at spelling correction
• Correcting document(s) being indexed mainly in the context of misspelling
• Correcting user queries to retrieve “right” answers queries
• Two main flavors: In an IR system, spelling correction is only ever run on queries
• Isolated word
• Check each word on its own for misspelling
• Will not catch typos resulting in correctly spelled words
• e.g., from form
• Context-sensitive
• Look at surrounding words,
• e.g., I flew form Heathrow to Narita.
54
General IR philosophy: don’t change the
Document Correction document
55
Query mis-spellings
• More common in IR than document correction
• typos in queries are common
• People are in a hurry
• Users often look for things they don’t know much about
56
Query mis-spellings
• Our principal focus here
• E.g., the query Alanis Morisett Search in the Google
57
Query mis-spellings
• Strategies:
• Retrieve documents indexed by the correct spelling, OR
• Return several suggested alternative queries with the correct
spelling
Did you mean … ?
58
Query mis-spellings
• Example:Wikipedia
59
Query mis-spellings
• In Google it was giving you a choice So multiple variations are possible
• Search with correct spelling and retrieving the documents over here, that is something you have
• Search with incorrect spelling and showing the documents to decide before hand. Because your
algorithm are going to depend on that
60
Isolated Word Correction
•Premise 1: there is a lexicon (“correct words”) from
which the correct spellings come
•Premise 2: We have a way of computing the distance
between a misspelled word and a correct word.
•Simple spelling correction algorithm: return the
“correct” word that has the smallest distance to the
misspelled word.
• Example: informaton → information
61
The way we measure the closeness calculating the distance between the
query term and various terms in the dictionary or the lexicon. 64
Isolated Word Correction: Premise 2
•There are multiple ways to define the distance measure
between two words
• We’ll study several alternatives
• Edit distance (Levenshtein distance)
• Weighted edit distance
• n-gram overlap
65
Edit Distance
Isolated Word Correction
66
Edit Distance
• The minimum edit distance between two strings
• Given two strings S1 and S2, the minimum number of operations to convert one (S1) to
the other (S2)
• Operations are typically character-level
• Insert, Delete, Replace, (Transposition)
• E.g., the edit distance from dof to dog is 1
• From cat to act is 2 (Just 1 with transpose.)
• from cat to dog is 3.
• Generally found by dynamic programming.
67
NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character
A
1
D
2
C
3
If Row Character ≠ Column Character
E
4
Replace Delete
G
5
Min(insert,
Base Case Insert replace,
delete) + 1
68
ABCFG into ADCEG
NULL A B C F G
Insert Delete
NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character
D
2
C
3
If Row Character ≠ Column Character
E
4
Replace Delete
G
5
Min(insert,
Base Case Insert replace,
delete) + 1
69
NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character
D
2
C
3
If Row Character ≠ Column Character
E
4
Replace Delete
G
5
Min(insert,
Base Case Insert replace,
delete) + 1
70
ABCFG into ADCEG
NULL A B C F G
Insert Delete
NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character
A
1 0 1 2 3 4
D 0+1
2 1
C 1+1
3 2
If Row Character ≠ Column Character
E 2+1
4 3
Replace Delete
3+1
G
5 4
Min(insert,
Base Case Insert replace,
delete) + 1
71
NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character
A
1 0 1 2 3 4
D
2 1 1
C
3 2 2
If Row Character ≠ Column Character
E
4 3 3
Replace Delete
G
5 4 3
Min(insert,
Base Case Insert replace,
delete) + 1
72
ABCFG into ADCEG
NULL A B C F G
Insert Delete
NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character
A
1 0 1 2 3 4
D
2 1 1 2
C
3 2 2 1
If Row Character ≠ Column Character
E
4 3 3 2
Replace Delete
G
5 4 4 3
Min(insert,
Base Case Insert replace,
delete) + 1
73
NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character
A
1 0 1 2 3 4
D
2 1 1 2 3
C
3 2 2 1
If Row Character ≠ Column Character
E
4 3 3 2
Replace Delete
G
5 4 4 3
Min(insert,
Base Case Insert replace,
delete) + 1
74
ABCFG into ADCEG
NULL A B C F G
Insert Delete
NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character
A
1 0 1 2 3 4
D
2 1 1 2 3
C
3 2 2 1 2
If Row Character ≠ Column Character
E
4 3 3 2
Replace Delete
G
5 4 4 3
Min(insert,
Base Case Insert replace,
delete) + 1
75
NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character
A
1 0 1 2 3 4
D
2 1 1 2 3
C
3 2 2 1 2
If Row Character ≠ Column Character
E
4 3 3 2 2
Replace Delete
G
5 4 4 3
Min(insert,
Base Case Insert replace,
delete) + 1
76
ABCFG into ADCEG
NULL A B C F G
Insert Delete
NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character
A
1 0 1 2 3 4
D
2 1 1 2 3
C
3 2 2 1 2
If Row Character ≠ Column Character
E
4 3 3 2 2
Replace Delete
G
5 4 4 3 3
Min(insert,
Base Case Insert replace,
delete) + 1
77
NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character
A
1 0 1 2 3 4
D
2 1 2 3 4
1
C
3 2 2 1 2
If Row Character ≠ Column Character
E
4 3 3 2 2
Replace Delete
G
5 4 4 3 3
Min(insert,
Base Case Insert replace,
delete) + 1
78
ABCFG into ADCEG
NULL A B C F G
Insert Delete
NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character
A
1 0 1 2 3 4
D
2 1 2 3 4
1
C
3 2 2 1 2 3
If Row Character ≠ Column Character
E
4 3 3 2 2
Replace Delete
G
5 4 4 3 3
Min(insert,
Base Case Insert replace,
delete) + 1
79
NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character
A
1 0 1 2 3 4
D
2 1 2 3 4
1
C
3 2 2 1 2 3
If Row Character ≠ Column Character
E
4 3 3 2 2 3
Replace Delete
G
5 4 4 3 3
Min(insert,
Base Case Insert replace,
delete) + 1
80
ABCFG into ADCEG
NULL A B C F G
Insert Delete
NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character
A
1 0 1 2 3 4
D
2 1 2 3 4
1
C
3 2 2 1 2 3
If Row Character ≠ Column Character
E
4 3 3 2 2 3
Replace Delete
G
5 4 4 3 3 2
Min(insert,
Base Case Insert replace,
delete) + 1
81
NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character
A
1 0 1 2 3 4
D
2 1 2 3 4
1
C
3 2 2 1 2 3
If Row Character ≠ Column Character
E
4 3 3 2 2 3
Replace Delete
G
5 4 4 3 3 2
Min(insert,
Base Case Insert replace,
delete) + 1
82
ABCFG into ADCEG
NULL A B C F G
Insert Delete
NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character
A
1 0 1 2 3 4
D
1R
2 1 2 3 4
C
3 2 2 1 2 3
If Row Character ≠ Column Character
E
4 3 3 2 R 2 3
Replace Delete
G
5 4 4 3 3 2
Min(insert,
Base Case Insert replace,
delete) + 1
83
Exercise
1. SPARTAN into PART
2. PLASMA into ALTRUISM
3. RELEVANT into ELEPHANT
4. SITTING into KITTEN
5. SATURDAY into SUNDAY
6. MONEY into MONKEY
7. LEVENSHTEIN into MEILENSTEIN
8. LAWN into FLAW
9. HONDA into HUNDAI
84
SPARTAN into PART
NULL P A R T
NULL
N
85
NULL A L T R U I S M
NULL
86
NULL E L E P H A N T
NULL
T
87
88
Weighted Edit Distance
• The weight of an operation depends on the character(s)
involved
• Meant to capture OCR or keyboard errors, e.g. m more likely to
be mistyped as n than as q
• Therefore, replacing m by n is a smaller edit distance than by q
• This may be formulated as a probability model
• Requires weight matrix as input
• Modify dynamic programming to handle weights
89
n-gram overlap
• Enumerate all the n-grams in the query string as well as in the lexicon
• Ex: Query: hello
• n=3 hel, ell, llo
• Use the n-gram index (recall wild-card search) to retrieve all lexicon terms
matching any of the query n-grams
Use k-gram index to retrieve all the terms in the dictionary of standard inverted index that
have a good overlap with the query term
Ex., let say most of the k-grams in the lexicon term match with the k-gram of the query
term so in that case we call it as good overlap 92
n-gram overlap
One of way we would decide something is a good overlap is to come up with the Threshold
• Threshold by number of matching n-grams
• Variants – weight by keyboard layout, etc.
Suppose if there are 5 trigrams that match between the query term and the term in the
dictionary then that particular term in the dictionary is a good overlap with the query
term
93
94
One option – Jaccard Coefficient
X nov, ove, vem, emb, mbe, ber 6
• A commonly-used measure of overlap
• It is a quantitative way of measuring the overlap Y dec, ece, cem, emb, mbe, ber 6
X Y / X Y = 3/6+6-3 = 3/9
• Equals 1 when X and Y have the same elements and 0 when they are disjoint
• X and Y don’t have to be of the same size
• Always assigns a number between 0 and 1
• Now threshold to decide if you have a match
• E.g., if J.C. > 0.8, declare a match
95
Matching Trigrams
•How can we adapt the procedure to compute the JC
between the query term & current term
• Threshold if JC (query, current term) > 0.8 (some
value near to 1)
• You append current term to answer list
• X = set of bigrams in a query term
• Y = set of bigrams in current term
98
Problem 1
• Compute the Jaccard coefficients between the query bord
and each of the terms in k-grams index given below.
99
• First idea: retrieve dictionary terms close (in weighted edit distance) to
each query term
• Now try all possible resulting phrases with one word “fixed” at a time
• flew from heathrow
• fled form heathrow
• flea form heathrow
Exercise
•Suppose that for “flew form Heathrow” we have 7
alternatives for flew, 19 for form and 3 for Heathrow.
102
Another Approach
•Break phrase query into a conjunction of biwords.
103
Soundex
106
Soundex
• It has existed since the late 1800s and originally was used by the U.S. Census
Bureau.
• The Soundex algorithm generates four-character codes based upon the
pronunciation of English words.
• These codes can be used to compare two words to determine whether they
sound alike.
• This can be very useful when searching for information in a database or text
file, particularly when looking for names that are commonly misspelled.
107
• http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#Top
108
Soundex – typical algorithm
1. Retain the first letter of the word.
2. Change all occurrences of the following letters to '0' (zero):
'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.
3. Change letters to digits as follows:
• B, F, P,V 1
• C, G, J, K, Q, S, X, Z 2
• D,T 3
• L4
• M, N 5
• R6
109
Soundex continued
4. Remove all pairs of consecutive digits.
5. Remove all zeros from the resulting string.
6. Pad the resulting string with trailing zeros and return the first four positions,
which will be of the form <uppercase letter> <digit> <digit> <digit>.
i.e., Return the first four characters, padded on the right with zeros if
there are less than four
E.g., Herman becomes H655.
110
Soundex Example
• An example of the use of Soundex is the search function of a customer
database.
• When performing a text search for the surname, "Smith", people with the
name, "Smythe", would not be found.
• However, as the Soundex code for both surnames is "S530", a phonetic
Soundex-based search would find both customers.
• The codes and data could also be used to ask the user,
• "Did you mean Smythe?".
111
Problem 1
• Find Soundex code for the following
• Matt and Matthew
• Catherine and Katherine
• Johnathan and Jonathan
• Teresa and Theresa
• Smith and Smyth
• Robert and Rupert
112
Soundex
• Soundex is the classic algorithm, provided not just by IR system but by most
databases (Oracle, Microsoft, …)
• How useful is Soundex?
• Not very – for information retrieval
• Because it lowers the precision
• As we get lot of different terms in web and many of them mapped to same Soundex code that not very
meaningful.
113
114
Assignment V
• Draw yourself a diagram showing the various indexes in a search engine
incorporating all the functionality we have talked about
• Given a query, how would you parse and dispatch sub-queries to the various
indexes?
115