0% found this document useful (0 votes)
28 views58 pages

9 Dictionaries and Tolerant Retrieval

Uploaded by

Ajitesh Thawait
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views58 pages

9 Dictionaries and Tolerant Retrieval

Uploaded by

Ajitesh Thawait
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Dictionaries

and Tolerant
Retrieval

Outline
• Dictionary Data Structures
• “Tolerant” Retrieval
• Wild-Card Queries
• Spell Correction: Document Correction, Query Misspellings
• Isolated Word Correction, Context-Sensitive Spell
Correction
• Soundex
2
Dictionaries and Tolerant Retrieval

• Here we will develop techniques that are


robust to

• Typographical errors in query

• Alternative spelling corrections

Dictionary Data Structure for Inverted Index

• Lets see how to implement dictionary in more detail


• Just recall  2 terms
• TermVocabulary
• Terms that are stored in the dictionary (data)
• Dictionary
• Actual implementation of Vocabulary (the data structure for storing the term vocabulary )

We will look into the implementation of Dictionary


Dictionary Data Structure stores the

Term Vocabulary

Document Frequency

Pointers to posting list


We want to store dictionary in main memory 4
Dictionary Data Structure for Inverted Index
• The dictionary data structure stores the term vocabulary, document frequency,
pointers to each postings list … in what data structure?

Question is  How do we actually implement a dictionary

Standard Inverted Index

Let see other kinds of indexes 5

Storing Dictionaries
• For each term, we need to store a couple of items:
• Terms
• Document frequency
• Pointer to posting list
• Assume for the time being that we can store this information in a fixed –
length entry.

6
A naïve dictionary
• Simplest way to implement a dictionary – array of structures.
• Just think of storing all the terms in a sorted array,
• An array of struct:

This is the structure (struct)


array of structures
stores
Term String
Doc. Freq Integer
Terms are Sorted Storing document Storing pointers to Posting list Pointers
& stores frequency as integers the posting list
lexicographically
7

A naïve dictionary
• If you are searching for a particular term in the dictionary
• We can se something like
• Binary search (if your are implementing in a array of structures)
• Going to take you about O(log n) time (where ‘n’ is the size of the array)

How do we store a dictionary in memory efficiently?

How do we quickly look up elements at query time?

8
How do we store a dictionary in memory efficiently?

char[20] int Postings *


Space needed: 20 bytes 4/8 bytes 4/8 bytes
Let us assume each term is approx. of 20 bytes or maximum of 20 bytes
use
4 bytes of integer  store doc. Frequency
(i.e., 232 = approx. 4 billion integer that can fit into a 4
What would be the max. size if integer that can fit into a 4 byte word.
byte word)
WEB CORPUS
•Will have more than 4 billion documents
•The value of document frequency can be much larger than 4 billion
9
•So , we have to go for an 8 byte integer to represent

How do we store a dictionary in memory efficiently?

• From above array of structures


• The size of dictionary given by
• 20 bytes + 8 bytes + 8 bytes = 36 bytes  for each row

36 bytes for each row

36 times the no. of terms in the index  that would give the  no. of bytes taken up by this
implementation (naïve dictionary)
10
How do we store a dictionary in memory efficiently?

• In Chapter 5, we will see the naïve implementation & then we will see how we
can compress the dictionary and also compress posting lists

11

How do we quickly look up elements at query time?

Step 1: Will do binary search


Step 2: We search narrow down to the term of interest
Step 3: Follow the pointers to the posting list for that term

Remember: these dictionaries can be huge, scanning is not an option

Lets look at more sophisticated ways of implementing a dictionary

12
Dictionary Data Structures
• Two main choices of data structures:
• Hash Tables
• Trees
• Some IR systems use hashes, some use search trees
• Criteria for when to use hashesVs. trees:
• Is there a fixed number of terms or will it keep growing?
• What are the relative frequencies with which various keys will be accessed?
• How many terms are we likely to have?

Let’s look at Hashes 13

Hash Tables
• Each vocabulary term is hashed into an integer.
• In hash table you have a hash function, which takes in a key and maps that key to an
integer
• At query time: hash query term, locate entry in fixed-width array
• (assume you’ve seen hash tables before)

14
Hash Tables
• Pros:
• Lookup in a hash is faster than lookup in tree: O(1)
• Cons:
• No easy way to find minor variants:
• judgment/judgement
• Resume vs Résumé As we know that there are two variants
Hash table will not allow me to
judgment/judgement get both of them at the same time
because
American English British English The value of function for string
‘judgment’ & ‘judgement’ is
different
Suppose I want to search for the word ‘judgement’ in the dictionary

15

Hash Tables
• No prefix search (all terms starting with automat) [tolerant retrieval]
• Automatic, Automation  stemmed to  automat (Equivalence class)
• in hashing we are not storing equivalence classes we obtain after stemming. We are storing automatic &
automation in different memory locations and that corresponds to different postings list.

• If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing


everything
• Let say I have 7 terms in my dictionary and hash function is mod 10 (7 mod 10)
• 7 mod 10 will give me 7 memory locations i.e., 7 different memory locations.
• Suppose now your vocabulary is increasing, let say vocabulary is consisting of 200 terms (200 mod 10) [Do we get
unique memory locations]  NO  this leads to collision  i.e., more than one term are actually
mapped to same memory location

So we have to redesign our hash function


16
Trees
• Trees solve the prefix problem (find all terms starting with automat).
• Simplest tree: binary tree
• Binary trees are problematic:
• Only balanced trees allow efficient retrieval
• Rebalancing binary trees is expensive
• Search is slightly slower than in hashes: O(log M), where M is the size of the
vocabulary. O(log M) only holds for balanced trees.
• Use B-trees
• B-trees mitigate the rebalancing problem
• Solves the prefix problem
• Slower O(log M) [ this requires the balanced tree]
• Rebalancing is expensive

17

Tree: Binary tree


• Implement dictionary by search trees
• Binary Search Tree
• You have  Internal nodes  have almost two children (it can be 0, 1
or almost 2 children)
 Leaves

18
This the way of storing the prefixes
Tree: binary tree
Root
All terms which has prefixes/ beginning a-m n-z All terms which has prefixes/ beginning
with a, b, c …m with n, o, p …z

We are further sub a-hu hy-m n-sh si-z


dividing a..m i.e a..hu and Designing of the binary
hy…m tree depends on you,
you can define the ranges
as per your
All the internal nodes contain the prefixes requirements/subjective
decisions
Generally we take the
ranges in which the terms
are falling

Leaf node
contains the 19
exact terms

Binary Tree
• Terms are stored in main memory and corresponding posting list are stored in
secondary memory (since it is very huge for each term).

20
Example
Query: retrieve the documents
corresponding to the terms beginning
with ‘G’

Inclusion of ranges is a subjective


decision, depends on your dictionary.

Whatever is your dictionary term,


depending upon that the ranges are
picked.

Suppose if some term is not their in the


dictionary then corresponding prefix
term will not be there in the tree

As these two terms are my results. Now the question is how to find the relevant documents

Perform lookup into the posting list of these two terms from the secondary memory Will it be AND or OR of the
posting list 21

Decreased the number of levels

Tree: B-tree 1. In binary tree whatever we were doing in


two levels in B-tree we are doing in one
level
2. No. of comparisons will be more i.e., in
the first level only we are doing 3
a-hu n-z comparisons
hy-m
We can say that B-tree is the flattened
version of Binary tree

• Definition: Every internal node has a number of children in the interval [a,b] where a, b are
appropriate natural numbers, e.g., [2,4].
means B-tree having at least 2 children/4 children But will never have a case were
vary
a particular number has exactly
2 to 4 2, 3 or 4 one child or 5 children 22
Could be
Tree: B-tree
• B-trees are usual way to implement the dictionary
• One of the reason for creating this range [a,b]

23

Trees
• Simplest: binary tree
• More usual: B-trees
• Trees require a standard ordering of characters and hence strings … but we standardly have
one
• Pros:
• Solves the prefix problem (terms starting with hyp)
• Cons:
• Slower: O(log M) [and this requires balanced tree]
• Rebalancing binary trees is expensive
• But B-trees mitigate the rebalancing problem

24
Wild-Card
Queries

25

Wild-card queries: *
• Wildcard queries are used in any of the following situations:
1. The user is uncertain of the spelling of a query term (e.g., Sydney vs.
Sidney, which leads to the wildcard query S*dney);
2. The user is aware of multiple variants of spelling a term and (consciously)
seeks documents containing any of the variants (e.g., color vs. colour);
3. The user seeks documents containing variants of a term that would be
caught by stemming, but is unsure whether the search engine performs
stemming (e.g., judicial vs. judiciary, leading to the wildcard query
judicia*);
4. The user is uncertain of the correct rendition of a foreign word or phrase
(e.g., the query Universit* Stuttgart).

26
Wild-card queries: *
• There are two types of wild cards
• trailing wildcard query – prefix query
• leading wildcard queries – suffix query

27

A Trailing Wildcard
• A query such as mon* is known as a trailing wildcard query, because the * symbol
occurs only once, at the end of the search string
• mon*: find all docs containing any word beginning with “mon”. (trailing
wild-card queries)
• We can have word for example
• monday, money, monitor, monkey.....  word starting with ‘mon’

28
A Trailing Wildcard
• A search tree on the dictionary is a convenient way of
handling trailing wildcard queries:
Easy with binary tree (or B-tree) lexicon
• We walk down the tree following the symbols m, o and n in turn,
at which point we can enumerate the set W of terms in the
dictionary with the prefix mon. retrieve all words that lie in range: mon ≤ w < moo
• Finally, we use |W| lookups on the standard inverted index to retrieve
all documents containing any term in W.
retrieve posting list for each term individually Perform OR on them
29

Leading Wildcard Queries


• First, consider leading wildcard queries, or queries of the form *mon
• *mon: find words ending in “mon”: harder
• Consider a reverse B-tree on the dictionary – one in which each root-to-leaf
path of the B-tree corresponds to a term in the dictionary written backwards:
thus, the term lemon would, in the B-tree, be represented by the path root-n-
o-m-e-l.
• Maintain an additional B-tree for terms backwards (reverse B-tree).
• Ex: monday  reverse B-tree we store it as ‘yadnom’
• A walk down the reverse B-tree then enumerates all terms R in the
vocabulary with a given prefix.
Can retrieve all words in range: nom ≤ w < non.
30
MORE GENERAL CASE: WILDCARD QUERIES

• In fact, using a regular B-tree together with a reverse B-tree, we can


handle an even more general case: wildcard queries in which there is
a single * symbol, such as se*mon.
• To do this, we use the regular B-tree to enumerate the set W of
dictionary terms beginning with the prefix se, then the reverse B-tree to
enumerate the set R of terms ending with the suffix mon.
• Next, we take the intersection W⋂ R of these two sets, to arrive at the set of
terms that begin
• with the prefix se and end with the suffix mon.

31

MORE GENERAL CASE: WILDCARD QUERIES

• Finally, we use the standard inverted index to retrieve all documents


containing any terms in this intersection.
• We can thus handle wildcard queries that contain a single * symbol using two
B-trees, the normal B-tree and a reverse B-tree.

Exercise: from this, how can we enumerate all terms meeting the wild-card query pro*cent ?

32
Excercise
• How can we enumerate all terms meeting the wild-card query?
• pro*cent? pro*cent

You should maintain Create result for pro* Create result for *cent
both trees in dictionary Using normal B-tree Using reverse B-tree

Will get terms starting with pro* and terms ending with *cent

For document retrieval

Get individual posting list for each term and perform AND to get the result.

33

Excercise
• Query  Jayasur*a Don’t know the spelling  either it may be ‘iy’ or ’y’

Jayasur*a

Norma B-tree search (prefix) Backward B-tree search (suffix)

Get all terms having prefix and suffix

Perform intersection of both these list


34
Query processing
• We have an enumeration of all terms in the dictionary that match the
wild-card query.

• We still have to look up the postings for each enumerated term.

• E.g., consider the query:


se*ate AND fil*er
This may result in the execution of many Boolean AND queries.
35

General wild-card queries


• Two strategies:
• Permuterm indexes
• k-Gram indexes

36
B-trees handle *’s at the end of a query term

• How can we handle *’s in the middle of query term?


• co*tion
• We could look up co* AND *tion in a B-tree and intersect the two
term sets
• Expensive
• The solution: transform wild-card queries so that the *’s occur at the
end
• This gives rise to the Permuterm Index.

37

Permuterm index
• Here we are going to employ extra index from standard inverted index called
the Permuterm index.

• Step 1: Standard Inverted Index steps


1. Collect documents to be indexed
2. Tokenize
3. Linguistic processing
4. Index the document that each term occurs in by creating a inverted index
• Consisting of a dictionary & postings

38
Permuterm index
• Step 2: Create Permuterm Index
1. Look at every term that goes into the Standard Inverted Index
2. Create rotation of that term
• Ex: hello
• For term hello, index under:
• hello$, ello$h, llo$he, lo$hel, o$hell  permuterm index is going to have each of these
rotation versions of original term

• where $ is a special symbol  tells us that the term ends at this particular $
sign

39

Permuterm index
• Step 3: Put all permuterm into dictionary  part of permuterm index
• Posting list of permuterm index will have a single entry

hello$
ello$h
llo$he hello

lo$hel
o$hell
$hello
40
Permuterm index
• Queries:
• X lookup on X$
• X* lookup on $X* (use B-tree look up)
• *X lookup on X$*
• X*Y lookup on Y$X*
• X*Y*Z ??? Exercise!

The dictionary of permuterm index can again be implemented as B-tree 


the string will have an extra

Query = hel*o
X=hel, Y=o
Lookup o$hel* 41

Permuterm query processing


• Rotate query wild-card to the right

• Now use B-tree lookup as before.

• Permuterm problem: ≈ quadruples lexicon size


Empirical observation for English.

42
Exercise
• Write down the entries in the permuterm index dictionary that are generated
by the term “mama”
• Solution
• mama$
• ama$m
• ma$ma
• a$mam
• $mama
• If you wanted to search for s*ng in permuterm wild card index, what key(s)
would one do the lookup
• Solution
• ng$s*

43

Bigram (k-gram) indexes


• More space-efficient than permuterm index
• Enumerate all k-grams (sequence of k chars) occurring in any term
• Important than permuterm index
• Used in context of spelling correction

• Will have 2 index


• Standard Inverted Index
• k-gram index

44
Bigram (k-gram) indexes
• Example: april
• If k=2  get bigrams
• $a,ap,pr,ri,il,l$  bigrams of april

• e.g., from text “April is the cruelest month” we get the 2-grams (bigrams) k=2

$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,ue,el,le,es,st,t$,$m,mo,on,nt,h$
• $ is a special word boundary symbol
• Maintain a second inverted index from bigrams to dictionary terms that match
each bigram.
45

Bigram (k-gram) indexes


• Dictionary of k-gram index
will contain

all possible k-gram generated from the text

• Posting list for a particular bigram in a dictionary


will contain

all the terms containing k-gram

46
Bigram (k-gram) indexes
Ex: if bigram ‘ap’ found in terms
april, apple, map....

ap april apple map

These are sorted in lexicographical order

47

Bigram (k-gram) indexes


• The k-gram index finds terms based on a query consisting of k-grams (here
k=2).
Query: mon*

Bigrams for query: $m, mo, on

Since * is at the end so we don’t put $ at the end (as it is not a character it’s a wildcard)

48
Processing wild-cards
Ex: mon*
generate bigrams

$m, mo, on

Entry in bigram index & posting Entry in bigram index & posting Entry in bigram index & posting
list would contain all terms start list would contain mo list would contain on
with m

$m mace madden

mo among amortize
on along among 49

$m mace madden

mo among amortize

Processing wild-cards on among around

• Query mon* can now be run as


• $m AND mo AND on
• Gets terms that match AND version of our wildcard query.
• Retrieve their corresponding posting list and perform OR of the posting list.
• But we’d enumerate moon. Moon is not the answer for this query (False Positive) bcoz the query was mon*
• Must post-filter these terms against query.
• Surviving enumerated terms are then looked up in the term-document
inverted index.
• Fast, space efficient (compared to permuterm).

50
Processing wild-card queries
• As before, we must execute a Boolean query for each enumerated, filtered
term.
• Wild-cards can result in expensive query execution (very large disjunctions…)
• gen* AND universit*
• (geneva AND university) OR (geneva AND université) OR (genève AND university)
OR (genève AND université) OR (general AND universities) OR . . . . . .
Very Expensive Requires query optimization
• If you encourage “laziness” people will respond!

51

Processing wild-card queries


• Do we need to support wildcard queries?
• Users are lazy! If wildcards are allowed → Users will love it (?)

Search
Type your search terms, use ‘*’ if you need to.
E.g., Alex* will match Alexander.

• Does Google allow wildcard queries? (Which web search engines allow
wildcard queries?)

52
Spelling
Correction

53

Spell Correction
• Two principal uses
We will be looking at spelling correction
• Correcting document(s) being indexed mainly in the context of misspelling
• Correcting user queries to retrieve “right” answers queries
• Two main flavors: In an IR system, spelling correction is only ever run on queries
• Isolated word
• Check each word on its own for misspelling
• Will not catch typos resulting in correctly spelled words
• e.g., from  form
• Context-sensitive
• Look at surrounding words,
• e.g., I flew form Heathrow to Narita.

54
General IR philosophy: don’t change the
Document Correction document

• Especially needed for OCR’ed documents


• Correction algorithms are tuned for this: rn/m
• Can use domain-specific knowledge
• E.g., OCR can confuse O and D more often than it would confuse O and I (adjacent on the
QWERTY keyboard, so more likely interchanged in typing).
• But also: web pages and even printed material has typos
• Goal: the dictionary contains fewer misspellings and better matching
• But often we don’t change the documents but aim to fix the query-document
mapping

55

Query mis-spellings
• More common in IR than document correction
• typos in queries are common
• People are in a hurry
• Users often look for things they don’t know much about

56
Query mis-spellings
• Our principal focus here
• E.g., the query Alanis Morisett Search in the Google

Showing result for correct spelling, its not


retrieving documents corresponding to actual
query. So, its retrieving documents corresponding
to corrected query and also its giving you an option
 you will get both the documents

57

Query mis-spellings
• Strategies:
• Retrieve documents indexed by the correct spelling, OR
• Return several suggested alternative queries with the correct
spelling
Did you mean … ?

58
Query mis-spellings
• Example:Wikipedia

59

Query mis-spellings
• In Google it was giving you a choice So multiple variations are possible
• Search with correct spelling and retrieving the documents over here, that is something you have
• Search with incorrect spelling and showing the documents to decide before hand. Because your
algorithm are going to depend on that

60
Isolated Word Correction
•Premise 1: there is a lexicon (“correct words”) from
which the correct spellings come
•Premise 2: We have a way of computing the distance
between a misspelled word and a correct word.
•Simple spelling correction algorithm: return the
“correct” word that has the smallest distance to the
misspelled word.
• Example: informaton → information
61

Isolated Word Correction: Premise 1


• Premise 1: Fundamental premise:
there is a lexicon (“correct words”) from which the correct spellings
come
• Two basic choices for this
• A standard lexicon such as Or standard dictionary from which correct spelling comes
• Webster’s English Dictionary
• An “industry-specific” lexicon – hand-maintained (for domain specific IR)
• Advantage: correct entries only!
This is a reference set of correct spellings against which you match particular word and see if the word will
correspond to correct spelling or not
62
Isolated Word Correction
• The another alternative is to just treat all the words in the
document of your corpus as making up a reference dictionary.
• The lexicon of the indexed corpus
• E.g., all words on the web Better coverage
• All names, acronyms etc.
• (Including the mis-spellings)
Misspelling will appear relatively rare compared to correct spellings
1.You treat entire terms appear in corpus as your lexicon.
2. Give equal weights to every term.
 compute weights for all terms based on frequencies
 collection frequency terms will have higher weight. 63

Isolated Word Correction: Premise 2


There is a way to calculate the distance between any two words

•Given a lexicon and a character sequence Q, return the


words in the lexicon closest to Q
•What’s “closest”? What is the distance between the query term and standard dictionary term
When we are given a query term  we will lookup the lexicon and determine
which are the words in the lexicon that are closest to the query term

The way we measure the closeness  calculating the distance between the
query term and various terms in the dictionary or the lexicon. 64
Isolated Word Correction: Premise 2
•There are multiple ways to define the distance measure
between two words
• We’ll study several alternatives
• Edit distance (Levenshtein distance)
• Weighted edit distance
• n-gram overlap

65

Edit Distance
Isolated Word Correction

66
Edit Distance
• The minimum edit distance between two strings
• Given two strings S1 and S2, the minimum number of operations to convert one (S1) to
the other (S2)
• Operations are typically character-level
• Insert, Delete, Replace, (Transposition)
• E.g., the edit distance from dof to dog is 1
• From cat to act is 2 (Just 1 with transpose.)
• from cat to dog is 3.
• Generally found by dynamic programming.

67

ADCEG into ABCFG


NULL A B C F G
Insert Delete

NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character

A
1

D
2

C
3
If Row Character ≠ Column Character
E
4
Replace Delete
G
5

Min(insert,
Base Case Insert replace,
delete) + 1

68
ABCFG into ADCEG
NULL A B C F G
Insert Delete

NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character

0+1 1+1 2+1 3+1


A
1 0 1 2 3 4

D
2

C
3
If Row Character ≠ Column Character
E
4
Replace Delete
G
5

Min(insert,
Base Case Insert replace,
delete) + 1

69

ABCFG into ADCEG


NULL A B C F G
Insert Delete

NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character

0+1 1+1 2+1 3+1


A
1 0 1 2 3 4

D
2

C
3
If Row Character ≠ Column Character
E
4
Replace Delete
G
5

Min(insert,
Base Case Insert replace,
delete) + 1

70
ABCFG into ADCEG
NULL A B C F G
Insert Delete

NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character

A
1 0 1 2 3 4

D 0+1
2 1

C 1+1
3 2
If Row Character ≠ Column Character
E 2+1
4 3
Replace Delete
3+1
G
5 4

Min(insert,
Base Case Insert replace,
delete) + 1

71

ABCFG into ADCEG


NULL A B C F G
Insert Delete

NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character

A
1 0 1 2 3 4

D
2 1 1

C
3 2 2
If Row Character ≠ Column Character
E
4 3 3
Replace Delete
G
5 4 3

Min(insert,
Base Case Insert replace,
delete) + 1

72
ABCFG into ADCEG
NULL A B C F G
Insert Delete

NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character

A
1 0 1 2 3 4

D
2 1 1 2

C
3 2 2 1
If Row Character ≠ Column Character
E
4 3 3 2
Replace Delete
G
5 4 4 3

Min(insert,
Base Case Insert replace,
delete) + 1

73

ABCFG into ADCEG


NULL A B C F G
Insert Delete

NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character

A
1 0 1 2 3 4

D
2 1 1 2 3

C
3 2 2 1
If Row Character ≠ Column Character
E
4 3 3 2
Replace Delete
G
5 4 4 3

Min(insert,
Base Case Insert replace,
delete) + 1

74
ABCFG into ADCEG
NULL A B C F G
Insert Delete

NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character

A
1 0 1 2 3 4

D
2 1 1 2 3

C
3 2 2 1 2
If Row Character ≠ Column Character
E
4 3 3 2
Replace Delete
G
5 4 4 3

Min(insert,
Base Case Insert replace,
delete) + 1

75

ABCFG into ADCEG


NULL A B C F G
Insert Delete

NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character

A
1 0 1 2 3 4

D
2 1 1 2 3

C
3 2 2 1 2
If Row Character ≠ Column Character
E
4 3 3 2 2
Replace Delete
G
5 4 4 3

Min(insert,
Base Case Insert replace,
delete) + 1

76
ABCFG into ADCEG
NULL A B C F G
Insert Delete

NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character

A
1 0 1 2 3 4

D
2 1 1 2 3

C
3 2 2 1 2
If Row Character ≠ Column Character
E
4 3 3 2 2
Replace Delete
G
5 4 4 3 3

Min(insert,
Base Case Insert replace,
delete) + 1

77

ABCFG into ADCEG


NULL A B C F G
Insert Delete

NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character

A
1 0 1 2 3 4

D
2 1 2 3 4
1

C
3 2 2 1 2
If Row Character ≠ Column Character
E
4 3 3 2 2
Replace Delete
G
5 4 4 3 3

Min(insert,
Base Case Insert replace,
delete) + 1

78
ABCFG into ADCEG
NULL A B C F G
Insert Delete

NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character

A
1 0 1 2 3 4

D
2 1 2 3 4
1

C
3 2 2 1 2 3
If Row Character ≠ Column Character
E
4 3 3 2 2
Replace Delete
G
5 4 4 3 3

Min(insert,
Base Case Insert replace,
delete) + 1

79

ABCFG into ADCEG


NULL A B C F G
Insert Delete

NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character

A
1 0 1 2 3 4

D
2 1 2 3 4
1

C
3 2 2 1 2 3
If Row Character ≠ Column Character
E
4 3 3 2 2 3
Replace Delete
G
5 4 4 3 3

Min(insert,
Base Case Insert replace,
delete) + 1

80
ABCFG into ADCEG
NULL A B C F G
Insert Delete

NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character

A
1 0 1 2 3 4

D
2 1 2 3 4
1

C
3 2 2 1 2 3
If Row Character ≠ Column Character
E
4 3 3 2 2 3
Replace Delete
G
5 4 4 3 3 2

Min(insert,
Base Case Insert replace,
delete) + 1

81

ABCFG into ADCEG


NULL A B C F G
Insert Delete

NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character

A
1 0 1 2 3 4

D
2 1 2 3 4
1

C
3 2 2 1 2 3
If Row Character ≠ Column Character
E
4 3 3 2 2 3
Replace Delete
G
5 4 4 3 3 2

Min(insert,
Base Case Insert replace,
delete) + 1

82
ABCFG into ADCEG
NULL A B C F G
Insert Delete

NULL
0 1 2 3 4 5 Base Case If Row Character = Column Character

A
1 0 1 2 3 4

D
1R
2 1 2 3 4

C
3 2 2 1 2 3
If Row Character ≠ Column Character
E
4 3 3 2 R 2 3
Replace Delete
G
5 4 4 3 3 2

Min(insert,
Base Case Insert replace,
delete) + 1

83

Exercise
1. SPARTAN into PART
2. PLASMA into ALTRUISM
3. RELEVANT into ELEPHANT
4. SITTING into KITTEN
5. SATURDAY into SUNDAY
6. MONEY into MONKEY
7. LEVENSHTEIN into MEILENSTEIN
8. LAWN into FLAW
9. HONDA into HUNDAI

84
SPARTAN into PART
NULL P A R T

NULL

N
85

NULL A L T R U I S M

NULL

86
NULL E L E P H A N T

NULL

T
87

88
Weighted Edit Distance
• The weight of an operation depends on the character(s)
involved
• Meant to capture OCR or keyboard errors, e.g. m more likely to
be mistyped as n than as q
• Therefore, replacing m by n is a smaller edit distance than by q
• This may be formulated as a probability model
• Requires weight matrix as input
• Modify dynamic programming to handle weights
89

Using Edit Distances


•Given query, first enumerate all character sequences
within a preset (weighted) edit distance (e.g., 2)
•Intersect this set with a list of “correct” words
•Show terms you found the o user as suggestions
•Alternatively,
• We can look up all possible corrections in our inverted
index and return all docs … slow
• We can run with a single most likely correction
90
Edit distance to all dictionary terms?
• Given a (misspelt) query – do we compute its edit distance
to every dictionary term?
• Expensive and slow
• Alternative?
• How do we cut the set of candidate dictionary terms?

• One possibility is to use n-gram overlap for this

• This can also be used by itself for spelling correction.


91

n-gram overlap
• Enumerate all the n-grams in the query string as well as in the lexicon
• Ex: Query: hello
• n=3  hel, ell, llo
• Use the n-gram index (recall wild-card search) to retrieve all lexicon terms
matching any of the query n-grams
Use k-gram index to retrieve all the terms in the dictionary of standard inverted index that
have a good overlap with the query term

How do we define a good overlap

Ex., let say most of the k-grams in the lexicon term match with the k-gram of the query
term  so in that case we call it as good overlap 92
n-gram overlap
One of way we would decide something is a good overlap is to come up with the Threshold
• Threshold by number of matching n-grams
• Variants – weight by keyboard layout, etc.

Suppose if there are 5 trigrams that match between the query term and the term in the
dictionary  then that particular term in the dictionary is a good overlap with the query
term

So it’s a good candidate term to suggest as spelling correction

93

Example with trigrams


• Suppose the text is november
• Trigrams are nov, ove, vem, emb, mbe, ber.

• The query is december


• Trigrams are dec, ece, cem, emb, mbe, ber.

• So 3 trigrams overlap (of 6 in each term)

• How can we turn this into a normalized measure of overlap?

94
One option – Jaccard Coefficient
X nov, ove, vem, emb, mbe, ber 6
• A commonly-used measure of overlap
• It is a quantitative way of measuring the overlap Y dec, ece, cem, emb, mbe, ber 6

• Let X and Y be two sets; then the J.C. is


3

X Y / X Y = 3/6+6-3 = 3/9
• Equals 1 when X and Y have the same elements and 0 when they are disjoint
• X and Y don’t have to be of the same size
• Always assigns a number between 0 and 1
• Now threshold to decide if you have a match
• E.g., if J.C. > 0.8, declare a match

95

One option – Jaccard Coefficient


• Note: if somebody does a spelling error, let say while typing a
particular term
• Its likely that most of the k-grams are going to remain unchanged,
only the k-gram that are involved in that spelling error will be
effected.

• Jaccard Coefficient can be used in a way that is pretty fast.


• No need to write dynamic programming algorithm to compute the
Jaccard coefficient between particular query term and dictionary
term.
96
Matching Trigrams
• Consider the query lord – we wish to identify words matching 2 of its 3
bigrams (lo, or, rd)

lo alone lore sloth Current terms  alone, ardent, border


Card, lore, morbid, sloth
or border lore morbid

rd ardent border card

Standard postings “merge” will enumerate …

Adapt this to using Jaccard (or another) measure.


97

Matching Trigrams
•How can we adapt the procedure to compute the JC
between the query term & current term
• Threshold  if JC (query, current term) > 0.8 (some
value near to 1)
• You append current term to answer list
• X = set of bigrams in a query term
• Y = set of bigrams in current term
98
Problem 1
• Compute the Jaccard coefficients between the query bord
and each of the terms in k-grams index given below.

99

Context-Sensitive Spell Correction


• Text: I flew from Heathrow to Narita.

• Consider the phrase query “flew form Heathrow”

• We’d like to respond


Did you mean “flew from Heathrow”?
because no docs matched the query phrase.
100
Context-sensitive correction
• Need surrounding context to catch this.

• First idea: retrieve dictionary terms close (in weighted edit distance) to
each query term

• Now try all possible resulting phrases with one word “fixed” at a time
• flew from heathrow
• fled form heathrow
• flea form heathrow

• Hit-based spelling correction:


Suggest the alternative that has lots of hits. 101

Exercise
•Suppose that for “flew form Heathrow” we have 7
alternatives for flew, 19 for form and 3 for Heathrow.

How many “corrected” phrases will we enumerate in


this scheme?

102
Another Approach
•Break phrase query into a conjunction of biwords.

•Look for biwords that need only one term corrected.

•Enumerate phrase matches and … rank them!

103

General Issues in Spell Correction


• We enumerate multiple alternatives for “Did you mean?”
• Need to figure out which to present to the user
• Use heuristics
• The alternative hitting most docs
• Query log analysis + tweaking
• For especially popular, topical queries
• Spell-correction is computationally expensive
• Avoid running routinely on every query?
• Run only on queries that matched few docs
104
General Issues in Spell Correction
• We enumerate multiple alternatives for “Did you mean?”
• Need to figure out which to present to the user
• Use heuristics
• The alternative hitting most docs
• Query log analysis + tweaking
• For especially popular, topical queries
• Spell-correction is computationally expensive
• Avoid running routinely on every query?
• Run only on queries that matched few docs
105

Soundex

106
Soundex
• It has existed since the late 1800s and originally was used by the U.S. Census
Bureau.
• The Soundex algorithm generates four-character codes based upon the
pronunciation of English words.
• These codes can be used to compare two words to determine whether they
sound alike.
• This can be very useful when searching for information in a database or text
file, particularly when looking for names that are commonly misspelled.

107

Soundex – typical algorithm


• Turn every token to be indexed into a 4-character reduced form

• Do the same with query terms

• Build and search an index on the reduced forms


• (when the query calls for a soundex match)

• http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#Top

108
Soundex – typical algorithm
1. Retain the first letter of the word.
2. Change all occurrences of the following letters to '0' (zero):
'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.
3. Change letters to digits as follows:
• B, F, P,V  1
• C, G, J, K, Q, S, X, Z  2
• D,T  3
• L4
• M, N  5
• R6

109

Soundex continued
4. Remove all pairs of consecutive digits.
5. Remove all zeros from the resulting string.
6. Pad the resulting string with trailing zeros and return the first four positions,
which will be of the form <uppercase letter> <digit> <digit> <digit>.
i.e., Return the first four characters, padded on the right with zeros if
there are less than four
E.g., Herman becomes H655.

Will hermann generate the same code?

110
Soundex Example
• An example of the use of Soundex is the search function of a customer
database.
• When performing a text search for the surname, "Smith", people with the
name, "Smythe", would not be found.
• However, as the Soundex code for both surnames is "S530", a phonetic
Soundex-based search would find both customers.
• The codes and data could also be used to ask the user,
• "Did you mean Smythe?".

111

Problem 1
• Find Soundex code for the following
• Matt and Matthew
• Catherine and Katherine
• Johnathan and Jonathan
• Teresa and Theresa
• Smith and Smyth
• Robert and Rupert

112
Soundex
• Soundex is the classic algorithm, provided not just by IR system but by most
databases (Oracle, Microsoft, …)
• How useful is Soundex?
• Not very – for information retrieval
• Because it lowers the precision
• As we get lot of different terms in web and many of them mapped to same Soundex code that not very
meaningful.

• Okay for applications were recall is very important.


• There are other algorithms for phonetic matching perform much better in the
context of IR

113

What queries can we process?


• We have
• Positional inverted index with skip pointers
• Wild-card index
• Spell-correction
• Soundex
• Queries such as
(SPELL(moriset) /3 toron*to) OR SOUNDEX(chaikofski)

114
Assignment V
• Draw yourself a diagram showing the various indexes in a search engine
incorporating all the functionality we have talked about

• Identify some of the key design choices in the index pipeline:


• Does stemming happen before the Soundex index?
• What about n-grams?

• Given a query, how would you parse and dispatch sub-queries to the various
indexes?

115

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy