0% found this document useful (0 votes)
6 views21 pages

Lec 9

The document discusses context-sensitive spell correction techniques, emphasizing the need for surrounding context and various methods to suggest corrections for misspelled phrases. It also introduces the Soundex algorithm for phonetic matching, detailing its process and limitations in information retrieval. Additionally, the document covers index construction strategies and hardware considerations for effective data management in information retrieval systems.

Uploaded by

menaahmed15200
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views21 pages

Lec 9

The document discusses context-sensitive spell correction techniques, emphasizing the need for surrounding context and various methods to suggest corrections for misspelled phrases. It also introduces the Soundex algorithm for phonetic matching, detailing its process and limitations in information retrieval. Additionally, the document covers index construction strategies and hardware considerations for effective data management in information retrieval systems.

Uploaded by

menaahmed15200
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Sec. 3.3.

Context-sensitive spell correction


• Text: I flew from Heathrow to Narita.
• Consider the phrase query “flew form
Heathrow”
• We’d like to respond
Did you mean “flew from Heathrow”?
because no docs matched the query phrase.

1
Sec. 3.3.5

Context-sensitive correction
• Need surrounding context to catch this.
• First idea: retrieve dictionary terms close (in
weighted edit distance) to each query term
• Now try all possible resulting phrases with one
word “fixed” at a time
– flew from heathrow
– fled form heathrow
– flea form heathrow
• Hit-based spelling correction: Suggest the
alternative that has lots of hits in query logs.
2
Sec. 3.3.5

Another approach
• Break phrase query into a conjunction of
biwords .
• Look for biwords that need only one term
corrected.
• Enumerate only phrases containing “common”
biwords.

3
SOUNDEX

4
Sec. 3.4

Soundex
• Class of heuristics to expand a query into
phonetic equivalents
– Language specific – mainly for names
– E.g., chebyshev  tchebycheff
• Invented for the U.S. census … in 1918

5
Sec. 3.4

Soundex – typical algorithm


• Turn every token to be indexed into a 4-
character reduced form
• Do the same with query terms
• Build an soundex index on the reduced forms
– (when the query calls for a soundex match, search
soundex index)

6
Sec. 3.4

Soundex – typical algorithm


1. Retain the first letter of the word.
2. Change all occurrences of the following letters
to '0' (zero):
'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.
3. Change letters to digits as follows:
• B, F, P, V  1
• C, G, J, K, Q, S, X, Z  2
• D,T  3
• L4
• M, N  5
• R6
7
Sec. 3.4

Soundex continued
4. Remove all pairs of consecutive digits.
5. Remove all zeros from the resulting string.
6. Pad the resulting string with trailing zeros and
return the first four positions, which will be
of the form <uppercase letter> <digit> <digit>
<digit>.

Will hermann generate the same code?


E.g., Herman becomes H655.
What code can be generated to “Osama”, “Osmaa”,”Mehmed”,”Mohamed”
8
Sec. 3.4

Soundex
• Soundex is the classic algorithm, provided by
most databases (Oracle, Microsoft, …)
• How useful is soundex?
• Not very – for information retrieval
• Okay for “high recall” tasks (e.g., Interpol),
though biased to names of certain nationalities
• Zobel and Dart (1996) show that other algorithms
for phonetic matching perform much better in
the context of IR

9
Introduction to
Information Retrieval
CS276: Information Retrieval and Web
Search
Pandu Nayak and Prabhakar Raghavan

Lecture 4: Index Construction


Ch. 4

Index construction
• How do we construct an index?
-Indexing: process of constructing an index.
-Indexer: machine which perform indexing.
• What strategies can we use with limited main
memory?(Real data is so big to be fit into
RAM)
Sec. 4.1

Hardware basics
• Many design decisions in information
retrieval(Algorthims and techniques) are
based on the characteristics of hardware
• We begin by reviewing hardware basics
Sec. 4.1

Hardware basics
• Access to data in memory is much faster than
access to data on disk.
Smaller Size
Processor Faster
More expensive
Subset of data in lower
Register level

Cache

Main Memory

Hard disk

Memory Hierchary
Hardware basics
• Disk seeks: No data is transferred from disk while
the disk head is being positioned.
-Seek time: Time for disk head to reach the right track.
-Rotational Delay: Time of rotating till reaching a spot directly under the head.
-Transfer time: amount of time to transfer one block from disk to memory.
• Therefore: Transferring one large chunk of data
from disk to memory is faster than transferring
many small chunks.
• Disk I/O is block-based: Reading and writing of
entire blocks (as opposed to smaller chunks).
• Block sizes: 8KB to 256 KB.
14
Sec. 4.1

Hardware basics
• Buffer: the part of main memory into which a
block of data is transferred in /from.
• Servers(machine) used in IR systems now
typically have several GB of main memory,
sometimes tens of GB.
• Available disk space is several (2–3) orders of
magnitude larger.
• Fault tolerance(machines that doesn’t fail) is very expensive: It’s much
cheaper to use many regular machines(as distributed machines) rather
than one fault tolerant machine. If one fails of distributed computer
fails ,reassigned the task to another working machine
Sec. 4.1

Hardware assumptions for this lecture


• symbol statistic
value
• s average seek time 5 ms = 5 x
10−3 s
• b transfer time per byte 0.02 μs = 2 x 10−8
s
• from RAM) processor’s clock rate 109 s−1(access a byte
• size of main memory several GB
• size of disk space 1 TB or
more
Sec. 4.2

RCV1: Our collection for this lecture


• Shakespeare’s collected works definitely aren’t
large enough for demonstrating many of the
points in this course.
• The collection we’ll use isn’t really large enough
either, but it’s publicly available and is at least a
more plausible example.
• As an example for applying scalable index
construction algorithms, we will use the Reuters
RCV1 collection.
• This is one year of Reuters newswire (part of
1995 and 1996)
A Reuters RCV1 document Sec. 4.2
Sec. 4.2

Reuters RCV1 statistics


• symbol statistic
value
• N documents
800,000
• L avg. # tokens per doc 200
• M terms (Distinict) 400,000
• avg. # bytes per token 6
(incl. spaces/punct.)
• avg. # bytes per token 4.5
(without spaces/punct.)
• avg. # bytes per term 7.5
• 100,000,000 non-positional postings
4.5 bytes per word token vs. 7.5 bytes per word term: why?
Sec. 4.2

Recall IIR 1 index construction Term


I
Doc #
1
did 1
enact 1
julius 1

• Documents are parsed to extract words and these caesar


I
1
1
are saved with the Document ID. was 1
killed 1
• Sorted by term(primary key) and if word is i' 1
duplicated sorted with doc id(secondary key) the
capitol
1
1
• Sorting step was done in main memory(chapter 1) brutus 1
killed 1
me 1
Doc 1 Doc 2 so 2
let 2
it 2
be 2
I did enact Julius So let it be with
with 2
caesar 2
Caesar I was killed Caesar. The noble the 2
i' the Capitol; Brutus hath told you noble 2
Brutus killed me. Caesar was ambitious
brutus 2
hath 2
told 2
you 2
caesar 2
was 2
ambitious 2
Sec. 4.2

Key step
Term
I
Doc #
1
Term
ambitious
Doc #
2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2
• After all documents have been caesar
I
1
1
capitol
caesar
1
1
parsed, the inverted file is was
killed
1
1
caesar
caesar
2
2
sorted by terms. i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
me 1 i' 1
We focus on this sort step. so 2 it 2
We have 100M items to sort. let 2 julius 1
it 2 killed 1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy