Lec 9
Lec 9
1
Sec. 3.3.5
Context-sensitive correction
• Need surrounding context to catch this.
• First idea: retrieve dictionary terms close (in
weighted edit distance) to each query term
• Now try all possible resulting phrases with one
word “fixed” at a time
– flew from heathrow
– fled form heathrow
– flea form heathrow
• Hit-based spelling correction: Suggest the
alternative that has lots of hits in query logs.
2
Sec. 3.3.5
Another approach
• Break phrase query into a conjunction of
biwords .
• Look for biwords that need only one term
corrected.
• Enumerate only phrases containing “common”
biwords.
3
SOUNDEX
4
Sec. 3.4
Soundex
• Class of heuristics to expand a query into
phonetic equivalents
– Language specific – mainly for names
– E.g., chebyshev tchebycheff
• Invented for the U.S. census … in 1918
5
Sec. 3.4
6
Sec. 3.4
Soundex continued
4. Remove all pairs of consecutive digits.
5. Remove all zeros from the resulting string.
6. Pad the resulting string with trailing zeros and
return the first four positions, which will be
of the form <uppercase letter> <digit> <digit>
<digit>.
Soundex
• Soundex is the classic algorithm, provided by
most databases (Oracle, Microsoft, …)
• How useful is soundex?
• Not very – for information retrieval
• Okay for “high recall” tasks (e.g., Interpol),
though biased to names of certain nationalities
• Zobel and Dart (1996) show that other algorithms
for phonetic matching perform much better in
the context of IR
9
Introduction to
Information Retrieval
CS276: Information Retrieval and Web
Search
Pandu Nayak and Prabhakar Raghavan
Index construction
• How do we construct an index?
-Indexing: process of constructing an index.
-Indexer: machine which perform indexing.
• What strategies can we use with limited main
memory?(Real data is so big to be fit into
RAM)
Sec. 4.1
Hardware basics
• Many design decisions in information
retrieval(Algorthims and techniques) are
based on the characteristics of hardware
• We begin by reviewing hardware basics
Sec. 4.1
Hardware basics
• Access to data in memory is much faster than
access to data on disk.
Smaller Size
Processor Faster
More expensive
Subset of data in lower
Register level
Cache
Main Memory
Hard disk
Memory Hierchary
Hardware basics
• Disk seeks: No data is transferred from disk while
the disk head is being positioned.
-Seek time: Time for disk head to reach the right track.
-Rotational Delay: Time of rotating till reaching a spot directly under the head.
-Transfer time: amount of time to transfer one block from disk to memory.
• Therefore: Transferring one large chunk of data
from disk to memory is faster than transferring
many small chunks.
• Disk I/O is block-based: Reading and writing of
entire blocks (as opposed to smaller chunks).
• Block sizes: 8KB to 256 KB.
14
Sec. 4.1
Hardware basics
• Buffer: the part of main memory into which a
block of data is transferred in /from.
• Servers(machine) used in IR systems now
typically have several GB of main memory,
sometimes tens of GB.
• Available disk space is several (2–3) orders of
magnitude larger.
• Fault tolerance(machines that doesn’t fail) is very expensive: It’s much
cheaper to use many regular machines(as distributed machines) rather
than one fault tolerant machine. If one fails of distributed computer
fails ,reassigned the task to another working machine
Sec. 4.1
Key step
Term
I
Doc #
1
Term
ambitious
Doc #
2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2
• After all documents have been caesar
I
1
1
capitol
caesar
1
1
parsed, the inverted file is was
killed
1
1
caesar
caesar
2
2
sorted by terms. i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
me 1 i' 1
We focus on this sort step. so 2 it 2
We have 100M items to sort. let 2 julius 1
it 2 killed 1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2