0% found this document useful (0 votes)

6 views21 pages

Lec 9

The document discusses context-sensitive spell correction techniques, emphasizing the need for surrounding context and various methods to suggest corrections for misspelled phrases. It also introduces the Soundex algorithm for phonetic matching, detailing its process and limitations in information retrieval. Additionally, the document covers index construction strategies and hardware considerations for effective data management in information retrieval systems.

Uploaded by

menaahmed15200

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views21 pages

Lec 9

Uploaded by

menaahmed15200

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Sec. 3.3.

Context-sensitive spell correction

• Text: I flew from Heathrow to Narita.
• Consider the phrase query “flew form
Heathrow”
• We’d like to respond
Did you mean “flew from Heathrow”?
because no docs matched the query phrase.

1
Sec. 3.3.5

Context-sensitive correction
• Need surrounding context to catch this.
• First idea: retrieve dictionary terms close (in
weighted edit distance) to each query term
• Now try all possible resulting phrases with one
word “fixed” at a time
– flew from heathrow
– fled form heathrow
– flea form heathrow
• Hit-based spelling correction: Suggest the
alternative that has lots of hits in query logs.
2
Sec. 3.3.5

Another approach
• Break phrase query into a conjunction of
biwords .
• Look for biwords that need only one term
corrected.
• Enumerate only phrases containing “common”
biwords.

3
SOUNDEX

4
Sec. 3.4

Soundex
• Class of heuristics to expand a query into
phonetic equivalents
– Language specific – mainly for names
– E.g., chebyshev  tchebycheff
• Invented for the U.S. census … in 1918

5
Sec. 3.4

Soundex – typical algorithm

• Turn every token to be indexed into a 4-
character reduced form
• Do the same with query terms
• Build an soundex index on the reduced forms
– (when the query calls for a soundex match, search
soundex index)

6
Sec. 3.4

Soundex – typical algorithm

1. Retain the first letter of the word.
2. Change all occurrences of the following letters
to '0' (zero):
'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.
3. Change letters to digits as follows:
• B, F, P, V  1
• C, G, J, K, Q, S, X, Z  2
• D,T  3
• L4
• M, N  5
• R6
7
Sec. 3.4

Soundex continued
4. Remove all pairs of consecutive digits.
5. Remove all zeros from the resulting string.
6. Pad the resulting string with trailing zeros and
return the first four positions, which will be
of the form <uppercase letter> <digit> <digit>
<digit>.

Will hermann generate the same code?

E.g., Herman becomes H655.
What code can be generated to “Osama”, “Osmaa”,”Mehmed”,”Mohamed”
8
Sec. 3.4

Soundex
• Soundex is the classic algorithm, provided by
most databases (Oracle, Microsoft, …)
• How useful is soundex?
• Not very – for information retrieval
• Okay for “high recall” tasks (e.g., Interpol),
though biased to names of certain nationalities
• Zobel and Dart (1996) show that other algorithms
for phonetic matching perform much better in
the context of IR

9
Introduction to
Information Retrieval
CS276: Information Retrieval and Web
Search
Pandu Nayak and Prabhakar Raghavan

Lecture 4: Index Construction

Ch. 4

Index construction
• How do we construct an index?
-Indexing: process of constructing an index.
-Indexer: machine which perform indexing.
• What strategies can we use with limited main
memory?(Real data is so big to be fit into
RAM)
Sec. 4.1

Hardware basics
• Many design decisions in information
retrieval(Algorthims and techniques) are
based on the characteristics of hardware
• We begin by reviewing hardware basics
Sec. 4.1

Hardware basics
• Access to data in memory is much faster than
access to data on disk.
Smaller Size
Processor Faster
More expensive
Subset of data in lower
Register level

Cache

Main Memory

Hard disk

Memory Hierchary
Hardware basics
• Disk seeks: No data is transferred from disk while
the disk head is being positioned.
-Seek time: Time for disk head to reach the right track.
-Rotational Delay: Time of rotating till reaching a spot directly under the head.
-Transfer time: amount of time to transfer one block from disk to memory.
• Therefore: Transferring one large chunk of data
from disk to memory is faster than transferring
many small chunks.
• Disk I/O is block-based: Reading and writing of
entire blocks (as opposed to smaller chunks).
• Block sizes: 8KB to 256 KB.
14
Sec. 4.1

Hardware basics
• Buffer: the part of main memory into which a
block of data is transferred in /from.
• Servers(machine) used in IR systems now
typically have several GB of main memory,
sometimes tens of GB.
• Available disk space is several (2–3) orders of
magnitude larger.
• Fault tolerance(machines that doesn’t fail) is very expensive: It’s much
cheaper to use many regular machines(as distributed machines) rather
than one fault tolerant machine. If one fails of distributed computer
fails ,reassigned the task to another working machine
Sec. 4.1

Hardware assumptions for this lecture

• symbol statistic
value
• s average seek time 5 ms = 5 x
10−3 s
• b transfer time per byte 0.02 μs = 2 x 10−8
s
• from RAM) processor’s clock rate 109 s−1(access a byte
• size of main memory several GB
• size of disk space 1 TB or
more
Sec. 4.2

RCV1: Our collection for this lecture

• Shakespeare’s collected works definitely aren’t
large enough for demonstrating many of the
points in this course.
• The collection we’ll use isn’t really large enough
either, but it’s publicly available and is at least a
more plausible example.
• As an example for applying scalable index
construction algorithms, we will use the Reuters
RCV1 collection.
• This is one year of Reuters newswire (part of
1995 and 1996)
A Reuters RCV1 document Sec. 4.2
Sec. 4.2

Reuters RCV1 statistics

• symbol statistic
value
• N documents
800,000
• L avg. # tokens per doc 200
• M terms (Distinict) 400,000
• avg. # bytes per token 6
(incl. spaces/punct.)
• avg. # bytes per token 4.5
(without spaces/punct.)
• avg. # bytes per term 7.5
• 100,000,000 non-positional postings
4.5 bytes per word token vs. 7.5 bytes per word term: why?
Sec. 4.2

Recall IIR 1 index construction Term

I
Doc #
1
did 1
enact 1
julius 1

• Documents are parsed to extract words and these caesar

I
1
1
are saved with the Document ID. was 1
killed 1
• Sorted by term(primary key) and if word is i' 1
duplicated sorted with doc id(secondary key) the
capitol
1
1
• Sorting step was done in main memory(chapter 1) brutus 1
killed 1
me 1
Doc 1 Doc 2 so 2
let 2
it 2
be 2
I did enact Julius So let it be with
with 2
caesar 2
Caesar I was killed Caesar. The noble the 2
i' the Capitol; Brutus hath told you noble 2
Brutus killed me. Caesar was ambitious
brutus 2
hath 2
told 2
you 2
caesar 2
was 2
ambitious 2
Sec. 4.2

Key step
Term
I
Doc #
1
Term
ambitious
Doc #
2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2
• After all documents have been caesar
I
1
1
capitol
caesar
1
1
parsed, the inverted file is was
killed
1
1
caesar
caesar
2
2
sorted by terms. i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
me 1 i' 1
We focus on this sort step. so 2 it 2
We have 100M items to sort. let 2 julius 1
it 2 killed 1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2

05 Index Construction
No ratings yet
05 Index Construction
47 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
C3 IndexConstruction
No ratings yet
C3 IndexConstruction
46 pages
Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
Lecture 4 - Index Construction - Compressing
No ratings yet
Lecture 4 - Index Construction - Compressing
90 pages
Lecture5 Spell Correction 1per
No ratings yet
Lecture5 Spell Correction 1per
61 pages
Lec 9
No ratings yet
Lec 9
21 pages
9 Dictionaries and Tolerant Retrieval
No ratings yet
9 Dictionaries and Tolerant Retrieval
58 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
Lec4 IR
No ratings yet
Lec4 IR
53 pages
3.tolerant Retrieval
No ratings yet
3.tolerant Retrieval
46 pages
L05
No ratings yet
L05
33 pages
6-Spelling Correction Soundex
No ratings yet
6-Spelling Correction Soundex
52 pages
Lecture 5p1 - Index Construction & Compressing
No ratings yet
Lecture 5p1 - Index Construction & Compressing
42 pages
Advanced Topics in Information Systems
No ratings yet
Advanced Topics in Information Systems
175 pages
C7 SpellCorrection
No ratings yet
C7 SpellCorrection
43 pages
Lecture3 Tolerant Retrieval
No ratings yet
Lecture3 Tolerant Retrieval
48 pages
04 Index Construction
No ratings yet
04 Index Construction
48 pages
chapter2-MA212-Indexing & Preprocessing
No ratings yet
chapter2-MA212-Indexing & Preprocessing
68 pages
Chapter4 Indexconstruction
No ratings yet
Chapter4 Indexconstruction
49 pages
Lecture4-Indexconstruction Ch2 and Ch4
No ratings yet
Lecture4-Indexconstruction Ch2 and Ch4
49 pages
Lecture2 Indexing
No ratings yet
Lecture2 Indexing
49 pages
10 Dictionaries and Tolerant Retrieval
No ratings yet
10 Dictionaries and Tolerant Retrieval
13 pages
Unit I
No ratings yet
Unit I
83 pages
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
No ratings yet
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
47 pages
20 Tolerantretrieval
No ratings yet
20 Tolerantretrieval
39 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
115 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
CE4525 4.0v1 Getting Started With Sophos Central XDR Live Discover
No ratings yet
CE4525 4.0v1 Getting Started With Sophos Central XDR Live Discover
22 pages
Lecture 5-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 5-Dictionaries and Tolerant Retrieval
48 pages
Document Indexing in Information Retrieval
No ratings yet
Document Indexing in Information Retrieval
19 pages
4 Indexing
No ratings yet
4 Indexing
29 pages
Back End Report
No ratings yet
Back End Report
2 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
03 - Lect3 Search Engines-Part2
No ratings yet
03 - Lect3 Search Engines-Part2
32 pages
Chap5 Index Construction
No ratings yet
Chap5 Index Construction
38 pages
Index Construction
No ratings yet
Index Construction
48 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
The Definitive Guide To Django Web Development Done Right 1st Edition Adrian Holovaty
No ratings yet
The Definitive Guide To Django Web Development Done Right 1st Edition Adrian Holovaty
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
49 pages
Lecture 4-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 4-Dictionaries and Tolerant Retrieval
50 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
Advanced Indexing Issues
No ratings yet
Advanced Indexing Issues
52 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
IT 333 Advanced Database Management Systems
0% (1)
IT 333 Advanced Database Management Systems
8 pages
C10 IR M2021 IndexConstruction SimpleandDistributed
No ratings yet
C10 IR M2021 IndexConstruction SimpleandDistributed
42 pages
Lecture3 Tolerant Retrieval Handout 6 Per
No ratings yet
Lecture3 Tolerant Retrieval Handout 6 Per
8 pages
Normalization
No ratings yet
Normalization
89 pages
Lecture 4-Indexconstruction
No ratings yet
Lecture 4-Indexconstruction
45 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Database Management Systems Lab
No ratings yet
Database Management Systems Lab
81 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
SQL Injection
No ratings yet
SQL Injection
22 pages
Building Microservices With Node - Js
No ratings yet
Building Microservices With Node - Js
16 pages
Unit 6 Advanced Databases
No ratings yet
Unit 6 Advanced Databases
108 pages
Lecture 10 - Spark
No ratings yet
Lecture 10 - Spark
87 pages
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
No ratings yet
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
16 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
DMS Report
No ratings yet
DMS Report
53 pages
IM M2-Week 3-Organization & Presentation of Data-1
No ratings yet
IM M2-Week 3-Organization & Presentation of Data-1
16 pages
Mongo-ETL Issues
No ratings yet
Mongo-ETL Issues
45 pages
2 Ip 12 Notes RDBMS 2022 PDF
No ratings yet
2 Ip 12 Notes RDBMS 2022 PDF
16 pages
Dbms Mar 2021
No ratings yet
Dbms Mar 2021
2 pages
UNIT 3 Notes
No ratings yet
UNIT 3 Notes
17 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Emailing DBMS - QB - Shubhammarotkar Toc Notes
No ratings yet
Emailing DBMS - QB - Shubhammarotkar Toc Notes
14 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Cs Practical File (Final)
No ratings yet
Cs Practical File (Final)
48 pages
Web Tech Lab Manual 2021
100% (1)
Web Tech Lab Manual 2021
4 pages
Mapping Logical Data Model To Relational Schema (Physical Data Model)
No ratings yet
Mapping Logical Data Model To Relational Schema (Physical Data Model)
31 pages
Blind SQL
No ratings yet
Blind SQL
15 pages
Miscellaneous
No ratings yet
Miscellaneous
2 pages
SQL MCQ
No ratings yet
SQL MCQ
3 pages
Tribhuvan University: Institute of Engineering
No ratings yet
Tribhuvan University: Institute of Engineering
48 pages
14137-CSC-202 (3) - A - 44052
No ratings yet
14137-CSC-202 (3) - A - 44052
5 pages
Elements of Hibernate Architecture
No ratings yet
Elements of Hibernate Architecture
8 pages
Instructions For Connecting To MySQL With SSH
No ratings yet
Instructions For Connecting To MySQL With SSH
2 pages
1Z0 083 Questions PDF
100% (1)
1Z0 083 Questions PDF
6 pages
Applications of DBMS
No ratings yet
Applications of DBMS
3 pages
Assignment - 3 (CLO-C3) : CSC371: Database Systems - Spring 2019
No ratings yet
Assignment - 3 (CLO-C3) : CSC371: Database Systems - Spring 2019
2 pages
Rust Essentials: Master the Language of Safe Systems Programming
From Everand
Rust Essentials: Master the Language of Safe Systems Programming
Tyler Hayes
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lec 9

Uploaded by

Lec 9

Uploaded by

Sec. 3.3.

Context-sensitive spell correction

Soundex – typical algorithm

Soundex – typical algorithm

Will hermann generate the same code?

Lecture 4: Index Construction

Hardware assumptions for this lecture

RCV1: Our collection for this lecture

Reuters RCV1 statistics

Recall IIR 1 index construction Term

• Documents are parsed to extract words and these caesar

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.