0% found this document useful (0 votes)

9 views25 pages

IR Chapter 2 Text Operations

Chapter Two discusses statistical properties of text, focusing on word frequency distribution and its implications for information retrieval systems. It introduces concepts such as Zipf's Law, Luhn's Ideas on word significance, and Heaps' Law regarding vocabulary size growth. The chapter emphasizes the importance of text preprocessing and tokenization in improving retrieval performance by filtering out non-significant words.

Uploaded by

Dawit Sebhat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views25 pages

IR Chapter 2 Text Operations

Uploaded by

Dawit Sebhat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Chapter Two

Text Operations

1
Statistical Properties of Text
 How is the frequency of different words distributed?

 How fast does vocabulary size grow with the size of a corpus?
◦ Such factors affect the performance of IR system & can be used to select
suitable term weights & other aspects of the system.

 A few words are very common.

◦ 2 most frequent words (e.g. “the”, “of”) can account for about 10% of word
occurrences.
2
Statistical…….
 Most words are very rare.
◦ Half the words in a corpus appear only once, called
“read only once”

3
Sample Word Frequency Data

4
Word distribution: Zipf's Law
 Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
◦ attempts to capture the distribution of the frequencies
(number of occurances ) of the words within a text.

 Zipf's Law states that when the distinct words in a text

are ranked by frequency from most frequent to least
frequent, the product of rank and frequency is a constant.
5
Zipf's Law...
Frequency * Rank = constant

That is If the words, w, in a collection are ranked, r,

by their frequency, f, they roughly fit the relation:
r*f=c
◦ Different collections have different constants c.

6
Zipf’s distributions
Rank Frequency Distribution
For all the words in a collection of documents, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency. (The most commonly occurring word has rank 1,
etc.)
f Distribution of sorted word frequencies,
according to Zipf’s law

w has rank r and

frequency f

7
Example: Zipf's Law

 The table shows the most frequently occurring words

from 336,310 document collection containing 125, 720,
891 total words; out of which 508, 209 unique words 8
Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words
(upper cut-off). Used by almost all systems.
• Significant words: Take words in between the
most frequent (upper cut-off) and least frequent
words (lower cut-off).
• Term weighting: Give differing weights to terms
based on their frequency, with most frequent
words weighed less. Used by almost all ranking
methods. 9
Zipf ’s Law Impact on IR
◦ Good News: Stopwords will account for a large fraction
of text so eliminating them greatly reduces inverted-
index storage costs.
◦ Bad News: For most words, gathering sufficient data for
meaningful statistical analysis (e.g. for correlation analysis
for query expansion) is difficult since they are extremely
rare. 10
Word significance: Luhn’s Ideas
 Luhn Idea (1958): the frequency of word occurrence in a text
furnishes a useful measurement of word significance.

 Luhn suggested that both extremely common and extremely

uncommon words were not very useful for indexing.

 For this, Luhn specifies two cut-off points: an upper and a

lower cutoffs based on which non-significant words are
excluded 11
Word significance: Luhn’s Ideas
 The words exceeding the upper cut-off were considered to be
common
 The words below the lower cut-off were considered to be rare
 Hence they are not contributing significantly to the content of the
text
 The ability of words to discriminate content, reached a peak at a
rank order position half way between the two-cutoffs
 Let f be the frequency of occurrence of words in a text, and r their
rank in decreasing order of word frequency, then a plot relating 12f
Luhn’s Ideas

Luhn (1958) suggested that both extremely common and

extremely uncommon words were not very useful for document
representation & indexing. 13
Vocabulary size : Heaps’ Law
 How does the size of the overall vocabulary (number of
unique words) grow with the size of the corpus?
◦ This determines how the size of the inverted index will
scale with the size of the corpus.

14
Text Operations
 Not all words in a document are equally significant to
represent the contents/meanings of a document
◦ Some word carry more meaning than others
◦ Noun words are the most representative of a
document content

 Therefore, one needs to preprocess the text of a

document in a collection to be used as index terms 15
Text Op….
 Preprocessing is the process of controlling the size of the
vocabulary or the number of distinct words used as index terms
◦ Preprocessing will lead to an improvement in the information
retrieval performance
 However, some search engines on the Web omit preprocessing
◦ Every word in the document is an index term

16
 Text operations is the process of text transformations in to logical
representations

 The main operations for selecting index terms are:

 Lexical analysis/Tokenization of the text - digits, hyphens, punctuations marks, and the
case of letters

 Elimination of stop words - filter out words which are not useful in the retrieval
process

 Stemming words - remove affixes (prefixes and suffixes)

 Construction of term categorization structures such as thesaurus/wordlist, to capture

relationship for allowing the expansion of the original query with related terms
17
Generating Document Representatives
 Text Processing System
◦ Input text – full text, abstract or title
◦ Output – a document representative adequate for use in an
automatic retrieval system
documents Tokenization stop words stemming Thesaurus

Index
terms 18
Lexical Analysis/Tokenization of Text
 Change text of the documents into words to be adopted
as index terms

 Objective - identify words in the text

◦ Digits, hyphens, punctuation marks, case of letters

◦ Numbers are not good index terms (like 1910, 1999);

but 510 B.C. – unique
19
Lexical Analysis…..
 Hyphen – break up the words (e.g. state-of-the-art = state of
the art)- but some words, e.g. gilt-edged, B-49 - unique words
which require hyphens

 Punctuation marks – remove totally unless significant,

e.g. program code: x.exe and xexe
 Case of letters – not important and can convert all to
upper or lower
20
 Analyze text into a sequence of discrete tokens (words).


Tokenization Input: “Friends, Romans and Countrymen”

 Output: Tokens (an instance of a sequence of characters that are

grouped together as a useful semantic unit for processing)

◦ Friends , and, Romans, Countrymen

 Each such token is now a candidate for an index entry,

after further processing

 But what are valid tokens to omit? 21

 One word or multiple: How do you decide it is one token or
Issues in Tokenization two or more?
◦ Hewlett-Packard  Hewlett and Packard as two tokens?
 state-of-the-art: break up hyphenated sequence.
 San Francisco, Los Angeles
 Addis Ababa, Bahir Dar
◦ lowercase, lower-case, lower case ?
 data base, database, data-base
• Numbers:
 dates (3/12/91 vs. Mar. 12, 1991);
 phone numbers,
 IP addresses (100.2.86.144)
22
Issues in Tokenization
 How to handle special cases involving apostrophes, hyphens
etc? C++, C#, URLs, emails, …
◦ Sometimes punctuation (e-mail), numbers (1999), and case
(Republican vs. republican) can be a meaningful part of a
token.
◦ However, frequently they are not.
23
Issues in Tokenization
 Simplest approach is to ignore all numbers and punctuation and
use only case-insensitive unbroken strings of alphabetic
characters as tokens.
◦ Generally, don’t index numbers as text, But often very useful. Will often
index “meta-data” , including creation date, format, etc. separately

 Issues of tokenization are language specific

◦ Requires the language to be known

24
Similarity Measure
 A similarity measure is a function that computes
the degree of similarity between two vectors.

 Using a similarity measure between the query

and each document:
◦ It is possible to rank the retrieved documents in the
order of presumed relevance.
◦ It is possible to enforce a certain threshold so that
25

Chapter 4
No ratings yet
Chapter 4
72 pages
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
No ratings yet
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
77 pages
Chap 4
No ratings yet
Chap 4
76 pages
2_Text Operations (1)
No ratings yet
2_Text Operations (1)
56 pages
6_2018_09_11!11_16_16_AM
No ratings yet
6_2018_09_11!11_16_16_AM
101 pages
2_text operation
No ratings yet
2_text operation
35 pages
2&3 Text Operation
No ratings yet
2&3 Text Operation
65 pages
lecture2-dictionary
No ratings yet
lecture2-dictionary
37 pages
3-More on Indexing & Text Operations
No ratings yet
3-More on Indexing & Text Operations
27 pages
Ch-2 Text Operations
No ratings yet
Ch-2 Text Operations
40 pages
C2 Dictionary
No ratings yet
C2 Dictionary
6 pages
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
No ratings yet
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
33 pages
2 - Text Operation
No ratings yet
2 - Text Operation
55 pages
Chapter 2 (Information Storage & Retrieval)
No ratings yet
Chapter 2 (Information Storage & Retrieval)
56 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
Chapter 2 Text Operations
No ratings yet
Chapter 2 Text Operations
37 pages
Chapter 2 Text Operation
No ratings yet
Chapter 2 Text Operation
46 pages
0 Experimenteeff
No ratings yet
0 Experimenteeff
5 pages
2 TextOperations
No ratings yet
2 TextOperations
54 pages
2 Text Operation
No ratings yet
2 Text Operation
46 pages
Chapter-2 - Automatic Text Anlysis
No ratings yet
Chapter-2 - Automatic Text Anlysis
67 pages
mod4 nlp
No ratings yet
mod4 nlp
53 pages
2 Text-Operation
No ratings yet
2 Text-Operation
60 pages
IR Lec03 Vocabulary Postings List
No ratings yet
IR Lec03 Vocabulary Postings List
28 pages
2-Text Operations_new
No ratings yet
2-Text Operations_new
39 pages
Topic 4 W4 - Text Processing
No ratings yet
Topic 4 W4 - Text Processing
42 pages
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
No ratings yet
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
13 pages
CH 2_text operation
No ratings yet
CH 2_text operation
38 pages
2 Text Operation
No ratings yet
2 Text Operation
42 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
chapter two IR
No ratings yet
chapter two IR
44 pages
IRS Chapter 2
No ratings yet
IRS Chapter 2
57 pages
Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
No ratings yet
Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
65 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Midterm 1
No ratings yet
Midterm 1
5 pages
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
No ratings yet
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
34 pages
mod 4
No ratings yet
mod 4
35 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
2 - Text Operation
No ratings yet
2 - Text Operation
47 pages
Processing Text: 4.1 From Words To Terms
No ratings yet
Processing Text: 4.1 From Words To Terms
52 pages
Chapter 2 Part 1 & 2
No ratings yet
Chapter 2 Part 1 & 2
58 pages
1. 2_text Operation_1 (2)
No ratings yet
1. 2_text Operation_1 (2)
28 pages
Multimedia Information Retrieval (CSC 545) : The Problem of IR
No ratings yet
Multimedia Information Retrieval (CSC 545) : The Problem of IR
29 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks.in)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks.in)
48 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
NLP Ir
No ratings yet
NLP Ir
24 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
38 pages
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
No ratings yet
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
40 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
AI6122 Topic 3.1 - Index
No ratings yet
AI6122 Topic 3.1 - Index
40 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
Domain-Specific Languages in R: Advanced Statistical Programming
From Everand
Domain-Specific Languages in R: Advanced Statistical Programming
Thomas Mailund
No ratings yet
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
No ratings yet
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
3 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
Keyphrase Extraction From Document Using Rake and Textrank Algorithms
No ratings yet
Keyphrase Extraction From Document Using Rake and Textrank Algorithms
11 pages
Module 5 - Information Retrieval and Lexical Resources
0% (1)
Module 5 - Information Retrieval and Lexical Resources
80 pages
Turbo Prolog Toolbox 1987 PDF
100% (1)
Turbo Prolog Toolbox 1987 PDF
386 pages
2-Regular expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular expressions, Text Normalization, Edit Distance
42 pages
2 - Text Operation
No ratings yet
2 - Text Operation
45 pages
Text Operations 2021
No ratings yet
Text Operations 2021
45 pages
Chapter-3
No ratings yet
Chapter-3
90 pages
Multi Media Material
No ratings yet
Multi Media Material
101 pages
Chapter-4
No ratings yet
Chapter-4
83 pages
Chapter-2
No ratings yet
Chapter-2
58 pages
Chapter-1-part-2
No ratings yet
Chapter-1-part-2
60 pages
Chapter 2
No ratings yet
Chapter 2
29 pages
Chapter-1-part-1
No ratings yet
Chapter-1-part-1
38 pages
Chapter-2
No ratings yet
Chapter-2
25 pages
IT Chapter 5 2015
No ratings yet
IT Chapter 5 2015
41 pages
red it
No ratings yet
red it
30 pages
IT Chapter 2 2015
No ratings yet
IT Chapter 2 2015
26 pages
Chapter 3
No ratings yet
Chapter 3
52 pages
Week02 Lecture Chapter01 Part 1
No ratings yet
Week02 Lecture Chapter01 Part 1
71 pages
IT Chapter 4 2015
100% (1)
IT Chapter 4 2015
30 pages
IT Chapter 6 2015
No ratings yet
IT Chapter 6 2015
20 pages
Ethics ch1
No ratings yet
Ethics ch1
25 pages
Chapter 1 Introduction to IR
No ratings yet
Chapter 1 Introduction to IR
18 pages
IT-Chapter 1 B PPT 2015-20
No ratings yet
IT-Chapter 1 B PPT 2015-20
21 pages
Language Processing System in Compiler Design: Difficulty Level: Last Updated: 22 Feb, 2021
No ratings yet
Language Processing System in Compiler Design: Difficulty Level: Last Updated: 22 Feb, 2021
54 pages
C Lab
No ratings yet
C Lab
19 pages
AT&CD Unit 3
No ratings yet
AT&CD Unit 3
13 pages
Compiler Design-Short Notes
No ratings yet
Compiler Design-Short Notes
61 pages
CPython Source Code-Real Python
No ratings yet
CPython Source Code-Real Python
81 pages
LPLab Manual 2012
No ratings yet
LPLab Manual 2012
9 pages
Chapter 3 - Values and Data Types - Solutions For Class 10 ICSE Logix Kips Computer Applications With BlueJ Java - KnowledgeBoat
No ratings yet
Chapter 3 - Values and Data Types - Solutions For Class 10 ICSE Logix Kips Computer Applications With BlueJ Java - KnowledgeBoat
23 pages
Controllable Sentence Simplification With A Unified Text-to-Text Transfer Transformer
No ratings yet
Controllable Sentence Simplification With A Unified Text-to-Text Transfer Transformer
12 pages
Module 3 CDSS PDF
No ratings yet
Module 3 CDSS PDF
44 pages
Chapter 5 Retrieval Efective
No ratings yet
Chapter 5 Retrieval Efective
24 pages
CD Mid1 Answers
No ratings yet
CD Mid1 Answers
18 pages
CSC 318 Class Notes
No ratings yet
CSC 318 Class Notes
21 pages
4 YEAR 2 SEMESTER FINAL EXAM SCHEDULE (2)
No ratings yet
4 YEAR 2 SEMESTER FINAL EXAM SCHEDULE (2)
2 pages
Edited Event 4th Reg
No ratings yet
Edited Event 4th Reg
7 pages
CD Assignment-5 21BRS1018
No ratings yet
CD Assignment-5 21BRS1018
8 pages
Session Validation
No ratings yet
Session Validation
2 pages
Compiler Design (CD) : Lab Assignment 1
No ratings yet
Compiler Design (CD) : Lab Assignment 1
36 pages
Model Exam For Remedial Alliance
No ratings yet
Model Exam For Remedial Alliance
4 pages
Lab 4
No ratings yet
Lab 4
5 pages
Data Mining Project Report
100% (2)
Data Mining Project Report
5 pages
Chapter 3 "Describing Syntax and Semantics"
No ratings yet
Chapter 3 "Describing Syntax and Semantics"
10 pages
Bapatla Engineering College, Bapatla
No ratings yet
Bapatla Engineering College, Bapatla
6 pages
Input and Output in Java
No ratings yet
Input and Output in Java
28 pages
Chemistry MODEL EXAM - 2
No ratings yet
Chemistry MODEL EXAM - 2
2 pages
DSD Lab 1 Handout
No ratings yet
DSD Lab 1 Handout
8 pages
Introduction To Compiler Design (CD) : Mu-Mit
No ratings yet
Introduction To Compiler Design (CD) : Mu-Mit
22 pages
Tiger Language Specification
No ratings yet
Tiger Language Specification
5 pages
16 Marks
No ratings yet
16 Marks
5 pages
Material For CAT 1
100% (1)
Material For CAT 1
22 pages
Type Token Ratio PDF
No ratings yet
Type Token Ratio PDF
3 pages
Lex and Yacc Introduction
No ratings yet
Lex and Yacc Introduction
13 pages
Question Bank NLP
100% (1)
Question Bank NLP
11 pages
MID - Exam For Emerging Technology
100% (4)
MID - Exam For Emerging Technology
4 pages
Remedial Chemistry Model Exam 2024
100% (2)
Remedial Chemistry Model Exam 2024
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

IR Chapter 2 Text Operations

Uploaded by

IR Chapter 2 Text Operations

Uploaded by

Chapter Two

 A few words are very common.

 Zipf's Law states that when the distinct words in a text

That is If the words, w, in a collection are ranked, r,

w has rank r and

 The table shows the most frequently occurring words

 Luhn suggested that both extremely common and extremely

 For this, Luhn specifies two cut-off points: an upper and a

Luhn (1958) suggested that both extremely common and

 Therefore, one needs to preprocess the text of a

 The main operations for selecting index terms are:

 Stemming words - remove affixes (prefixes and suffixes)

 Construction of term categorization structures such as thesaurus/wordlist, to capture

 Objective - identify words in the text

◦ Digits, hyphens, punctuation marks, case of letters

◦ Numbers are not good index terms (like 1910, 1999);

 Punctuation marks – remove totally unless significant,

 Output: Tokens (an instance of a sequence of characters that are

◦ Friends , and, Romans, Countrymen

 Each such token is now a candidate for an index entry,

 But what are valid tokens to omit? 21

 Issues of tokenization are language specific

 Using a similarity measure between the query

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.