0% found this document useful (0 votes)

24 views37 pages

Chapter 2 Text Operations

Uploaded by

Dawit Sebhat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views37 pages

Chapter 2 Text Operations

Uploaded by

Dawit Sebhat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Chapter Two

Text Operations

1
Statistical Properties of Text
 How is the frequency of different words distributed?

 How fast does vocabulary size grow with the size of a corpus?
◦ Such factors affect the performance of IR system & can be used to select
suitable term weights & other aspects of the system.

 A few words are very common.

◦ 2 most frequent words (e.g. “the”, “of”) can account for about 10% of word
occurrences.
2
Statistical…….
 Most words are very rare.
◦ Half the words in a corpus appear only once, called
“read only once”

3
Sample Word Frequency Data

4
Word distribution: Zipf's Law
 Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
◦ attempts to capture the distribution of the frequencies
(number of occurances ) of the words within a text.

 Zipf's Law states that when the distinct words in a text

are ranked by frequency from most frequent to least
frequent, the product of rank and frequency is a constant.
5
Zipf's Law...
Frequency * Rank = constant

That is If the words, w, in a collection are ranked, r,

by their frequency, f, they roughly fit the relation:
r*f=c
◦ Different collections have different constants c.

6
Zipf ’s distributions
Rank Frequency Distribution
For all the words in a collection of documents, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency. (The most commonly occurring word has rank 1,
etc.)
f Distribution of sorted word frequencies,
according to Zipf’s law

w has rank r and

frequency f

7
Example: Zipf's Law

 The table shows the most frequently occurring words

from 336,310 document collection containing 125, 720,
891 total words; out of which 508, 209 unique words 8
Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words
(upper cut-off). Used by almost all systems.
• Significant words: Take words in between the
most frequent (upper cut-off) and least frequent
words (lower cut-off).
• Term weighting: Give differing weights to terms
based on their frequency, with most frequent
words weighed less. Used by almost all ranking
methods. 9
Zipf ’s Law Impact on IR
◦ Good News: Stopwords will account for a large fraction
of text so eliminating them greatly reduces inverted-
index storage costs.
◦ Bad News: For most words, gathering sufficient data for
meaningful statistical analysis (e.g. for correlation analysis
for query expansion) is difficult since they are extremely
rare. 10
Word significance: Luhn’s Ideas
 Luhn Idea (1958): the frequency of word occurrence in a text
furnishes a useful measurement of word significance.

 Luhn suggested that both extremely common and extremely

uncommon words were not very useful for indexing.

 For this, Luhn specifies two cut-off points: an upper and a

lower cutoffs based on which non-significant words are
excluded 11
Word significance: Luhn’s Ideas
 The words exceeding the upper cut-off were considered to be
common
 The words below the lower cut-off were considered to be rare
 Hence they are not contributing significantly to the content of the
text
 The ability of words to discriminate content, reached a peak at a
rank order position half way between the two-cutoffs
 Let f be the frequency of occurrence of words in a text, and r their
rank in decreasing order of word frequency, then a plot relating 12f
Luhn’s Ideas

Luhn (1958) suggested that both extremely common and

extremely uncommon words were not very useful for document
representation & indexing. 13
Vocabulary size : Heaps’ Law
 How does the size of the overall vocabulary (number of
unique words) grow with the size of the corpus?
◦ This determines how the size of the inverted index will
scale with the size of the corpus.

14
Vocabulary Growth: Heaps’ Law
 Heap’s law: estimates the number of vocabularies in a
given corpus
◦ The vocabulary size grows by O(n ),
β where β is a constant
between 0 – 1.
◦ If V is the size of the vocabulary and n is the length of the corpus
in words, Heap’s provides the following equation:
 Where constants:
◦ K  10−100
◦   0.4−0.6 (approx. square-root)

V = Kn 15
Heap’s distributions
• Distribution of size of the vocabulary: there is a linear
relationship between vocabulary size and number of
tokens

Example: from 1,000,000,000 documents, there

may be 1,000,000 distinct words. Can you agree? 16
Example
 We want to estimate the size of the vocabulary
for a corpus of 1,000,000 words. However, we
only know statistics computed on smaller
corpora sizes:
◦ For 100,000 words, there are 50,000 unique words
◦ For 500,000 words, there are 150,000 unique words
◦ Estimate the vocabulary size for the 1,000,000 words
corpus?
◦ How about for a corpus of 1,000,000,000 words? 17
Text Operations
 Not all words in a document are equally significant to
represent the contents/meanings of a document
◦ Some word carry more meaning than others
◦ Noun words are the most representative of a
document content

 Therefore, one needs to preprocess the text of a

document in a collection to be used as index terms 18
Text Op….
 Preprocessing is the process of controlling the size of the
vocabulary or the number of distinct words used as index terms
◦ Preprocessing will lead to an improvement in the information
retrieval performance
 However, some search engines on the Web omit preprocessing
◦ Every word in the document is an index term

19
 Text operations is the process of text transformations in to logical
representations

 The main operations for selecting index terms are:

 Lexical analysis/Tokenization of the text - digits, hyphens, punctuations marks, and the
case of letters

 Elimination of stop words - filter out words which are not useful in the retrieval
process

 Stemming words - remove affixes (prefixes and suffixes)

 Construction of term categorization structures such as thesaurus/wordlist, to capture

relationship for allowing the expansion of the original query with related terms
20
Generating Document Representatives
 Text Processing System
◦ Input text – full text, abstract or title
◦ Output – a document representative adequate for use in an
automatic retrieval system
documents Tokenization stop words stemming Thesaurus

Index
terms 21
Lexical Analysis/Tokenization of Text
 Change text of the documents into words to be adopted
as index terms

 Objective - identify words in the text

◦ Digits, hyphens, punctuation marks, case of letters

◦ Numbers are not good index terms (like 1910, 1999);

but 510 B.C. – unique
22
Lexical Analysis…..
 Hyphen – break up the words (e.g. state-of-the-art = state of
the art)- but some words, e.g. gilt-edged, B-49 - unique words
which require hyphens

 Punctuation marks – remove totally unless significant,

e.g. program code: x.exe and xexe
 Case of letters – not important and can convert all to
upper or lower
23
 Analyze text into a sequence of discrete tokens (words).


Tokenization Input:“Friends, Romans and Countrymen”

 Output: Tokens (an instance of a sequence of characters that are

grouped together as a useful semantic unit for processing)

◦ Friends , and, Romans, Countrymen

 Each such token is now a candidate for an index entry,

after further processing

 But what are valid tokens to omit? 24

 One word or multiple: How do you decide it is one token or
Issues in Tokenization two or more?
◦ Hewlett-Packard → Hewlett and Packard as two tokens?
 state-of-the-art: break up hyphenated sequence.
 San Francisco, Los Angeles
 Addis Ababa, Bahir Dar
◦ lowercase, lower-case, lower case ?
 data base, database, data-base
• Numbers:
 dates (3/12/91 vs. Mar. 12, 1991);
 phone numbers,
 IP addresses (100.2.86.144)
25
Issues in Tokenization
 How to handle special cases involving apostrophes, hyphens
etc? C++, C#, URLs, emails, …
◦ Sometimes punctuation (e-mail), numbers (1999), and case
(Republican vs. republican) can be a meaningful part of a
token.
◦ However, frequently they are not.
26
Issues in Tokenization
 Simplest approach is to ignore all numbers and punctuation and
use only case-insensitive unbroken strings of alphabetic
characters as tokens.
◦ Generally, don’t index numbers as text, But often very useful. Will often
index “meta-data” , including creation date, format, etc. separately

 Issues of tokenization are language specific

◦ Requires the language to be known

27
Exercise: Tokenization
 The cat slept peacefully in the living room. It’s a
very old cat.

 Mr. O’Neill thinks that the boys’ stories about

Chile’s capital aren’t amusing.

28
Term Weights: Term Frequency
 More frequent terms in a document are more
important, i.e. more indicative of the topic.
fij = frequency of term i in document j

 May want to normalize term frequency (tf) by

dividing by the frequency of the most common
term in the document:
tfij = fij / maxi{fij}
29
Term Weights: Inverse Document Frequency
 Terms that appear in many different documents are
less indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
 An indication of a term’s discrimination power.
 Log used to dampen the effect relative to tf.
30
TF-IDF Weighting
 A typical combined term importance indicator
is tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
 A term occurring frequently in the document
but rarely in the rest of the collection is given
high weight.
 Many other ways of determining term weights
have been proposed.
 Experimentally, tf-idf has been found to work
well.
31
Computing TF-IDF -- An Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8 32
Similarity Measure
 A similarity measure is a function that computes
the degree of similarity between two vectors.

 Using a similarity measure between the query

and each document:
◦ It is possible to rank the retrieved documents in the
order of presumed relevance.
◦ It is possible to enforce a certain threshold so that
33
Similarity Measure - Inner Product
 Similarity between vectors for the document di and query q can be
computed as the vector innert product (a.k.a. dot product):

sim(dj,q) = dj•qi =1=  ij iq

w w

where wij is the weight of term i in document j and wiq is the weight of term i in
the query
 For binary vectors, the inner product is the number of matched
query terms in the document (size of intersection).
 For weighted term vectors, it is the sum of the products of the
weights of the matched terms.

34
Properties of Inner Product
 The inner product is unbounded.

 Favors long documents with a large number

of unique terms.

 Measures how many terms matched but not

how many terms are not matched.
35
36
37

RTE Reference Guide 5.1.1 EN
No ratings yet
RTE Reference Guide 5.1.1 EN
242 pages
HydraFacial Tower User Guide
50% (2)
HydraFacial Tower User Guide
40 pages
Payshield 10K: The Hardware Security Module That Secures The World'S Payments
No ratings yet
Payshield 10K: The Hardware Security Module That Secures The World'S Payments
2 pages
IR Chapter 2 Text Operations
No ratings yet
IR Chapter 2 Text Operations
25 pages
Text Operations 2021
No ratings yet
Text Operations 2021
45 pages
Chapter 2 Text Operation
No ratings yet
Chapter 2 Text Operation
46 pages
2 - Text Operation
No ratings yet
2 - Text Operation
47 pages
2 - Text Operation
No ratings yet
2 - Text Operation
45 pages
2 Text Operation
No ratings yet
2 Text Operation
42 pages
2&3 Text Operation
No ratings yet
2&3 Text Operation
65 pages
2 Text Operation
No ratings yet
2 Text Operation
46 pages
chapter two IR
No ratings yet
chapter two IR
44 pages
2_text operation
No ratings yet
2_text operation
35 pages
Chapter 2 (Information Storage & Retrieval)
No ratings yet
Chapter 2 (Information Storage & Retrieval)
56 pages
2_Text Operations (1)
No ratings yet
2_Text Operations (1)
56 pages
2 TextOperations
No ratings yet
2 TextOperations
54 pages
CH 2_text operation
No ratings yet
CH 2_text operation
38 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
2 Text-Operation
No ratings yet
2 Text-Operation
60 pages
Chapter-2 - Automatic Text Anlysis
No ratings yet
Chapter-2 - Automatic Text Anlysis
67 pages
Chap 4
No ratings yet
Chap 4
76 pages
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
No ratings yet
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
13 pages
Chapter 4
No ratings yet
Chapter 4
72 pages
Processing Text: 4.1 From Words To Terms
No ratings yet
Processing Text: 4.1 From Words To Terms
52 pages
mod4 nlp
No ratings yet
mod4 nlp
53 pages
Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
No ratings yet
Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
65 pages
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
No ratings yet
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
40 pages
ISR Assignment 1
No ratings yet
ISR Assignment 1
13 pages
Topic 4 W4 - Text Processing
No ratings yet
Topic 4 W4 - Text Processing
42 pages
Module 5 - Information Retrieval and Lexical Resources
0% (1)
Module 5 - Information Retrieval and Lexical Resources
80 pages
Ch-2 Text Operations
No ratings yet
Ch-2 Text Operations
40 pages
0 Experimenteeff
No ratings yet
0 Experimenteeff
5 pages
IR Chapter 2 Part II
No ratings yet
IR Chapter 2 Part II
45 pages
2-Text Operations_new
No ratings yet
2-Text Operations_new
39 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
2 - Text Operation
No ratings yet
2 - Text Operation
55 pages
Term Weighting
No ratings yet
Term Weighting
71 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
Term Weighting and Similarity Measures
50% (2)
Term Weighting and Similarity Measures
54 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
Multimedia Information Retrieval (CSC 545) : The Problem of IR
No ratings yet
Multimedia Information Retrieval (CSC 545) : The Problem of IR
29 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
Qta Lse Day2.PDF
No ratings yet
Qta Lse Day2.PDF
55 pages
Basic Text Process
No ratings yet
Basic Text Process
3 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
mod 4
No ratings yet
mod 4
35 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
IRS Chapter 2
No ratings yet
IRS Chapter 2
57 pages
NLP Ir
No ratings yet
NLP Ir
24 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Text
No ratings yet
Text
3 pages
Preprocessing Stemin JI
No ratings yet
Preprocessing Stemin JI
3 pages
1-s2.0-S1877050916311589-main- part-5
No ratings yet
1-s2.0-S1877050916311589-main- part-5
7 pages
Introduction To Automatic Indexing
No ratings yet
Introduction To Automatic Indexing
28 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
1 Information Retrieval System
No ratings yet
1 Information Retrieval System
10 pages
3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
6_2018_09_11!11_16_16_AM
No ratings yet
6_2018_09_11!11_16_16_AM
101 pages
Glossário Jurídico
From Everand
Glossário Jurídico
Luanda Garibotti Victorino
5/5 (1)
Multi Media Material
No ratings yet
Multi Media Material
101 pages
Chapter 1 Introduction to IR
No ratings yet
Chapter 1 Introduction to IR
18 pages
red it
No ratings yet
red it
30 pages
Chapter-3
No ratings yet
Chapter-3
90 pages
Chapter-1-part-2
No ratings yet
Chapter-1-part-2
60 pages
Chapter-4
No ratings yet
Chapter-4
83 pages
IT Chapter 5 2015
No ratings yet
IT Chapter 5 2015
41 pages
Chapter 2
No ratings yet
Chapter 2
29 pages
Chapter-2
No ratings yet
Chapter-2
58 pages
Chapter-1-part-1
No ratings yet
Chapter-1-part-1
38 pages
4 YEAR 2 SEMESTER FINAL EXAM SCHEDULE (2)
No ratings yet
4 YEAR 2 SEMESTER FINAL EXAM SCHEDULE (2)
2 pages
Chapter-2
No ratings yet
Chapter-2
25 pages
Chapter 5 Retrieval Efective
No ratings yet
Chapter 5 Retrieval Efective
24 pages
IT Chapter 2 2015
No ratings yet
IT Chapter 2 2015
26 pages
IT-Chapter 1 B PPT 2015-20
No ratings yet
IT-Chapter 1 B PPT 2015-20
21 pages
IT Chapter 4 2015
100% (1)
IT Chapter 4 2015
30 pages
Ethics ch1
No ratings yet
Ethics ch1
25 pages
Session Validation
No ratings yet
Session Validation
2 pages
Chapter 3
No ratings yet
Chapter 3
52 pages
MID - Exam For Emerging Technology
100% (4)
MID - Exam For Emerging Technology
4 pages
Model Exam For Remedial Alliance
No ratings yet
Model Exam For Remedial Alliance
4 pages
IT Chapter 6 2015
No ratings yet
IT Chapter 6 2015
20 pages
Remedial Chemistry Model Exam 2024
100% (2)
Remedial Chemistry Model Exam 2024
6 pages
Chemistry MODEL EXAM - 2
No ratings yet
Chemistry MODEL EXAM - 2
2 pages
Edited Event 4th Reg
No ratings yet
Edited Event 4th Reg
7 pages
labclient.labondemand.com_Instructions_ExamResult_b3a22e8f-c8fc-4eca-8bf5-d55862038f5a
No ratings yet
labclient.labondemand.com_Instructions_ExamResult_b3a22e8f-c8fc-4eca-8bf5-d55862038f5a
2 pages
Summary of MATLAB Onramp
No ratings yet
Summary of MATLAB Onramp
3 pages
Programmable Electronic Current Burden: Features and Benefits
No ratings yet
Programmable Electronic Current Burden: Features and Benefits
3 pages
Mas Iban Skeletal Animation Library
No ratings yet
Mas Iban Skeletal Animation Library
65 pages
PT VW 330 VX 400 NT
No ratings yet
PT VW 330 VX 400 NT
6 pages
Redux Devtools
No ratings yet
Redux Devtools
5 pages
Meteodyn Complex Terrain Modeling CFD Software Bolund Hill Round Robin Test
No ratings yet
Meteodyn Complex Terrain Modeling CFD Software Bolund Hill Round Robin Test
18 pages
Xerox Wireless Print Solutions Adapter: Your Modern Workflow Connector
No ratings yet
Xerox Wireless Print Solutions Adapter: Your Modern Workflow Connector
2 pages
CADSUserGuide Gen 2 V4.2
No ratings yet
CADSUserGuide Gen 2 V4.2
21 pages
Jailbreak GPT Handbook by Zsec
No ratings yet
Jailbreak GPT Handbook by Zsec
15 pages
A3 Alpha Meter With EA-NIC: Connected Utilities
No ratings yet
A3 Alpha Meter With EA-NIC: Connected Utilities
2 pages
Get Pro Spring Security: Securing Spring Framework 6 and Boot 3–based Java Applications, Third Edition Massimo Nardone free all chapters
100% (2)
Get Pro Spring Security: Securing Spring Framework 6 and Boot 3–based Java Applications, Third Edition Massimo Nardone free all chapters
47 pages
Variants of Turing Machine
No ratings yet
Variants of Turing Machine
28 pages
Laser Printer - History Computer
No ratings yet
Laser Printer - History Computer
4 pages
DSP Assignment
No ratings yet
DSP Assignment
19 pages
Dimpy Mishra & Huf
No ratings yet
Dimpy Mishra & Huf
1 page
Commercial Displays Brochure PDF
No ratings yet
Commercial Displays Brochure PDF
20 pages
1.1 - System Concept
No ratings yet
1.1 - System Concept
35 pages
Module 4: The Graphics Device Interface (GDI), Colors, and Fonts
No ratings yet
Module 4: The Graphics Device Interface (GDI), Colors, and Fonts
33 pages
Luxel V-6 Series CTP: Punch Output Upgrade Installation Manual 6800151008
No ratings yet
Luxel V-6 Series CTP: Punch Output Upgrade Installation Manual 6800151008
34 pages
Q1) Explain MS Excel in Brief.: Workbook and Worksheet
No ratings yet
Q1) Explain MS Excel in Brief.: Workbook and Worksheet
33 pages
Sophos Firewall Vs Checkpoint BC
No ratings yet
Sophos Firewall Vs Checkpoint BC
8 pages
CV 1
No ratings yet
CV 1
1 page
Datalogger Fronius
No ratings yet
Datalogger Fronius
88 pages
Minigpt-4 Enhancing Vision-Language Understanding With Advanced Large Language Models
No ratings yet
Minigpt-4 Enhancing Vision-Language Understanding With Advanced Large Language Models
15 pages
GPS Vehicle Tracker (GPS+GSM+SMS/GPRS) GT06 User Manual: (Version 3.2)
No ratings yet
GPS Vehicle Tracker (GPS+GSM+SMS/GPRS) GT06 User Manual: (Version 3.2)
23 pages
22.3 Sample Learning Schedule ES
No ratings yet
22.3 Sample Learning Schedule ES
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Chapter 2 Text Operations

Uploaded by

Chapter 2 Text Operations

Uploaded by

Chapter Two

 A few words are very common.

 Zipf's Law states that when the distinct words in a text

That is If the words, w, in a collection are ranked, r,

w has rank r and

 The table shows the most frequently occurring words

 Luhn suggested that both extremely common and extremely

 For this, Luhn specifies two cut-off points: an upper and a

Luhn (1958) suggested that both extremely common and

Example: from 1,000,000,000 documents, there

 Therefore, one needs to preprocess the text of a

 The main operations for selecting index terms are:

 Stemming words - remove affixes (prefixes and suffixes)

 Construction of term categorization structures such as thesaurus/wordlist, to capture

 Objective - identify words in the text

◦ Digits, hyphens, punctuation marks, case of letters

◦ Numbers are not good index terms (like 1910, 1999);

 Punctuation marks – remove totally unless significant,

 Output: Tokens (an instance of a sequence of characters that are

◦ Friends , and, Romans, Countrymen

 Each such token is now a candidate for an index entry,

 But what are valid tokens to omit? 24

 Issues of tokenization are language specific

 Mr. O’Neill thinks that the boys’ stories about

 May want to normalize term frequency (tf) by

 Using a similarity measure between the query

sim(dj,q) = dj•qi =1=  ij iq

 Favors long documents with a large number

 Measures how many terms matched but not

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.