0% found this document useful (0 votes)
24 views37 pages

Chapter 2 Text Operations

Uploaded by

Dawit Sebhat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views37 pages

Chapter 2 Text Operations

Uploaded by

Dawit Sebhat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Chapter Two

Text Operations

1
Statistical Properties of Text
 How is the frequency of different words distributed?

 How fast does vocabulary size grow with the size of a corpus?
◦ Such factors affect the performance of IR system & can be used to select
suitable term weights & other aspects of the system.

 A few words are very common.


◦ 2 most frequent words (e.g. “the”, “of”) can account for about 10% of word
occurrences.
2
Statistical…….
 Most words are very rare.
◦ Half the words in a corpus appear only once, called
“read only once”

3
Sample Word Frequency Data

4
Word distribution: Zipf's Law
 Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
◦ attempts to capture the distribution of the frequencies
(number of occurances ) of the words within a text.

 Zipf's Law states that when the distinct words in a text


are ranked by frequency from most frequent to least
frequent, the product of rank and frequency is a constant.
5
Zipf's Law...
Frequency * Rank = constant

That is If the words, w, in a collection are ranked, r,


by their frequency, f, they roughly fit the relation:
r*f=c
◦ Different collections have different constants c.

6
Zipf ’s distributions
Rank Frequency Distribution
For all the words in a collection of documents, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency. (The most commonly occurring word has rank 1,
etc.)
f Distribution of sorted word frequencies,
according to Zipf’s law

w has rank r and


frequency f

7
Example: Zipf's Law

 The table shows the most frequently occurring words


from 336,310 document collection containing 125, 720,
891 total words; out of which 508, 209 unique words 8
Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words
(upper cut-off). Used by almost all systems.
• Significant words: Take words in between the
most frequent (upper cut-off) and least frequent
words (lower cut-off).
• Term weighting: Give differing weights to terms
based on their frequency, with most frequent
words weighed less. Used by almost all ranking
methods. 9
Zipf ’s Law Impact on IR
◦ Good News: Stopwords will account for a large fraction
of text so eliminating them greatly reduces inverted-
index storage costs.
◦ Bad News: For most words, gathering sufficient data for
meaningful statistical analysis (e.g. for correlation analysis
for query expansion) is difficult since they are extremely
rare. 10
Word significance: Luhn’s Ideas
 Luhn Idea (1958): the frequency of word occurrence in a text
furnishes a useful measurement of word significance.

 Luhn suggested that both extremely common and extremely


uncommon words were not very useful for indexing.

 For this, Luhn specifies two cut-off points: an upper and a


lower cutoffs based on which non-significant words are
excluded 11
Word significance: Luhn’s Ideas
 The words exceeding the upper cut-off were considered to be
common
 The words below the lower cut-off were considered to be rare
 Hence they are not contributing significantly to the content of the
text
 The ability of words to discriminate content, reached a peak at a
rank order position half way between the two-cutoffs
 Let f be the frequency of occurrence of words in a text, and r their
rank in decreasing order of word frequency, then a plot relating 12f
Luhn’s Ideas

Luhn (1958) suggested that both extremely common and


extremely uncommon words were not very useful for document
representation & indexing. 13
Vocabulary size : Heaps’ Law
 How does the size of the overall vocabulary (number of
unique words) grow with the size of the corpus?
◦ This determines how the size of the inverted index will
scale with the size of the corpus.

14
Vocabulary Growth: Heaps’ Law
 Heap’s law: estimates the number of vocabularies in a
given corpus
◦ The vocabulary size grows by O(n ),
β where β is a constant
between 0 – 1.
◦ If V is the size of the vocabulary and n is the length of the corpus
in words, Heap’s provides the following equation:
 Where constants:
◦ K  10−100
◦   0.4−0.6 (approx. square-root)

V = Kn 15
Heap’s distributions
• Distribution of size of the vocabulary: there is a linear
relationship between vocabulary size and number of
tokens

Example: from 1,000,000,000 documents, there


may be 1,000,000 distinct words. Can you agree? 16
Example
 We want to estimate the size of the vocabulary
for a corpus of 1,000,000 words. However, we
only know statistics computed on smaller
corpora sizes:
◦ For 100,000 words, there are 50,000 unique words
◦ For 500,000 words, there are 150,000 unique words
◦ Estimate the vocabulary size for the 1,000,000 words
corpus?
◦ How about for a corpus of 1,000,000,000 words? 17
Text Operations
 Not all words in a document are equally significant to
represent the contents/meanings of a document
◦ Some word carry more meaning than others
◦ Noun words are the most representative of a
document content

 Therefore, one needs to preprocess the text of a


document in a collection to be used as index terms 18
Text Op….
 Preprocessing is the process of controlling the size of the
vocabulary or the number of distinct words used as index terms
◦ Preprocessing will lead to an improvement in the information
retrieval performance
 However, some search engines on the Web omit preprocessing
◦ Every word in the document is an index term

19
 Text operations is the process of text transformations in to logical
representations

 The main operations for selecting index terms are:


 Lexical analysis/Tokenization of the text - digits, hyphens, punctuations marks, and the
case of letters

 Elimination of stop words - filter out words which are not useful in the retrieval
process

 Stemming words - remove affixes (prefixes and suffixes)

 Construction of term categorization structures such as thesaurus/wordlist, to capture


relationship for allowing the expansion of the original query with related terms
20
Generating Document Representatives
 Text Processing System
◦ Input text – full text, abstract or title
◦ Output – a document representative adequate for use in an
automatic retrieval system
documents Tokenization stop words stemming Thesaurus

Index
terms 21
Lexical Analysis/Tokenization of Text
 Change text of the documents into words to be adopted
as index terms

 Objective - identify words in the text

◦ Digits, hyphens, punctuation marks, case of letters

◦ Numbers are not good index terms (like 1910, 1999);


but 510 B.C. – unique
22
Lexical Analysis…..
 Hyphen – break up the words (e.g. state-of-the-art = state of
the art)- but some words, e.g. gilt-edged, B-49 - unique words
which require hyphens

 Punctuation marks – remove totally unless significant,


e.g. program code: x.exe and xexe
 Case of letters – not important and can convert all to
upper or lower
23
 Analyze text into a sequence of discrete tokens (words).


Tokenization Input:“Friends, Romans and Countrymen”

 Output: Tokens (an instance of a sequence of characters that are


grouped together as a useful semantic unit for processing)

◦ Friends , and, Romans, Countrymen

 Each such token is now a candidate for an index entry,


after further processing

 But what are valid tokens to omit? 24


 One word or multiple: How do you decide it is one token or
Issues in Tokenization two or more?
◦ Hewlett-Packard → Hewlett and Packard as two tokens?
 state-of-the-art: break up hyphenated sequence.
 San Francisco, Los Angeles
 Addis Ababa, Bahir Dar
◦ lowercase, lower-case, lower case ?
 data base, database, data-base
• Numbers:
 dates (3/12/91 vs. Mar. 12, 1991);
 phone numbers,
 IP addresses (100.2.86.144)
25
Issues in Tokenization
 How to handle special cases involving apostrophes, hyphens
etc? C++, C#, URLs, emails, …
◦ Sometimes punctuation (e-mail), numbers (1999), and case
(Republican vs. republican) can be a meaningful part of a
token.
◦ However, frequently they are not.
26
Issues in Tokenization
 Simplest approach is to ignore all numbers and punctuation and
use only case-insensitive unbroken strings of alphabetic
characters as tokens.
◦ Generally, don’t index numbers as text, But often very useful. Will often
index “meta-data” , including creation date, format, etc. separately

 Issues of tokenization are language specific


◦ Requires the language to be known

27
Exercise: Tokenization
 The cat slept peacefully in the living room. It’s a
very old cat.

 Mr. O’Neill thinks that the boys’ stories about


Chile’s capital aren’t amusing.

28
Term Weights: Term Frequency
 More frequent terms in a document are more
important, i.e. more indicative of the topic.
fij = frequency of term i in document j

 May want to normalize term frequency (tf) by


dividing by the frequency of the most common
term in the document:
tfij = fij / maxi{fij}
29
Term Weights: Inverse Document Frequency
 Terms that appear in many different documents are
less indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
 An indication of a term’s discrimination power.
 Log used to dampen the effect relative to tf.
30
TF-IDF Weighting
 A typical combined term importance indicator
is tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
 A term occurring frequently in the document
but rarely in the rest of the collection is given
high weight.
 Many other ways of determining term weights
have been proposed.
 Experimentally, tf-idf has been found to work
well.
31
Computing TF-IDF -- An Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8 32
Similarity Measure
 A similarity measure is a function that computes
the degree of similarity between two vectors.

 Using a similarity measure between the query


and each document:
◦ It is possible to rank the retrieved documents in the
order of presumed relevance.
◦ It is possible to enforce a certain threshold so that
33
Similarity Measure - Inner Product
 Similarity between vectors for the document di and query q can be
computed as the vector innert product (a.k.a. dot product):

sim(dj,q) = dj•qi =1=  ij iq


w w

where wij is the weight of term i in document j and wiq is the weight of term i in
the query
 For binary vectors, the inner product is the number of matched
query terms in the document (size of intersection).
 For weighted term vectors, it is the sum of the products of the
weights of the matched terms.

34
Properties of Inner Product
 The inner product is unbounded.

 Favors long documents with a large number


of unique terms.

 Measures how many terms matched but not


how many terms are not matched.
35
36
37

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy