IR Chapter 2 Text Operations
IR Chapter 2 Text Operations
Text Operations
1
Statistical Properties of Text
How is the frequency of different words distributed?
How fast does vocabulary size grow with the size of a corpus?
◦ Such factors affect the performance of IR system & can be used to select
suitable term weights & other aspects of the system.
3
Sample Word Frequency Data
4
Word distribution: Zipf's Law
Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
◦ attempts to capture the distribution of the frequencies
(number of occurances ) of the words within a text.
6
Zipf’s distributions
Rank Frequency Distribution
For all the words in a collection of documents, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency. (The most commonly occurring word has rank 1,
etc.)
f Distribution of sorted word frequencies,
according to Zipf’s law
7
Example: Zipf's Law
14
Text Operations
Not all words in a document are equally significant to
represent the contents/meanings of a document
◦ Some word carry more meaning than others
◦ Noun words are the most representative of a
document content
16
Text operations is the process of text transformations in to logical
representations
Elimination of stop words - filter out words which are not useful in the retrieval
process
Index
terms 18
Lexical Analysis/Tokenization of Text
Change text of the documents into words to be adopted
as index terms
Tokenization Input: “Friends, Romans and Countrymen”
24
Similarity Measure
A similarity measure is a function that computes
the degree of similarity between two vectors.