Inverted File Assignment
Inverted File Assignment
By:
Name ID No.
Biniam Worku GSE/6722/13
Inverted Index
Inverted index also known as Inverted file, an index data structure storing a
mapping from content, such as words or numbers, to its locations in a document or
a set of documents. Building and maintaining an inverted index is a relatively low-
cost risk. On a text of n words an inverted index can be built in O(n) time, n is
number of terms. The vocabulary (List of terms) and the occurrence (Location and
frequency of terms in a document collection) are the two contents of an inverted
file.
The occurrence contains one record per term and lists frequency of each term in a
document, also shows locations of words in the text.
A vocabulary file (Word list): stores all of the distinct terms (keywords) that
appear in any of the documents (in lexicographical order) and for each word a
pointer to posting file.
In the given assignment the first thing to do is tokenizing the documents. For a
reference, the documents given in the assignment are:
After tokenization, the next thing to do is sorting the inverted file by terms.
Term Doc# Term Doc#
New 1 for 3
home 1 forecasts 1
to 1 home 1
home 1 home 1
sales 1 home 2
forecast home 3
1
s
home 4
Rise 2
homes 3
in 2
in 2
home 2
in 2
sales 2
in 3
in 2
july 2
july 2
Home 3 july 3
sales 3 july 4
rise 3 new 1
in 3 Next stop words (to, in and new 3 for) are
july 3 removed. new 4
for 3 rise 2
new 3 rise 3
By stemming the suffix ‘s’ is removed
homes 3 rise 4
from terms like, forecasts, sales and
July 4 sales 1
new 4
homes. sales 2
home 4 sales 3
sales 4 Then all the terms caps sales 4 changed
rise 4 to small caps for the to 1
normalizing purposes.
Multiple term entries in a single document are merged and frequency information
added.
Term Doc# TF
forecas
1
t 1
home 1 2
home 2 1
home 3 2 Content Frequency (CF) and Document
home 4 1 Frequency(DF) are calculated by using Document
july 2 1 numbers and frequency of terms appear in the
july 3 1
documents . The result is shown on the following table .
july 4 1
new 1 1 Term DF CF
new 3 1 forecas
new 4 1 t 1 1
rise 2 1 home 4 6
rise 3 1 july 3 3
rise 4 1 new 3 3
sale 1 1 rise 3 3
sale 2 1 sale 4 4
sale 3 1
sale 4 1
The final step is Separation of inverted file into vocabulary and posting file.
Vocabulary: For searching purpose we need only word list. This allows the
vocabulary to be kept in memory at search time since the space required for the
vocabulary is small.
Posting file : requires much more space. For each word appearing in the text we
are keeping statistical information related to word occurrence in documents.
vocabulary Doc#
1 posting
TF
1
1 2
2 1
Term DF CF 3 2
forecast 1 1 4 1
home 4 6 2 1
july 3 3 3 1
new 3 3 4 1
rise 3 3 1 1
sale 3 1
4 4
4 1
2 1
3 1
4 1
1 1
2 1
3 1
4 1
Pointer
s
References
1. Modern information retrieval lecture note Addis Ababa university , school of
information science , 2021
2. Christopher D. Manning, Hinrich Schütze, and Prabhakar Raghavan (2007)
Introduction to information retrieval,cabridge university press, Cambridge
,England