0% found this document useful (0 votes)
93 views6 pages

Inverted File Assignment

1) The document discusses constructing an inverted index to summarize text documents. It involves tokenizing documents, sorting terms, calculating term frequency and document frequency, and separating the index into a vocabulary file and posting files. 2) An inverted index maps words to their locations in documents, allowing fast full-text searches. It contains a vocabulary listing all unique terms and posting files listing frequency and location for each term across documents. 3) The example shows tokenizing example documents, processing the terms, calculating statistics, and structuring the final inverted index files.

Uploaded by

Bini Teflon Ankh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views6 pages

Inverted File Assignment

1) The document discusses constructing an inverted index to summarize text documents. It involves tokenizing documents, sorting terms, calculating term frequency and document frequency, and separating the index into a vocabulary file and posting files. 2) An inverted index maps words to their locations in documents, allowing fast full-text searches. It contains a vocabulary listing all unique terms and posting files listing frequency and location for each term across documents. 3) The example shows tokenizing example documents, processing the terms, calculating statistics, and structuring the final inverted index files.

Uploaded by

Bini Teflon Ankh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Addis Ababa University

School of Information Science


Department of Information Science
IR
Assignment II
constructing inverted file

By:
Name ID No.
Biniam Worku GSE/6722/13

Submission Date: 29/9/2021

Inverted Index
Inverted index also known as Inverted file, an index data structure storing a
mapping from content, such as words or numbers, to its locations in a document or
a set of documents. Building and maintaining an inverted index is a relatively low-
cost risk. On a text of n words an inverted index can be built in O(n) time, n is
number of terms. The vocabulary (List of terms) and the occurrence (Location and
frequency of terms in a document collection) are the two contents of an inverted
file.

The occurrence contains one record per term and lists frequency of each term in a
document, also shows locations of words in the text.

A vocabulary file (Word list): stores all of the distinct terms (keywords) that
appear in any of the documents (in lexicographical order) and for each word a
pointer to posting file.

In the given assignment the first thing to do is tokenizing the documents. For a
reference, the documents given in the assignment are:

Doc 1  :  New home to home sales forecasts


Doc 2  : Rise in home sales in July
Doc 3  :  Home sales rise in July for new homes
Doc 4  :  July new home sales rise

After tokenization, the next thing to do is sorting the inverted file by terms.
Term Doc# Term Doc#
New 1 for 3
home 1 forecasts 1
to 1 home 1
home 1 home 1
sales 1 home 2
forecast home 3
1
s
home 4
Rise 2
homes 3
in 2
in 2
home 2
in 2
sales 2
in 3
in 2
july 2
july 2
Home 3 july 3
sales 3 july 4
rise 3 new 1
in 3 Next stop words (to, in and new 3 for) are
july 3 removed. new 4
for 3 rise 2
new 3 rise 3
By stemming the suffix ‘s’ is removed
homes 3 rise 4
from terms like, forecasts, sales and
July 4 sales 1
new 4
homes. sales 2
home 4 sales 3
sales 4 Then all the terms caps sales 4 changed
rise 4 to small caps for the to 1
normalizing purposes.

Multiple term entries in a single document are merged and frequency information
added.

Term frequency (TF) then calculated by counting number of occurrence of


terms in the collections.
Term Doc#
forecas
1
t Term Doc# TF
home 1 forecas
1
home 1 t 1
home 2 home 1 2
home 3 home 2 1
home 4 home 3 2
home 3 home 4 1
july 2 july 2 1
july 3 july 3 1
july 4 july 4 1
new 1 new 1 1
new 3 new 3 1
new 4 new 4 1
rise 2 rise 2 1
rise 3 rise 3 1
rise 4 rise 4 1
sale 1 sale 1 1
sale 2 sale 2 1
sale 3 sale 3 1
sale 4 sale 4 1

Term Doc# TF
forecas
1
t 1
home 1 2
home 2 1
home 3 2 Content Frequency (CF) and Document
home 4 1 Frequency(DF) are calculated by using Document
july 2 1 numbers and frequency of terms appear in the
july 3 1
documents . The result is shown on the following table .
july 4 1
new 1 1 Term DF CF
new 3 1 forecas
new 4 1 t 1 1
rise 2 1 home 4 6
rise 3 1 july 3 3
rise 4 1 new 3 3
sale 1 1 rise 3 3
sale 2 1 sale 4 4
sale 3 1
sale 4 1
The final step is Separation of inverted file into vocabulary and posting file.
Vocabulary: For searching purpose we need only word list. This allows the
vocabulary to be kept in memory at search time since the space required for the
vocabulary is small.

Posting file : requires much more space. For each word appearing in the text we
are keeping statistical information related to word occurrence in documents.

vocabulary Doc#
1 posting
TF
1
1 2
2 1
Term DF CF 3 2
forecast 1 1 4 1
home 4 6 2 1
july 3 3 3 1
new 3 3 4 1
rise 3 3 1 1
sale 3 1
4 4
4 1
2 1
3 1
4 1
1 1
2 1
3 1
4 1
Pointer
s

References
1. Modern information retrieval lecture note Addis Ababa university , school of
information science , 2021
2. Christopher D. Manning, Hinrich Schütze, and Prabhakar Raghavan (2007)
Introduction to information retrieval,cabridge university press, Cambridge
,England

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy