0% found this document useful (0 votes)
12 views43 pages

3-Index Construction

Chapter 3 discusses the construction of an indexing subsystem for document retrieval, detailing processes like tokenization, stop word removal, stemming, and term weighting. It explains how current search engines index documents using web crawlers and outlines the major steps in index construction, including the creation of inverted files and vocabulary files. The chapter also covers the evaluation metrics for index files, complexities of indexing, and text compression methods such as Huffman coding.

Uploaded by

abelgetahun66
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views43 pages

3-Index Construction

Chapter 3 discusses the construction of an indexing subsystem for document retrieval, detailing processes like tokenization, stop word removal, stemming, and term weighting. It explains how current search engines index documents using web crawlers and outlines the major steps in index construction, including the creation of inverted files and vocabulary files. The chapter also covers the evaluation metrics for index files, complexities of indexing, and text compression methods such as Huffman coding.

Uploaded by

abelgetahun66
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Chapter 3 : Index

Construction
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2024)
Indexing Subsystem

documents
Documents Assign document identifier

document
Tokenization document IDs
tokens
Stop word removal
non-stop list tokens
Stemming & Normalization
stemmed terms
Term weighting

Weighted index terms


Index File
Indexing: Basic Concepts

v Indexing is used to speed up access to desired information from


document collection as per users query such that:
v It enhances efficiency in terms of time for retrieval. Relevant
documents are searched and retrieved quick.
v Example: author catalog in library.
v An index file consists of records, called index entries.

v Index files are much smaller than the original file.


v Remember Heaps Law: In 1 GB text collection the size of a
vocabulary is only 5 MB (Baeza-Yates and Ribeiro-Neto, 2005)
v This size may be further reduced by Linguistic pre-processing
(like stemming & other normalization methods).
v The usual unit for indexing is the word.
v Index terms - are used to look up records in a file.
How Current Search Engines
index?
v Indexes are built using a web crawler, which retrieves each
page on the Web for indexing.
v After indexing, the local copy of each page is discarded, unless
stored in a cache.
v Some search engines: automatically index
v Such search engines include: Google, AltaVista, Excite, HotBot,
InfoSeek, Lycos.
v Some others: semi automatically index
v Partially human indexed, hierarchically organized.
v Such search engines include: Yahoo, Magellan, Galaxy, WWW
Virtual Library.
v Common features
v allow Boolean searches.
Major Steps in Index
Construction
v Source file: Collection of text document.
v A document can be described by a set of representative keywords called
index terms.
v Index Terms Selection:
v Tokenize: identify words in a document, so that each document is represented
by a list of keywords or attributes
v Stop words: removal of high frequency words
v Stop list of words is used for comparing the input text
v Word stem and normalization: reduce words with similar meaning into their
stem/root word. Suffix stripping is the common method.
v Term relevance weight: Different index terms have varying relevance when
used to describe document contents.
v This effect is captured through the assignment of numerical weights to
each index term. There are different weighting methods: TF, TF*IDF, …
v Output: a set of index terms (vocabulary) to be used for Indexing the
documents that each term occurs in.
Basic Indexing Process

Documents to
Friends, Romans, countrymen.
be indexed.
Token Tokenizer
stream. Friends Romans countrymen

Modified Linguistic friend roman countryman


tokens. preprocessing

Index File Indexer 2 4


friend
(Inverted file).
roman 1 2
countryman 13 16
Building Index file

v An index file of a document is a file consisting of a list of index


terms and a link to one or more documents that has the index term.
v A good index file maps each keyword Ki to a set of documents Di
that contain the keyword.

v Index file usually has index terms in a sorted order.


v The sort order of the terms in the index file provides an order on a
physical file.
Building Index file

v An index file is list of search terms that are organized for


associative look-up, i.e., to answer user’s query:
v In which documents does a specified search term appear?
v Where within each document does each term appear? (There may
be several occurrences.)

v For organizing index file for a collection of documents, there


are various options available:
v Decide what data structure and/or file structure to use.
v Is it sequential file, inverted file, suffix array, signature file, etc. ?
Index file Evaluation Metrics

v Running time:
v Indexing time;
v Access/search time;
v Update time (Insertion time, Deletion time, Modification time….)

v Space overhead:
v Computer storage space consumed.

v Access types supported efficiently.


v Is the indexing structure allows to access:
v Records with a specified term, or
v Records with terms falling in a specified range of values.
Sequential File

v Sequential file is the most primitive file structures.


v It has no vocabulary as well as linking pointers.
v The records are generally arranged serially, one after another,
but in lexicographic order on the value of some key field.
v A particular attribute is chosen as primary key whose value will
determine the order of the records.
v When the first key fails to discriminate among records, a second
key is chosen to give an order.
Example:

v Given a collection of documents, they are parsed to extract


words and these are saved with the Document ID.

I did enact Julius


Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary

v After all documents Sequential file


h a v e b e e n
t o k e n i z e d ,
stopwords are
removed, and
normalization and
stemming are
applied, to generate
index terms

v These index terms


in sequential file
are sorted in
alphabetical order.
Sequential File

v Its main advantages are:


v Easy to implement;
v Provides fast access to the next record using lexicographic order.
v Instead of Linear time search, one can search in logarithmic time using
binary search.
v Its disadvantages:
v Difficult to update. Index must be rebuilt if a new term is added.
Inserting a new record may require moving a large proportion of the
file;
v Random access is extremely slow.
v The problem of update can be solved :
v By ordering records by date of acquisition, than the key value; hence,
the newest entries are added at the end of the file & therefore pose no
difficulty to updating.
v But searching becomes very tough; it requires linear time.
Inverted file

v A word oriented indexing mechanism based on sorted list of


keywords, with each keyword having links to the documents
containing it.
v Building and maintaining an inverted index is a relatively low cost
risk.
v On a text of n words an inverted index can be built in O(n) time, n
is number of keywords.

v Content of the inverted file:


v Data to be held in the inverted file includes :
v The vocabulary (List of terms)
v The occurrence (Location and frequency of terms in a
document collection)
Inverted file

v The occurrence: contains one record per term, listing.


v Frequency of each term in a document, i.e. count number of
occurrences of keywords in a document.
• TFij, number of occurrences of term tj in document di
• DFj, number of documents containing tj
• maxi, maximum frequency of any term in di
• N, total number of documents in a collection
• CFj,, collection frequency of tj in nj
• ….

v Locations/Positions of words in the text.


Inverted file

v Why vocabulary?
v Having information about vocabulary (list of terms) speeds
searching for relevant documents.
v Why location?
v Having information about the location of each term within the
document helps for:
v User interface design: highlight location of search term,
v Proximity based ranking: adjacency and near operators (in
Boolean searching)
v Why frequencies?
v Having information about frequency is used for:
v Calculating term weighting (like TF, TF*IDF, …)
v Optimizing query processing.
Inverted File

v Documents are organized by the terms/words they contain


Term CF Document TF Location
ID
auto 3 2 1 66 This is called an
19 1 213 index file.
29 1 45
bus 4 3 1 94
19 2 7, 212
Text operations
22 1 56
are performed
taxi 1 5 1 43
before building
train 3 11 2 3, 70
the index.
34 1 40
Construction of Inverted File

v An inverted index consists of two files:


v Vocabulary file
v Posting file

Advantage of dividing inverted file:


v Keeping a pointer in the vocabulary to the list in the posting file
allows:
v The vocabulary to be kept in memory at search time even for large
text collection, and
v Posting file to be kept on disk for accessing to documents.
Inverted Index Storage

v Separation of inverted file into vocabulary and posting file is a


good idea.
v Vocabulary: For searching purpose we need only word list.
v This allows the vocabulary to be kept in memory at search time
since the space required for the vocabulary is small.
v The vocabulary grows by O(nβ), where β is a constant between 0 – 1.
v Example: from 1,000,000,000 documents, there may be 1,000,000
distinct words. Hence, the size of index is 100 MBs, which can easily
be held in memory of a dedicated computer.
v Posting file requires much more space.
v For each word appearing in the text we are keeping statistical
information related to word occurrence in documents.
v Each of the postings pointer to the document requires an extra space
of O(n).
v How to speed up access to inverted file?
Vocabulary File

v A vocabulary file (Word list):


v Stores all of the distinct terms (keywords) that appear in any of
the documents (in lexicographical order) and
v For each word a pointer to posting file.

v Records kept for each term j in the word list contains the
following:
v Term j
v Number of documents in which term j occurs (DFj)
v Total frequency of term j (CFj)
v Pointer to postings (inverted) list for term j
Postings File (Inverted List)

v For each distinct term in the vocabulary, stores a list of


pointers to the documents that contain that term.

v Each element in an inverted list is called a posting, i.e., the


occurrence of a term in a document.

v It is stored as a separate inverted list for each column, i.e., a


list corresponding to each term in the index file.
v Each list consists of one or many individual postings related to
Document ID, TF and location information about a given term i.
Organization of Index File

Vocabulary
Postings Actual
(word list)
(inverted list) Documents
Term No Tot Pointer
of freq To
Doc posting

Act 3 3 Inverted
Bus 3 4 lists

pen 1 1
total 2 3
Example: Indexing

v Given a collection of documents, they are parsed to extract


words and these are saved with the Document ID.

I did enact Julius


Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary

vAfter all
documents
have been
tokenized the
inverted file is
sorted by terms.
Remove stopwords, Apply Stemming
& Compute Term Frequency

v Multiple term
entries in a single
document are
merged and
f re q u e n c y
information
added.

vC o u n t i n g
number of
occurrence of
terms in the
collections helps
to compute TF.
Vocabulary and Postings File

v The file is commonly split into a Dictionary and a Posting file.

Vocabulary Posting

Pointers
Complexity Analysis

v The inverted index can be built in O(n) time.


v n is number of vocabulary terms.
v Since terms in vocabulary file are sorted searching takes
logarithmic time.

v To update the inverted index it is possible to apply Incremental


indexing which requires O(k) time, k is number of new index
terms.
Exercises

v Construct the inverted index for the following document


collections.
v Doc 1 : New home to home sales forecasts
v Doc 2 : Rise in home sales in July
v Doc 3 : Home sales rise in July for new homes
v Doc 4 : July new home sales rise
Implementation Issues

v Storage of text:
v The need for text compression: to reduce storage space.
v Indexing text
v Storage of indexes
v Is compression required? Do we store on memory or in a disk ?
v Accessing text
v Accessing indexes
v How to access to indexes? What data/file structure to use?
v Processing indexes
v How to a search a given query in the index? How to update the index?
v Accessing documents
Text Compression

v Text compression is about finding ways to represent the text in


fewer bits or bytes. Advantages:
v Save storage space requirement.
v Speed up document transmission time
v Takes less time to search the compressed text
v Common compression methods:
v Statistical methods: which requires statistical information about
frequency of occurrence of symbols in the document.
v E.g. Huffman coding
v Estimate probabilities of symbols, code one at a time, shorter codes for
high probabilities.
v Adaptive methods: which constructs dictionary in the processing of
compression.
v E.g. Ziv-Lempel compression:
v Replace words or symbols with a pointer to dictionary entries.
Huffman Coding

v Developed in 1950s by David Huffman,


w i d e l y u s e d f o r t e x t c o m p re s s i o n , 0 1
multimedia codec and message
transmission. D4
0 1
v The problem: Given a set of n symbols 1 D3
and their weights (or frequencies), 0
construct a tree structure (a binary tree for D1 D2
binary code) with the objective of Code of:
reducing memory space & decoding time
per symbol. D1 = 000
D2 = 001
v Huffman coding is constructed based on D3 = 01
frequency of occurrence of letters in text D4 = 1
documents.
How to Construct Huffman
Coding
v Step 1: Create forest of trees for each symbol, t1, t2,… tn
v Step 2: Sort forest of trees according to falling probabilities of
symbol occurrence.
v Step 3: WHILE more than one tree exist DO.
v Merge two trees t1 and t2 with least probabilities p1 and p2.
v Label their root with sum p1 + p2
v Associate binary code: 1 with the right branch and 0 with the left
branch.
v Step 4: Create a unique codeword for each symbol by traversing
the tree from the root to the leaf.
v Concatenate all encountered 0s and 1s together during traversal.
v The resulting tree has a prob. of 1 in its root and symbols in its
leaf node.
Example

v Consider a 7-symbol alphabet given in the following table to


construct the Huffman coding.
Symbol Probability
a 0.05
b 0.05 v The Huffman encoding algorithm
c 0.1 picks each time two symbols
(with the smallest frequency) to
d 0.2 combine.
e 0.3
f 0.2
g 0.1
Huffman code tree

1
0 1
0.4 0.6
0 1 0
1
0.3 e
d f 0 1
0.2
g 1
0 0.1
c 0 1
a b
v Using the Huffman coding a table can be constructed by
working down the tree, left to right. This gives the binary
equivalents for each symbol in terms of 1s and 0s.
v What is the Huffman binary representation for ‘café’?
Exercise

v 1. Given the following, apply the Huffman algorithm to find an


optimal binary code:

Character: a b c d e t

Frequency: 16 5 12 17 10 25

v 2. Given text: “for each rose, a rose is a rose”


v Construct the Huffman coding
Ziv-Lempel Compression

v The problem with Huffman coding is that it requires knowledge


about the data before encoding takes place.
v Huffman coding requires frequencies of symbol occurrence before
codeword is assigned to symbols.
v Ziv-Lempel compression
v Not rely on previous knowledge about the data.
v Rather builds this knowledge in the course of data
transmission/data storage.
v Ziv-Lempel algorithm (called LZ) uses a table of code-words
created during data transmission;
v Each time it replaces strings of characters with a reference to a
previous occurrence of the string.
Lempel-Ziv Compression
Algorithm
v The multi-symbol patterns are of the form: C0C1 . . . Cn-1Cn. The
prefix of a pattern consists of all the pattern symbols except the
last: C0C1 . . . Cn-1

v Lempel-Ziv Output: there are three options in assigning a code


to each symbol in the list.
v If one-symbol pattern is not in dictionary, assign (0, symbol)
v I f m u l t i - s y m b o l p a t t e r n i s n o t i n d i c t i o n a r y, a s s i g n
(dictionaryPrefixIndex, lastPatternSymbol)
v If the last input symbol or the last pattern is in the dictionary,
assign (dictionaryPrefixIndex, )
Example: LZ Compression

v Encode (i.e., compress) the string ABBCBCABABCAABCAAB


using the LZ algorithm.

v The compressed message is: 0A0B2C3A2A4A6B


Example: Decompression

v Decode (i.e., decompress) the sequence: 0A0B2C3A2A4A6B

The decompressed message is:


ABBCBCABABCAABCAAB
Exercise

v Encode (i.e., compress) the following strings using the Lempel-


Ziv algorithm.
1. Mississippi
2. ABBCBCABABCAABCAAB
3. SATATASACITASA.
Indexing Structures:
Assignment
v Discuss in detail theoretical and algorithmic concepts (including
construction, various operations, complexity, etc.) on the
following commonly used data structures:
1. Data structure vs. file structure
2. Arrays (fixed and dynamic arrays), sorted arrays
3. Records and linked list
4. Tree (AVL tree, Binary tree): balanced vs. unbalanced tree)
5. B tree and its variants (B+, B++ Tree, B* Tree),
6. Hierarchical Tree (like Quad Tree and its variants)
7. PAT-Tree and its variants
8. Disjoint tree: balanced and degenerate tree
9. Graph
10. . Hashing,
11. Trie and its variants
Question & Answer

3/27/2025 42
Thank You !!!

3/27/2025 43

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy