3-Index Construction
3-Index Construction
Construction
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2024)
Indexing Subsystem
documents
Documents Assign document identifier
document
Tokenization document IDs
tokens
Stop word removal
non-stop list tokens
Stemming & Normalization
stemmed terms
Term weighting
Documents to
Friends, Romans, countrymen.
be indexed.
Token Tokenizer
stream. Friends Romans countrymen
v Running time:
v Indexing time;
v Access/search time;
v Update time (Insertion time, Deletion time, Modification time….)
v Space overhead:
v Computer storage space consumed.
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
v Why vocabulary?
v Having information about vocabulary (list of terms) speeds
searching for relevant documents.
v Why location?
v Having information about the location of each term within the
document helps for:
v User interface design: highlight location of search term,
v Proximity based ranking: adjacency and near operators (in
Boolean searching)
v Why frequencies?
v Having information about frequency is used for:
v Calculating term weighting (like TF, TF*IDF, …)
v Optimizing query processing.
Inverted File
v Records kept for each term j in the word list contains the
following:
v Term j
v Number of documents in which term j occurs (DFj)
v Total frequency of term j (CFj)
v Pointer to postings (inverted) list for term j
Postings File (Inverted List)
Vocabulary
Postings Actual
(word list)
(inverted list) Documents
Term No Tot Pointer
of freq To
Doc posting
Act 3 3 Inverted
Bus 3 4 lists
pen 1 1
total 2 3
Example: Indexing
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
vAfter all
documents
have been
tokenized the
inverted file is
sorted by terms.
Remove stopwords, Apply Stemming
& Compute Term Frequency
v Multiple term
entries in a single
document are
merged and
f re q u e n c y
information
added.
vC o u n t i n g
number of
occurrence of
terms in the
collections helps
to compute TF.
Vocabulary and Postings File
Vocabulary Posting
Pointers
Complexity Analysis
v Storage of text:
v The need for text compression: to reduce storage space.
v Indexing text
v Storage of indexes
v Is compression required? Do we store on memory or in a disk ?
v Accessing text
v Accessing indexes
v How to access to indexes? What data/file structure to use?
v Processing indexes
v How to a search a given query in the index? How to update the index?
v Accessing documents
Text Compression
1
0 1
0.4 0.6
0 1 0
1
0.3 e
d f 0 1
0.2
g 1
0 0.1
c 0 1
a b
v Using the Huffman coding a table can be constructed by
working down the tree, left to right. This gives the binary
equivalents for each symbol in terms of 1s and 0s.
v What is the Huffman binary representation for ‘café’?
Exercise
Character: a b c d e t
Frequency: 16 5 12 17 10 25
3/27/2025 42
Thank You !!!
3/27/2025 43