mod 4
mod 4
Overview:
The huge amount of information stored in electronic form, has placed heavy demands on
information retrieval systems. This has made information retrieval an important research area.
4.1 Introduction
• Information retrieval (IR) deals with the organization, storage, retrieval, and evaluation
of information relevant to a user's query.
• A user in need of information formulates a request in the form of a query written in a
natural language.
• The retrieval system responds by retrieving the document that seems relevant to the query.
“An information retrieval system does not inform (i.e., change the knowledge of) the user on the
subject of their inquiry. It merely informs on the existence (or non-existence) and whereabouts of
documents relating to their request”.
• This chapter focuses on text document retrieval, excluding question answering and data
retrieval systems, which handle precise queries for specific data or answers.
• In contrast, IR systems deal with vague, imprecise queries and aim to retrieve relevant
documents rather than exact answers.
1
• A commonly used data structure is the inverted index, which maps keywords to the
documents they appear in.
• To further reduce the number of keywords, text operations such as stop word
elimination (removing common functional words) and stemming (reducing words to
their root form) are used.
• Zipf’s law can be applied to reduce the index size by filtering out extremely frequent or
rare terms.
• Since not all terms are equally relevant, term weighting assigns numerical values to
keywords to reflect their importance.
• Choosing appropriate index terms and weights is a complex task, and several term-
weighting schemes have been developed to address this challenge.
4.2.1 Indexing
IR system can access a document to decide its relevance to a query. Large collection of documents,
this technique poses practical problems. A collection of raw documents is usually transformed into an
easily accessible representation. This process is known as indexing.
• Stemming reduces words to their root form by removing affixes (e.g., "compute,"
"computing," "computes," and "computer" → "compute").
• This helps normalize morphological variants for consistent text representation.
• Stems, are used as index terms.
• The Porter Stemmer (1980) is one of the most widely used stemming algorithms.
The stemmed representation of the text, Design features of information retrieval systems, is
{design, feature, inform, retrieval, system}
o High-frequency words lack discriminative power and are not useful for
indexing.
• The primary goal is to retrieve all relevant documents for a user query.
• Different IR models exist, varying in:
o Representation: e.g., as sets of terms or vectors of weighted terms o
4
• Some models use binary matching, while others use vector space models with
numerical scoring for ranking results.
These models can be classified as follows:
Classical models of IR
Non-classical models of IR
Alternative models of IR
Example:
ii. Vector: Query and documents represented as vectors → cosine similarity used to
rank results.
2. Non-classical IR models:
3. Alternative IR models:
Examples include the Cluster model, Fuzzy model, and Latent Semantic Indexing (LSI).
• Introduced in the 1950s – Oldest of the three classical information retrieval models.
• Based on Boolean logic and set theory – Uses binary logic (true/false) operations.
• Document representation – Documents are represented as sets of keywords.
5
• Uses inverted files – A data structure listing keywords and the documents they appear in.
• Query formulation – Users must write queries using Boolean operators (AND, OR, NOT).
• Retrieval method – Documents are retrieved based on the presence or absence of query
terms.
Advantages:
6
They are simple, efficient, and easy to implement and perform well in terms of recall and
precision if the query is well formulated.
Drawbacks:
• The Boolean model retrieves only fully matching documents; it cannot handle documents
that are partially relevant to a query (No partial relevance).
• It does not rank the retrieved documents by relevance—documents either match or don’t
(No ranking of results).
• Users must formulate queries using strict Boolean expressions, which is unnatural and
difficult for most users (Strict query format).
7
4.3 Vector Space Model
• Representation:
• Documents and queries are represented as vectors of features (terms).
• Each vector exists in a multi-dimensional space, with each dimension corresponding to a
unique term in the corpus.
• Numerical vectors: Terms are assigned weights, often based on their frequency in the
document (e.g., TF-IDF).
• Similarity computation:
• Ranking algorithms (e.g., cosine similarity) are used to compute the similarity between a
document vector and the query vector.
• The similarity score determines how relevant a document is to a given query.
• Retrieval output:
• Documents are ranked based on their similarity scores to the query.
• A ranked list of documents is presented as the retrieval result.
Given a finite set of n documents: D = {d1, d2, ..., dj ..., dn) and
a finite set of m terms: T = {t1, t2, ..., tj, ..., tm}
8
Each document is represented by a column vector of weights as follows:
Where wij is the weight of the term ti in document dj, the document collection as a whole is
represented by an m x n term-document matrix as:
Example:
Consider the documents and terms in previous section Let the weights be assigned based on the
frequency of the term within the document. Then, the associated vectors will be
(2, 2, 1)
(1, 0, 1)
(0, 1, 1)
The vectors can be represented as a point in Euclidean space,
To reduce the importance of the length of document vectors, we normalize document vectors.
Normalization changes all vectors to a standard length.
We convert document vectors to unit length by dividing each dimension by the overall length of
the vector.
Elements of each column are divided by the length of the column vector given by
9
Let Query be Q = (1,1,0)
1. D1 — 0.951 Retrieved
2. D2 — 0.504
3. D3 — 0.504
4.4 Term Weighting
• Each selected indexing term distinguishes a document from others in the collection.
• A term that appears more frequently in a document likely represents its content well.
• Terms common across many documents are less useful for distinguishing content.
• Calculated as:
IDF = log(n / ni)
10
• n = total number of documents
• Note: A term in all documents gets lowest IDF (1), while a term in one document gets
highest IDF (n before taking log).
4.4.1 TF & IDF:
To assign higher weight to terms that occur frequently in a particular document but are rare across
the corpus
The tf-idf weighting scheme combines two components to determine the importance of a term:
• Term frequency (tf): A local statistic indicating how often a term appears in a document.
• Inverse document frequency (idf): A global statistic that reflects how rare or specific a
term is across the entire document collection.
• tf-idf is Widely used in information retrieval and natural language processing to assess
the relevance of a term in a document relative to a corpus.
Example:
Consider a document represented by the three terms {tornado, swirl, wind} with the raw tf {4, 1,
and 1} respectively. In a collection of 100 documents, 15 documents contain the term tornado,
20 contain swirl, and 40 contain wind.
The idf of other terms are computed in the same way. Table shows the weights assigned to the three terms
using this approach.
11
Note:
Tornado: highest TF-IDF weight (3.296), indicating both high frequency in the document and relatively
low occurrence across all documents.
Swirl: rare but relevant
Wind: least significant
Most weighting schemes can thus be characterized by the following three factors:
12
Term weighting in IR has evolved significantly from basic tf-idf. Different combinations of tf,
idf, and normalization strategies form various weighting schemes, each affecting retrieval
performance. Advanced models like BM25 further refine this by incorporating document length
and probabilistic reasoning.
4.4.3 A simple automatic method for obtaining indexed representation of the documents is
as follows.
Step 1: Tokenization This extracts individual terms form a document, converts all the letters to
lower case, and removes punctuation marks.
Step 2: Stop word elimination This removes words that appear more frequently in the document
collection.
Step 3: Stemming This reduces the remaining terms to their linguistic root, to obtain the index
terms.
Step 4: Term weighting This assigns weights to terms according to their importance in the
document, in the collection, or some combination of both.
Example:
13
Sample documents
• Vector Space Model (VSM) represents documents and queries as vectors in a multi-
dimensional space.
• Retrieval is based on measuring the closeness between query and document vectors.
• Documents are ranked according to their numeric similarity to the query.
• Selected documents are those geometrically closest to the query vector.
• The model assumes that similar vectors represent semantically related documents.
• Example in a 2D space using terms ti and tj:
o Document d1: 2 occurrences of ti o
Document d2: 1 occurrence of ti o Document
d3: 1 occurrence each of ti and tj
• Term weights (raw term frequencies) are used as
vector coordinates.
• Angles θ1, θ2, θ3 represent direction differences
between document vectors and the query.
• Basic similarity measure: counting common
terms.
• Commonly used similarity metric: inner product of query and document vectors.
Jaccard’s Coefficient:
Computes similarity as the ratio of the inner product to the union (sum of squares minus
intersection).
Computes the cosine of the angle between the document vector dj and the query vector qk. It
gives a similarity score between 0 and 1:
15
• Retrieval is seen as an information flow from document to query.
3. Interaction Model
• Documents are interconnected; retrieval emerges from the interaction between query
and documents.
• Implemented using artificial neural networks, where documents and the query are
neurons in a dynamic network.
• Query integration reshapes connections, and the degree of interaction guides retrieval.
4.6 Alternative Models of IR
4.6.1 Cluster Model
Reduces the number of document comparisons during retrieval by grouping similar documents.
• Suggests that documents with high similarity are likely to be relevant to the same queries.
16
Cluster Representation
o ꞅₖ = {a₁ₖ, a₂ₖ, ..., aₘₖ}, where each element represents the average of
corresponding term weights in the documents of that cluster.
• A cluster Ck whose similarity Sk exceeds a threshold is returned and the search proceeds
in that cluster.
Example:
Consider 3 documents (d1, d2, d3) and 5 terms (t1 to t5). The term-by-document matrix is:
t/d d1 d2 d3
t1 1 1 0
t2 1 0 0
t3 1 1 1
t4 0 0 1
t5 1 1 0
So, document vectors are: d1 = (1, 1, 1, 0, 1), d2 = (1, 0, 1, 0, 1), d3 = (0, 0, 1, 1, 0) Calculate
d1 d2 d3
d1 1.0
d2 0.87 1.0
d3 0.35 0.41 1.0
Clustering with threshold 0.7
• d1 and d2 → sim = 0.87 → Cluster C1 • d3 has low similarity with both → Cluster C2
Clusters:
• C1 = {d1, d2}
• C2 = {d3}
Cluster representatives
• r1 = avg(d1, d2)
= ((1+1)/2, (1+0)/2, (1+1)/2, (0+0)/2, (1+1)/2)
= (1, 0.5, 1, 0, 1)
• r2 = d3 = (0, 0, 1, 1, 0)
In the fuzzy model of information retrieval, each document is represented as a fuzzy set of
terms, where each term is associated with a membership degree indicating its importance to
the document's content. These weights are typically derived from term frequency within the
document and across the entire collection.
For queries:
• A single-term query returns documents where the term’s weight exceeds a threshold.
This model allows ranking documents by their degree of relevance to the query.
Example:
Documents:
• d1 = {information, retrieval, query}
• d2 = {retrieval, query, model}
• d3 = {information, retrieval}
Term Set:
Query:
• q = t2 ˄ t4 (i.e., model AND retrieval)
In fuzzy logic, the AND operation (˄) is typically interpreted using the minimum of the
memberships.
19
Step 1: Retrieve memberships for t2 and t4
non-zero membership values for the query will be returned. So, only: d2 has a non-zero value
(1/3)
Latent Semantic Indexing (LSI) applies Singular Value Decomposition (SVD) to information
retrieval, aiming to uncover hidden semantic structures in word usage across documents.
Unlike traditional keyword-based methods, LSI captures conceptual similarities between terms
and documents, even when there’s no exact term match.
W=TSDT
latent
20
• Similarity Computation: Documents are ranked using similarity measures (e.g., cosine
similarity) between the query vector and document vectors in the latent space.
Advantages:
• Captures semantic relationships between terms and documents.
• Can retrieve relevant documents even if they don’t share any terms with the query.
• Reduces the impact of synonymy and polysemy.
Example:
The SVD of X is computed to get the three matrices T, S, and D. X5x6=T5x5 S5x5 (D6×5)T
Term Vector
Singular values
Document Vector
Consider the first two
largest singular
values of S, and rescale DT2x6 with singular values to get matrix R2x6 = S2x2D2x6, as shown in
below Figure. R is a reduced dimensionality representation of the original term-by-document
matrix X.
21
To find out the changes introduced by the reduction, we compute document similarities in the
new space and compare them with the similarities between documents in the original space.
The document-document correlation matrix for the original n-dimensional space is given by
the matrix Y= XTX. Here, Y is a square, symmetric n x n matrix. An element Yij, in this matrix
gives the similarity between documents i and j. The correlation matrix for the original document
vectors is shown in Figure (Z) This matrix is computed using X, after normalizing the lengths of
its columns.
The document-document correlation matrix for the new space is computed analogously using
the reduced representation R. Let N be the matrix R with length-normalized columns. Then, M=
NTN gives the matrix of document correlations in the reduced space. The correlation matrix M
is given in Figure.
The similarity between document d1, d4(-0.0304), and d6(-0.2322) is quite low in the new space
because document d1 is not topically similar to documents d4 and d6.
In the original space, the similarity between documents d2 and d3 and between documents d2 and
d5 is 0. In the new space, they have high similarity values (0.5557 and 0.8518 respectively)
although documents d3 and d5 share no term with the document d2. This topical similarity is
recognized due to the co-occurrence of patterns in the documents.
4.7 Major Issues in Information Retrieval
1. Vocabulary Mismatch: Users often express queries using terms that differ from those in
relevant documents, leading to retrieval failures.
2. Ambiguity and Polysemy: Words with multiple meanings can cause confusion in
interpreting user intent, affecting retrieval accuracy.
22
3. Scalability and Performance: As data volumes grow, IR systems must efficiently index
and retrieve information without compromising speed or accuracy.
4. Evaluation Metrics: Determining the relevance and effectiveness of IR systems is
challenging due to the subjective nature of "relevance" and the lack of standardized
evaluation methods.
5. User Behavior Modeling: Understanding and predicting user behavior is essential for
refining search results and improving user satisfaction.
6. Integration with Natural Language Processing (NLP): Incorporating NLP techniques
can enhance IR systems by enabling better understanding of context and semantics, but
it also introduces complexity.
These issues highlight the multifaceted nature of IR and the need for interdisciplinary approaches
to address them effectively.
23
Part B
LEXICAL RESOURCES
1. Introduction
The chapter provides an overview of freely available tools and lexical resources for natural
language processing (NLP), aimed at assisting researchers—especially newcomers to the field.
It emphasizes the importance of knowing where to find resources, which can significantly reduce
time and effort. The chapter compiles and briefly discusses key tools such as stemmers, taggers,
parsers, and lexical databases like WordNet and FrameNet, along with accessible test corpora,
all of which are available online or through scholarly articles.
2. WORDNET
A comprehensive lexical database for the English language developed at Princeton University
under George A. Miller based on psycholinguistic principles, WordNet is divided into three
databases: nouns, verbs, and a combined one for adjectives and adverbs.
Key features include:
• Synsets: Groups of synonymous words representing a single concept.
• Lexical and semantic relations: These include synonymy, antonymy,
hypernymy/hyponymy (generalization/specialization), meronymy/holonymy
(part/whole), and troponymy (manner-based verb distinctions).
• Multiple senses: Words can belong to multiple synsets and parts of speech, with each
sense given a gloss—a dictionary-style definition with usage examples.
• Hierarchical structure: Nouns and verbs are arranged in taxonomic hierarchies (e.g.,
'river' has a hypernym chain), while adjectives are grouped by antonym sets.
The figure 1 shows the entries for the word 'read'. 'Read' has one sense as a noun and 11 senses as a verb.
Glosses help differentiate meanings. Figures 2, 3, and 4 show some of the relationships that hold between
nouns, verbs, and adjectives and adverbs.
24
Nouns and verbs are organized into hierarchies based on the hypernymy/hyponymy relation,
whereas adjectives are organized into clusters based on antonym pairs (or triplets). Figure 5
shows a hypernym chain for 'river' extracted from
WordNet. Figure 6 shows the troponym relations for
the verb 'laugh'.
• Figure 7 shows the Hindi WordNet entry for the word (aakanksha).
• Hindi WordNet can be obtained from the URL
http://www.cfilt.iitb.ac.in/wordnet/webhwn/. CFLIT has also developed a Marathi
WordNet.
• Figure 8 shows the Marathi WordNet
(http://www.cfilt.iitb.ac.in/wordnet/webmwn/wn.php) entry for the word 'qa' (pau).
25
Figure 8 WordNet entry for the Marathi word
(pau)
26
Document Summarization:
WordNet aids in generating lexical chains—sequences of semantically related words—that help
identify key topics and coherence in texts.
• Barzilay and Elhadad (1997) used this technique to improve text summarization.
3. FRAMENET
FrameNet, a rich lexical database focused on semantically annotated English sentences,
grounded in frame semantics.
1. Frame Semantics:
Each word (especially verbs, nouns, adjectives) evokes a specific situation or event
known as a frame.
2. Target Word / Predicate:
The word that evokes the frame (e.g., nab in the ARREST frame).
3. Frame Elements (FEs):
These are semantic roles or participants in the frame-specific event (e.g.,
AUTHORITIES, SUSPECT, TIME in the ARREST frame).
o These roles define the predicate-argument structure of the sentence.
4. Annotated Sentences:
Sentences, often drawn from the British National Corpus, are tagged with frame
elements to illustrate how words function in context.
5. Ontology Representation:
FrameNet provides a semantic-level ontology of language, representing not just
grammatical but also contextual and conceptual relationships.
Example:
In the sentence, “The police nabbed the suspect,” the word nab triggers the ARREST frame:
• The police → AUTHORITIES
• The suspect → SUSPECT
The COMMUNICATION frame includes roles like ADDRESSEE, COMMUNICATOR, TOPIC, and
MEDIUM. The JUDGEMENT frame includes JUDGE, EVALUEE, and REASON. Frames can inherit
roles from others; for instance, the STATEMENT frame inherits from COMMUNICATION and includes
roles such as SPEAKER, ADDRESSEE, and MESSAGE.
The following sentences show some of these roles:
27
[Judge She] [Evaluee blames the police] [Reason for failing to provide enough protection].
[Speaker She] told [Addressee me] [Message 'I’ll return by 7:00 pm today'].
Figure 9 shows the core and non-core frame elements of the COMMUNICATION frame, along with other
details.
28
4. STEMMERS:
Stemming (or conflation) is the process of reducing inflected or derived words to their base
or root form. The resulting stem doesn't need to be a valid word, as long as related terms map
to the same stem.
Purpose:
• Helps in query expansion, indexing (e.g., in search engines), and various NLP tasks.
Common Stemming Algorithms:
• Porter's Stemmer – Most widely used (Porter, 1980).
• Lovins Stemmer – An earlier approach (Lovins, 1968).
• Paice/Husk Stemmer – A more recent and flexible method (Paice, 1990).
These tools, called stemmers, differ in how aggressively they reduce words but all aim to
improve text processing by grouping word variants.
Figure 10 shows a sample text and output produced using these stemmers.
29
o Ramanathan and Rao (2003): Used handcrafted suffix lists for Hindi.
o Majumder et al. (2007): Used a cluster-based approach, evaluated using
Bengali data, and found that stemming improves recall.
• CFILT, IIT Bombay has developed stemmers for Indian languages:
o http://www.cfilt.iitb.ac.in
4.3 Stemming Applications:
• Widely used in search engines and IR systems:
o Reduces word variants to a common form, improving recall and reducing index
size.
o Example: "astronaut" and "astronauts" are treated as the same term. o
However, for English, stemming may not always improve precision.
• Also applied in:
o Text summarization o Text categorization o Helps in term
frequency analysis by consolidating word forms into stems.
5. PART-OF-SPEECH TAGGER
Part-of-speech tagging is a crucial early-stage NLP technique used in applications like speech
synthesis, machine translation, information retrieval (IR), and information extraction. In
IR, it helps with indexing, phrase extraction, and word sense disambiguation.
• Performance:
30
o Outperforms unidirectional methods.
o Comparable to top algorithms like kernel SVMs.
• Reference: Tsuruoka and Tsujii (2005)
Table 12.1 shows tagged text of document #93 of the CACM collection.
31
5.6 Tree-Tagger
• Type: Probabilistic (uses decision trees).
• Strengths:
o Effective with sparse data.
o Automatically selects optimal context size.
• Accuracy: Above 96% on Penn Treebank.
• The tagger is available at the link http://www.ims.uni-
stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html
32
Glasgow University, UK, maintains a list of freely available IR test collections. Table lists the
sources of those and few more IR test
collections. LETOR (learning to rank) is a
package of benchmark data sets released by
Microsoft Research Asia. It consists of two
datasets OHSUMED and TREC (TD2003 and
TD2004).
LETOR is packaged with extracted features for each query-document pair in the collection,
baseline results of several state-of-the-art learning-to-rank algorithms on the data and evaluation
tools. The data set is aimed at supporting future research in the area of learning ranking function
for information retrieval.
Evaluating a text summarizing system requires existence of 'gold summaries'. DUC provides
document collections with known extracts and abstracts, which are used for evaluating
performance of summarization systems submitted at TREC conferences. Figure 11 shows a
sample document and its extract from DUC 2002 summarization data.
Open Mind Word Expert13 attempts to create a very large sense-tagged corpus. It collects word
sense tagging from the general public over the Web.
33
6.4 Asian Language Corpora
The EMILLE (Enabling Minority Language Engineering) corpus is a multilingual resource
developed at Lancaster University, UK, aimed at supporting natural language processing (NLP)
for South Asian languages. The project, in collaboration with the Central Institute for Indian
Languages (CIIL) in India, provides extensive data and tools for various Indian languages. The
corpus includes monolingual written and spoken corpora, parallel corpora, and annotated data.
The monolingual written corpus covers 14 South Asian languages, while the spoken data,
sourced from BBC Asia radio broadcasts, includes five languages: Hindi, Bengali, Gujarati,
Punjabi, and Urdu. The parallel corpus consists of English texts and their translations into five
languages, featuring materials like UK government advice leaflets, aligned at the sentence level.
The annotated section includes part-of-speech tagging for Urdu and annotations of demonstrative
usage in Hindi.
The EMILLE/CIIL corpus is available free of charge for research purposes at elda.org, with
further details provided in the manual at emille.lancs.ac.uk. This resource is particularly valuable
for research in statistical machine translation and other NLP applications involving Indian
languages, despite challenges posed by the limited availability of electronic text repositories in
these languages.
34
• HLT-NAACL: Sponsored by the North American chapter of ACL; covers human
language technologies.
Notable Journals:
• Journal of Computational Linguistics: Focuses on theoretical and linguistic aspects.
• Natural Language Engineering Journal: Focuses on practical NLP applications.
• Information Retrieval (Kluwer), Information Processing and Management (Elsevier),
ACM TOIS (Transactions on Information Systems), Journal of the American Society for
Information Science.
Other Relevant Journals:
• International Journal of Information Technology and Decision Making (World Scientific)
• Journal of Digital Information Management
• Journal of Information Systems AI Journals Reporting NLP Work:
• Artificial Intelligence
• Computational Intelligence
• IEEE Transactions on Intelligent Systems
• Journal of AI Research
35