NLP QBS Module 4 & 5
NLP QBS Module 4 & 5
i) Cluster Model
The Cluster Model is a technique in Information Retrieval (IR) that organizes
documents into clusters based on their similarity. Instead of retrieving individual
documents from an extensive dataset, the retrieval system first identifies relevant
clusters and then searches within them, making retrieval faster and more efficient.
Example
Imagine a digital library containing 200,000 research papers on Artificial
Intelligence, Data Science, and Natural Language Processing. Instead of searching
through all papers, the clustering algorithm groups documents into categories like:
- Machine Learning
- Neural Networks
- Reinforcement Learning
- Deep Learning
When a user searches for "Transformer Models," the system first identifies the
cluster containing topics related to "Deep Learning" before retrieving documents
from this specific group. This reduces search time and improves relevance.
Applications
- Search Engines: Google and Bing use clustering to organize web pages by topic.
- Recommendation Systems: Netflix and Amazon group movies/products for
better recommendations.
- Text Categorization: Helps automate document classification.
Advantages
- Improves Retrieval Efficiency: Cluster-based searches are faster than scanning
the entire dataset.
- Enhances Organization: Categorization helps users find relevant information.
- Scalable for Large Datasets: Works well with vast collections of documents.
Limitations
- Pre-Clustering Overhead: Requires computational power to initially group
documents.
- Cluster Accuracy: Performance depends on choosing the right algorithm.
Example
Imagine a user searching for “best laptops for programming” on an e-commerce
website. The fuzzy IR model assigns relevance scores to laptops based on
parameters like:
- Processor Speed
- RAM Size
- Battery Life
- User Reviews
A high-end gaming laptop might score 0.9 relevance, whereas a budget laptop
might score 0.6 relevance. Both laptops appear in search results but ranked
differently based on partial matching.
Applications
- Semantic Search Engines: Handles queries with vague language.
- Recommendation Systems: Suggests products based on partial matches.
- Medical Databases: Retrieves research papers with similar diagnoses.
Advantages
- Handles Ambiguity: Can retrieve relevant documents even with unclear queries.
- Better User Experience: Improves search results quality.
- Increased Search Flexibility: Works well in real-world applications.
Limitations
- Complex Implementation: Requires defining membership functions for
relevance.
- Higher Computation Costs: More resource-intensive than Boolean models.
2. WordNet and Its Applications
WordNet is a lexical database that organizes words based on their meanings and
relationships. Developed at Princeton University, WordNet plays a crucial role in
Natural Language Processing (NLP) and Information Retrieval (IR).
Structure
WordNet classifies words into synsets—groups of synonyms that represent a
single concept. Each synset contains:
- Definition (Gloss): A dictionary-style explanation.
- Example Usage: Sentences demonstrating meaning.
- Lexical Relations: Synonymy, hypernymy, hyponymy, antonymy, meronymy,
etc.
Example
For the word "read," WordNet lists:
- Synset: {read, interpret, scan}
- Gloss: "Interpret something written or printed."
- Example: "Can you read Greek?"
Applications
1. Search Engines: Enhances relevance by understanding synonyms.
2. Text Classification: Groups documents based on meaning.
3. Machine Translation: Improves language translation quality.
4. Sentiment Analysis: Helps AI understand emotions in text.
5. Word Sense Disambiguation: Distinguishes meanings in different contexts.
3. Latent Semantic Indexing (LSI) Model
Example
If a user searches for “climate change”, LSI recognizes underlying concepts and
retrieves related documents discussing global warming, CO2 emissions, and
deforestation.
Applications
1. Search Engines: Enhances relevance in information retrieval.
2. Document Clustering: Improves organization of research papers.
3. Recommendation Systems: Suggests articles based on conceptual similarity.
Advantages
- Handles Synonymy and Polysemy: Improves topic-based retrieval.
- Enhances Search Accuracy: Retrieves documents even if exact keywords are
missing.
- Improves Document Classification: Organizes texts based on meaning.
Limitations
- Computationally Intensive: Requires significant processing power.
- May Retrieve Irrelevant Documents: Needs fine-tuning to enhance precision.
4. Explain FrameNet with its applications.
• Sentences from the British National Corpus are tagged with these frame
elements.
• Each word evokes a particular situation with particular participants.
FrameNet aims at capturing these situations through case-frame
representation of words (verbs, adjectives, and nouns).
• The word that invokes a frame is called the target word or predicate, and the
participant entities are defined using semantic roles, called frame elements.
• The FrameNet ontology is a semantic level representation of predicate
argument structure.
Each frame contains a main lexical item as a predicate and associated frame-
specific semantic roles, such as AUTHORITIES, TIME, SUSPECT (in the
ARREST frame), called frame elements.
[Judge She] [Evaluee blames the police] [Reason for failing to provide enough
protection].
[Speaker She] told [Addressee me] [Message "I'll return by 7:00 pm today"].
Applications:
1. Gildea and Jurafsky (2002) and Kwon et al. (2004) used FrameNet data for
automatic semantic parsing.
2. It is useful for information extraction through shallow semantic role
labeling.
3. FrameNet also helps in thematic role identification. For example, it helps
identify that the theme role played by “match” is the same in the sentences:
“The umpire stopped the match.”
“The match stopped due to bad weather.”
In the first, “match” is the object; in the second, it is the subject.
4. FrameNet supports question-answering systems by defining frames such as
TRANSFER with roles like SENDER, RECIPIENT, and GOODS.
For example: “Khushbu received a packet from the examination cell.”
This enables answering questions like “Who sent the packet to Khushbu?”
5. Other applications of FrameNet include Information Retrieval (IR),
machine translation, text summarization, and word sense disambiguation.
A wide variety of NLP research corpora have been developed for a variety of
research purposes. For example, for the Information Retrieval (IR) task, test
collections like the ones shown below have been used:
IR Test Collection
LISA
CACM
CISI
MEDLINE
CRANFIELD
TIME
ADI
These are available at:
http://www.dcs.gla.ac.uk/idom/ir_resources/test_collections/
Document:
> "Hurricane Gilbert — A monster hurricane was on a collision course with the
Caribbean... (long article)"
Summary (Gold Summary):
> "Hurricane Gilbert grew into the most powerful storm ever recorded in the
Western Hemisphere... Gilbert’s path moved west toward Mexico."
The system’s output summary is compared with this gold summary to measure
performance.
Word Sense Disambiguation
• For Word Sense Disambiguation (WSD), the SEMCOR corpus is used.
SEMCOR is a corpus where every content word is tagged with a WordNet
sense.
• Another important initiative is the Word Expert Project, which provides a
large-scale, sense-tagged dataset via crowdsourcing.
• These datasets are used to train and evaluate systems on selecting the
correct word sense given a context.
PART-OF-SPEECH TAGGER
• Part-of-speech tagging is used at an early stage of text processing in
many NLP applications such as speech synthesis, machine translation,
IR, and information extraction.
• In IR, part-of-speech tagging can be used in indexing (for identifying
useful tokens like nouns), extracting phrases and for disambiguating
word senses.
• The rest of this section presents a number of part-of-speech taggers that
are already in place.
Stanford Log-linear Part-of-Speech (POS) Tagger
• This POS Tagger is based on maximum entropy Markov models. The
key features of the tagger are as follows:
(i) It makes explicit use of both the preceding and following tag
contexts via a dependency network representation.
(ii) It uses a broad range of lexical features.
(iii) It utilizes priors in conditional log-linear models.
The reported accuracy of this tagger on the Penn Treebank WSJ is 97.24%, which
amounts to an error reduction of 4.4% on the best previous single automatically
learned tagging result (Toutanova et al. 2003).
TnT tagger
A DT Simple JJ
Technique NN algebraic JJ
For IN a DT
The DT address NN
To TO
Translate VB
Brill Tagger
Brill (1992) described a trainable rule-based tagger that obtained performance
comparable to that of stochastic taggers.
It uses transformation-based learning to automatically induce rules. A number
of extensions to this rule-based tagger have been proposed by Brill (1994). He
describes a method for exploring lexical relations in tagging that stochastic
taggers are currently unable to express.
It has been designed so that it can be easily adapted to different types of text in
different input formats. CLAWS has achieved 96–97% accuracy. The precise
degree of accuracy varies according to the type of text. For more information
on the CLAWS tagger, see Garside (1987), Leech, Garside, and Bryant (1994),
Garside (1996), and Garside and Smith (1997).
Tree-Tagger
Tree-Tagger (Schmid 1994) is a probabilistic tagging method. It avoids
problems faced by the Markov model methods when estimating transition
probabilities from sparse data, by using a decision tree to estimate transition
probabilities.
The decision tree automatically detects the appropriate size of the context to be
used in estimation. The reported accuracy for TreeTagger is above 96% on the
Penn-Treebank WSJ corpus.
STEMMERS
Input Text:
Such an analysis can reveal features that are not easily visible from the
variations in the individual genes and can lead to a picture of expression
that is more biologically transparent and accessible to interpretation.
Output:
• Lovins stemmer:
such an analys can rev featur that are not as eas vis from the vari in the
individu gen and can lead to a pictur of express that is mor biolog transpar
and access to interpret.
• Porter’s stemmer:
such an analysi can reveal featur that are not easili visibl from the variat in
the individu gene and can lead to a pictur of express that is more biolog
transpar and access to interpret.
• Paice stemmer:
Such an analys can rev feat that are not easy vis from the vary in the individ
gen and can lead to a pict of express that is mor biolog transp and access to
interpret.
There are many stemmers available for English and other languages.
Snowball presents stemmers for English, Russian, and a number of other
European languages, including French, Spanish, Portuguese, Hungarian,
Italian, German, Dutch, Swedish, Norwegian, Danish, and Finnish.
Stemmers for Indian Languages
Standard stemmers are not yet available for Hindi and other Indian
languages. The major research on Hindi stemming has been accomplished
by Ramanathan and Rao (2003) and Majumder et al. (2007).
Ramanathan and Rao (2003) based their work on the use of handcrafted
suffix lists. Majumder et al. (2007) used a cluster-based approach to find
classes of root words and their morphological variants. They used a task-
based evaluation of their approach and reported that stemming improves
recall for Indian languages.
Stemming Applications
• For some languages, and especially for English, online resources are
widely available.
• There are many large parallel corpora that contain translations between
English and many languages.
• But, the vast majority of the world’s languages do not have large parallel
training texts available.
Data Augmentation
• Statistical technique for dealing with insufficient training data.
• Adding new synthetic data that is generated from the current natural data.
• Backtranslation is a common data augmentation technique in machine
translation (MT) using monolingual target-language data (text written
only in the target language) to create synthetic parallel data.
• It addresses the scarcity of parallel corpora by generating synthetic
bitexts from abundant monolingual data.
• The process involves training a target-to-source MT model using
available parallel data, then using it to translate monolingual target-
language text into the source language.
• The resulting synthetic source-target pairs are added to the training data
to improve the original source-to-target MT model.
• Backtranslation includes configurable parameters like decoding methods
(greedy, beam search, sampling) and data ratio settings (e.g., upsampling
real bitext).
• It is highly effective—studies suggest it provides about two-thirds the
benefit of training on real bitext.
Multilingual models
• The models we’ve described so far are for bilingual translation: one
source language, one target language.
• It’s also possible to build a multilingual translator. In a multilingual
translator, we train the system by giving it parallel sentences in many
different pairs of languages.
• We tell the system which language is which by adding a special token ls
to the encoder specifying the source language we’re translating from, and
a special token lot to the decoder telling it the target language we’d like
to translate into.
One advantage of a multilingual model is that they can improve the translation
of lower resourced languages by drawing on information from a similar
language in the training data.
Sociotechnical issues
• Limited native speaker involvement: Many low-resource language
projects lack participation from native speakers in content curation,
technology development, and evaluation.
• Low data quality: some multilingual datasets were of acceptable quality,
often containing errors, repeated content,
• Efforts to broaden coverage: Recent large-scale projects aim to support
MT across hundreds of languages, expanding beyond English-centric
training.
• Participatory design: Researchers advocate for involving native
speakers in all stages of MT development,
• Improved evaluation method: Instead of direct evaluation, post-editing
MT outputs helps reduce bias.
3. Summarize on Language Divergences and Typology.
Two languages that share their basic word order type often have other
similarities. For example, VO languages generally have prepositions, whereas
OV languages generally have postpositions.
VO → verb wrote is followed by its object a letter and the prepositional phrase
to a friend, in which the preposition to is followed by its argument a friend.
2. Lexical Divergences
• Lexical gap: English does not have a word that corresponds neatly for
phrases like: filial piety or loving child, or good son/daughter.
• Verb-framed languages (Spanish, French, Japanese)
o Verb shows the direction or path of movement.
o The manner (how something moves) is optional or added
separately.
Examples:
o Entrar = to enter
o Salir = to go out
o Subir = to go up
3. Morphological Typology
Languages differ morphologically along two main dimensions:
1. Number of Morphemes per Word:
o Isolating languages (e.g., Vietnamese, Cantonese): o One
morpheme per word. o Words are simple and not combined.
o Polysynthetic languages (e.g., Siberian Yupik): o Many
morphemes in one word. o One word can express a full sentence.
2. Segmentability of Morphemes:
o Agglutinative languages (e.g., Turkish): o Morphemes are clearly
separable. o Each morpheme represents one meaning or function.
o Fusional languages (e.g., Russian):
o Morphemes blend together.
o One morpheme may carry multiple meanings (e.g., case,
number, gender).
4. Referential density
• Measures how often a language uses explicit pronouns.
• High referential density (hot languages):
o Use pronouns more frequently. o Easier for the listener (e.g.,
English).
• Low referential density (cold languages):
o Omit pronouns often. o Listener must infer more (e.g.,
Chinese, Japanese).
Automatic Evaluation
• The simplest and most robust metric for MT evaluation is called chrF,
which stands for character F-score.
• A good machine translation will tend to contain characters and words
that occur in a human translation of the same sentence.
• chrP percentage of character 1-grams, 2-grams, ..., k-grams in the
hypothesis that occur in the reference, averaged.
• chrR percentage of character 1-grams, 2-grams,..., k-grams in the
reference that occur in the hypothesis, averaged.
• The metric then computes an F-score by combining chrP and chrR
using a weighting parameter β. It is common to set β = 2, thus
weighing recall twice as much as precision: