0% found this document useful (0 votes)
3 views21 pages

NLP QBS Module 4 & 5

The document discusses various alternative models of Information Retrieval (IR) including the Cluster Model, Fuzzy Model, WordNet, Latent Semantic Indexing (LSI), FrameNet, and research corpora. Each model has its own concepts, applications, advantages, and limitations, providing insights into how information can be efficiently retrieved and processed. Additionally, it outlines different parts of speech taggers used in NLP applications, highlighting their methodologies and performance.

Uploaded by

Shreya Prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views21 pages

NLP QBS Module 4 & 5

The document discusses various alternative models of Information Retrieval (IR) including the Cluster Model, Fuzzy Model, WordNet, Latent Semantic Indexing (LSI), FrameNet, and research corpora. Each model has its own concepts, applications, advantages, and limitations, providing insights into how information can be efficiently retrieved and processed. Additionally, it outlines different parts of speech taggers used in NLP applications, highlighting their methodologies and performance.

Uploaded by

Shreya Prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

MODULE-4

1. Alternative Models of Information Retrieval (IR)

i) Cluster Model
The Cluster Model is a technique in Information Retrieval (IR) that organizes
documents into clusters based on their similarity. Instead of retrieving individual
documents from an extensive dataset, the retrieval system first identifies relevant
clusters and then searches within them, making retrieval faster and more efficient.

Concepts and Functionality


Cluster-based retrieval operates under the assumption that if a document is
relevant to a query, then documents within the same cluster will also likely be
relevant.
The clustering process involves several key steps:
1. Document Representation: Each document is represented as a vector using
techniques like TF-IDF (Term Frequency-Inverse Document Frequency), Word
Embeddings, or Bag-of-Words.
2. Similarity Measurement: The system calculates document similarity using
mathematical functions such as Cosine Similarity, Euclidean Distance, or Jaccard
Similarity.
3. Clustering Algorithm: Documents are grouped based on similarity using
clustering techniques like:
- Hierarchical Clustering – Forms a tree-like hierarchy of document clusters.
- K-Means Clustering – Divides documents into a predefined number of clusters
based on their proximity.
- DBSCAN – Identifies clusters based on data density, useful for noisy datasets.

Example
Imagine a digital library containing 200,000 research papers on Artificial
Intelligence, Data Science, and Natural Language Processing. Instead of searching
through all papers, the clustering algorithm groups documents into categories like:
- Machine Learning
- Neural Networks
- Reinforcement Learning
- Deep Learning

When a user searches for "Transformer Models," the system first identifies the
cluster containing topics related to "Deep Learning" before retrieving documents
from this specific group. This reduces search time and improves relevance.
Applications
- Search Engines: Google and Bing use clustering to organize web pages by topic.
- Recommendation Systems: Netflix and Amazon group movies/products for
better recommendations.
- Text Categorization: Helps automate document classification.

Advantages
- Improves Retrieval Efficiency: Cluster-based searches are faster than scanning
the entire dataset.
- Enhances Organization: Categorization helps users find relevant information.
- Scalable for Large Datasets: Works well with vast collections of documents.

Limitations
- Pre-Clustering Overhead: Requires computational power to initially group
documents.
- Cluster Accuracy: Performance depends on choosing the right algorithm.

ii) Fuzzy Model


The Fuzzy Model in IR is based on fuzzy logic, allowing for partial relevance
instead of strict binary classifications ("relevant" vs. "not relevant"). This model
enhances retrieval performance for vague or imprecise queries.

Concepts and Functionality


Unlike Boolean models, where a document must exactly match a query to be
retrieved, fuzzy models work by assigning degrees of membership to documents
based on how closely they match a query.
Key concepts include:
1. Fuzzy Set Theory: Uses membership functions to determine relevance scores.
2. Partial Matching: Instead of requiring an exact keyword match, documents can
be partially relevant.
3. Weighted Terms: Some words in a query hold more importance than others.

Example
Imagine a user searching for “best laptops for programming” on an e-commerce
website. The fuzzy IR model assigns relevance scores to laptops based on
parameters like:
- Processor Speed
- RAM Size
- Battery Life
- User Reviews
A high-end gaming laptop might score 0.9 relevance, whereas a budget laptop
might score 0.6 relevance. Both laptops appear in search results but ranked
differently based on partial matching.

Applications
- Semantic Search Engines: Handles queries with vague language.
- Recommendation Systems: Suggests products based on partial matches.
- Medical Databases: Retrieves research papers with similar diagnoses.

Advantages
- Handles Ambiguity: Can retrieve relevant documents even with unclear queries.
- Better User Experience: Improves search results quality.
- Increased Search Flexibility: Works well in real-world applications.

Limitations
- Complex Implementation: Requires defining membership functions for
relevance.
- Higher Computation Costs: More resource-intensive than Boolean models.
2. WordNet and Its Applications

WordNet is a lexical database that organizes words based on their meanings and
relationships. Developed at Princeton University, WordNet plays a crucial role in
Natural Language Processing (NLP) and Information Retrieval (IR).

Structure
WordNet classifies words into synsets—groups of synonyms that represent a
single concept. Each synset contains:
- Definition (Gloss): A dictionary-style explanation.
- Example Usage: Sentences demonstrating meaning.
- Lexical Relations: Synonymy, hypernymy, hyponymy, antonymy, meronymy,
etc.

Example
For the word "read," WordNet lists:
- Synset: {read, interpret, scan}
- Gloss: "Interpret something written or printed."
- Example: "Can you read Greek?"
Applications
1. Search Engines: Enhances relevance by understanding synonyms.
2. Text Classification: Groups documents based on meaning.
3. Machine Translation: Improves language translation quality.
4. Sentiment Analysis: Helps AI understand emotions in text.
5. Word Sense Disambiguation: Distinguishes meanings in different contexts.
3. Latent Semantic Indexing (LSI) Model

Latent Semantic Indexing (LSI) is a mathematical approach in IR that uncovers


hidden relationships between words in documents. It addresses challenges posed
by synonymy (words with similar meanings) and polysemy (words with multiple
meanings).

Concepts and Functionality


1. Term-Document Matrix: Represents documents numerically.
2. Singular Value Decomposition (SVD): Reduces dimensionality and identifies
semantic relationships.
3. Concept-Based Retrieval: Queries are matched to underlying concepts instead
of keywords.

Example
If a user searches for “climate change”, LSI recognizes underlying concepts and
retrieves related documents discussing global warming, CO2 emissions, and
deforestation.

Applications
1. Search Engines: Enhances relevance in information retrieval.
2. Document Clustering: Improves organization of research papers.
3. Recommendation Systems: Suggests articles based on conceptual similarity.

Advantages
- Handles Synonymy and Polysemy: Improves topic-based retrieval.
- Enhances Search Accuracy: Retrieves documents even if exact keywords are
missing.
- Improves Document Classification: Organizes texts based on meaning.

Limitations
- Computationally Intensive: Requires significant processing power.
- May Retrieve Irrelevant Documents: Needs fine-tuning to enhance precision.
4. Explain FrameNet with its applications.

FrameNet is a large database of semantically annotated English sentences. It is


based on the principles of frame semantics. It defines a tagset of semantic roles
called frame elements.

• Sentences from the British National Corpus are tagged with these frame
elements.
• Each word evokes a particular situation with particular participants.
FrameNet aims at capturing these situations through case-frame
representation of words (verbs, adjectives, and nouns).
• The word that invokes a frame is called the target word or predicate, and the
participant entities are defined using semantic roles, called frame elements.
• The FrameNet ontology is a semantic level representation of predicate
argument structure.

Each frame contains a main lexical item as a predicate and associated frame-
specific semantic roles, such as AUTHORITIES, TIME, SUSPECT (in the
ARREST frame), called frame elements.

• In the example sentence, “[Authorities the police] nabbed [Suspect the


snatcher],” the verb “nab” is the target word and belongs to the ARREST
frame.
• In the COMMUNICATION frame, semantic roles include ADDRESSEE,
COMMUNICATOR, TOPIC, and MEDIUM.
• A JUDGEMENT frame may include the roles JUDGE, EVALUEE, and
REASON.
• The STATEMENT frame may inherit from the COMMUNICATION
frame. It contains roles such as SPEAKER, ADDRESSEE, and MESSAGE.

Example sentences include:

[Judge She] [Evaluee blames the police] [Reason for failing to provide enough
protection].
[Speaker She] told [Addressee me] [Message "I'll return by 7:00 pm today"].
Applications:

1. Gildea and Jurafsky (2002) and Kwon et al. (2004) used FrameNet data for
automatic semantic parsing.
2. It is useful for information extraction through shallow semantic role
labeling.
3. FrameNet also helps in thematic role identification. For example, it helps
identify that the theme role played by “match” is the same in the sentences:
“The umpire stopped the match.”
“The match stopped due to bad weather.”
In the first, “match” is the object; in the second, it is the subject.
4. FrameNet supports question-answering systems by defining frames such as
TRANSFER with roles like SENDER, RECIPIENT, and GOODS.
For example: “Khushbu received a packet from the examination cell.”
This enables answering questions like “Who sent the packet to Khushbu?”
5. Other applications of FrameNet include Information Retrieval (IR),
machine translation, text summarization, and word sense disambiguation.

In the Communication frame, the core elements are:


Addressee [Add]: Receiver of message from the Communicator.
Communicator [Com]: The person conveying (written or spoken) a message to
another person.
Message [Msg]: A proposition or set of propositions that the Communicator wants
the Addressee to convey.
Topic [Top]: The entity that the proposition(s) are about.
Non-core elements include:
Amount_of_information [Amo]: The amount of information exchanged when
communication occurs.
Depictive [Dep-Act]: The Depictive describes the state of the Communicator.
Duration: The length of time during which the communication takes place.
Manner [Manr]: The manner in which the Communicator communicates.
Means [Mns]: The means by which the Communicator communicates.
Medium [Medium]: The physical or abstract setting in which the Message is
conveyed.
Time: The time at which the communication takes place.
Sample predicates include: communicate, indicate, signal, and speech.
5. Summarize on Research corpora with illustration.

A wide variety of NLP research corpora have been developed for a variety of
research purposes. For example, for the Information Retrieval (IR) task, test
collections like the ones shown below have been used:

IR Test Collection

Test collections such as:

LISA
CACM
CISI
MEDLINE
CRANFIELD
TIME
ADI
These are available at:
http://www.dcs.gla.ac.uk/idom/ir_resources/test_collections/

• In IR research, especially in learning to rank, benchmark datasets are


commonly used.
Example: LETOR (LEarning TO Rank)
• A benchmark dataset released by Microsoft Research Asia for research on
learning to rank.

• Includes standard features, relevance judgments, data partitioning,


evaluation tools, and several baseline algorithms.
• LETOR is prepared from the Gov2 web page collection, used in TREC
2003 and TREC 2004.
Summarization Data

Summarization data is used in evaluating summaries generated by machines. The


most common approach is to use gold summaries—reference summaries created
by humans.

• The DUC (Document Understanding Conference) datasets are widely used.


The task setup generally includes:
• A set of documents (input)
• A gold summary (expected output)
• This setup allows comparison between system-generated summaries and the
human-created reference summaries.

Document:

> "Hurricane Gilbert — A monster hurricane was on a collision course with the
Caribbean... (long article)"
Summary (Gold Summary):
> "Hurricane Gilbert grew into the most powerful storm ever recorded in the
Western Hemisphere... Gilbert’s path moved west toward Mexico."
The system’s output summary is compared with this gold summary to measure
performance.
Word Sense Disambiguation
• For Word Sense Disambiguation (WSD), the SEMCOR corpus is used.
SEMCOR is a corpus where every content word is tagged with a WordNet
sense.
• Another important initiative is the Word Expert Project, which provides a
large-scale, sense-tagged dataset via crowdsourcing.
• These datasets are used to train and evaluate systems on selecting the
correct word sense given a context.

Asian Language Corpora


• For Asian languages, corpora such as EMILLE (Enabling Minority
Language Engineering) are available.
• EMILLE is a corpus of Indic languages such as Hindi, Bengali, Tamil,
Urdu, etc.
• Developed at Lancaster University in collaboration with the Central
Institute of Indian Languages (CIIL), Mysore.
• It includes:
Monolingual corpora (written and spoken)
Parallel corpora (English–Indic language)
Annotated corpora (morphological and syntactic information)
• Sources of data:
Government leaflets
BBC radio broadcasts
Transcribed conversations
• EMILLE is helpful in machine translation and linguistic research for under-
resourced languages.

6. Outline different Parts of speech Taggers.

PART-OF-SPEECH TAGGER
• Part-of-speech tagging is used at an early stage of text processing in
many NLP applications such as speech synthesis, machine translation,
IR, and information extraction.
• In IR, part-of-speech tagging can be used in indexing (for identifying
useful tokens like nouns), extracting phrases and for disambiguating
word senses.
• The rest of this section presents a number of part-of-speech taggers that
are already in place.
Stanford Log-linear Part-of-Speech (POS) Tagger
• This POS Tagger is based on maximum entropy Markov models. The
key features of the tagger are as follows:
(i) It makes explicit use of both the preceding and following tag
contexts via a dependency network representation.
(ii) It uses a broad range of lexical features.
(iii) It utilizes priors in conditional log-linear models.

The reported accuracy of this tagger on the Penn Treebank WSJ is 97.24%, which
amounts to an error reduction of 4.4% on the best previous single automatically
learned tagging result (Toutanova et al. 2003).

A Part-of-Speech Tagger for English

• This tagger uses a bi-directional inference algorithm for part-of-speech


tagging. It is based on maximum entropy Markov models (MEMM).
• The algorithm can enumerate all possible decomposition structures and
find the highest probability sequence together with the corresponding
decomposition structure in polynomial time.
• Experimental results of this part-of-speech tagger show that the
proposed bi-directional inference methods consistently outperform
unidirectional inference methods and bi-directional MEMMs give
comparable performance to that achieved by state-of-the-art learning
algorithms, including kernel support vector machines (Tsuruoka and
Tsujii 2005).

TnT tagger

• Trigrams’n’Tags or TnT (Brants 2000) is an efficient statistical part-of-


speech tagger.
• This tagger is based on hidden Markov models (HMM) and uses some
optimization techniques for smoothing and handling unknown words.
• It performs at least as well as other current approaches, including the
maximum entropy framework. Table 12.1 shows the tagged text of
document #93 of the CACM collection.

Word Tag Word Tag

A DT Simple JJ

Technique NN algebraic JJ

Is VBZ formulas NNS


Shown VBN into IN

For IN a DT

Enabling VBG three CD

The DT address NN

Computer NN Computer code

To TO

Translate VB

Brill Tagger
Brill (1992) described a trainable rule-based tagger that obtained performance
comparable to that of stochastic taggers.
It uses transformation-based learning to automatically induce rules. A number
of extensions to this rule-based tagger have been proposed by Brill (1994). He
describes a method for exploring lexical relations in tagging that stochastic
taggers are currently unable to express.

It implements a rule-based approach to tagging unknown words. It


demonstrates how the tagger can be extended into a k-best tagger, where
multiple tags can be assigned to words in some cases of uncertainty.

CLAWS Part-of-Speech Tagger for English


Constituent likelihood automatic word-tagging system (CLAWS) is one of the
earliest probabilistic taggers for English.
It was developed at the University of Lancaster
(http://ucrel.lancs.ac.uk/claws/). The latest version of the tagger, CLAWS4,
can be considered a hybrid tagger as it involves both probabilistic and rule-
based elements.

It has been designed so that it can be easily adapted to different types of text in
different input formats. CLAWS has achieved 96–97% accuracy. The precise
degree of accuracy varies according to the type of text. For more information
on the CLAWS tagger, see Garside (1987), Leech, Garside, and Bryant (1994),
Garside (1996), and Garside and Smith (1997).

Tree-Tagger
Tree-Tagger (Schmid 1994) is a probabilistic tagging method. It avoids
problems faced by the Markov model methods when estimating transition
probabilities from sparse data, by using a decision tree to estimate transition
probabilities.

The decision tree automatically detects the appropriate size of the context to be
used in estimation. The reported accuracy for TreeTagger is above 96% on the
Penn-Treebank WSJ corpus.

ACOPOST: A Collection of POS Taggers


ACOPOST is a set of freely available POS taggers. The taggers in the set are
based on different frameworks. The programs are written in C. ACOPOST
currently consists of the following four taggers:

Maximum Entropy Tagger (MET)


This tagger is based on a framework suggested by Ratnaparkhi (1997). It uses
an iterative procedure to successively improve parameters for a set of features
that help to distinguish between relevant contexts.

Trigram Tagger (T3)


This tagger is based on HMM. The states in the model are tag pairs that emit
words. The technique has been suggested by Rabiner (1990) and the
implementation is influenced by Brants (2000).

Error-driven Transformation-based Tagger (TBT)


This tagger is based on the transformation-based tagging approach proposed by
Brill (1993). It uses annotated corpuses to learn transformation rules, which are
then used to change the assigned tag using contextual information.

Example-based Tagger (ET)


The underlying assumption of example-based models (also called memory-
based, instance-based or distance-based models) is that cognitive behaviour
can be achieved by looking at past experiences that match the current problem,
instead of learning and applying abstract rules. This framework.

POS Tagger for Indian Languages


The automatic text processing of Hindi and other Indian languages is
constrained heavily due to lack of basic tools and large annotated corpuses.
Research groups are now focusing on removing these bottlenecks.
The work on the development of tools, techniques, and corpora is going on at
several places such as CDAC, IIT Bombay, IIIT Hyderabad, University of
Hyderabad, CIIL Mysore, and University of Lancaster. IIT Bombay is involved in
the development of morphology analysers and part-of-speech taggers for Hindi
and Marathi. Both these languages have rich morphological structures. Their
approach is based on bootstrapping on a small corpus tagged by a rule-based
tagger and then applying statistical techniques to train a machine. More
information can be found at http://ltrc.iiit.net and www.cse.iitb.ac.in. Work on
Urdu part-of-speech taggers has been reported by Hardie (2003) and Baker et al.
(2004).

7. Explain Stemmers with its applications.

STEMMERS

• Stemming, often called conflation, is the process of reducing inflected


(or sometimes derived) words to their base or root form.
• The stem need not be identical to the morphological base of the word; it
is usually sufficient that related words map to the same stem, even if this
stem is not in itself a valid root.
• Stemming is useful in search engines for query expansion or indexing
and other NLP problems. Stemming programs are commonly referred to
as stemmers. The most common algorithm for stemming English is
Porter’s algorithm (Porter 1980).
• Other existing stemmers include Lovins' stemmer (Lovins 1968) and a
more recent one called the Paice/Husk stemmer (Paice 1990).

Input Text:

Such an analysis can reveal features that are not easily visible from the
variations in the individual genes and can lead to a picture of expression
that is more biologically transparent and accessible to interpretation.

Output:

• Lovins stemmer:
such an analys can rev featur that are not as eas vis from the vari in the
individu gen and can lead to a pictur of express that is mor biolog transpar
and access to interpret.
• Porter’s stemmer:
such an analysi can reveal featur that are not easili visibl from the variat in
the individu gene and can lead to a pictur of express that is more biolog
transpar and access to interpret.
• Paice stemmer:
Such an analys can rev feat that are not easy vis from the vary in the individ
gen and can lead to a pict of express that is mor biolog transp and access to
interpret.

Stemmers for European Languages

There are many stemmers available for English and other languages.
Snowball presents stemmers for English, Russian, and a number of other
European languages, including French, Spanish, Portuguese, Hungarian,
Italian, German, Dutch, Swedish, Norwegian, Danish, and Finnish.
Stemmers for Indian Languages
Standard stemmers are not yet available for Hindi and other Indian
languages. The major research on Hindi stemming has been accomplished
by Ramanathan and Rao (2003) and Majumder et al. (2007).
Ramanathan and Rao (2003) based their work on the use of handcrafted
suffix lists. Majumder et al. (2007) used a cluster-based approach to find
classes of root words and their morphological variants. They used a task-
based evaluation of their approach and reported that stemming improves
recall for Indian languages.

Stemming Applications

• Stemmers are common elements in search and retrieval systems such as


Web search engines.
• Stemming reduces the variants of a word to same stem. This reduces the
size of the index and also helps retrieve documents that contain variants
of a query terms.
For example, a user issuing a query for documents on ‘astronauts’
would like documents on ‘astronaut’ as well. Stemming permits this
by reducing both versions of the word to the same stem. However,
the effectiveness of stemming for English query systems is not too
great, and in some cases may even reduce precision.
• Text summarization and text categorization also involve term frequency
analysis to find features. In this analysis, stemming is used to transform
various morphological forms of words into their stems.
MODULE 5
1. Explain Machine Translation using Encoder and Decoder.

• Translate each sentence independently


o source language → target language.
o The green witch arrived (English) → Lleg´o la bruja verde (Spanish).
• MT uses supervised machine learning:
o System is given a large set of parallel sentences.
o Learns to map source sentences into target sentences.
o Split the sentences into a sequence of subword tokens.
o The systems are then trained to maximize the probability of the
sequence of tokens in the target language y1,…..,ym given the sequence
of tokens in the source language x1,……, xn :
p (y1,…..,ym| | x1,……, xn)
• Rather than use the input tokens directly, the encoder-decoder architecture
is used.
o Encoder takes the input words x =[ x1,…, xn] and produces an
intermediate context h.
o Decoder, the system takes h and, word by word, generates the output y:

2. Outline Translating in Low resource situations.

• For some languages, and especially for English, online resources are
widely available.
• There are many large parallel corpora that contain translations between
English and many languages.
• But, the vast majority of the world’s languages do not have large parallel
training texts available.

Two commonly used approaches: Data augmentation [Backtranslation], and


Multilingual models.

Data Augmentation
• Statistical technique for dealing with insufficient training data.
• Adding new synthetic data that is generated from the current natural data.
• Backtranslation is a common data augmentation technique in machine
translation (MT) using monolingual target-language data (text written
only in the target language) to create synthetic parallel data.
• It addresses the scarcity of parallel corpora by generating synthetic
bitexts from abundant monolingual data.
• The process involves training a target-to-source MT model using
available parallel data, then using it to translate monolingual target-
language text into the source language.
• The resulting synthetic source-target pairs are added to the training data
to improve the original source-to-target MT model.
• Backtranslation includes configurable parameters like decoding methods
(greedy, beam search, sampling) and data ratio settings (e.g., upsampling
real bitext).
• It is highly effective—studies suggest it provides about two-thirds the
benefit of training on real bitext.

Multilingual models
• The models we’ve described so far are for bilingual translation: one
source language, one target language.
• It’s also possible to build a multilingual translator. In a multilingual
translator, we train the system by giving it parallel sentences in many
different pairs of languages.
• We tell the system which language is which by adding a special token ls
to the encoder specifying the source language we’re translating from, and
a special token lot to the decoder telling it the target language we’d like
to translate into.

One advantage of a multilingual model is that they can improve the translation
of lower resourced languages by drawing on information from a similar
language in the training data.

Sociotechnical issues
• Limited native speaker involvement: Many low-resource language
projects lack participation from native speakers in content curation,
technology development, and evaluation.
• Low data quality: some multilingual datasets were of acceptable quality,
often containing errors, repeated content,
• Efforts to broaden coverage: Recent large-scale projects aim to support
MT across hundreds of languages, expanding beyond English-centric
training.
• Participatory design: Researchers advocate for involving native
speakers in all stages of MT development,
• Improved evaluation method: Instead of direct evaluation, post-editing
MT outputs helps reduce bias.
3. Summarize on Language Divergences and Typology.

Different types of Language Divergences and Typology:


1. Word Order Typology
2. Lexical Divergences
3. Morphological Typology
4. Referential density

1. Word Order Typology


• English and Japanese, languages differ in the basic word order of
verbs, subjects, and objects in simple declarative clauses.
• German, French, English, and Mandarin, are all SVO (Subject-
Verb-Object) languages.
• Hindi and Japanese, by contrast, are SOV languages, the verb
tends to come at the end of basic clauses.
• Irish and Arabic are VSO languages
• Arabic, with a VSO order, also has the verb before the object and
prepositions.
• Other kinds of ordering preferences vary idiosyncratically

Two languages that share their basic word order type often have other
similarities. For example, VO languages generally have prepositions, whereas
OV languages generally have postpositions.

VO → verb wrote is followed by its object a letter and the prepositional phrase
to a friend, in which the preposition to is followed by its argument a friend.

2. Lexical Divergences

Translate the individual words from one language to another. The


appropriate word can vary depending on the context.

• Bass - English source-language; fish lubina or the musical instrument


bajo - in Spanish.
• A wall – English; Wand (walls inside a building), and Mauer (walls
outside a building) – in German.
• Brother - English uses the word brother for any male sibling, Chinese
and many other languages have distinct words for older brother and
younger brother (Mandarin gege and didi, respectively)
• One language places more grammatical constraints on word choice
than another.

o English marks nouns for whether they are singular or plural.


Mandarin doesn’t. Or French and Spanish.

• Lexical gap: English does not have a word that corresponds neatly for
phrases like: filial piety or loving child, or good son/daughter.
• Verb-framed languages (Spanish, French, Japanese)
o Verb shows the direction or path of movement.
o The manner (how something moves) is optional or added
separately.
Examples:
o Entrar = to enter
o Salir = to go out
o Subir = to go up

• Satellite-framed languages (English, German, Russian)


o The verb shows the manner of movement.
o The direction or path is shown in a satellite (like a preposition or
particle).
Examples:
o Run out = run (manner) + out (path)
o Jump over = jump (manner) + over (path)
o Slide down = slide (manner) + down (path)

3. Morphological Typology
Languages differ morphologically along two main dimensions:
1. Number of Morphemes per Word:
o Isolating languages (e.g., Vietnamese, Cantonese): o One
morpheme per word. o Words are simple and not combined.
o Polysynthetic languages (e.g., Siberian Yupik): o Many
morphemes in one word. o One word can express a full sentence.
2. Segmentability of Morphemes:
o Agglutinative languages (e.g., Turkish): o Morphemes are clearly
separable. o Each morpheme represents one meaning or function.
o Fusional languages (e.g., Russian):
o Morphemes blend together.
o One morpheme may carry multiple meanings (e.g., case,
number, gender).
4. Referential density
• Measures how often a language uses explicit pronouns.
• High referential density (hot languages):
o Use pronouns more frequently. o Easier for the listener (e.g.,
English).
• Low referential density (cold languages):
o Omit pronouns often. o Listener must infer more (e.g.,
Chinese, Japanese).

Hot vs. Cold Languages:


• Hot languages = more explicit, easier for the listener.
• Cold languages = less explicit, require more inference.
4. Explain MT Evaluation.

Translations are evaluated along two dimensions:


I. Adequacy: how well the translation captures the exact meaning of the
source sentence. Sometimes called faithfulness or fidelity.
II. Fluency: how fluent the translation is in the target language (is it
grammatical, clear, readable, natural).

Using Human Raters to Evaluate MT


• Human evaluation is the most accurate method for assessing machine
translation (MT) quality, focusing on two main dimensions: fluency
and adequacy
• Raters, often crowd workers, assign scores on a scale (e.g., 1–5 or 1–
100) for each.
• Bilingual raters compare source and translation directly for adequacy,
while monolingual raters compare MT output with a human reference.
• Proper training is crucial, as raters often struggle to distinguish
fluency from adequacy.
• To ensure consistency, outliers are removed and ratings are
normalized.

Automatic Evaluation
• The simplest and most robust metric for MT evaluation is called chrF,
which stands for character F-score.
• A good machine translation will tend to contain characters and words
that occur in a human translation of the same sentence.
• chrP percentage of character 1-grams, 2-grams, ..., k-grams in the
hypothesis that occur in the reference, averaged.
• chrR percentage of character 1-grams, 2-grams,..., k-grams in the
reference that occur in the hypothesis, averaged.
• The metric then computes an F-score by combining chrP and chrR
using a weighting parameter β. It is common to set β = 2, thus
weighing recall twice as much as precision:

Alternative overlap metric: BLEU


• BLEU is a traditional metric used to evaluate machine translation
quality.
• It is precision-based, not based on recall.
• It uses n-gram precision, meaning it checks how many n-grams (1 to 4
words in a row) in the translation match the reference.
• It calculates a geometric mean of unigram to 4-gram precision scores.
• BLEU includes a brevity penalty to avoid favoring translations that
are too short.

Statistical Significance Testing for MT evals


• chrF and BLEU are overlap-based metrics used to compare machine
translation (MT) systems.
• They help answer: Did the new translation system improve over the
old one?
• To check if a difference in scores is statistically significant, we use
tests like the paired bootstrap test or randomization test.

5. Interpret in detail Encoder and Decoder Model.

The encoder-decoder transformer architecture for machine translation.


• Fig. shows the intuition of the architecture at a high level.
• The encoder-decoder architecture is made up of two transformers: an
encoder, which is the same as the basic transformers, and a decoder,
which is augmented with a special new layer called the cross-attention
layer.
• The encoder takes the source language input word tokens X = x1,.……,
xn and maps them to an output representation Henc = h1, …., hn; via a
stack of encoder blocks.
• The decoder attends to the encoder representation and generates the
target words one by one. o At each timestep conditioning on the source
sentence and the previously generated target language words to generate
a token.
• In order to attend to the source language, the transformer blocks in the
decoder have an extra cross-attention layer.
• A self-attention layer that attends to the input from the previous layer,
followed by layer norm, a feed forward layer, and another layer norm.

Each encoder block consists of:


1. Multi-Head Self-Attention
o Allows the model to focus on different positions in the input
sequence simultaneously.
o Uses scaled dot-product attention in multiple “heads.”
2. Add & Layer Normalization
o Adds the input and the output of the attention layer (residual
connection).
o Applies layer normalization.
3. Feed-Forward Neural Network (FFN)
o Two linear layers with a ReLU (or GELU) in between.
o Applies to each position independently.
4. Add & Layer Normalization (again)
o Adds the input and output of the FFN and normalizes.
Each decoder block adds an additional attention layer:
1. Masked Multi-Head Self-Attention
o Prevents attending to future tokens (important during training).
2. Add & Norm
3. Encoder-Decoder Attention (Cross attention)
o The decoder attends to encoder outputs. o Helps generate
context-aware outputs.
4. Add & Norm
5. Feed-Forward Neural Network
6. Add & Norm3

6. Briefly extend on Bias and Ethical Issues.

• MT systems can show gender bias, especially when translating from


gender-neutral languages (like Hungarian or Spanish) to gendered ones
(like English).
• They may default to male pronouns due to lack of gender info or cultural
stereotypes.
• Example: Hungarian gender-neutral pronoun "ő" becomes:
o she” if the job is nurse
o “he” if the job is CEO
• These biases reflect and reinforce gender stereotypes, which is a serious
ethical concern.
• The WinoMT dataset tests MT systems on sentences involving non-
stereotypical gender roles.
• MT systems often perform worse on such sentences.
• Example: A system may mistranslate “The doctor asked the nurse to help
her” if it expects the doctor to be male.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy