0% found this document useful (0 votes)
48 views6 pages

MUD Exam 2024 SOLVED

solved exam

Uploaded by

silvshootss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views6 pages

MUD Exam 2024 SOLVED

solved exam

Uploaded by

silvshootss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Master on Data Science Universitat Politècnica de Catalunya

Mining Unstructured Data - MUD

Final Exam June 17th , 2024

——– PART A ———

Exercise 1. (3 points)
Given the following morphologically analyzed sentence,
I saw bats yesterday
PRP VBD NNS NN
NN NN JJ ADV
VBZ
and a HMM model partially represented by the following matrices,
A PRP JJ NN NNS VBZ VBD ADV B I saw bats yesterday
* 0.4 0.1 0.3 0.1 PRP 1
PRP 0.2 0.3 0.1 JJ 0.2
JJ 0.8 0.2 NN 0.1 0.4 0.1
NN 0.2 0.3 0.4 0.1 NNS 0.5
NNS 0.1 0.5 VBZ 0.3
VBZ 0.2 0.2 0.3 0.2 VBD 0.5
VBD 0.1 0.4 0.2 0.2 ADV 1
ADV 0.1 0.1 0.1 0.2 0.3

a) Apply Viterbi algorithm to get the best POS-tag sequence. Provide the whole dynamic table with
all the information required to achieve the resulting POS-tag sequence.
b) Which is the resulting best POS-tag sequence and its probability? The answer must be justified by
means of the information of the dynamic table, if not, the answer will be considered wrong.
c) Is the resulting POS-tag sequence correct? Justify briefly your answer.

Solution
a) The table:

I saw bats yesterday


PRP 0.4*1
δ=0.4
JJ 0.06*0.1*0.2
δ=0.0012
φ=VBD
NN 0.0012*0.8*0.1 0.006*0.1*0.1
δ=0.000096 δ=0.00006
φ=JJ
NNS 0.06*0.2*0.5
δ=0.006
φ=VBD
VBZ
VBD 0.4*0.3*0.5
δ=0.06
phi=PRP
ADV

1
b) Result: PRP VBD JJ NN; prob = 0.000096
c) No. ’Yesterday’ is an adverb (ADV) in the context of the sentence. Note that ”saw bats yesterday”
refers to a set of animals seen in a period of time.

Exercise 2. (2 points)
Given the following sentence with the result of a POS tagger:

John saw my brother playing with his glasses


NNP VBD PRP$ NN VBG IN PRP$ NNS

NNP: proper noun; VBD: verb, past; PRP$: possessive pronoun; NN/NNS, singular/plural noun; VBG: verb, gerund; IN: preposition

a) We want to learn a CRF model able to recognize noun-phrase chunks. Design one correct feature
template useful to recognize more than one noun-phrase chunks occurring in the sentence. Derive
one correct and useful feature function.

b) Draw the parse trees derived by the following PCFG for the sentence. Which is the best parse tree?
What is the result when using CKY algorithm with this grammar? Justify briefly the answers.

S → NP VP (1.0) NNP → John (1.0)


NP → PRP$ NN (0.5) NN → brother (1.0)
NP → PRP$ NNS (0.3) NNS → glasses (1.0)
NP → NNP (0.1) PRP$ → my (0.6)
NP → NP AP (0.1) PRP$ → his (0.4)
PP → IN NP (1.0) IN → with (1.0)
AP → VBG PP (1.0) VBD → saw (0.7)
VP → VBD NP (0.4) VBG → playing (0.3)
VP → VP AP (0.6)

Solution
a) A template is correct if it is defined a) only considering its parameters and metaparameters, and
b) defining the current state, xt , mandatory. Obviously, the observations must be words of the
modeled language and the states must take values from the labels BIO. A template is considered a
priori useful if it makes sense for the specific task, although the associated λ-value results zero after
learning the model.
A possible correct and a priori useful template could be the following:

fa,b (xt−1 , xt , W, t) = 1 si xt = a y pos(wt ) = b ; 0 otherwise

Note that the current state, xt , is defined and the template will be correct if the values for meta-
parameter a are labels BIO. Also, note that the template is a priori useful for recognizing 2 noun-
phrase chunks (as required) because we can derive feature functions like the following one, with
which we are defining that a noun-phrase chunk can start with a particular PoS tag (PRP$):

fB,P RP $ (xt−1 , xt , W, t) = 1 si xt = B y pos(wt ) = P RP $ ; 0 otherwise

This feature function is useful to define ”your” and ”his” as starting words of the noun-phrase
chunks ”your brother” and ”his glasses” because both words are labeled with PoS tag PRP$ . With
Viterbi algorithm, the combination of this features with others would be optimized to achieve the
recognition of optimal BIO sequences.

2
b) The parse trees that can be derived from the grammar are the following:

NP (0.1) VP (0.4)

NNP

John VBD NP (0.1)

saw
NP (0.5) AP

PRP$ NN VBG PP
your (0.6) brother playing
IN NP (0.3)

with PRP$ NNS

his (0.4) glasses


S

NP (0.1) VP (0.6)

NNP

John VP AP

VBD NP (0.5) VBG PP

saw PRP$ NN playing


IN NP (0.3)
your (0.6) brother with PRP$ NNS

his (0.4) glasses

Their probabilities are 1.14e-4 y 8.64e-4, respectively. So, the best tree is the second one.
Note that the grammar is not in CNF and so, CKY cannot be applied. It is acceptable to answer that
CKY cannot be applied or that CKY returns an ERROR. In no case is it acceptable to answer that
the grammar can be transformed into CNF and then apply CKY because the transformed grammar
is a different grammar, so we would be using a different grammar.

——– PART B ———

Exercise 3. (3 points)
You are evaluating a set of word embedding models for various NLP tasks. You are using extrinsic eval-
uation methods to assess the performance of these models. Your analysis reveals that the performance
of the models varies significantly depending on the training parameters, the specific NLP task, and the
nature of the text data.

3
Explain the following observations you made while evaluating the different models, providing a
theoretical justification for each. You can provide examples to illustrate your points.
(a) Word embeddings trained with a larger window size tend to perform better in semantic similarity
tasks, while those trained with a smaller window size excel in syntactic analogy tasks.

(b) Word embeddings based on TF-IDF work better than Word2Vec for author classification in poems.
(c) FastText works better for topic extraction in tweets whereas Word2Vec works better for topic extrac-
tion in paper abstracts.
(d) PPMI-based word embeddings outperform Word2Vec for identifying the semantic similarity of rare
words in a corpus.
(e) Sentence embeddings generated by averaging GloVe word embeddings outperform BERT embed-
dings for identifying duplicate questions in a community forum.
(f) Contextual embeddings from the same BERT model but obtained from different layers are suitable
for different tasks: embeddings from earlier layers are better at part-of-speech tagging, while those
from later layers excel at sentence classification.

Solution
(a) Word embeddings trained with a bigger window aggregate more information regarding the semantic
context of each word, but lose positional information within a sentence. Smaller windows will
capture the function of a word within a sentence, making them suitable for syntactic analogy tasks
where the relative position of words is key. For example, a model trained on a large window might
learn that ”king” and ”queen” are semantically similar due to their frequent co-occurrence in royal
contexts. However, a model trained on a smaller window would be better at recognising that ”king”
is to ”rule” as ”chef” is to ”cook”.
(b) TF-IDF based word embeddings capture the importance of words specific to a particular document
or author in a corpus. This makes them suitable for tasks like author classification in poems, where
stylistic choices and unique vocabulary are strong indicators of authorship. Word2Vec, on the other
hand, learns embeddings based on the co-occurrence of words in across the corpus and might not
be as effective in capturing individual writing styles present in a limited set of poems.
(c) FastText considers character n-grams within words, while Word2Vec works a the level of whole
words. This makes FastText more robust to noisy text such as tweets, which often include mis-
spellings and informal language, as it can still capture meaning from partial word representations.
This is necessary for topic extraction in tweets. In contrast, Word2Vec’s reliance on full-word con-
texts makes it more suitable for topic extraction in paper abstracts, which are generally written using
formal language and consistent terminology.

(d) PPMI-based word embeddings address the issue of rare words by normalizing co-occurrence counts
with their joint probabilities, giving more weight to statistically significant relationships. This makes
them particularly effective for identifying the semantic similarity of rare words, which might not
co-occur frequently enough in a corpus for Word2Vec to learn accurate representations. For exam-
ple, PPMI would be more likely to identify the similarity between ”serendipitous” and ”fortuitous”
even if they appear infrequently, as their co-occurrence is statistically significant compared to their
individual occurrences.
(e) Averaging GloVe word embeddings creates a simple sentence representation that captures the overall
semantic content. This is suitable for identifying duplicate questions in a community forum, where
the focus is on semantic equivalence rather than subtle nuances in meaning or word order. BERT
embeddings, while powerful, can be sensitive to word order and context, potentially overfitting to
small variations in duplicate questions. For example, ”How do I bake a cake?” and ”What is the recipe
for a cake?” are semantically similar but have different structures that BERT might overemphasize.

4
An alternative and also valid answer to this question comes from the fact that BERT needs to be
fine-tuned for semantic similarity tasks, so a general pre-trained MLM BERT model might not be
suitable for the task, whereas GloVe word embeddings should work out of the box.
(f) BERT’s layered architecture allows it to capture different levels of linguistic information. Earlier
layers tend to encode more syntactic information, as they are closer to the word level input. As the
information propagates through the layers, it becomes more abstract and semantically rich. As a re-
sult, later layers are better suited for tasks requiring sentence-level understanding, such as sentiment
analysis or sentence classification. For example, earlier layers might be good at identifying the part
of speech of ”running” in ”I am running late”, while later layers would be better at understanding
the overall meaning of the sentence conveying lateness.

Exercise 4. (2 points)
You are fine-tuning a large language model (LLM) pre-trained on a large, general-purpose dataset to
develop a specialized home assistant chatbot. During the fine-tuning process, you observe that while the
model’s performance on the home assistant tasks initially improves, it then begins to degrade rapidly.
Furthermore, you notice a significant drop in the model’s performance on the original, general-purpose
tasks it was initially trained on. This phenomenon is known as catastrophic forgetting.
(a) Explain two potential causes for this catastrophic forgetting in your fine-tuned LLM.
(b) Propose two strategies to mitigate or prevent catastrophic forgetting and preserve the LLM’s perfor-
mance on both the original and new tasks.
(c) You decide to improve your home assistant chatbot by incorporating user feedback. How does
Reinforcement Learning from Human Feedback (RLHF) differ from traditional fine-tuning in this
context?

Solution
(a) Potential Causes for Catastrophic Forgetting (only two are requested):
• Overwriting of Shared Representations: Fine-tuning on a specialized dataset can overwrite the
general knowledge representations learned during pre-training, especially if the new data is
significantly different in domain or style. This is because both pre-training and fine-tuning
tasks often share the same underlying model parameters, espeically in the lower layers.
• Dataset bias: The home assistant dataset likely has a different data distribution compared to
the general-purpose dataset. This bias can cause the model to overfit to the specific patterns in
the home assistant data, degrading its ability to generalize to the context of the original tasks.
• Insufficient Training Data: If the fine-tuning dataset for the home assistant is relatively small,
the model may not have enough examples to learn the new task effectively without sacrificing
its previously acquired knowledge.
• Aggressive Optimization: Using a high learning rate or training for too many epochs during
fine-tuning can lead to significant changes in the model’s parameters, potentially overwriting
the more subtle representations learned during pre-training.
(b) Strategies to Mitigate Catastrophic Forgetting (only two are requested):

• Parameter Freezing: Instead of fine-tuning all the model parameters, we could freeze the lower
layers responsible for general language understanding and only train the upper layers on the
new dataset. This preserves the pre-trained knowledge while allowing specialization for the
new task.
• Adaptation techniques: Instead of updating all parameters of a layer, only fine-tune a subset
of them, or add additional weights. This includes techniques such as bias tuning, adapter
modules/matrices and Low-rank adaptation (LoRA).

5
• Regularization Techniques: Applying regularization methods like L2 regularization or dropout
during fine-tuning can prevent the model from overfitting the new task and hence deviate too
much from the original knowledge. Other more specialized regularization techniques like Elas-
tic Weight Consolidation (EWC) or Synaptic Intelligence (SI) can also be used. These discourage
the model from drastically changing the weights important for the original tasks.
• Multi-Task Learning: Train the LLM on both the original and new tasks simultaneously. This
can be done by interleaving data from both datasets during training.
• Proximal Policy Optimization (PPO): PPO can be applied to mitigate catastrophic forgetting.
PPO limits the update size of the model parameters during fine-tuning, ensuring that the new
knowledge is integrated gradually without drastically deviating from the original distribution.

(c) RLHF vs. Traditional Fine-Tuning:

• Traditional fine-tuning relies on a fixed dataset with predefined labels/sequences to adjust the
model parameters and improve performance on the specific task. In our context (home assis-
tant), this involves training the model on a dataset of user queries paired with expected chatbot
responses.
• RLHF incorporates human feedback directly into the training loop. Instead of relying only
on predefined labels/sequences, RLHF utilizes human evaluation to assess the quality of the
chatbot’s responses. This feedback, usually given as rankings or preferences between different
responses, is used as a reward signal to train a reward model. The reward model then guides the
chatbot’s learning process through reinforcement learning, encouraging it to generate responses
that align better with human preferences.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy