Text For Chapter 4
Text For Chapter 4
Keywords: Emotion Detection, Large Language Model tuning, Tourism Reviews, Crowdsourcing, Natural Language
Processing
——————–
Some text here was deleted to focus more on the text that
needs to be included in Chapter 4.
——————–
analysis, sarcasm detection, and emotion de- where: supporti i, is the number of reviews in class
tection. and
Fi is the F-score for class i. By employing
2. EmoNet 2: It is a fine-grained emotion detec- this weighted F-score, the emotion classes with a
tion tool with Gated Recurrent Neural Networks higher number of reviews in the dataset will have a
(GRNNs) implemented by Abdul-Mageed and higher influence on the overall F-score. Table 1
Ungar (2017). GRNNs is a type of neural net- shows the weight for each emotion class in our
work that are well-suited for tasks such as emo- TORCE dataset.
tion detection, as they are able to learn long- term
dependencies in text data. EmoNet is trained on a Emotion class Weight
massive dataset of labeled text examples, which Anger 25%
includes over a million exam- ples from a variety Sadness 10%
of sources, such as social media, news articles, Joy 49%
and movie reviews. This diversity of data allows Surprise 10%
EmoNet to learn the nuances of human emotion Anticipation 6%
and to accurately detect a wide range of
emotions, even in com- plex and challenging Table 1: The weight for all emotion classes in the
contexts. dataset.
3. pysentimiento 3: It is a Python toolkit that Table 2 shows the weighted average F-score across
aims to provide an access to state-of-the-art large the tools under evaluation. As shown py-
language models for sentiment analy- sis and sentimiento has the best overall performance on
social NLP tasks (Pérez et al., 2021). Multiple TORCE dataset. Figure 2 demonstrate the F-score of
language models were tested on a multilingual detected emotion classes. Furthermore, It’s worth
emotion dataset labelled with six basic emotions, mentioning that pysentimiento and ETT do not
including “anger”, “disgust”, “fear”, “joy”, support the “anticipation” class. However, we can
“sadness” and “surprise”. observe that the “joy” class is the easiest class to
4. ETT 4: Poth et al. (2021) propose a method for detect. In contrast, seems to be harder to detect on
efficiently selecting intermediate tasks that can tourism review data.
improve the performance of a variety of NLP
Emotion Detection Tool F-score
tasks. Their method is based on the observation
NRC 0.47
that embedding-based methods, which rely solely
EmoNet 0.52
on the respective datasets, outperform
pysentimiento 0.67
computational expensive few-shot fine-tuning
ETT 0.59
approaches. They evaluated their methods on a
diverse set of 42 intermediate and 11 target Table 2: Weighted average F-score.
English classification, multiple choice, question
answering, and sequence tagging tasks. Emotion
detection is one of the wide range of tasks they
have presented. The model was trained on 6
different emotions in- cluding “anger”, “love”,
“fear”, “joy”, “sadness” and “surprise”.
These tools described above were evaluated us- ing
the TORCE as test dataset. We used the widely used F-
score as the evaluation metric. Because the TORCE
dataset is imbalanced in terms of pro- portion of data
from each emotion category, our
2
https://github.com/UBC-NLP/EmoNet
3
https://github.com/pysentimiento/pysentimiento
4
https://github.com/adapter-hub/efficient-task-
Figure 2: F-score for each emotion class.
transfer
2. Tuning Language Models for
160GB of text combined. Those models were se-
Tourism Review Emotion Detection lected from the Transformers package 5.
Before we tuned these models, we tested the
Machine Learning models are sensitive to the qual- ity
original models for emotion classification and they
and quantity of the training data. If the labelled data
produced poor F-scores, as shown in Table 3. Such
contains errors or biases, the model could learn and
results were expected because the models are trained
perpetuate these mistakes. Also, it is dif- ficult for
on generic large data. They need to be tuned for
machine learning to deal with imbalance in data. For
specific tasks such as text classification (Devlin et al.,
example, if a model is trained on an unbal- anced
2019).
emotion dataset, it would more likely to make mistakes
for the minority emotion classes. This is because the Language model F-score
model has seen fewer examples of the minority emotion BERT 0.11
classes during training, and may not have learned to DistilBert 0.31
identify them effectively. To address this issue, there Roberta 0.32
are some techniques that can be used to mitigate the
negative impact of the unbalanced training dataset. Table 3: Weighted average F-score for the lan- guage
Such techniques in- clude oversampling, models without fine-tuning.
undersampling and weighted learning. By using these
techniques, the perfor- mance of machine learning With regards to tuning LLMs, a critical factor is to
models on unbalanced datasets can be improved. choose a optimal learning rate value. The learning rate
As shown in figure 1, our tourism review dataset is a hyperparameter that controls the speed at which a
suffers from the scarcity of some emotion classes. machine-learning model updates its param- eters. If a
Therefore, we wanted to test different techniques to too high learning rate is applied, an LLM with may not
oversample the scarce emotion classes in order to be able to fully learn the patterns in the training data and
make them more equal to the majority classes in the may have difficulty in generalizing for new data. On
training dataset. For this purpose, we chose an aug- other hand, an LLM with a too low learning rate may
mentation technique which is based on the method take a very long time to train and may not be able to
suggested by Wei and Zou (2019), explained be- low: achieve the desired per- formance metrics. Therefore,
it is important to do trials with different learning rates
1. Random Insertion (RI): Based on the context to find the optimal learning rate. We tested a set of
of the review and using BERT language model, up different learning rates for each language model,
to two words are added to each review. including 2e-5, 1e- 5, 5e-6 and 1e-6. Moreover, the
batch size was equal to 16, and train for up to 20
2. Random Deletion (RD): Randomly delete up epochs. Table 4 shows the weighted average F-score
to two words per review. of emotion classification on our tourism reviews
dataset with different LLMs, learning rates, and
3. Random Swapping (RS): Randomly swap
augmentation techniques.
two neighboring words. This process is ap- plied
As shown in the table, each of the BERT, Dis-
up to two times for each review.
tilBert and Roberta were tuned using four tuning
techniques and combination of them. In addition, four
4. Synonym Replacement (SR): Randomly re-
learning rates were tested for each model and tuning
place up to two words with their synonyms using
method plus combination of them. The F- scores in
BERT language model.
blue and red colours indicate the best results for each
5. All Augmentation methods combined: learning rate.
Mixed use of the above mentioned augmenta- As the result, Roberta model with random inser-
tion methods. tion tuning method and ie-5 learning rate parame- ter
produced the best result of F-score 0.80 (see red
In our experiment on Large Language Models coloured score). Overall, Roberta model pro- duced
(LLMs) tuning, we tested three models including: 1) consistently higher F-scores compared to the other two
BERT (Devlin et al., 2019) base uncased trained on two models, producing promising results with all
datasets English Wikipedia and BookCorpus (Zhu et augmentation methods. Figure 3 shows the F-scores for
al., 2015), 2) DistilBert (Sanh et al., 2020) base four different learning rates across tun- ing methods,
uncased trained on the same datasets as BERT, but it and Figure 4 shows the F-scores for the five emotion
is a smaller and faster because it has less parameters classes produced by this model. In Figure 3, codes
than BERT model, and 3) Roberta (Liu et al., 2019) ”RI”, ”RD”, ”RS”, ”SR” and ”All Com- bine” indicate
base trained on five datasets weigh ”Random Insertion”, ”Random Dele-
5
https://github.com/huggingface/transformers
Lang. Model Learning Rate Random Insertion Random Deletion Random Swapping Synonym Replacement All combined
2e-5 0.76 0.74 0.78 0.75 0.78
1e-5 0.77 0.74 0.77 0.75 0.77
BERT 5e-6 0.77 0.74 0.76 0.75 0.77
1e-6 0.73 0.72 0.72 0.74 0.72
2e-5 0.72 0.73 0.74 0.73 0.74
1e-5 0.74 0.72 0.74 0.73 0.72
DistilBert 5e-6 0.73 0.72 0.74 0.71 0.72
1e-6 0.65 0.64 0.65 0.66 0.65
2e-5 0.78 0.79 0.77 0.79 0.77
1e-5 0.80 0.77 0.79 0.79 0.79
Roberta 5e-6 0.77 0.77 0.77 0.79 0.77
1e-6 0.76 0.75 0.77 0.76 0.75
3. Conclusion Jason Wei and Kai Zou. 2019. Eda: Easy data aug-
mentation techniques for boosting performance on
In this paper, we reported on our study in which we text classification tasks.
explored the performance of existing emotion de-
tection tools and large language models (LLMs) for Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan
tourism reviews in social media. In particular, we Salakhutdinov, Raquel Urtasun, Antonio Tor- ralba,
examined the impact of language model tuning us- ing and Sanja Fidler. 2015. Aligning books and
augmentation techniques on the performance of movies: Towards story-like visual explana- tions by
emotion classification. This experiment is based on a watching movies and reading books.
new tourism emotion corpus named TORCE, which
was compiled by collecting tourists’ reviews from the
tourism website TripAdvisor and manually annotated
with emotion information. Our experi- mental results
demonstrate that LLM tuning has a positive impact
on the automatic emotion classi- fication, and if we
apply suitable tuning methods, we can expect a
significant improvement of perfor- mance of the tools
based on LLMs. In future work, we will extend the
study on a larger test dataset and more LLMs.
4. Bibliographical References